API for ClinVar Variants

This API provides access to information about simple genetic variants taken from the ClinVar dataset, restricted to records whose assembly is GRCh37 or "na". For details about how we have processed the data, see the data construction section.

Source files: The ClinVar data files variant_summary.txt.gz, hgvs4variation.txt.gz, and variation_allele.txt.gz.

API Demo

The following demo shows how this API might be used with an autocompleter we've developed. (Example: Try typing FAM.)

For further experimentation with the autocompleter and this API, try the autocompleter demo page.

API Documentation

API Base URL: https://clinicaltables.nlm.nih.gov/api/variants/v4/search (+ query string parameters)

This data set may also be accessed through the FHIR ValueSet $expand operation.

In addition to the base URL, you will need to specify other parameters. See the query string parameters section below for details.

Query String Parameters and Default Values

At a minimum, when using the above base URL, you will need to specify the "terms" parameter containing a word or partial word to match.

Parameter NameDefault ValueDescription
terms(Required.) The search string (e.g., just a part of a word) for which to find matches in the list. More than one partial word can be present in "terms", in which case there is an implicit AND between them.
maxList Optional, with a default of 7. Specifies the number of results requested, up to the upper limit of 500. If present but the value is empty, 500 will be used.
qAn optional, additional query string used to further constrain the results returned by the "terms" field. Unlike the terms field, "q" is not automatically wildcarded, but can include wildcards and can specify field names. See the Elasticsearch query string page for documentation of supported syntax.
dfVariationID,NameA comma-separated list of display fields (from the fields section below) which are intended for the user to see when looking at the results.
sfAll fieldsA comma-separated list of fields to be searched.
cfVariationIDA field to regard as the "code" for the returned item data.
efA comma-separated list of additional fields to be returned for each retrieved list item. (See the Output format section for how the data for fields is returned.) If you wish the keys in the returned data hash to be something other than the field names, you can specify an alias for the field name by separating it from its field name with a colon, e.g., "ef=field_name1:alias1,field2,field_name3:alias3,etc. Note that not every field specified in the ef parameter needs to have an alias.

Variants Field Descriptions

FieldField Description
AlternateAlleleThe value of the AlternateAllele field in the source file.
AlleleIDThe ID of the allele as taken from the AlleleID column of the source file.
AminoAcidChangeThis is the amino acid change (starting with "p.") parsed from the Name field.
ChromosomeThe chromosome number, taken from the Chromosome field in the source file.
ChromosomeAccessionThe chromosome accession number, taken from the ChromosomeAccession field in the source file.
CytogeneticThe cytogenetic location of the allele, taken from the "Cytogenetic" field in the source file.
dbSNPThe "rs" ID number from dbSNP, taken from the "RS# (dbSNP)" field in the source file.
GeneIDThe gene ID from NCBI's gene database.
GeneSymbolThis is the GeneSymbol field listed in the source file. It is the symbol for the gene that overlaps the variant.
GenomicLocationThis is an HL7-style concatenation of the Start and Stop fields, i.e., Start^Stop.
hgnc_idA unique ID provided by the HGNC for each gene with an approved symbol. Although standard HGNC IDs are of the format HGNC:n, where n is a number, we have removed the "HGNC:" prefix, so that these values are just numbers.
hgnc_id_numThis is the hgnc_id with the "HGNC:" prefix removed, just in case some apps may want the autocomplete to work on the numeric part of the id as well.
HGVS_cThe NucleotideExpression field from the source file. (The "RefSeq cDNA-based HGVS expression".)
HGVS_exprsThe list of all NucleotideExpression and ProteinExpression for the variant.)
HGVS_pThe ProteinExpression field from the source file. (The "RefSeq protein-based HGVS expression".)
NameThis is the "Name" field (a description of the allele) from the source file.
NucleotideChangeThis is the nucleotide change (usually starting with "c.") parsed from the Name field.
phenotypesThis list only includes phenotypes that have an MedGen ID. This contains disease name and MedGen code pairs. The data is stored as an array of objects, where each object has a "text" property (with the disease name) and a "code" property. See PhenotypeIDS and PhenotypeList for more details.
phenotypeThis contains the shortest disease name from the phenotypes field above. (There are some situations where one is enough.). Like the phenotypes field, its value is an object with a "text" property for the disease name and a "code" field. For the "df" (display field) API parameter, the code by itself or the disease name by itself can be requested as phenotype.code or phenotype.text. When just "phenotypes" is speficied with "df", the code and disease name values will be returned a combined string.
PhenotypeIDSThe list of phenotype ids for the variant.
PhenotypeListThe list of phenotypes (disease names) for the variant.
RefSeqIDThis is the RefSeq accession number parsed out of the "Name" field.
ReferenceAlleleThe value of the ReferenceAllele field in the source file.
StartThe starting position of the allele, taken from the Start field in the source file.
StopThe ending position of the allele, taken from the Stop field in the source file.
TypeThe type of the variant, taken from the Type field in the source file.
VariationIDThe preferred id for the ClinVar variants data. It's taken from the VariationID field in the source file.

Output format

Output for an API query is an array of the following elements:

  1. The total number of results on the server (which can be more than the number returned). For APIs in which there are millions of records, this number might be a lower bound due to early termination if there are more than a hundred thousand results.
  2. An array of codes for the returned items. (This is the field specified with the cf query parameter above.)
  3. A hash of the "extra" data requested via the "ef" query parameter above. The keys on the hash are the fields (or their requested aliases) named in the "ef" parameter, and the value for a field is an array of that field's values in the same order as the returned codes.
  4. An array, with one element for each returned code, where each element is an array of the display strings specified with the "df" query parameter.
  5. An array, with one element for each returned code, where each element is the "code system" for the returned code. Note that only code-system aware APIs will return this array.

Sample API Queries

QueryResultDescription
https://clinicaltables.nlm.nih.gov/api/variants/v4/search?terms=FAM [41738, ["359712","336720","336728","21725","336713","336729","336732"], null, [ ["359712","NM_032581.4(FAM126A):c.*4329del"], ["336720","NM_001201543.2(FAM161A):c.*815del"], ["336728","NM_001201543.2(FAM161A):c.*248del"], ["21725","FAM126A:c.627-439_831+348del"], ["336713","NM_001201543.2(FAM161A):c.*1345G>A"], ["336729","NM_001201543.2(FAM161A):c.*196A>C"], ["336732","NM_001201543.2(FAM161A):c.*26C>T"]]] Returns the first 7 matches for terms starting with FAM (which in this case matches only gene names.)
https://clinicaltables.nlm.nih.gov/api/variants/v4/search?terms=FAM&ef=GeneSymbol [41738, ["359712","336720","336728","21725","336713","336729","336732"], {"GeneSymbol":["FAM126A","FAM161A","FAM161A","FAM126A","FAM161A","FAM161A","FAM161A"]}, [ ["359712","NM_032581.4(FAM126A):c.*4329del"], ["336720","NM_001201543.2(FAM161A):c.*815del"], ["336728","NM_001201543.2(FAM161A):c.*248del"], ["21725","FAM126A:c.627-439_831+348del"], ["336713","NM_001201543.2(FAM161A):c.*1345G>A"], ["336729","NM_001201543.2(FAM161A):c.*196A>C"], ["336732","NM_001201543.2(FAM161A):c.*26C>T"]]] This is the same query as above but with a request to return the GeneSymbol field as an "extra field" (ef).

Data Construction Details

This section describes in detail the steps we followed in processing the source files, to provide more specifics about the content we are providing and also to allow someone to reproduce our processing. This is not quite an algorithm, but details the changes made to the data.

Highlights (summary)

  1. There are 3 files to work with: variant_summary.txt.gz, hgvs4variation.txt.gz, and variation_allele.txt.gz
  2. Only records which are for assembly "GRCh37" or "na", and which have a Name field containing an NCBI RefSeq and a DNA HGVS expression are included.
  3. The Name field is parsed and used to populate new fields AminoAcidChange and NucleotideChange.
  4. Besides the PhenotypeIDS and PhenotypeList fields that come with the data, two new fields, pheonotype and phenotypes, have been created from them, to include only the phenotypes that have MedGen codes. The field values are json object strings for phenotype and json array strings for phenotypes. For more details, please see the "Processing details" section below.
  5. Records for the same VariationID are combined into a single record, with the values for each field being combined either with tildes or brackets, depending on the field.

Processing details:

  1. Only include records that have the "Assembly" field set to "GRCh37" or "na"
  2. A record's Name field value usually begins with an NCBI RefSeq accession number, possibly followed by a gene symbol, followed by an DNA HGVS expression, possibly followed by an protein HGVS expression. The regular expression used to check this is:
          /^(N\S_\d+\.?\d+)(\(([A-Z0-9]+)\))?:(\S\.\S+)( \((\S.\S+)\))?/
  3. Parse the Name field using the above regular expression, and then: store the protein HGVS piece in a new field, AminoAcidChange; store the RefSeq accession number in a new field, RefSeqID; and, use the DNA HGVS piece as the value for the NucleotideChange field. When any of such pieces do not exist in the Name, the fields will be backfilled with information from the following fields of the data files: ProteinChange field, NucleotideChange field, and the NucleotideExpression field.
  4. Store all the values of the record's NucleotideExpression and ProteinExpression fields as HGVS_exprs, and a "best" pair of NucleotideExpression and ProteinExpression is selected to populate HGVS_c and HGVS_p. The rules for selecting this best pair is as follows: the first pair whose NucleotideExpression starts with "NM" and whose ProteinExpression is populated is selected; if no such pair exists, the first pair whose NucleotideExpression starts with "NM" is selected; otherwise, the first pair seen is selected.
  5. If there is a value in the record's Chromosome field, create a GenomicLocation field as the value of the Start field, plus a '^' character, plus the value of the Stop field.
  6. If the value in the record's "RS# (dbSNP)" field is not "-1", create a "dbSNP" field equal to "rs" plus that value.
  7. The field "nsv/esv (dbVar)" is renamed to nsv_esv.
  8. For each phenotype (based on PhenotypeIDS and PhenotypeList) in the record that has a MedGen code, a JSON object is created with two properties: "code" that is the MedGen code, and "text" that is the phenotype name. These JSON objects are then stored in the "phenotypes" field as a JSON array string. The one such phenotype that has the shortest phenotype name (text) is stored in the "phenotype" field as a JSON object string.
  9. Combine data for records sharing the same VariationID. For the ReferenceAllele and Alternate allele fields, the combined value is whatever values are in the variant's records joined with a tilde, regardless of whether those values are the same or not. For other fields, if the values are the same across the variant's record, the combined value is simply that shared value. If the values are not all the same, then the combined value is created by wrapping each value (unique or not) in brackets and concatenating the result.