API for ClinVar Alleles
This API provides access to information about alleles in simple genetic variants taken from the ClinVar dataset, restricted to the GRCh37 assembly. Unlike the "Variants" data table, which has a single record per variant ID, this data table has been processed to have a single record per allele ID. For details about how we have processed the data, see the data construction section.
Source file: The ClinVar "variant_summary.txt.gz" file.
This service is provided "as is" and free of charge. Please see the Frequently Asked Questions page for more details on terms of service, etc.API Demo
The following demo shows how this API might be used with an autocompleter we've developed. (Example: Try typing FAM.)
For further experimentation with the autocompleter and this API, try the autocompleter demo page.
API Documentation
API Base URL: https://clinicaltables.nlm.nih.gov/api/alleles/v3/search (+ query string parameters)
This data set may also be accessed through the FHIR ValueSet $expand operation.
In addition to the base URL, you will need to specify other parameters. See the query string parameters section below for details.
Query String Parameters and Default Values
At a minimum, when using the above base URL, you will need to specify the "terms" parameter containing a word or partial word to match.
| Parameter Name | Default Value | Description |
|---|---|---|
| terms | (Required.) The search string (e.g., just a part of a word) for which to find matches in the list. More than one partial word can be present in "terms", in which case there is an implicit AND between them. | |
| maxList | 7 | Optional, with a default of 7. Specifies the number of results requested, up to the upper limit of 500. If present but the value is empty, 500 will be used. Note that this parameter does not support pagination, see "count" and "offset" below for details on pagination support. |
| count | 7 | The number of results to retrieve (page size). The maximum count allowed is 500, see "offset" below on pagination support. |
| offset | 0 | The starting result number (0-based) to retrieve. Use offset and count together for pagination. Note that the current limit on the total number of results that can be retrieved (offset + count) is 7,500. We reserve the right to decrease or increase this limit based on system capacity and/or other factors. Please see the FAQ page on how to sign up to our email list to be notified of any changes or new features. |
| q | An optional, additional query string used to further constrain the results returned by the "terms" field. Unlike the terms field, "q" is not automatically wildcarded, but can include wildcards and can specify field names. See the Elasticsearch query string page for documentation of supported syntax. | |
| df | AlleleID, Chromosome, GeneSymbol, HGVS_c, AminoAcidChange, Cytogenetic, dbSNP, TypeAbbr | A comma-separated list of display
fields (from the fields section below) which are
intended for the user to see when looking at the results. The parameter "ef" (see below) may also be used to specify the data fields to retrieve. The main difference is that the value of "df" is always a string (for display), while the value for "ef" could be a json object when the field value has a complex structure. |
| sf | All fields | A comma-separated list of fields to be searched. |
| cf | AlleleID | A field to regard as the "code" for the returned item data. |
| ef | A comma-separated list of additional fields to be
returned for each retrieved list item. (See the Output format section for how the data for fields
is returned.) If you wish the keys in the returned data hash to be something
other than the field names, you can specify an alias for the field name by
separating it from its field name with a colon, e.g.,
"ef=field_name1:alias1,field2,field_name3:alias3,etc. Note that not
every field specified in the ef parameter needs to have an alias. The parameter "df" (see above) may also be used to specify the data fields to retrieve. The main difference is that the value of "df" is always a string (for display), while the value for "ef" could be a json object when the field value has a complex structure. |
Alleles Field Descriptions
| Field | Field Description |
|---|---|
| AlternateAllele | The value of the AlternateAllele field in the source file. |
| AlternateAllele_lbl | The value of the AlternateAllele field in the source file, but prefixed with "Alt=". |
| AlleleID | The ID of the allele as taken from the AlleleID column of the source file. |
| AminoAcidChange | This is the amino acid change (starting with "p.") parsed from the Name field. |
| Chromosome | The chromosome number, taken from the Chromosome field in the source file, but prefixed with "chr". |
| ChromosomeAccession | The chromosome accession number, taken from the ChromosomeAccession field in the source file. |
| Cytogenetic | The cytogenetic location of the allele, taken from the "Cytogenetic" field in the source file. |
| dbSNP | The "rs" ID number from dbSNP, taken from the "RS# (dbSNP)" field in the source file. |
| GeneID | The gene ID from NCBI's gene database. |
| GeneSymbol | This is the GeneSymbol field listed in the source file. It is the symbol for the gene that overlaps the variant. |
| GenomicLocation | This is an HL7-style concatenation of the Start and Stop fields, i.e., Start^Stop. |
| hgnc_id | A unique ID provided by the HGNC for each gene with an approved symbol. Although standard HGNC IDs are of the format HGNC:n, where n is a number, we have removed the "HGNC:" prefix, so that these values are just numbers. |
| HGVS_c | The "HGVS (c.)" field from the source file. (The "RefSeq cDNA-based HGVS expression".) |
| HGVS_p | The "HGVS (p.)" field from the source file. (The "RefSeq protein-based HGVS expression".) |
| Name | This is the "Name" field (a description of the allele) from the source file. |
| NucleotideChange | This is the nucleotide change (usually starting with "c.") parsed from the Name field. |
| phenotypes | This contains disease name and MedGen code pairs, where the MedGen codes were pulled from the PhenotypeIDs column in the source file. The data is stored as an array of objects, where each object has a "text" property (with the disease name) and a "code" property. |
| phenotype | This contains the shortest disease name from the phenotypes field above. (There are some situations where one is enough.) This field is not searchable. Like the phenotypes field, its value is an object with a "text" property for the disease name and a "code" field. For the "df" (display field) API parameter, the code by itself or the disease name by itself can be requested as phenotype.code or phenotype.text. When just "phenotypes" is speficied with "df", the code and disease name values will be returned a combined string. |
| RefSeqID | This is the RefSeq accession number parsed out of the "Name" field. |
| ReferenceAllele | The value of the ReferenceAllele field in the source file. |
| ReferenceAllele_lbl | The value of the ReferenceAllele field in the source file, but prefixed with "Ref=". |
| Start | The starting position of the allele, taken from the Start field in the source file, but prefixed with "start". This field is not searchable. |
| Stop | The ending position of the allele, taken from the Stop field in the source file, but prefixed with "stop". This field is not searchable. |
| Type | The type of the variant, taken from the Type field in the source file. |
| TypeAbbr | An abbreviated version of the Type field. |
| VariantID | The ClinVar variant ID, taken from the VariantID field in the source file. |
Output format
Output for an API query is an array of the following elements:
- The total number of results on the server, which can be more than the number of results returned. This reported total number of results may also be significantly less than the actual number of results and is limited to 10,000, which may significantly improve the service response time.
- An array of codes for the returned items. (This is the field specified with the cf query parameter above.)
- A hash of the "extra" data requested via the "ef" query parameter above. The keys on the hash are the fields (or their requested aliases) named in the "ef" parameter, and the value for a field is an array of that field's values in the same order as the returned codes.
- An array, with one element for each returned code, where each element is an array of the display strings specified with the "df" query parameter.
- An array, with one element for each returned code, where each element is the "code system" for the returned code. Note that only code-system aware APIs will return this array.
Sample API Queries
| Query | Result | Description |
|---|---|---|
| https://clinicaltables.nlm.nih.gov/api/alleles/v3/search?terms=FAM | [154,["180165","15077","15811","15816","16063","16068","16253"],null,[["180165","chr4","FAM175A","NM_139076.2:c.-4T>C","","","rs202166386","SNV"],["15077","chr2","FAM161A","NM_001201543.1:c.1567C>T","p.Arg523Ter","2p15","rs202193201","SNV"],["15811","chr8","FAM83H","NM_198488.3:c.1243G>T","p.Glu415Ter","8q24.3","rs137854437","SNV"],["15816","chr8","FAM83H","NM_198488.3:c.860C>A","p.Ser287Ter","8q24.3","rs137854442","SNV"],["16063","chr7","FAM20C","NM_020223.3:c.1163T>G","p.Leu388Arg","","rs796051849","SNV"],["16068","chr7","FAM20C","NM_020223.3:c.956+5G>C","","","rs796051854","SNV"],["16253","chr7","FAM126A","NM_032581.3:c.51+1G>A","","7p15.3","rs72549405","SNV"]]] | Returns the first 7 matches for terms starting with FAM (which in this case matches only gene names.) |
| https://clinicaltables.nlm.nih.gov/api/alleles/v3/search?terms=FAM&ef=VariantID | [154,["180165","15077","15811","15816","16063","16068","16253"],{"VariantID":["182463","38","772","777","1024","1029","1214"]},[["180165","chr4","FAM175A","NM_139076.2:c.-4T>C","","","rs202166386","SNV"],["15077","chr2","FAM161A","NM_001201543.1:c.1567C>T","p.Arg523Ter","2p15","rs202193201","SNV"],["15811","chr8","FAM83H","NM_198488.3:c.1243G>T","p.Glu415Ter","8q24.3","rs137854437","SNV"],["15816","chr8","FAM83H","NM_198488.3:c.860C>A","p.Ser287Ter","8q24.3","rs137854442","SNV"],["16063","chr7","FAM20C","NM_020223.3:c.1163T>G","p.Leu388Arg","","rs796051849","SNV"],["16068","chr7","FAM20C","NM_020223.3:c.956+5G>C","","","rs796051854","SNV"],["16253","chr7","FAM126A","NM_032581.3:c.51+1G>A","","7p15.3","rs72549405","SNV"]]] | This is the same query as above but with a request to return the VariantID field as an "extra field" (ef). |
Data Construction Details
This section describes in detail the steps we followed in processing the source file, to provide more specifics about the content we are providing and also to allow someone to reproduce our processing. This is not quite an algorithm, but details the changes made to the data.
Highlights (summary)
- Only records which are for assembly GRCh37 and which have a Name field containing an NCBI RefSeq and a DNA HGVS expression are included.
- The Name field is parsed and used to populate new fields AminoAcidChange and NucleotideChange.
- The MedGen codes from the PhenotypeIDs column are used to find disease names via the disease_names API. These are put into code/text JSON pairs, and stored in the phenotypes field.
- The GeneSymbol field is used to look up the HGNC ID via the genes API, and the result is stored in the hgnc_id field.
- Records for the same allele ID are combined into a single record, with the values for each field being combined either with tildes or brackets, depending on the field.
Processing details:
- Only include records that have the "Assembly" field set to "GRCh37"
- Only include records whose Name field begins with an NCBI RefSeq
accession number, possibly followed by a gene symbol, followed by an
DNA HGVS expression, possibly followed by an protein HGVS expression. The
regular expression used to check this is:
/^(N\S_\d+\.?\d+)(\(([A-Z0-9]+)\))?:(\S\.\S+)( \((\S.\S+)\))?/ - Parse the Name field using the above regular expression, and store the protein HGVS piece in a new field, AminoAcidChange, if such a piece exists. Likewise, store the DNA HGVS piece in a new field, NucleotideChange.
- Store the record's "HGVS(c.)" field as HGVS_c, and the record's "HGVS(p.)" field as "HGVS_p". Sometimes these fields are blank but there the information is in the Name field, which is why we parse that one.
- If there is a value in the record's Chromosome field, create a
GenomicLocation field as the value of the Start field, plus a '^'
character, plus the value of the Stop field. Also create a "TypeAbbr"
field based on the record's Type field, using the following mapping:
- copy number gain: gain
- copy number loss: loss
- deletion: del
- duplication: dup
- fusion: fusion
- indel: indel
- insertion: inser
- inversion: inver
- NT expansion: NT exp
- protein only: protein only
- short repeat: shrt rpt
- single nucleotide variant: SNV
- undetermined variant: undet
- If the value in the record's "RS# (dbSNP)" field is not "-1", create a "dbSNP" field equal to "rs" plus that value.
- From the record's PhenotypeIDs column, extract the MedGen codes, and use the disease_names API to look up the codes and get the corresponding names. Create an array of JSON objects for each, where each object has a "code" (the MedGen code) and a "text" (the disease name) property. Store this array as a JSON string in a new "phenotypes" field. Take the object for the shortest disease name and store it in a new "phenotype" field.
- Combine data for records sharing the same AlleleID. If the values are the same across the allele's record, the combined value is simply that shared value. Otherwise, for the ReferenceAllele and Alternate allele fields, the combined value is whatever values are in the allele's records (unique or not) joined with a tilde. For other fields, if the values are not all the same, then the combined value is created by wrapping each value (unique or not) in brackets and concatenating the result.
- Create a ReferenceAllele_lbl field from ReferenceAllele by prefixing it with "Ref=". Likewise, create an AlternateAllele_lbl field from AlternateAllele by prefixing it with "Alt=".
- Prefix the Start field with "start=", the Stop field with "stop=", and the Chromosome field with "chr".
- Use the GeneSymbol field to find a match in the genes API, and from that find the HGNC ID for the gene symbol. Store that ID in a new field "hgnc_id", handling cases of multiple values (due to multiple genes for the record) with brackets as described above.