# Data Curation

## ***Is the data in OncoKB™ manually curated?***&#x20;

Yes. All data in OncoKB is manually curated by members of the Scientific Content Management Team (SCMT) and data curators, both under the leadership of the OncoKB Lead Scientist. Manually curated data includes:

1. Gene assignment as an oncogene and/or tumor suppressor
2. Gene Background
3. Variant Oncogenic Effect
4. Variant Biological Effect
5. Variant Drug Sensitivity and Resistance (utilizing the [OncoKB Levels of Evidence](https://www.oncokb.org/levels))

For information about the primary data sources we use to identify and curate cancer variants and their biological and therapeutic implications, please refer to [Section I.C of the OncoKB Curation Standard Operating Procedure v4](https://sop.oncokb.org/).

## ***Does OncoKB™ use any automated methods to predict variant effect or drug sensitivity?***&#x20;

All variants in OncoKB are manually researched by a member of the scientific content management team (SCMT) to determine their oncogenic and biological effect, as well as drug sensitivity (if any). The only automated method of prediction that OncoKB utilizes relates to assigning the oncogenic effect of variants from [cancerhotspots.org](https://www.cancerhotspots.org/#/home). Each variant identified as a hotspot on this website is researched and reviewed by an SCMT member. Per [Chapter 1: Sub-Protocol 2.5: Assertion of the oncogenic effect of a VPS of the OncoKB Standard Operating Procedure v4](https://sop.oncokb.org), variants with supporting scientific literature are classified as “Oncogenic” and variants that are considered hotspots based purely on statistical recurrence per [Chang et al., 2018](https://pubmed.ncbi.nlm.nih.gov/29247016/) are considered “Likely Oncogenic”. The Cancer Hotspots website has a static list of variants based on the 2018 publication, and OncoKB’s curation of cancer hotspots is based on this list.

## ***Who curates OncoKB™ data and what is their educational background?***

The OncoKB staff consists of the following:

* The OncoKB Lead Scientist (Ph.D)
* The Lead Scientist, Knowledge Systems (Ph.D)
* The Scientific Content Management Team (SCMT), which includes two senior scientists (Ph.Ds) and three scientific writers/editors (ranging from Bachelor's-level to Ph.D-level scientists)
* The Lead Software Engineer (MS)
* Software Engineers
* The Data and Software Liaison (MS)

For more information about the OncoKB Staff, please refer to[ Section I.B of the OncoKB Curation Standard Operating Procedure v4](https://sop.oncokb.org/).

## ***What is the source of the variant annotations (e.g. in-house database, automatic aggregation of public domain databases; which databases are included?)***

Four primary data sources are used to identify and curate cancer variants and their biological and therapeutic implications:

* Public cancer variant databases of alterations identified in tumor sequencing studies, e.g., [cBioPortal](https://www.cbioportal.org/)&#x20;
* Statistically significant and recurrent variants identified based on 24,592 sequenced tumors using methods described in[ Chang et al., 2018](https://pubmed.ncbi.nlm.nih.gov/29247016/)
* Disease-specific treatment guidelines such as those provided by the National Cancer Compendium Network (NCCN) and proceedings of major scientific and/or clinical conferences such as the American Society of Clinical Oncology (ASCO) and the American Association of Cancer Research (AACR)
* General scientific literature, accessed through PubMed

For more information about the external databases we use as references for curation, please refer to [ Section I.C of the OncoKB Curation Standard Operating Procedure v4](https://sop.oncokb.org/).

## ***What are the criteria for defining a gene as an oncogene or tumor suppressor?***

Please refer to[ ](https://sop.oncokb.org/)[*Chapter 1: Table 1.3: Assertion of the function of a cancer gene* of the OncoKB Standard Operating Procedure v4](https://sop.oncokb.org) for a detailed protocol on the criteria we use to categorize a gene as an oncogene and/or tumor suppressor.

## ***What are the criteria for defining the oncogenic effect (oncogenic, likely oncogenic, likely neutral, inconclusive) of a variant?***

Please refer to [*Chapter 1: Sub-Protocol 2.5: Assertion of the oncogenic effect of a VPS* of the OncoKB Standard Operating Procedure v4](https://sop.oncokb.org) for a detailed protocol on the criteria we use to define the oncogenic effect of a variant.

## ***What are the criteria for defining the biological effect (gain/loss/switch of function, likely gain/loss/switch of function, neutral, inconclusive) of a variant?***

Please refer to [*Chapter 1: Sub-Protocol 2.4: Assertion of the biological effect of a VPS* of the OncoKB Standard Operating Procedure v4](https://sop.oncokb.org) for a detailed protocol on the criteria we use to define the biological effect of a variant.

## ***Where can I find the Gene ID / RefSeq for all genes in OncoKB™?***

The OncoKB [*Cancer Genes Page*](https://www.oncokb.org/cancerGenes) contains a downloadable file (*Cancer Gene List*) that includes the Entrez Gene ID and RefSeq for all genes included in OncoKB. Additionally, at the top of every *Gene Page*, the Entrez Gene ID and RefSeq for that gene are displayed.

## ***What are the rules for mutation syntax in OncoKB™?***

OncoKB uses standardized syntax for naming different mutation types, including missense mutations (mis), duplications (dup), deletions (del) etc. For more information about OncoKB mutation syntax, please refer to [*Chapter 6: Table 3.1: OncoKB alteration nomenclature, style and formatting* of the OncoKB Standard Operating Procedure v4](https://sop.oncokb.org/).

## ***What cancer type ontology is used in OncoKB™?***

We use [OncoTree](http://oncotree.mskcc.org) as our ontology. OncoTree provides mapping to NCI Thesaurus and [UMLS](https://www.nlm.nih.gov/research/umls/index.html). UMLS includes [SMOMED CT](https://www.nlm.nih.gov/healthit/snomedct/us_edition.html) as its source. Please refer to [*Chapter 1: Protocol 3: Tumor type assignment* of the OncoKB Standard Operating Procedure v4](https://sop.oncokb.org/).

## ***Do “delins” alterations map to deletions or insertions?***

“delins” alterations are in-frame alterations that will map to either “insertions” or “deletions” based on the number of amino acid changes. For example, V600\_K601delinsE would be interpreted as an inframe deletion, while R435\_K436delinsKKR would be interpreted as an inframe insertion.

## ***Some genes have an alteration called “Oncogenic Mutations.” What does this mean?***

“Oncogenic Mutations” is used when there is tumor-specific information (e.g., a therapeutic implication) that applies to ALL functional (oncogenic/likely oncogenic) alterations of a gene. Please note that if a gene has “Amplification” curated as “Oncogenic” or “Likely Oncogenic”, this alteration will NOT be associated with the tumor-type specific information under “Oncogenic Mutations.”

## ***How does OncoKB™ handle atypical variants such as EGFR vIII?***

Alterations that do not follow the pre-specified OncoKB nomenclature are curated and hard-coded in the system in order for annotation to be pulled properly. Please refer to [*Chapter 6: Table 3.1: OncoKB alteration nomenclature, style and formatting* of the OncoKB Standard Operating Procedure v4](https://sop.oncokb.org/). Examples of such alterations include:

* FLT3: internal tandem duplication
* EGFR: vIII
* EGFR: Kinase domain duplication
* EGFR: C-terminal domain

## ***What does “Switch-of-function” mean?***

Mutations that are classified as “switch-of-function” have evidence-based data demonstrating that the alteration causes the protein to acquire a new function, such as the neomorphic ability of the IDH1 R132H-mutant protein to produce D-2-hydroxyglutarate.

## ***How can I distinguish between a VUS and a variant that was not reviewed by the OncoKB™ team?***

A VUS is a variant that has been investigated by a member of the OncoKB team and for which no known data was identified. These variants will be represented with a hollow, dark grey circular icon and will include the sentence: “As of \[date], there was no available functional data about the \[variant] mutation”. Variants that have not been investigated by the OncoKB team will be represented by a hollow, light grey circular icon and will include the sentence: “The \[variant] has not specifically been reviewed by the OncoKB team, and its oncogenic function is considered unknown.”

## ***Mutations in the RAS genes (HRAS, KRAS, NRAS) are curated as “gain-of-function” when other knowledgebases classify them as “loss-of-function”. Why?***

\
Most oncogenic RAS mutations do indeed cause the RAS proteins to lose their GTPase catalytic activity. However, we take into account that these mutations freeze HRAS, KRAS and NRAS in their constitutively active form, which is associated with increased downstream pathway activity and thus why they are classified in OncoKB as gain-of-function mutations.

## ***Do all therapeutics propagate as Level 3B in other indications?***

No. While most associations in a specified cancer type that are OncoKB Levels 1, 2 or 3A will propagate as Level 3B in other cancer types, there are several exceptions to this rule. These include but are not limited to:&#x20;

1\. Therapeutic levels 1, 2 or 3A and associated drugs in solid cancer types do not propagate to hematologic cancers, or vice versa.&#x20;

2\. Levels for resistance (R1 and R2) and associated drugs do not propagate to other cancer types.&#x20;

3\. There are several therapeutic regimens in our system that are tissue-specific and would not be an appropriate recommendation in other cancer types (e.g. Selumetinib + Iodine I 131-6-Beta-Iodomethyl-19-Norcholesterol in NRAS mutant thyroid cancer).&#x20;

4\. Level 4 alterations do not propagate to other indications.

## ***ClinVar calls a specific variant benign, but you call it likely oncogenic. Why?***

Beginning with our December 2025 release (OncoKB™ v6.0), OncoKB™ includes both germline and somatic variant annotations (Note: germline annotations are only available on [OncoKB.org](http://oncokb.org) and will be incorporated in the OncoKB API in an upcoming release). For germline content, OncoKB™ incorporates only pathogenic or likely pathogenic variants identified in patients sequenced at MSK and interpreted by MSK Diagnostic Molecular Genetics according to Standards and Guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology ([Richards et al. Genetics in Medicine, 2015](https://www.acmg.net/docs/standards_guidelines_for_the_interpretation_of_sequence_variants.pdf)).

ClinVar also evaluates variants in the germline setting and assigns variants one of the following pathogenicity categories: pathogenic, likely pathogenic, benign, likely benign, and VUS. Variants that ClinVar classifies as benign or likely benign will therefore not appear in the germline section of OncoKB™.

Importantly, OncoKB™ annotates the oncogenic effect of somatic variants in cancer by assigning one of the following oncogenicity categories: oncogenic, likely oncogenic, likely neutral or neutral. The oncogenic effect assigned to a somatic variant cannot be directly compared to the pathogenic classification assigned to the same variant in the germline setting. A variant that does not definitively predispose to inherited cancer in the germline setting (and is therefore classified as benign, likely benign, inconclusive, or a VUS in ClinVar) may still function as an oncogenic driver when acquired somatically.

## ***Why is this specific hotspot a VUS but called likely oncogenic?***

Mutations that occur at hotspots as per [Chang et al.](https://aacrjournals.org/cancerdiscovery/article/8/2/174/6249/Accelerating-Discovery-of-Functional-Mutant) are considered likely oncogenic based on the statistical significance of their recurrence in cancer. Therefore, functional characterization may not exist for a variant, making it a variant of unknown significance (VUS); however, if it occurs at a statistically significant hotspot, it will be annotated as likely oncogenic. That being said, functional characterization will always supersede this designation, so some hotspot mutations may be called likely neutral or inconclusive based on functional characterization demonstrating that they are not oncogenic in vitro or in vivo.

## ***Where can I access germline data in OncoKB™?***

As of the April 2026 data release, germline data is available on our website, [oncokb.org](http://oncokb.org), as well as through the API.

## ***How does OncoKB™ annotate variants - at the DNA level, the protein level, or both?***

Somatic variants in OncoKB™ are predominantly annotated at the protein level, since most clinically relevant somatic alterations result in a protein change. A key exception is TERT promoter mutations, which are annotated at the DNA level, using both genomic (g.) and complementary DNA (c.) notation, because these alterations occur outside the gene’s protein-coding region.

Germline variants in OncoKB™ are annotated primarily at the cDNA level (c.). Multiple distinct DNA changes can produce the same protein change, yet represent different variants with different interpretations; therefore, cDNA-level annotation is required for accurate biological and clinical interpretation. Many germline variants also occur outside protein-coding regions (e.g. intronic regions), and therefore do not result in any protein-level alteration. When a germline variant does result in a protein alteration, OncoKB™ provides both the DNA-level (c.) and protein-level (p.) nomenclature.

## ***Does OncoKB curate both standard care and investigational germline therapeutic biomarkers?***

Currently, OncoKB™ curates only standard care germline therapeutic biomarkers (Levels 1 and 2). We may expand in the future to include investigational germline biomarkers (Levels 3A and 4).

## ***Do germline therapeutic implications propagate to other tumor types (e.g., as Level 3B)?***

No.&#x20;

Because germline variants are present in all cells of the body (not just tumor cells), their presence in a given tumor type does not necessarily indicate that the tumor is biologically driven by that variant. As a result, germline variants are more likely to represent passenger alterations outside of their established disease context.

Therefore, germline therapeutic implications in OncoKB™ are restricted to tumor types with direct clinical evidence and are not propagated to other tumor types (e.g., as Level 3B).&#x20;

## ***Do germline biomarkers have an FDA level of evidence?***

No.&#x20;

Germline alterations and their therapeutic implications are outside the scope of the OncoKB™ partial FDA recognition and are not assigned an FDA level of evidence.

## ***Are germline therapeutic associations curated using the same methods as somatic associations?***

\
Germline therapeutic associations are curated using a dedicated section of the OncoKB™ Standard Operating Procedure (SOP) (Part II, Chapter 9, Protocols 3–6), separate from somatic curation.

While the overall review and approval process remains consistent, all germline therapeutic associations are evaluated by a germline-focused arm of the Clinical Genomics Annotation Committee (CGAC) with disease-specific expertise in inherited cancer predisposition and management.

## ***Can germline protein changes be annotated through the API?***

As of April 2026, the OncoKB™ API supports annotation of germline variants provided in HGVSg (gDNA) or HGVSc (cDNA) format.&#x20;

Protein-level (HGVSp) queries are not currently supported. This is because multiple distinct DNA alterations can result in the same protein change, and OncoKB™ curates these DNA-level variants separately. As a result, a protein-level query would not uniquely map to a single OncoKB™ entry.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://faq.oncokb.org/data-curation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
