Title: | Calculate Phenotype Risk Scores |
---|---|
Description: | Use phenotype risk scores based on linked clinical and genetic data to study Mendelian disease and rare genetic variants. See Bastarache et al. 2018 <doi:10.1126/science.aal4043>. |
Authors: | Jake Hughey [aut, cre], Layla Aref [aut] |
Maintainer: | Jake Hughey <[email protected]> |
License: | GPL-2 |
Version: | 1.0.2 |
Built: | 2025-01-30 03:52:25 UTC |
Source: | https://github.com/hugheylab/phers |
The data are artificial and do not correspond to real patients.
demoSample
demoSample
A data table with the following columns:
person_id
: Character vector of the identifier for each person in the
cohort
sex
: Character vector indicating biological sex
This table provides a mapping between 27 Mendelian diseases and the corresponding ICD-9 and ICD-10 codes that indicate a genetic diagnosis.
diseaseDxIcdMap
diseaseDxIcdMap
A data.table with the following columns:
disease_id
: Numeric vector of OMIM disease identifiers
disease_name
: Character vector of disease names
icd
: Character vector of ICD codes indicating a genetic diagnosis
flag
: Integer vector of the vocabulary of the ICD code
(9: ICD-9-CM, 10: ICD-10-CM)
icd_name
: Character vector containing the description of each ICD code
getPhecodeOccurrences()
, getDxStatus()
This table provides a mapping between Mendelian diseases and their clinical features, represented as Human Phenotype Ontology (HPO) terms. The mapping is based on annotations from Online Mendelian Inheritance in Man (OMIM).
diseaseHpoMap
diseaseHpoMap
A data.table with the following columns:
disease_id
: Numeric vector of OMIM disease identifiers
disease_name
: Character vector of disease names
hpo_term_id
: Character vector of HPO identifiers corresponding to each
disease's clinical features
hpo_term_name
: Character vector of HPO term descriptions
https://hpo.jax.org/app/download/annotation
This function is useful for verifying that raw or residual phenotype risk scores of diagnosed individuals (cases) tend to be higher than scores of undiagnosed individuals (controls).
getDxStatus( demos, icdOccurrences, minUniqueAges = 2L, diseaseDxIcdMap = phers::diseaseDxIcdMap )
getDxStatus( demos, icdOccurrences, minUniqueAges = 2L, diseaseDxIcdMap = phers::diseaseDxIcdMap )
demos |
A data.table having one row per person in the cohort. Must have
a column |
icdOccurrences |
A data.table of occurrences of ICD codes for each
person in the cohort. Must have columns |
minUniqueAges |
Integer indicating the minimum number of unique
ICD code entry ages required to classify a person as a case. Persons with
at least one, but fewer than |
diseaseDxIcdMap |
A data.table of the mapping between diseases and
the corresponding ICD codes that indicate a diagnosis. Must have columns
|
A data.table with columns person_id
, disease_id
, and dx_status
(1 indicates a case, 0 indicates a control, -1 indicates neither).
library('data.table') icdSample1 = merge(icdSample, demoSample[, .(person_id, dob)], by = 'person_id') icdSample1[, occurrence_age := as.numeric((entry_date - dob)/365.25)] icdSample1[, `:=`(entry_date = NULL, dob = NULL)] dxStatus = getDxStatus(demoSample, icdSample1)
library('data.table') icdSample1 = merge(icdSample, demoSample[, .(person_id, dob)], by = 'person_id') icdSample1[, occurrence_age := as.numeric((entry_date - dob)/365.25)] icdSample1[, `:=`(entry_date = NULL, dob = NULL)] dxStatus = getDxStatus(demoSample, icdSample1)
The association test for each disease-variant pair is based on a linear model, with the phenotype risk score as the dependent variable.
getGeneticAssociations( scores, genotypes, demos, diseaseVariantMap, lmFormula, modelType = c("genotypic", "additive", "dominant", "recessive"), level = 0.95, dopar = FALSE )
getGeneticAssociations( scores, genotypes, demos, diseaseVariantMap, lmFormula, modelType = c("genotypic", "additive", "dominant", "recessive"), level = 0.95, dopar = FALSE )
scores |
A data.table of phenotype risk scores. Must have columns
|
genotypes |
A matrix or 'BEDMatrix' object containing genetic data, with
rownames corresponding to |
demos |
A data.table of characteristics for each person in the cohort.
Must have column |
diseaseVariantMap |
A data.table indicating which genetic variants to
test for association with phenotype risk scores for which diseases. Must
have columns |
lmFormula |
A formula representing the linear model (excluding the term
for genotype) to use for the association tests. All terms in the formula
must correspond to columns in |
modelType |
A string indicating how to encode genotype in the model. |
level |
A number indicating the level of the confidence interval. Default is 0.95. |
dopar |
Logical indicating whether to run calculations in parallel if
a parallel backend is already set up, e.g., using
|
A data.table of statistics for the association tests (if a model fails to converge, NAs will be reported):
disease_id
: Disease identifier
variant_id
: Variant identifier
n_total
: Number of persons with non-missing genotype data for the
given variant.
n_wt
: Number of persons homozygous for the wild-type allele.
n_het
: Number of persons having one copy of the alternate allele.
n_hom
: Number of persons homozygous for the alternate allele.
beta
: Coefficient for the association of genotype with score
se
: Standard error for beta
pval
: P-value for beta
being non-zero
ci_lower
: Lower bound of the confidence interval for beta
ci_upper
: Upper bound of the confidence interval for beta
If modelType
is "genotypic", the data.table will include separate
statistics for heterozygous and homozygous genotypes.
stats::lm()
, stats::confint()
, getScores()
library('data.table') library('BEDMatrix') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights weights = getWeights(demoSample, phecodeOccurrences) # OMIM disease IDs for which to calculate phenotype risk scores diseaseId = 154700 # map diseases to phecodes diseasePhecodeMap = mapDiseaseToPhecode() # calculate scores scores = getScores(weights, diseasePhecodeMap[disease_id == diseaseId]) # map diseases to genetic variants nvar = 10 diseaseVariantMap = data.table(disease_id = diseaseId, variant_id = paste0('snp', 1:nvar)) # load sample genetic data npop = 50 genoSample = BEDMatrix(system.file('extdata', 'geno_sample.bed', package = 'phers')) colnames(genoSample) = paste0('snp', 1:nvar) rownames(genoSample) = 1:npop # run genetic association tests genoStats = getGeneticAssociations( scores, genoSample, demoSample, diseaseVariantMap, lmFormula = ~ sex, modelType = 'additive')
library('data.table') library('BEDMatrix') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights weights = getWeights(demoSample, phecodeOccurrences) # OMIM disease IDs for which to calculate phenotype risk scores diseaseId = 154700 # map diseases to phecodes diseasePhecodeMap = mapDiseaseToPhecode() # calculate scores scores = getScores(weights, diseasePhecodeMap[disease_id == diseaseId]) # map diseases to genetic variants nvar = 10 diseaseVariantMap = data.table(disease_id = diseaseId, variant_id = paste0('snp', 1:nvar)) # load sample genetic data npop = 50 genoSample = BEDMatrix(system.file('extdata', 'geno_sample.bed', package = 'phers')) colnames(genoSample) = paste0('snp', 1:nvar) rownames(genoSample) = 1:npop # run genetic association tests genoStats = getGeneticAssociations( scores, genoSample, demoSample, diseaseVariantMap, lmFormula = ~ sex, modelType = 'additive')
This is typically the first step of an analysis using phenotype risk scores,
the next is getWeights()
.
getPhecodeOccurrences( icdOccurrences, icdPhecodeMap = phers::icdPhecodeMap, dxIcd = phers::diseaseDxIcdMap )
getPhecodeOccurrences( icdOccurrences, icdPhecodeMap = phers::icdPhecodeMap, dxIcd = phers::diseaseDxIcdMap )
icdOccurrences |
A data.table of occurrences of ICD codes for each
person in the cohort. Must have columns |
icdPhecodeMap |
A data.table of the mapping between ICD codes and
phecodes. Must have columns |
dxIcd |
A data.table of ICD codes to exclude from mapping to phecodes.
Must have columns |
A data.table of phecode occurrences for each person.
library('data.table') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights (using the prevalence method) weights = getWeights(demoSample, phecodeOccurrences) # OMIM disease IDs for which to calculate phenotype risk scores diseaseId = 154700 # map diseases to phecodes diseasePhecodeMap = mapDiseaseToPhecode() # calculate scores scores = getScores(weights, diseasePhecodeMap[disease_id == diseaseId]) # calculate residual scores rscores = getResidualScores(demoSample, scores, lmFormula = ~ sex)
library('data.table') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights (using the prevalence method) weights = getWeights(demoSample, phecodeOccurrences) # OMIM disease IDs for which to calculate phenotype risk scores diseaseId = 154700 # map diseases to phecodes diseasePhecodeMap = mapDiseaseToPhecode() # calculate scores scores = getScores(weights, diseasePhecodeMap[disease_id == diseaseId]) # calculate residual scores rscores = getResidualScores(demoSample, scores, lmFormula = ~ sex)
The residual score indicates to what extent a person's phenotype risk score for a given disease deviates from the expected score, after adjusting for the person's characteristics in a linear model.
getResidualScores(demos, scores, lmFormula)
getResidualScores(demos, scores, lmFormula)
demos |
A data.table of characteristics for each person in the cohort.
Must have column |
scores |
A data.table containing the phenotype risk score for each
person for each disease. Must have columns |
lmFormula |
A formula representing the linear model to use for
calculating residual scores. All terms in the formula must correspond to
columns in |
A data.table, based on scores
, with an additional column
resid_score
. Residual scores for each disease are standardized to have
unit variance.
stats::rstandard()
, getScores()
library('data.table') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights (using the prevalence method) weights = getWeights(demoSample, phecodeOccurrences) # OMIM disease IDs for which to calculate phenotype risk scores diseaseId = 154700 # map diseases to phecodes diseasePhecodeMap = mapDiseaseToPhecode() # calculate scores scores = getScores(weights, diseasePhecodeMap[disease_id == diseaseId]) # calculate residual scores rscores = getResidualScores(demoSample, scores, lmFormula = ~ sex)
library('data.table') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights (using the prevalence method) weights = getWeights(demoSample, phecodeOccurrences) # OMIM disease IDs for which to calculate phenotype risk scores diseaseId = 154700 # map diseases to phecodes diseasePhecodeMap = mapDiseaseToPhecode() # calculate scores scores = getScores(weights, diseasePhecodeMap[disease_id == diseaseId]) # calculate residual scores rscores = getResidualScores(demoSample, scores, lmFormula = ~ sex)
A person's phenotype risk score for a given disease corresponds to the sum of the weights of the disease-relevant phecodes that the person has received.
getScores(weights, diseasePhecodeMap)
getScores(weights, diseasePhecodeMap)
weights |
A data.table of phecodes and their corresponding weights.
Must have columns |
diseasePhecodeMap |
A data.table of the mapping between diseases and
phecodes. Must have columns |
A data.table containing the phenotype risk score for each person for each disease.
mapDiseaseToPhecode()
, getPhecodeOccurrences()
, getWeights()
,
getResidualScores()
library('data.table') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights (using the prevalence method) weights = getWeights(demoSample, phecodeOccurrences) # OMIM disease IDs for which to calculate phenotype risk scores diseaseId = 154700 # map diseases to phecodes diseasePhecodeMap = mapDiseaseToPhecode() # calculate scores scores = getScores(weights, diseasePhecodeMap[disease_id == diseaseId]) # calculate residual scores rscores = getResidualScores(demoSample, scores, lmFormula = ~ sex)
library('data.table') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights (using the prevalence method) weights = getWeights(demoSample, phecodeOccurrences) # OMIM disease IDs for which to calculate phenotype risk scores diseaseId = 154700 # map diseases to phecodes diseasePhecodeMap = mapDiseaseToPhecode() # calculate scores scores = getScores(weights, diseasePhecodeMap[disease_id == diseaseId]) # calculate residual scores rscores = getResidualScores(demoSample, scores, lmFormula = ~ sex)
This is typically the second step of an analysis using phenotype risk scores,
the next is getScores()
.
getWeights( demos, phecodeOccurrences, method = c("prevalence", "logistic", "cox", "loglinear", "prevalence_precalc"), methodFormula = NULL, negativeWeights = FALSE, dopar = FALSE )
getWeights( demos, phecodeOccurrences, method = c("prevalence", "logistic", "cox", "loglinear", "prevalence_precalc"), methodFormula = NULL, negativeWeights = FALSE, dopar = FALSE )
demos |
A data.table having one row per person in the cohort. Must have
a column |
phecodeOccurrences |
A data.table of phecode occurrences for each person
in the cohort. Must have columns |
method |
A string indicating the statistical model for calculating weights. |
methodFormula |
A formula representing the right-hand side of the model
corresponding to |
negativeWeights |
Logical indicating whether to allow negative weights for individuals with no occurrences of a phecode. This option is not required for the "loglinear" method since under this method, individuals with a nonzero phecode occurrence can also have negative weights. |
dopar |
Logical indicating whether to run calculations in parallel if
a parallel backend is already set up, e.g., using
|
A data.table with columns person_id
, phecode
, pred
, and w
.
The column pred
represents a different quantity depending on method
.
Under the "prevalence" method
, it is fraction of the cohort that has
at least one occurrence of the given phecode. The "prevalence_precalc"
method
is similar to the "prevalence" method
but pred
is calculated
based on EHR data from the Vanderbilt University Medical Center.
Under "logistic" or "cox" method
, it is the predicted probability of
given individual having a given phecode based on methodFormula
.
Under the "loglinear" method
, it is the predicted
log2(num_occurrences + 1)
of a given phecode for a given individual
based on methodFormula
. For the "prevalence", "prevalence_precalc",
"cox", and "logistic" method
s, weight is calculated as -log10(pred)
when an individual has non-zero phecode occurrence and log10(1 - pred)
when an individual has zero phecode occurrence. For the "loglinear" method
weight is calculated as the difference between the observed
log2(num_occurrences + 1)
and pred
.
getPhecodeOccurrences()
, getScores()
library('data.table') library('survival') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights using the prevalence method weightsPrev = getWeights(demoSample, phecodeOccurrences) # calculate weights using the prevalence method # (assign negative weights to those with zero phecode occurrence) weightsPrevNeg = getWeights( demoSample, phecodeOccurrences, negativeWeights = TRUE) # calculate weights using the logistic method weightsLogistic = getWeights( demoSample, phecodeOccurrences, method = 'logistic', methodFormula = ~ sex) # calculate weights using the loglinear method phecodeOccurrences2 = phecodeOccurrences[, .( num_occurrences = uniqueN(entry_date)), by = .(person_id, phecode)] weightsLoglinear = getWeights( demoSample, phecodeOccurrences2, method = 'loglinear', methodFormula = ~ sex) # calculate weights using the cox method phecodeOccurrences3 = phecodeOccurrences[, .( first_occurrence_date = min(entry_date)) , by = .(person_id, phecode)] phecodeOccurrences3 = merge( phecodeOccurrences3, demoSample[, .(person_id, dob)], by = 'person_id') phecodeOccurrences3[, occurrence_age := as.numeric((first_occurrence_date - dob)/365.25)] phecodeOccurrences3[, `:=`(first_occurrence_date = NULL, dob = NULL)] demoSample3 = demoSample[, .( person_id, sex, first_age = as.numeric((first_visit_date - dob)/365.25), last_age = as.numeric((last_visit_date - dob)/365.25))] weightsCox = getWeights( demoSample3, phecodeOccurrences3, method = 'cox', methodFormula = ~ sex) # calculate weights using pre-calculated weights based on data from # Vanderbilt University Medical Center weightsPreCalc = getWeights( demoSample, phecodeOccurrences, method = 'prevalence_precalc')
library('data.table') library('survival') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights using the prevalence method weightsPrev = getWeights(demoSample, phecodeOccurrences) # calculate weights using the prevalence method # (assign negative weights to those with zero phecode occurrence) weightsPrevNeg = getWeights( demoSample, phecodeOccurrences, negativeWeights = TRUE) # calculate weights using the logistic method weightsLogistic = getWeights( demoSample, phecodeOccurrences, method = 'logistic', methodFormula = ~ sex) # calculate weights using the loglinear method phecodeOccurrences2 = phecodeOccurrences[, .( num_occurrences = uniqueN(entry_date)), by = .(person_id, phecode)] weightsLoglinear = getWeights( demoSample, phecodeOccurrences2, method = 'loglinear', methodFormula = ~ sex) # calculate weights using the cox method phecodeOccurrences3 = phecodeOccurrences[, .( first_occurrence_date = min(entry_date)) , by = .(person_id, phecode)] phecodeOccurrences3 = merge( phecodeOccurrences3, demoSample[, .(person_id, dob)], by = 'person_id') phecodeOccurrences3[, occurrence_age := as.numeric((first_occurrence_date - dob)/365.25)] phecodeOccurrences3[, `:=`(first_occurrence_date = NULL, dob = NULL)] demoSample3 = demoSample[, .( person_id, sex, first_age = as.numeric((first_visit_date - dob)/365.25), last_age = as.numeric((last_visit_date - dob)/365.25))] weightsCox = getWeights( demoSample3, phecodeOccurrences3, method = 'cox', methodFormula = ~ sex) # calculate weights using pre-calculated weights based on data from # Vanderbilt University Medical Center weightsPreCalc = getWeights( demoSample, phecodeOccurrences, method = 'prevalence_precalc')
This table provides a mapping between Human Phenotype Ontology (HPO) terms and phecodes, and is useful for using phecodes to represent the clinical features of Mendelian diseases (version 1.2).
hpoPhecodeMap
hpoPhecodeMap
A data.table with the following columns:
hpo_term_id
: Character vector of HPO term identifiers
hpo_term_name
: Character vector of HPO term descriptions
phecode
: Character vector of phecodes
phecode_name
: Character vector of phecode descriptions
This table provides a mapping between International Classification of Diseases 9th and 10th revisions (ICD-9-CM and ICD-10-CM) and phecodes (version 1.2).
icdPhecodeMap
icdPhecodeMap
A data.table with the following columns:
icd
: Character vector of ICD codes
flag
: Integer vector of the vocabulary of the ICD code
(9: ICD-9-CM, 10: ICD-10-CM)
icd_name
: Character vector of ICD code descriptions
phecode
: Character vector of phecodes
phecode_name
: Character vector of phecode descriptions
https://phewascatalog.org/phecodes
The data are artificial and do not correspond to real patients.
icdSample
icdSample
A data.table with the following columns:
person_id
: Character vector of the identifier for each person
icd
: Character vector of the ICD codes recorded for each person
flag
: Integer vector of the vocabulary of the ICD code
(9: ICD-9-CM, 10: ICD-10-CM)
entry_date
: Vector of type Date
indicating the date each ICD code was
recorded.
getPhecodeOccurrences()
, getWeights()
, getScores()
A mapping of diseases to their clinical features, represented as phecodes, is required for calculating phenotype risk scores.
mapDiseaseToPhecode( diseaseHpoMap = phers::diseaseHpoMap, hpoPhecodeMap = phers::hpoPhecodeMap )
mapDiseaseToPhecode( diseaseHpoMap = phers::diseaseHpoMap, hpoPhecodeMap = phers::hpoPhecodeMap )
diseaseHpoMap |
A data.table containing the mapping between diseases and
HPO terms. Must have columns |
hpoPhecodeMap |
A data.table containing the mapping between HPO terms
and phecodes. Must have columns |
A data.table with columns disease_id
and phecode
.
library('data.table') library('survival') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights using the prevalence method weightsPrev = getWeights(demoSample, phecodeOccurrences) # calculate weights using the prevalence method # (assign negative weights to those with zero phecode occurrence) weightsPrevNeg = getWeights( demoSample, phecodeOccurrences, negativeWeights = TRUE) # calculate weights using the logistic method weightsLogistic = getWeights( demoSample, phecodeOccurrences, method = 'logistic', methodFormula = ~ sex) # calculate weights using the loglinear method phecodeOccurrences2 = phecodeOccurrences[, .( num_occurrences = uniqueN(entry_date)), by = .(person_id, phecode)] weightsLoglinear = getWeights( demoSample, phecodeOccurrences2, method = 'loglinear', methodFormula = ~ sex) # calculate weights using the cox method phecodeOccurrences3 = phecodeOccurrences[, .( first_occurrence_date = min(entry_date)) , by = .(person_id, phecode)] phecodeOccurrences3 = merge( phecodeOccurrences3, demoSample[, .(person_id, dob)], by = 'person_id') phecodeOccurrences3[, occurrence_age := as.numeric((first_occurrence_date - dob)/365.25)] phecodeOccurrences3[, `:=`(first_occurrence_date = NULL, dob = NULL)] demoSample3 = demoSample[, .( person_id, sex, first_age = as.numeric((first_visit_date - dob)/365.25), last_age = as.numeric((last_visit_date - dob)/365.25))] weightsCox = getWeights( demoSample3, phecodeOccurrences3, method = 'cox', methodFormula = ~ sex) # calculate weights using pre-calculated weights based on data from # Vanderbilt University Medical Center weightsPreCalc = getWeights( demoSample, phecodeOccurrences, method = 'prevalence_precalc')
library('data.table') library('survival') # map ICD codes to phecodes phecodeOccurrences = getPhecodeOccurrences(icdSample) # calculate weights using the prevalence method weightsPrev = getWeights(demoSample, phecodeOccurrences) # calculate weights using the prevalence method # (assign negative weights to those with zero phecode occurrence) weightsPrevNeg = getWeights( demoSample, phecodeOccurrences, negativeWeights = TRUE) # calculate weights using the logistic method weightsLogistic = getWeights( demoSample, phecodeOccurrences, method = 'logistic', methodFormula = ~ sex) # calculate weights using the loglinear method phecodeOccurrences2 = phecodeOccurrences[, .( num_occurrences = uniqueN(entry_date)), by = .(person_id, phecode)] weightsLoglinear = getWeights( demoSample, phecodeOccurrences2, method = 'loglinear', methodFormula = ~ sex) # calculate weights using the cox method phecodeOccurrences3 = phecodeOccurrences[, .( first_occurrence_date = min(entry_date)) , by = .(person_id, phecode)] phecodeOccurrences3 = merge( phecodeOccurrences3, demoSample[, .(person_id, dob)], by = 'person_id') phecodeOccurrences3[, occurrence_age := as.numeric((first_occurrence_date - dob)/365.25)] phecodeOccurrences3[, `:=`(first_occurrence_date = NULL, dob = NULL)] demoSample3 = demoSample[, .( person_id, sex, first_age = as.numeric((first_visit_date - dob)/365.25), last_age = as.numeric((last_visit_date - dob)/365.25))] weightsCox = getWeights( demoSample3, phecodeOccurrences3, method = 'cox', methodFormula = ~ sex) # calculate weights using pre-calculated weights based on data from # Vanderbilt University Medical Center weightsPreCalc = getWeights( demoSample, phecodeOccurrences, method = 'prevalence_precalc')
The weights are based on EHR data from the Vanderbilt University Medical Center Synthetic Derivative (SD) and ICD-phecode map version 1.2 and are calculated using the "prevalence" method.
preCalcWeights
preCalcWeights
A data.table with the following columns:
phecode
: Character vector of phecodes
prev
: Numeric vector of prevalences, i.e., fraction of subjects in the SD
that have at least one occurrence of the given phecode
w
: Numeric vector of weights, calculated as -log10(prev)