Getting Started
We developed the novel dnenrich framework to estimate enrichment of de novo mutations within pre-defined groups of genes. dnenrich performs one-sided pathway enrichment and recurrence analysis, calculating the p-value under a binomial model of greater than expected hits per gene-set, that is the rate at which random mutations hit a particular pathway (or recur) as many times or more than observed. For example, to calculate the enrichment of LoF ("loss-of-function") mutations in a gene set/pathway, dnenrich calculates the observed statistic:
Nobs = Number of LoF mutations observed to fall in a gene in set of interest
To model exome-wide dispersion of de novo variation under a null hypothesis that accounts for gene size, tri-nucleotide context and functional effect of a mutation, and per-trio effective gene coverage, random permutations are used to place mutations in the "exome", while the number of mutations, their base context, and functional impact are held fixed to that of the observed data. Conditioning on the observed number of mutations is critical due to the role of sequencing coverage on detectable absolute mutation rates, particularly for data from published studies where access to exome-wide coverage statistics may not be possible. We define the "exome" here such that each gene takes up a relative fraction proportional to its total coding length, and this per-gene length can be adjusted based on the joint sequencing coverage for each particular trio (though in benchmarking data, this did not have a noticeable effect on results). Then, at permutation i, dnenrich calculates the permuted statistic:
Npermi = Number of randomly generated LoF mutations in some gene in set
And, the p-value is calculated as:
(1+ number of times Npermi >= Nobs) / (1+ number of permutations)
Fold-enrichment statistics (observed-to-expected ratio = O/E) are calculated as the ratio between the statistic of observed hits (Nobs) and the average of this statistic over all permutations, i.e., the estimated expected value of the statistic under the null, E[Npermi].
And, to test for nominal significance of recurrence of any particular gene, dnenrich counts the number of permutations in which 2 or more mutations fall in that gene.
There is also an option in dnenrich for comparative analysis using as a test statistic the relative proportion of hits of one set of mutations among that set and a "contrast" set of mutations (e.g., between diseases, or functional classes of mutations), where the properties of each class of mutation are separately matched during permutation.
When running dnenrich with a particular set of genes (e.g., a pathway), there is the option to assign a weight to each gene within that pathway, for example, to correspond to the relative certainty that genes in fact belong to a pathway derived from high-throughput experiments. In our paper, we used this gene-weighting feature when constructing sets of genes bearing de novo mutations from the various diseases such that a gene hit recurrently would have a weight equal to the number of times it was hit by a mutation. That is, when testing overlap of mutations from one disease sample with genes from another disease sample, this ensured the raw test statistic (# of mutations in first disease, multiplied by the # of mutations in second disease) would be symmetrically computed for calculating the overlap of genes from one disease with mutations from a second disease, and vice versa.
Requisite input files (see program usage and example runs)
- Alias list:
Official gene names with aliases that map back to the gene name. Format is (where even genes with no aliases should be included, lest that official name be considered an alias of another gene, e.g., for TCF4):geneA geneA_alias_1 geneA_alias_2 geneA_alias_3 geneB geneC geneC_alias_1 geneC_alias_2 ...
- Gene size matrix:
Matrix of gene sizes, where each row gives the relative sizes of each gene for the corresponding base composition and functional impact of mutation.
Format:Group Type geneA geneB geneC * *:* 1000 2000 3000 * *:LoF 50 75 200 * *:NS 100 100 110 individual1 *:* 900 1800 2999 individual1 *:LoF 45 55 180 individual1 *:NS 90 97 105 individual2 *:* 800 1900 2950 individual2 *:LoF 49 65 199 individual2 *:NS 85 100 108
Column 1: Group. The group is either the generic group ("*") used for trios that do not have per-gene sequencing depths listed in the file, or the name of the child for whom effective gene sizes will be given. The generic group does not take into account per-trio sequencing coverage statistics. The per-trio gene sizes should be used if one has calculated the effective gene sizes after requiring that there be sufficient sequencing coverage in all 3 members of the trio at the corresponding bases in the respective genes.
Column 2: Class of mutation. For example, "*:LoF", "*:NS", where the labels for this are described here.
Column 3: Gene. After the first two columns, each column represents a gene and the number of coding base pairs that the gene spans, effectively "gene size". In the case of overlapping genes, the number of base pairs of overlap is shown for the overlap region (so that a gene is disjointly modeled as the sum of its sizes across all of the columns in which it appears); sizes of regions containing multiple genes are denoted by the gene names delimited by "+", for example: "geneA+geneB+geneC".
- Gene set:
White-space-delimited file listing all genes in the gene set the user is testing.
Format:geneA set1 weight_geneA_set1 geneB set1 weight_geneB_set1 geneC set1 weight_geneC_set1 geneA set2 weight_geneA_set2 geneD set2 weight_geneD_set2
Column 1: Gene ID
Column 2: Gene set. Name of gene set to which gene belongs (NOTE: CANNOT CONTAIN WHITE SPACE IN THE GENE SET NAME.).
Column 3: Weight of gene in that gene set. Weights (integer or floating point value) are optional and can be omitted or set to 1 if not available.
- Mutation list:
File listing the mutations in the users dataset.
Format:individual1 geneA mutation_class mutation_weight comparative individual1 geneC mutation_class mutation_weight comparative individual2 geneD mutation_class mutation_weight comparative
All 5 fields are required.Column 1: Individual ID. Person in which the mutation occurs.
Column 2: Gene. Gene in which mutation occurs.
Column 3: Class of mutation. The class of mutation should match the class listed in the gene matrix. Correct labeling for this is is described in the mutation naming guide on the Getting Started page. This field is optional and will be set to a default of '*:*' (the generic class).
Column 4: Weight of mutation. The weight assigned to a particular mutation. One example where this would be used is if a gene was hit recurrently, and the weight was equal to the number of times it was hit. This field is optional and will be set to a default of 1.
Column 5: Binary column used for comparative analysis mode. The fifth column indicates whether comparative analysis will be carried out. If all values are "1", then no comparative analysis is performed; otherwise, a one-sided test is performed comparing the mutations marked by "1" in this column against the mutations marked by "0" in this column. This field is optional and will be set to a default of "1".
Reiterating: if not relevant, the "mutation_weight" and "comparative" fields can be omitted or set to 1. - Background list (optional input file):
List of genes to be "conditioned" on during permutation, listed in a single column (i.e., one gene per row). One example where the use of conditioning on a background set of genes is useful is if we are looking at enrichment in differentially expressed brain genes; the background list could be a list of brain expressed genes.
If multiple background list files are given, then the intersection of all lists is used as the final background set to which all observed mutations are subsetted, and for which all null permutations are generated.
Typically, input files (3), (4), and (5) will represent your own data.
The gene mutations in the mutation list should be identified according to the following guide, in order to match up with the mutation types in the gene sizes matrix we distribute online (format defined here). This convention allows the user to specify both the base change and the functional effect of the mutation.
Mutations are labeled in the following way:
trinucleotide_base_context_change:functional_annotationSome examples:
*:* | Any base change and any functional effect *:LoF | Any base change and a loss of function mutation (used for LoF indels) *:NS | Any base change and a non synonymous mutation (used for NS indels) *:esplice | Any base change and an esplice mutation *:missense | Any base change and a missense mutation *:nonsense | Any base change and a nonsense mutation *:silent | Any base change and a silent mutation AAA/ACA:* | Base change from AAA to ACA and any functional mutation TTT/TAT:NS | Base change from TTT to TAT resulting in a nonsynonymous mutation TTT/TAT:missense | Base change from TTT to TAT resulting in a missense mutation