Getting Started

We developed the novel dnenrich framework to estimate enrichment of de novo mutations within pre-defined groups of genes. dnenrich performs one-sided pathway enrichment and recurrence analysis, calculating the p-value under a binomial model of greater than expected hits per gene-set, that is the rate at which random mutations hit a particular pathway (or recur) as many times or more than observed. For example, to calculate the enrichment of LoF ("loss-of-function") mutations in a gene set/pathway, dnenrich calculates the observed statistic:

N_obs = Number of LoF mutations observed to fall in a gene in set of interest

To model exome-wide dispersion of de novo variation under a null hypothesis that accounts for gene size, tri-nucleotide context and functional effect of a mutation, and per-trio effective gene coverage, random permutations are used to place mutations in the "exome", while the number of mutations, their base context, and functional impact are held fixed to that of the observed data. Conditioning on the observed number of mutations is critical due to the role of sequencing coverage on detectable absolute mutation rates, particularly for data from published studies where access to exome-wide coverage statistics may not be possible. We define the "exome" here such that each gene takes up a relative fraction proportional to its total coding length, and this per-gene length can be adjusted based on the joint sequencing coverage for each particular trio (though in benchmarking data, this did not have a noticeable effect on results). Then, at permutation i, dnenrich calculates the permuted statistic:

N_{perm_i} = Number of randomly generated LoF mutations in some gene in set

And, the p-value is calculated as:

(1+ number of times N_{perm_i} >= N_obs) / (1+ number of permutations)

Fold-enrichment statistics (observed-to-expected ratio = O/E) are calculated as the ratio between the statistic of observed hits (N_obs) and the average of this statistic over all permutations, i.e., the estimated expected value of the statistic under the null, E[N_{perm_i}].

To test for exome-wide recurrence of LoF mutations, dnenrich similarly counts the number of permutations for which 2 or more genes are hit by LoF mutations.
And, to test for nominal significance of recurrence of any particular gene, dnenrich counts the number of permutations in which 2 or more mutations fall in that gene.

There is also an option in dnenrich for comparative analysis using as a test statistic the relative proportion of hits of one set of mutations among that set and a "contrast" set of mutations (e.g., between diseases, or functional classes of mutations), where the properties of each class of mutation are separately matched during permutation.

When running dnenrich with a particular set of genes (e.g., a pathway), there is the option to assign a weight to each gene within that pathway, for example, to correspond to the relative certainty that genes in fact belong to a pathway derived from high-throughput experiments. In our paper, we used this gene-weighting feature when constructing sets of genes bearing de novo mutations from the various diseases such that a gene hit recurrently would have a weight equal to the number of times it was hit by a mutation. That is, when testing overlap of mutations from one disease sample with genes from another disease sample, this ensured the raw test statistic (# of mutations in first disease, multiplied by the # of mutations in second disease) would be symmetrically computed for calculating the overlap of genes from one disease with mutations from a second disease, and vice versa.

Requisite input files (see program usage and example runs)

Alias list:
Official gene names with aliases that map back to the gene name. Format is (where even genes with no aliases should be included, lest that official name be considered an alias of another gene, e.g., for TCF4):
```
geneA	geneA_alias_1	geneA_alias_2	geneA_alias_3
geneB
geneC	geneC_alias_1	geneC_alias_2
...
```

Gene size matrix:
Matrix of gene sizes, where each row gives the relative sizes of each gene for the corresponding base composition and functional impact of mutation.
Format:
```
Group	Type	geneA	geneB	geneC
*	*:*	1000	2000	3000
*	*:LoF	50	75	200
*	*:NS	100	100	110
individual1	*:*	900	1800	2999
individual1	*:LoF	45	55	180
individual1	*:NS	90	97	105
individual2	*:*	800	1900	2950
individual2	*:LoF	49	65	199
individual2	*:NS	85	100	108
```
Column 1: Group. The group is either the generic group ("*") used for trios that do not have per-gene sequencing depths listed in the file, or the name of the child for whom effective gene sizes will be given. The generic group does not take into account per-trio sequencing coverage statistics. The per-trio gene sizes should be used if one has calculated the effective gene sizes after requiring that there be sufficient sequencing coverage in all 3 members of the trio at the corresponding bases in the respective genes.

Column 2: Class of mutation. For example, "*:LoF", "*:NS", where the labels for this are described here.

Column 3: Gene. After the first two columns, each column represents a gene and the number of coding base pairs that the gene spans, effectively "gene size". In the case of overlapping genes, the number of base pairs of overlap is shown for the overlap region (so that a gene is disjointly modeled as the sum of its sizes across all of the columns in which it appears); sizes of regions containing multiple genes are denoted by the gene names delimited by "+", for example: "geneA+geneB+geneC".

Gene set:
White-space-delimited file listing all genes in the gene set the user is testing.
Format:
```
geneA	set1	weight_geneA_set1
geneB	set1	weight_geneB_set1
geneC	set1	weight_geneC_set1
geneA	set2	weight_geneA_set2
geneD	set2	weight_geneD_set2
```
Column 1: Gene ID

Column 2: Gene set. Name of gene set to which gene belongs (NOTE: CANNOT CONTAIN WHITE SPACE IN THE GENE SET NAME.).

Column 3: Weight of gene in that gene set. Weights (integer or floating point value) are optional and can be omitted or set to 1 if not available.

Mutation list:
File listing the mutations in the users dataset.
Format:
```
individual1	geneA	mutation_class	mutation_weight	comparative
individual1	geneC	mutation_class	mutation_weight	comparative
individual2	geneD	mutation_class	mutation_weight	comparative
```
All 5 fields are required.
Column 1: Individual ID. Person in which the mutation occurs.

Column 2: Gene. Gene in which mutation occurs.

Column 3: Class of mutation. The class of mutation should match the class listed in the gene matrix. Correct labeling for this is is described in the mutation naming guide on the Getting Started page. This field is optional and will be set to a default of '*:*' (the generic class).

Column 4: Weight of mutation. The weight assigned to a particular mutation. One example where this would be used is if a gene was hit recurrently, and the weight was equal to the number of times it was hit. This field is optional and will be set to a default of 1.

Column 5: Binary column used for comparative analysis mode. The fifth column indicates whether comparative analysis will be carried out. If all values are "1", then no comparative analysis is performed; otherwise, a one-sided test is performed comparing the mutations marked by "1" in this column against the mutations marked by "0" in this column. This field is optional and will be set to a default of "1".
Reiterating: if not relevant, the "mutation_weight" and "comparative" fields can be omitted or set to 1.

Background list (optional input file):
List of genes to be "conditioned" on during permutation, listed in a single column (i.e., one gene per row). One example where the use of conditioning on a background set of genes is useful is if we are looking at enrichment in differentially expressed brain genes; the background list could be a list of brain expressed genes.

If multiple background list files are given, then the intersection of all lists is used as the final background set to which all observed mutations are subsetted, and for which all null permutations are generated.

Typically, input files (3), (4), and (5) will represent your own data.

Mutation naming guide

The gene mutations in the mutation list should be identified according to the following guide, in order to match up with the mutation types in the gene sizes matrix we distribute online (format defined here). This convention allows the user to specify both the base change and the functional effect of the mutation.

Mutations are labeled in the following way:

trinucleotide_base_context_change:functional_annotation

Some examples:

*:*   | Any base change and any functional effect

*:LoF   | Any base change and a loss of function mutation (used for LoF indels)

*:NS   | Any base change and a non synonymous mutation (used for NS indels)

*:esplice   | Any base change and an esplice mutation

*:missense   | Any base change and a missense mutation

*:nonsense   | Any base change and a nonsense mutation

*:silent   | Any base change and a silent mutation

AAA/ACA:*   | Base change from AAA to ACA and any functional mutation 

TTT/TAT:NS   | Base change from TTT to TAT resulting in a nonsynonymous mutation

TTT/TAT:missense   | Base change from TTT to TAT resulting in a missense mutation