QTLtools trans - trans QTL analysis
QTLtools trans --vcf
[in.vcf|
in.vcf.gz|
in.bcf|
in.bed.gz] --bed
quantifications.bed.gz [--nominal | --permute |
--sample integer | --adjust in.txt]
--out
output.txt [OPTIONS]
This mode maps trans (distal) quantitative trait loci (QTLs) that affect the
phenotypes, using linear regression. The method is detailed in
<
https://www.nature.com/articles/ncomms15452>. We first regress out the
provided covariates from the phenotype data, followed by running the linear
regression between the phenotype residuals and the genotype. If
--normal and
--cov are provided at the same time, then the
residuals after the covariate correction are rank normal transformed. It
incorporates an efficient permutation scheme. You can run a nominal pass (
--nominal) listing all genotype-phenotype associations below a certain
threshold, a permutation pass (
--permute or
--sample
no_genes_to_sample) to empirically characterize the null distribution
of associations, or adjust the nominal p-values based on permutations (
--adjust).
In the full permutation scheme (
--permute) we permute all phenotypes
using the same random number sequence to preserve the correlation structure.
By doing so, the only association we actually break in the data is between the
genotype and the phenotype data. Then, we proceed with a standard association
scan identical to the one used in the nominal pass. In practice, we repeat
this for 100 permutations of the phenotype data. Subsequently, we can proceed
with FDR correction by ranking all the nominal p-values in ascending order and
by counting how many p-values in the permuted data sets are smaller. This
provides an FDR estimate: if we have 500 p-values in the permuted data sets
that are smaller than the 100th smallest nominal p-value, we can then assume
that the FDR for the 100 first associations is around 5% (=500/(100 ×
100)).
To enable fast screening in trans, we also designed an approximation of the
method described just above based on what we already do in cis. To make it
possible, we assume that the phenotypes are independent and normally
distributed (which can be enforced with
--normal). The idea is that
since all phenotypes are normally distributed, effectively they are the same,
and also the cis region removed from each phenotype is so small compared to
rest of the genome that its phenotype specific impact is negligible. Hence the
number of and the correlation amongst variants for each phenotype is
approximately the same, and each phenotype is approximately the same; thus we
can run permutations with a small number of phenotypes rather then all, which
drastically decreases the computational burden and the null distribution
generated can be applied to all phenotypes. The implementation draws from the
null by permuting some randomly chosen phenotypes, testing for associations
with all variants in trans and storing the smallest p-value. When we repeat
this many times (typically 1000), effectively building a null distribution of
the strongest associations for a single phenotype. We then make it continuous
by fitting a beta distribution as we do in cis and use it to adjust every
nominal p-value coming from the initial pass for the number of variants being
tested. To correct for the number of phenotypes being tested, we estimate FDR
as we do in cis; that is from the best adjusted p-values per phenotype (one
per phenotype). This also gives an adjusted p-value threshold that we use to
identify all phenotype-variant pairs that are whole-genome significant. In our
experiments, this approach gives similar results to the full permutation
scheme both in term of FDR estimates and number of discoveries, while running
faster.
Since linear regressions assumes normally distributed data, we
highly
recommend using the
--normal option to rank normal transform the
phenotype quantifications in order to avoid false positive associations due to
outliers. If you are using the approximate permutation scheme (
--sample)
you MUST use the --normal option or make sure that your
phenotypes are normally distributed.
- --vcf
[in.vcf|in.bcf|in.vcf.gz|
in.bed.gz]
- Genotypes in VCF/BCF format, or another molecular phenotype
in BED format. If there is a DS field in the genotype FORMAT of a variant
(dosage of the genotype calculated from genotype probabilities, e.g. after
imputation), then this is used as the genotype. If there is only the GT
field in the genotype FORMAT then this is used and it is converted to a
dosage. REQUIRED.
- --bed quantifications.bed.gz
- Molecular phenotype quantifications in BED format.
REQUIRED.
- --out output.txt
- Output file. REQUIRED.
- --cov covariates.txt
- Covariates to correct the phenotype data with.
- --normal
- Rank normal transform the phenotype data so that each
phenotype is normally distributed. RECOMMENDED.
- --window integer
- Size of the cis window to remove flanking each phenotype's
start position. DEFAULT=5000000.
- --threshold float
- P-value threshold below which hits are reported. Give 1.0
to print everything, which may generate a huge file. When --adjust
is provided, this threshold applies to the adjusted p-values.
DEFAULT=1e-5.
- --bins integer
- Number of bins to use to categorize all p-values above
--threshold. DEFAULT=1000.
- --nominal
- Calculate the nominal p-value for the genotype-phenotype
associations and print out the ones that pass the provided threshold.
Mutually exclusive with --permute, --sample and --adjust.
- --permute
- Permute all phenotypes together, once. For multiple
permutations you need to change the random seed using --seed for
each permutation. Mutually exclusive with --nominal, --sample and
--adjust.
- --sample integer
- Permute randomly chosen phenotypes integer times.
Mutually exclusive with --nominal, --permute, --adjust, and --chunk.
- --adjust filename
- Test and adjust p-values using the null distribution in
filename. Mutually exclusive with --nominal, --permute, and
--sample.
- --chunk integer1 integer2
- For parallelization. Divide the data into integer2
number of chunks and process chunk number integer1. Minimum
number of chunks has to be at least the same number of chromosomes in the
--bed file.
- .hits.txt.gz
- Space separated results output file detailing the
variant-phenotype pairs that pass the threshold with the following
columns:
1 |
The phenotype ID |
2 |
The phenotype chromosome |
3 |
Start position of the phenotype |
4 |
The variant ID |
5 |
The variant chromosome |
6 |
The start position of the variant |
7 |
The nominal p-value of the association between the variant and the
phenotype. |
8 |
The adjusted p-value of the association between the variant and the
phenotype. Requires --adjust
|
9 |
Correlation coefficient |
- .best.txt.gz
- Space separated output file listing the most significant
variant per phenotype.
1 |
The phenotype ID |
2 |
The adjusted p-value of the association between the variant and the
phenotype. Requires --adjust
|
3 |
The nominal p-value of the association between the variant and the
phenotype. |
4 |
The variant ID |
- .bins.txt.gz
- Space separated output file containing the binning of all
hits with a p-value below the specified --threshold.
1 |
The index of the bin |
2 |
The lower bound of the correlation coefficient for this bin |
3 |
The upper bound of the correlation coefficient for this bin |
4 |
The upper bound of the p-value for this bin |
5 |
The lower bound of the p-value for this bin |
- 1
- Run a nominal analysis, rank normal transforming the
phenotypes and outputting all associations with a p-value below 1e-5:
-
- QTLtools trans --vcf genotypes.chr22.vcf.gz --bed
genes.simulated.chr22.bed.gz --nominal --normal --out trans.nominal
- 2
- Run a full permutation analysis with 100 jobs on a compute
cluster, run the following making sure that you change the seed for
each permutation iteration (qsub needs to be changed to the job
submission system used [bsub, psub, etc...])
-
- for j in $(seq 1 100); do
echo "QTLtools trans --vcf genotypes.chr22.vcf.gz --bed
genes.simulated.chr22.bed.gz --permute --normal --out trans.perm$j.txt
--seed $j" | qsub
done
- 1
- Build the null distribution randomly selecting 1000
phenotypes, and rank normal transforming the phenotypes:
-
- QTLtools trans --vcf genotypes.chr22.vcf.gz --bed
genes.simulated.chr22.bed.gz --sample 1000 --normal --out
trans.sample
- 2
- Run the nominal pass adjusting the p-values with the given
null distribution, rank normal transforming the phenotypes, and printing
out associations with an adjusted p-value less than 0.1:
-
- QTLtools trans --vcf genotypes.chr22.vcf.gz --bed
genes.simulated.chr22.bed.gz --adjust trans.sample.best.txt.gz --threshold
0.1 --normal --out trans.adjust
QTLtools(1)
QTLtools website: <
https://qtltools.github.io/qtltools>
- o
- Versions up to and including 1.2, suffer from a bug in
reading missing genotypes in VCF/BCF files. This bug affects variants with
a DS field in their genotype's FORMAT and have a missing genotype (DS
field is .) in one of the samples, in which case genotypes for all the
samples are set to missing, effectively removing this variant from the
analyses.
Please submit bugs to <
https://github.com/qtltools/qtltools>
Delaneau, O., Ongen, H., Brown, A. et al. A complete tool set for molecular QTL
discovery and analysis.
Nat Commun 8, 15452 (2017).
<
https://doi.org/10.1038/ncomms15452>
Halit Ongen (
[email protected]), Olivier Delaneau
(
[email protected])