QTLtools cis - cis QTL analysis
QTLtools cis --vcf
[in.vcf|
in.vcf.gz|
in.bcf|
in.bed.gz] --bed
quantifications.bed.gz [
--nominal float |
--permute integer | --mapping in.txt]
--out
output.txt [OPTIONS]
This mode maps cis (proximal) quantitative trait loci (QTLs) that affect the
phenotype, using linear regression. The method is detailed in
<
https://www.nature.com/articles/ncomms15452>. We first regress out the
provided covariates from the phenotype data, followed by running the linear
regression between the phenotype residuals and the genotype. If
--normal and
--cov are provided at the same time, then the
residuals after the covariate correction are rank normal transformed. It
incorporates an efficient permutation scheme to control for differential
multiple testing burden of each phenotype. You can run a nominal pass (
--nominal threshold) listing all genotype-phenotype associations
below a certain threshold, a permutation pass (
--permute
no_of_permutations) to empirically characterize the null distribution
of associations for each phenotype separately, thus adjusting the nominal
p-value of the best association for a phenotype, or a conditional analysis
pass (
--mapping filename) to discover multiple proximal QTLs
with independent effects on a phenotype.
As multiple molecular phenotypes can belong to higher order biological entities,
e.g. exons of genes, QTLtools cis allows grouping of phenotypes to maximize
the discoveries in such particular cases. Specifically, QTLtools can either
aggregate multiple phenotypes in a given group into a single phenotype via PCA
(
--grp-pca1) or by taking their mean (
--grp-mean), or directly
use all individual phenotypes in an extended permutation scheme that accounts
for their number and correlation structure (
--grp-best). In our
experience,
--grp-best outperforms the other options for expression
QTLs (eQTLs).
The conditional analysis pass first uses permutations to derive a nominal
p-value threshold per phenotype that varies and reflects the number of
independent tests per cis-window. Then, it uses a forward-backward stepwise
regression to learn the number of independent signals per phenotype, determine
the best candidate variant per signal and assign all significant hits to the
independent signal they relate to.
Since linear regressions assumes normally distributed data, we
highly
recommend using the
--normal option to rank normal transform the
phenotype quantifications in order to avoid false positive associations due to
outliers.
- --vcf
[in.vcf|in.bcf|in.vcf.gz|
in.bed.gz]
- Genotypes in VCF/BCF format, or another molecular phenotype
in BED format. If there is a DS field in the genotype FORMAT of a variant
(dosage of the genotype calculated from genotype probabilities, e.g. after
imputation), then this is used as the genotype. If there is only the GT
field in the genotype FORMAT then this is used and it is converted to a
dosage. REQUIRED.
- --bed quantifications.bed.gz
- Molecular phenotype quantifications in BED format.
REQUIRED.
- --out output.txt
- Output file. REQUIRED.
- --cov covariates.txt
- Covariates to correct the phenotype data with.
- --normal
- Rank normal transform the phenotype data so that each
phenotype is normally distributed. RECOMMENDED.
- --std-err
- Calculate and output the standard error of the beta
(slope).
- --window integer
- Size of the cis window flanking each phenotype's start
position. DEFAULT=1000000. RECOMMENDED=1000000.
- --nominal float
- Calculate the nominal p-value for the genotype-phenotype
associations and print out the ones that pass the provided threshold. Give
1.0 to print everything. Mutually exclusive with --permute and
--mapping.
- --permute integer
- Adjust the best nominal p-value for this phenotype
accounting for the number of variants and the linkage disequilibrium in
its cis-window. We recommend at least 1000 permutation for the final
analysis, and in most cases you will see diminishing returns when going
over 5000. However, if you are doing exploratory analyses like which/how
many covariates to include, you can go as low as 100. Mutually exclusive
with --nominal and --mapping. RECOMMENDED=1000.
- --mapping thresholds_filename
- The conditional analysis. First you need to run a
permutation analysis, then generate a thresholds file using the
runFDR_cis.R script in the script directory. Mutually exclusive with
--nominal and --permute.
- --grp-best
- Correct for multiple phenotypes within a phenotype group.
Mutually exclusive with --grp-pca1 and --grp-mean.
- --grp-pca1
- Run PCA on phenotypes within a phenotype group and use PC1
for association testing. Mutually exclusive with --grp-best and
--grp-mean.
- --grp-mean
- Average phenotypes within a group and use the results for
association testing. Mutually exclusive with --grp-best and
--grp-pca1.
- --chunk integer1 integer2
- For parallelization. Divide the data into integer2
number of chunks and process chunk number integer1. Chunk 0 will
print a header. Mutually exclusive with --region and --region-pair.
Minimum number of chunks has to be at least the same number of
chromosomes in the --bed file.
- --region chr:start-end
- Genomic region to be processed. E.g.
chr4:12334456-16334456, or chr5 Mutually exclusive with --chunk and
--region-pair.
- --region-pair chr:start-end
chr:start-end
- Genomic region for genotypes followed by region for
phenotypes to be processed. Mutually exclusive with --chunk and --region.
- --nominal output file
- Space separated output file with the following columns
(certain columns are only printed based on options). We recommend
including chunk 0 to print out a header in order to avoid confusion.
1 |
phe_id | grp_id |
The phenotype ID or if one of the grouping options is provided, then
phenotype group ID |
2 |
phe_chr |
The phenotype chromosome |
3 |
phe_from |
Start position of the phenotype |
4 |
phe_to |
End position of the phenotype |
5 |
phe_strd |
The phenotype strand |
5.1 |
phe_id | ve_by_pc1 | n_phe_in_grp |
Only printed if --group-best | --group-pca1 |
--group-mean. The phenotype ID, variance explained by PC1, or
number of phenotypes in the phenotype group for --group-best,
--group-pca1, and --group-mean, respectively. |
5.2 |
n_phe_in_grp |
Only printed if --group-pca1 | --group-mean. The
number of phenotypes in the phenotype group. |
6 |
n_var_in_cis |
The number variants in the cis window for this phenotype. |
7 |
dist_phe_var |
The distance between the variant and the phenotype start
positions. |
8 |
var_id |
The variant ID. |
9 |
var_chr |
The variant chromosome. |
10 |
var_from |
The start position of the variant. |
11 |
var_to |
The end position of the variant. |
12 |
nom_pval |
The nominal p-value of the association between the variant and the
phenotype. |
13 |
r_squared |
The r squared of the linear regression. |
14 |
slope |
The beta (slope) of the linear regression. |
14.1 |
slope_se |
The standard error of the beta. Only printed if --std-err is
provided. |
15 |
best_hit |
Whether this varint was the best hit for this phenotype. |
- --permute output file
- Space separated output file with the following columns
(certain columns are only printed based on options). We recommend
including chunk 0 to print out a header in order to avoid confusion.
1 |
phe_id | grp_id |
The phenotype ID or if one of the grouping options is provided, then
phenotype group ID |
2 |
phe_chr |
The phenotype chromosome |
3 |
phe_from |
Start position of the phenotype |
4 |
phe_to |
End position of the phenotype |
5 |
phe_strd |
The phenotype strand |
5.1 |
phe_id | ve_by_pc1 | n_phe_in_grp |
Only printed if --group-best | --group-pca1 |
--group-mean. The phenotype ID, variance explained by PC1, or
number of phenotypes in the phenotype group for --group-best,
--group-pca1, and --group-mean, respectively. |
5.2 |
n_phe_in_grp |
Only printed if --group-pca1 | --group-mean. The
number of phenotypes in the phenotype group. |
6 |
n_var_in_cis |
The number variants in the cis window for this phenotype. |
7 |
dist_phe_var |
The distance between the variant and the phenotype start
positions. |
8 |
var_id |
The most significant variant ID. |
9 |
var_chr |
The most significant variant's chromosome. |
10 |
var_from |
The start position of the most significant variant. |
11 |
var_to |
The end position of the most significant variant. |
12 |
dof1 |
The number of degrees of freedom used to compute the p-values. |
13 |
dof2 |
Estimated number of degrees of freedom used in beta approximation
p-value calculations. |
14 |
bml1 |
The first shape parameter of the fitted beta distribution (alpha
parameter). These should be close to 1. |
15 |
bml2 |
The second shape parameter of the fitted beta distribution (beta
parameter). This corresponds to the effective number of independent
tests in the region. |
16 |
nom_pval |
The nominal p-value of the association between the most significant
variant and the phenotype. |
17 |
r_squared |
The r squared of the linear regression. |
18 |
slope |
The beta (slope) of the linear regression. |
18.1 |
slope_se |
The standard error of the beta. Only printed if --std-err is
provided. |
19 |
adj_emp_pval |
Adjusted empirical p-value from permutations. This is the adjusted
p-value not using the beta approximation. Simply calculated as:
(number of p-values observed during permutations that were smaller
than or equal to the nominal p-value + 1) / (number of permutations +
1). The most significant p-value achievable would be 1 / (number of
permutations + 1). |
20 |
adj_beta_pval |
Adjusted empirical p-value given by the fitted beta distribution.
We strongly recommend using this adjusted p-value in any downstream
analysis. |
- --mapping output file
- Space separated output file with the following columns
(certain columns are only printed based on options). We recommend
including chunk 0 to print out a header in order to avoid confusion.
1 |
phe_id | grp_id |
The phenotype ID or if one of the grouping options is provided, then
phenotype group ID |
2 |
phe_chr |
The phenotype chromosome |
3 |
phe_from |
Start position of the phenotype |
4 |
phe_to |
End position of the phenotype |
5 |
phe_strd |
The phenotype strand |
5.1 |
phe_id | ve_by_pc1 | n_phe_in_grp |
Only printed if --group-best | --group-pca1 |
--group-mean. The phenotype ID, variance explained by PC1, or
number of phenotypes in the phenotype group for --group-best,
--group-pca1, and --group-mean, respectively. |
5.2 |
n_phe_in_grp |
Only printed if --group-pca1 | --group-mean. The
number of phenotypes in the phenotype group. |
6 |
n_var_in_cis |
The number variants in the cis window for this phenotype. |
7 |
dist_phe_var |
The distance between the variant and the phenotype start
positions. |
8 |
var_id |
The most significant variant ID. |
9 |
var_chr |
The most significant variant's chromosome. |
10 |
var_from |
The start position of the most significant variant. |
11 |
var_to |
The end position of the most significant variant. |
12 |
rank |
The rank of the association. This tells you if the variant has been
mapped as belonging to the best signal (rank=0), the second best
(rank=1), etc ... As a consequence, the maximum rank value for a given
phenotype tells you how many independent signals there are (e.g.
rank=2 means 3 independent signals). |
13 |
fwd_pval |
The nominal forward p-value of the association between the most
significant variant and the phenotype. |
14 |
fwd_r_squared |
The r squared of the forward linear regression. |
15 |
fwd_slope |
The beta (slope) of the forward linear regression. |
15.1 |
fwd_slope_se |
The standard error of the forward beta. Only printed if
--std-err is provided. |
16 |
fwd_best_hit |
Whether or not this variant was the forward most significant
variant. |
17 |
fwd_sig |
Whether this variant was significant. Currently all variants are
significant so this is redundant. |
18 |
bwd_pval |
The nominal backward p-value of the association between the most
significant variant and the phenotype. |
19 |
bwd_r_squared |
The r squared of the backward linear regression. |
20 |
bwd_slope |
The beta (slope) of the backward linear regression. |
20.1 |
bwd_slope_se |
The standard error of the backward beta. Only printed if
--std-err is provided. |
21 |
bwd_best_hit |
Whether or not this variant was the backward most significant
variant. |
22 |
bwd_sig |
Whether this variant was significant. Currently all variants are
significant so this is redundant. |
- o
- Nominal pass, correcting for technical covariates, rank
normal transforming the phenotype, and printing out associations with a
p-value <= 0.01 on chromosome 22 between 17000000 and 18000000 bp, and
excluding some samples (see QTLtools (1)):
-
- QTLtools cis --vcf genotypes.chr22.vcf.gz --bed
genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal
0.01 --region chr22:17000000-18000000 --normal --out nominals.txt
--exclude-samples sample_names_to_exclude.txt
- o
- Nominal pass with parallelization correcting for technical
covariates, rank normal transforming the phenotype, and printing out
associations with a p-value <= 0.01. To facilitate parallelization on
compute cluster, we developed an option to run the analysis into chunks of
molecular phenotypes. For instance, to run analysis on chunk 12 when
splitting the example data set into 20 chunks, run:
-
- QTLtools cis --vcf genotypes.chr22.vcf.gz --bed
genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal
0.01 --chunk 12 20 --normal --out nominals_12_20.txt
- o
- If you want to submit the whole analysis with 20 jobs on a
compute cluster, just run (qsub needs to be changed to the job submission
system used [bsub, psub, etc...]):
-
- for j in $(seq 0 20); do
echo "QTLtools cis --vcf genotypes.chr22.vcf.gz --bed
genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --nominal
0.01 --chunk $j 20 --normal --out nominals_$j\_20.txt" | qsub
done
- o
- Permutation pass, correcting for technical covariates, rank
normal transforming the phenotype, and running 1000 permutations with a
specific random seed on chromosome 22 between 17000000 and 18000000
bp:
-
- QTLtools cis --vcf genotypes.chr22.vcf.gz --bed
genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --permute
1000 --region chr22:17000000-18000000 --normal --seed 1354145 --out
permutation.txt
- o
- Permutation pass with parallelization correcting for
technical covariates, rank normal transforming the phenotype, and running
5000 permutations. To facilitate parallelization on compute cluster, we
developed an option to run the analysis into chunks of molecular
phenotypes. For instance, to run analysis on chunk 12 when splitting the
example data set into 20 chunks, run:
-
- QTLtools cis --vcf genotypes.chr22.vcf.gz --bed
genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --permute
5000 --chunk 12 20 --normal --out permutations_12_20.txt
- o
- If you want to submit the whole analysis with 20 jobs on a
compute cluster, just run (qsub needs to be changed to the job submission
system used [bsub, psub, etc...]):
-
- for j in $(seq 0 20); do
echo "QTLtools cis --vcf genotypes.chr22.vcf.gz --bed
genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --permute
5000 --chunk $j 20 --normal --out permutations_$j\_20.txt" | qsub
done
- o
- When your phenotypes in the BED file are grouped, you can
perform a permutation pass at the phenotype group level in order to
discover group-level QTLs:
-
- QTLtools cis --vcf genotypes.chr22.vcf.gz --bed
exons.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --permute
1000 --normal --grp-best --out permutation.group.txt
Conditional analysis to discover independent signals.
- 1
- First we need to run a permutation analysis (see previous
section), then calculate nominal p-value threshold for each gene. Here an
FDR of 5% is given as an example:
-
- cat permutations_*.txt | gzip -c >
permutations_all.txt.gz
Rscript ./script/qtltools_runFDR_cis.R permutations_all.txt.gz 0.05
permutations_all
- 2
- Now you can proceed with the actual conditional analysis.
Here splitting into 20 chunks, and when all complete concatenate the
results:
-
- for j in $(seq 0 20); do
echo "QTLtools cis --vcf genotypes.chr22.vcf.gz --bed
genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --mapping
permutations_all.thresholds.txt --chunk $j 20 --normal --out
conditional_$j\_20.txt" | qsub
done
cat conditional_*.txt > conditional_all.txt
- 3
- If you are interested in the most significant variants per
independent signal, you can filter the results, using the backward
p-value:
-
- awk '{ if ($21 == 1) print $0 }' conditional_all.txt >
conditional_top_variants.txt
QTLtools(1)
QTLtools website: <
https://qtltools.github.io/qtltools>
- o
- Versions up to and including 1.2, suffer from a bug in
reading missing genotypes in VCF/BCF files. This bug affects variants with
a DS field in their genotype's FORMAT and have a missing genotype (DS
field is .) in one of the samples, in which case genotypes for all the
samples are set to missing, effectively removing this variant from the
analyses.
Please submit bugs to <
https://github.com/qtltools/qtltools>
Delaneau, O., Ongen, H., Brown, A. et al. A complete tool set for molecular QTL
discovery and analysis.
Nat Commun 8, 15452 (2017).
<
https://doi.org/10.1038/ncomms15452>
Olivier Delaneau (
[email protected]), Halit Ongen
(
[email protected])