NAME

QTLtools rtc - Regulatory Trait Concordance score analysis

SYNOPSIS

QTLtools rtc --vcf [in.vcf|in.vcf.gz|in.bcf|in.bed.gz] --bed quantifications.bed.gz --hotspots hotspots_b37_hg19.bed [--gwas-cis | --gwas-trans | --mergeQTL-cis | --mergeQTL-trans] variants_external.txt qtls_in_this_dataset.txt --out output.txt [OPTIONS]

DESCRIPTION

The RTC algorithm assesses the likelihood of a shared functional effect between a GWAS variant and an molQTL by quantifying the change in the statistical significance of the molQTL after correcting the molQTL phenotype for the genetic effect of the GWAS variant and comparing its correction impact to that of all other SNPs in the interval. The method is detailed in <https://www.nature.com/articles/ng.3981>. When assessing tissue specificity of molQTLs we use the same method, however in that case the GWAS variant becomes an molQTL in a different tissue. The RTC method is as follows: for a GWAS variant falling into the same region flanked by recombination hotspots (coldspot) with an molQTL, with N number of variants in a given coldspot:
1
Correct the phenotype for each of the variants in the region separately by linear regression, resulting in N number of pseudo-phenotypes (residuals).
2
Redo the molQTL variant association with all of these pseudo-phenotypes.
3
Sort (decreasing) the resulting p-values and find the rank of the molQTL to GWAS-pseudo-phenotype among all molQTL to pseudo-phenotype associations.
4
RTC = (N - Rank of GWAS) / N
This results in the RTC score which ranges from 0 to 1 where higher values indicate a more likely shared functional effect between the GWAS and the molQTL variants. An RTC score greater than or equal to 0.9 is considered a shared functional effect. If there are multiple independent molQTLs for a given phenotype, RTC for each independent molQTL is assessed after correcting the phenotype with all the other molQTL variants for that phenotype. This correction is done using linear regression and taking the residuals after regressing the phenotype with the other molQTLs.
In order to convert RTC score into a probability of sharing, we employ two simulations per coldspot region, H0 and H1. The H0 scenario is when two variants in a coldspot are tagging different functional effects. For a coldspot that harbours colocalized GWAS and molQTL variants, we pick two random hidden causal variants. We then find two variants (GWAS and molQTL) that are linked (default r-squared ≥ 0.5) to the hidden causal variants. We generate a pseudo phenotype for molQTL based on the slope and intercept of the observed molQTL and randomly distributed residuals of the observed molQTL. Subsequently we rerun the RTC analysis with this new pseudo-phenotype and using the GWAS and molQTL variants.
The H1 scenario is when the two variants are tagging the same functional variant. The scheme here is exactly the same as the H0 scheme, except there is only one hidden causal variant and both GWAS and molQTL variants are randomly selected from variants that are linked to the same hidden causal variant.

OPTIONS

--vcf [in.vcf|in.bcf|in.vcf.gz| in.bed.gz]
Genotypes in VCF/BCF format, or another molecular phenotype in BED format. If there is a DS field in the genotype FORMAT of a variant (dosage of the genotype calculated from genotype probabilities, e.g. after imputation), then this is used as the genotype. If there is only the GT field in the genotype FORMAT then this is used and it is converted to a dosage. REQUIRED.
--bed quantifications.bed.gz
Molecular phenotype quantifications in BED format. REQUIRED.
--out output.txt
Output file. REQUIRED.
--hotspots recombination_hotspots.bed
Recombination hotspots in BED format. REQUIRED.
--cov covariates.txt
Covariates to correct the phenotype data with.
--stats-vcf [in.vcf|in.bcf|in.vcf.gz ]
Calculate D' and r-squared from this file. Defaults to the --vcf file. Needs to have phased genotypes for D' calculations.
--stats-vcf-include-samples samples.txt
Samples to include from the --stats-vcf file. One sample ID per line.
--stats-vcf-exclude-samples samples.txt
Samples to exclude from the --stats-vcf file. One sample ID per line.
--normal
Rank normal transform the phenotype data so that each phenotype is normally distributed. RECOMMENDED.
--conditional
molQTLs contain independent signals so execute the conditional analysis.
--debug
Print out debugging info to stderr. DON'T USE.
--warnings
Print all encountered individual warnings to stdout.
--header
Add a header to the output file when --chunk or --region is provided.
--individual-Dprime
Calculate D' on an individual variant basis. If not provided D' will not be calculated after first unphased genotype is encountered.
--mem-est
Estimate memory usage and exit.
--mem [0|1|2|3]
Keep results of calculations that may be used multiple times in memory. 0 = nothing in mem, 1 = only basic, 2 = all in mem but clean after unlikely to be reused, 3 = all in mem no cleaning. DEFAULT=0. RECOMMENDED=2.
--window integer
Size of the cis window flanking each phenotype's start position. DEFAULT=1000000. RECOMMENDED=1000000.
--sample integer
Number of simulated RTC values to try to achieve for each coldspot, for converting RTC to a probability. At each iteration we try to pick a unique combination of variants, thus the actual number of sample iterations may be less than this value, due to the number variants in a region. If you want to run this analysis, please provide at least 100. DEFAULT=0.
--max-sample integer
Max number of sample iterations trying to reach --sample before quitting. Provide the actual number not the multiplier. DEFAULT=--sample * 50.
--R2-threshold float
The minimum r-squared required when picking a variant that is linked to the hidden causal variant(s) when running simulations using --sample. DEFAULT=0.5.
--D-prime-threshold float
If the pairs of variants fall into different coldspots and have a D' greater than this, the RTC calculation is extended to multiple coldspot regions including both variants. Assumes D' can be calculated. DEFAULT=OFF. NOT RECOMMENDED.
--grp-best
Correct for multiple phenotypes within a phenotype group.
--pheno-col integer
1-based phenotype id column number. DEFAULT=1 or 5 when --grp-best
--geno-col integer
1-based genotype id column number. DEFAULT=8 or 10 when --grp-best
--grp-col integer
1-based phenotype group id column number. Only relevant if --grp-best is in effect. DEFAULT=1
--rank-col integer
1-based conditional analysis rank column number. Only relevant if --conditional is in effect. DEFAULT=12 or 14 when --grp-best
--best-col integer
1-based phenotype column number Only relevant if --conditional is in effect. DEFAULT=21 or 23 when --grp-best
--gwas-cis variants_external.txt qtls_in_this_dataset.txt
Run RTC for GWAS and cis-molQTL colocalization analysis. Takes two file names as arguments. The first is the file with GWAS variants of interest with one variant ID per line. These should match the variants IDs in the --vcf file. The second is the QTLtools output for the cis run that was ran using the same --vcf, --bed, and --cov files. REQUIRED unless (and mutually exclusive with) --gwas-trans, --mergeQTL-cis, --mergeQTL-trans.
--gwas-trans variants_external.txt qtls_in_this_dataset.txt
Run RTC for GWAS and trans-molQTL colocalization analysis. Takes two file names as arguments. The first is the file with GWAS variants of interest with one variant ID per line. These should match the variants IDs in the --vcf file. The second is the QTLtools output for the trans run that was ran using the same --vcf, --bed, and --cov files. You will need to adjust *-col options. REQUIRED unless (and mutually exclusive with) --gwas-cis, --mergeQTL-cis, --mergeQTL-trans.
--mergeQTL-cis variants_external.txt qtls_in_this_dataset.txt
Run RTC for cis-molQTL and cis-molQTL colocalization analysis. Takes two file names as arguments. The first is the file with cis-molQTL variants of interest discovered in a different dataset, e.g. different tissue, with one variant ID per line. These should match the variants IDs in the --vcf file. The second is the QTLtools output for the cis run that was ran using the same --vcf, --bed, and --cov files. REQUIRED unless (and mutually exclusive with) --gwas-trans, --mergeQTL-cis, --mergeQTL-trans.
--mergeQTL-trans variants_external.txt qtls_in_this_dataset.txt
Run RTC for trans-molQTL and trans-molQTL colocalization analysis. Takes two file names as arguments. The first is the file with trans-molQTL variants of interest discovered in a different dataset, e.g. different tissue, with one variant ID per line. These should match the variants IDs in the --vcf file. The second is the QTLtools output for the trans run that was ran using the same --vcf, --bed, and --cov files. You will need to adjust *-col options. REQUIRED unless (and mutually exclusive with) --gwas-trans, --gwas-cis, --mergeQTL-cis.
--chunk integer1 integer2
For parallelization. Divide the data into integer2 number of chunks and process chunk number integer1. Chunk 0 will print a header. Mutually exclusive with --region. Minimum number of chunks has to be at least the same number of chromosomes in the --bed file.
--region chr:start-end
Genomic region to be processed. E.g. chr4:12334456-16334456, or chr5. Mutually exclusive with --chunk.

OUTPUT FILE

--out output file
Space separated output file with the following columns. Columns after the 22nd are only printed if --sample is provided. We recommend including chunk 0 to print out a header in order to avoid confusion.
1 other_variant The variant ID that is external to this dataset, could be the GWAS variant or another molQTL
2 our_variant The molQTL variant ID that is internal to this dataset
3 phenotype The phenotype ID
4 phenotype_group The phenotype group ID
5 other_variant_chr The external variant's chromosome
6 other_variant_start The external variant's start position
7 other_variant_rank Rank of the external variant. Only relevant if the external variants are part of an conditional analysis
8 our_variant_chr The internal variant's chromosome
9 our_variant_start The internal variant's start position
10 our_variant_rank Rank of the internal variant. Only relevant if the internal variants are part of an conditional analysis
11 phenotype_chr The phenotype's chromosome
12 phenotype_start The start position of the phenotype
13 distance_between_variants The distance between the two variants
14 distance_between_other_variant_and_pheno The distance between the external variant and the phenotype
15 other_variant_region_index The region index of the external variant
16 our_variant_region_index The region index of the internal variant
17 region_start The start position of the region
18 region_end The end position of the region
19 variant_count_in_region The number of variants in the region
20 RTC The RTC score
21 D' The D' of the two variants. Only calculated if there are phased genotypes
22 r^2 The r squared of the two variants
22 p_value The p-value of the RTC score
23 unique_picks_H0 The number of unique combinations of variants in the H0 simulations
24 unique_picks_H1 The number of unique combinations of variants in the H1 simulations
25 rtc_bin_start Lower bound of the RTC bin, based on the observed RTC score
26 rtc_bin_end Upper bound of the RTC bin, based on the observed RTC score
27 rtc_bin_H0_proportion The proportion of H0 simulated values that are between rtc_bin_start and rtc_bin_end
28 rtc_bin_H1_proportion The proportion of H1 simulated values that are between rtc_bin_start and rtc_bin_end
29 median_r^2 The median r-squared in the region
30 median_H0 The median RTC score in the H0 simulations
31 median_H1 The median RTC score in the H1 simulations
32 H0 The RTC scores observed in the H0 simulations
33 H1 The RTC scores observed in the H1 simulations

EXAMPLES

o
Run RTC with GWAS variants and cis-eQTLs correcting for technical covariates and rank normal transforming the phenotype:
QTLtools rtc --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --hotspot hotspots_b37_hg19.bed --gwas-cis GWAS.b37.txt permutations_all.significant.txt --normal --out rtc_results.txt
o
RTC with GWAS variants and cis-eQTLs and simulations, correcting for technical covariates, rank normal transforming the phenotype, and running conditional analysis while keeping data in memory. To facilitate parallelization on compute cluster, we developed an option to run the analysis into chunks of molecular phenotypes. For instance, to run analysis on chunk 12 when splitting the example data set into 20 chunks, run:
QTLtools rtc --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --hotspot hotspots_b37_hg19.bed --gwas-cis GWAS.b37.txt conditional_all.significant.txt --normal --conditional --mem 2 --chunk 12 20 --sample 200 --out rtc_results_12_20.txt
o
If you want to submit the whole analysis with 20 jobs on a compute cluster, just run (qsub needs to be changed to the job submission system used [bsub, psub, etc...]):
for j in $(seq 0 20); do
echo "QTLtools rtc --vcf genotypes.chr22.vcf.gz --bed genes.50percent.chr22.bed.gz --cov genes.covariates.pc50.txt.gz --hotspot hotspots_b37_hg19.bed --gwas-cis GWAS.b37.txt conditional_all.significant.txt --normal --conditional --mem 2 --chunk $j 20 --sample 200 --out rtc_results_$j_20.txt" | qsub
done

SEE ALSO

QTLtools(1)
QTLtools website: <https://qtltools.github.io/qtltools>

BUGS

Please submit bugs to <https://github.com/qtltools/qtltools>

CITATION

Ongen H, Brown AA, Delaneau O, et al. Estimating the causal tissues for complex traits and diseases. Nat Genet. 2017;49(12):1676-1683. doi:10.1038/ng.3981 <https://doi.org/10.1038/ng.3981>

AUTHORS

Halit Ongen ([email protected]), Olivier Delaneau ([email protected])

Recommended readings

Pages related to QTLtools-rtc you should read also: