samtools-consensus - produces produce a consensus FASTA/FASTQ/PILEUP
samtools consensus [
-saAMq] [
-r region] [
-f
format] [
-l line-len] [
-d min-depth]
[
-C cutoff] [
-c call-fract] [
-H
het-fract]
in.bam
Generate consensus from a SAM, BAM or CRAM file based on the contents of the
alignment records. The consensus is written either as FASTA, FASTQ, or a
pileup oriented format. This is selected using the
-f FORMAT
option.
The default output for FASTA and FASTQ formats include one base per non-gap
consensus. Hence insertions with respect to the aligned reference will be
included and deletions removed. This behaviour can be controlled with the
--show-ins and
--show-del options. This could be used to compute
a new reference from sequences assemblies to realign against.
The pileup-style format strictly adheres to one row per consensus location,
differing from the one row per reference based used in the related
"samtools mpileup" command. This means the base quality values for
inserted columns are reported. The base quality value of gaps (either within
an insertion or otherwise) are determined as the average of the surrounding
non-gap bases. The columns shown are the reference name, position, nth base at
that position (zero if not an insertion), consensus call, consensus
confidence, sequences and quality values.
Two consensus calling algorithms are offered. The default computes a
heterozygous consensus in a Bayesian manner, derived from the "Gap5"
consensus algorithm. Quality values are also tweaked to take into account
other nearby low quality values. This can also be disabled, using the
--no-adj-qual option.
This method also utilises the mapping qualities, unless the
--no-use-MQ
option is used. Mapping qualities are also auto-scaled to take into account
the local reference variation by processing the MD:Z tag, unless
--no-adj-MQ is used. Mapping qualities can be capped between a minimum
(
--low-MQ) and maximum (
--high-MQ), although the defaults are
liberal and trust the data to be true. Finally an overall scale on the
resulting mapping quality can be supplied (
--scale-MQ, defaulting to
1.0). This has the effect of favouring more calls with a higher false positive
rate (values greater than 1.0) or being more cautious with higher false
negative rates and lower false positive (values less than 1.0).
The second method is a simple frequency counting algorithm, summing either +1
for each base type or +
qual if the
--use-qual option is
specified. This is enabled with the
--mode simple option.
The summed share of a specific base type is then compared against the total
possible and if this is above the
--call-fract fraction
parameter then the most likely base type is called, or "N" otherwise
(or absent if it is a gap). The
--ambig option permits generation of
ambiguity codes instead of "N", provided the minimum fraction of the
second most common base type to the most common is above the
--het-fract
fraction.
General options that apply to both algorithms:
-
-r REG, --region REG
- Limit the query to region REG. This requires an
index.
-
-f FMT, --format FMT
- Produce format FMT, with "fastq",
"fasta" and "pileup" as permitted options.
-
-l N, --line-len N
- Sets the maximum line length of line-wrapped fasta and
fastq formats to N.
-
-o FILE, --output FILE
- Output consensus to FILE instead of stdout.
-
-m STR, --mode STR
- Select the consensus algorithm. Valid modes are
"simple" frequency counting and the "bayesian" (Gap5)
method, with Bayesian being the default. (Note case does not matter, so
"Bayesian" is accepted too.)
- -a
- Outputs all bases, from start to end of reference, even
when the aligned data does not extend to the ends. This is most useful for
construction of a full length reference sequence.
-
--rf, --incl-flags STR|INT
- Only include reads with at least one FLAG bit set. Defaults
to zero, which filters no reads.
-
--ff, --excl-flags STR|INT
- Exclude reads with any FLAG bit set. Defaults to
"UNMAP,SECONDARY,QCFAIL,DUP".
-
--min-MQ INT
- Filters out reads with a mapping quality below INT.
This defaults to zero.
-
--show-del yes/no
- Whether to show deletions as "*" (no) or to omit
from the output (yes). Defaults to no.
-
--show-ins yes/no
- Whether to show insertions in the consensus. Defaults to
yes.
-
-A, --ambig
- Enables IUPAC ambiguity codes in the consensus output.
Without this the output will be limited to A, C, G, T, N and *.
- The following options apply only to the simple consensus
mode:
-
-
-q, --use-qual
- For the simple consensus algorithm, this enables use of
base quality values. Instead of summing 1 per base called, it sums the
base quality instead. These sums are also used in the --call-fract
and --het-fract parameters too. Quality values are always used for
the "Gap5" consensus method and this option has no affect. Note
currently quality values only affect SNPs and not inserted sequences,
which still get scores with a fixed +1 per base type occurrence.
-
-d D, --min-depth D
- The minimum depth required to make a call. Defaults to 1.
Failing this depth check will produce consensus "N", or absent
if it is an insertion.
-
-H H, --het-fract H
- For consensus columns containing multiple base types, if
the second most frequent type is at least H fraction of the most
common type then a heterozygous base type will be reported in the
consensus. Otherwise the most common base is used, provided it meets the
--call-fract parameter (otherwise "N"). The fractions
computed may be modified by the use of quality values if the -q
option is enabled. Note although IUPAC has ambiguity codes for A,C,G,T vs
any other A,C,G,T it does not have codes for A,C,G,T vs gap (such as in a
heterozygous deletion). Given the lack of any official code, we use
lower-case letter to symbolise a half-present base type.
-
-c C, --call-fract C
- Only used for the simple consensus algorithm. Require at
least C fraction of bases agreeing with the most likely consensus
call to omit that base type. This defaults to 0.75. Failing this check
will output "N".
- The following options apply only to Bayesian consensus mode
enabled
- with the -5 option.
- -5
- Enable Bayesian consensus algorithm.
-
-C C, --cutoff C
- Only used for the Gap5 consensus mode, which produces a
Phred style score for the final consensus quality. If this is below
C then the consensus is called as "N".
-
--use-MQ, --no-use-MQ
- Enable or disable the use of mapping qualities. Defaults to
on.
-
--adj-MQ, --no-adj-MQ
- If mapping qualities are used, this controls whether they
are scaled by the local number of mismatches to the reference. The
reference is unknown by this tool, so this data is obtained from the MD:Z
auxiliary tag (or ignored if not present). Defaults to on.
-
--NM-halo INT
- Specifies the distance either side of the base call being
considered for computing the number of local mismatches.
-
--low-MQ MIN, --high-MQ
MAX
- Specifies a minimum and maximum value of the mapping
quality. These are not filters and instead simply put upper and lower caps
on the values. The defaults are 0 and 60.
-
--scale-MQ FLOAT
- This is a general multiplicative mapping quality scaling
factor. The effect is to globally raise or lower the quality values used
in the consensus algorithm. Defaults to 1.0, which leaves the values
unchanged.
-
--P-het FLOAT
- Controls the likelihood of any position being a
heterozygous site. Defaults to 1e-4. Smaller numbers makes the algorithm
more likely to call a pure base type. Note the algorithm will always
compute the probability of the base being homozygous vs heterozygous,
irrespective of whether the output is reported as ambiguous (it will be
"N" if deemed to be heterozygous without --ambig mode
enabled).
- -
- Create a modified FASTA reference that has a 1:1 coordinate
correspondence with the original reference used in alignment.
samtools consensus -a --show-ins no in.bam -o ref.fa
- -
- Create a FASTQ file for the contigs with aligned data,
including insertions.
samtools consensus -f fastq in.bam -o cons.fq
Written by James Bonfield from the Sanger Institute.
samtools(1),
samtools-mpileup(1),
Samtools website: <
http://www.htslib.org/>