samtools-stats - produces comprehensive statistics from alignment file
samtools stats [
options]
in.sam|
in.bam|
in.cram
[
region...]
samtools stats collects statistics from BAM files and outputs in a text format.
The output can be visualized graphically using plot-bamstats.
A summary of output sections is listed below, followed by more detailed
descriptions.
CHK |
Checksum |
SN |
Summary numbers |
FFQ |
First fragment qualities |
LFQ |
Last fragment qualities |
GCF |
GC content of first fragments |
GCL |
GC content of last fragments |
GCC |
ACGT content per cycle |
GCT |
ACGT content per cycle, read oriented |
FBC |
ACGT content per cycle for first fragments only |
FTC |
ACGT raw counters for first fragments |
LBC |
ACGT content per cycle for last fragments only |
LTC |
ACGT raw counters for last fragments |
BCC |
ACGT content per cycle for BC barcode |
CRC |
ACGT content per cycle for CR barcode |
OXC |
ACGT content per cycle for OX barcode |
RXC |
ACGT content per cycle for RX barcode |
QTQ |
Quality distribution for BC barcode |
CYQ |
Quality distribution for CR barcode |
BZQ |
Quality distribution for OX barcode |
QXQ |
Quality distribution for RX barcode |
IS |
Insert sizes |
RL |
Read lengths |
FRL |
Read lengths for first fragments only |
LRL |
Read lengths for last fragments only |
ID |
Indel size distribution |
IC |
Indels per cycle |
COV |
Coverage (depth) distribution |
GCD |
GC-depth |
Not all sections will be reported as some depend on the data being coordinate
sorted while others are only present when specific barcode tags are in use.
Some of the statistics are collected for “first” or
“last” fragments. Records are put into these categories using
the PAIRED (0x1), READ1 (0x40) and READ2 (0x80) flag bits, as follows:
- •
- Unpaired reads (i.e. PAIRED is not set) are all
“first” fragments. For these records, the READ1 and READ2
flags are ignored.
- •
- Reads where PAIRED and READ1 are set, and READ2 is not set
are “first” fragments.
- •
- Reads where PAIRED and READ2 are set, and READ1 is not set
are “last” fragments.
- •
- Reads where PAIRED is set and either both READ1 and READ2
are set or neither is set are not counted in either category.
Information on the meaning of the flags is given in the SAM specification
document <
https://samtools.github.io/hts-specs/SAMv1.pdf>.
The CHK row contains distinct CRC32 checksums of read names, sequences and
quality values. The checksums are computed per alignment record and summed,
meaning the checksum does not change if the input file has the sort-order
changed.
The SN section contains a series of counts, percentages, and averages, in a
similar style to
samtools flagstat, but more comprehensive.
raw total sequences - total number of
reads in a file, excluding supplementary and secondary reads. Same number
reported by
samtools view -c -F 0x900.
filtered sequences - number of discarded reads when using -f or -F
option.
sequences - number of processed reads.
is sorted - flag indicating whether the file is coordinate sorted (1) or
not (0).
1st fragments - number of
first fragment reads (flags 0x01 not
set; or flags 0x01 and 0x40 set, 0x80 not set).
last fragments - number of
last fragment reads (flags 0x01 and
0x80 set, 0x40 not set).
reads mapped - number of reads, paired or single, that are mapped (flag
0x4 or 0x8 not set).
reads mapped and paired - number of mapped paired reads (flag 0x1 is set
and flags 0x4 and 0x8 are not set).
reads unmapped - number of unmapped reads (flag 0x4 is set).
reads properly paired - number of mapped paired reads with flag 0x2 set.
paired - number of paired reads, mapped or unmapped, that are neither
secondary nor supplementary (flag 0x1 is set and flags 0x100 (256) and 0x800
(2048) are not set).
reads duplicated - number of duplicate reads (flag 0x400 (1024) is set).
reads MQ0 - number of mapped reads with mapping quality 0.
reads QC failed - number of reads that failed the quality checks (flag
0x200 (512) is set).
non-primary alignments - number of secondary reads (flag 0x100 (256)
set).
supplementary alignments - number of supplementary reads (flag 0x800
(2048) set).
total length - number of processed bases from reads that are neither
secondary nor supplementary (flags 0x100 (256) and 0x800 (2048) are not set).
total first fragment length - number of processed bases that belong to
first fragments.
total last fragment length - number of processed bases that belong to
last fragments.
bases mapped - number of processed bases that belong to
reads
mapped.
bases mapped (cigar) - number of mapped bases filtered by the CIGAR
string corresponding to the read they belong to. Only alignment matches(M),
inserts(I), sequence matches(=) and sequence mismatches(X) are counted.
bases trimmed - number of bases trimmed by bwa, that belong to non
secondary and non supplementary reads. Enabled by -q option.
bases duplicated - number of bases that belong to
reads
duplicated.
mismatches - number of mismatched bases, as reported by the NM tag
associated with a read, if present.
error rate - ratio between
mismatches and
bases mapped
(cigar).
average length - ratio between
total length and
sequences.
average first fragment length - ratio between
total first fragment
length and
1st fragments.
average last fragment length - ratio between
total last fragment
length and
last fragments.
maximum length - length of the longest read (includes hard-clipped
bases).
maximum first fragment length - length of the longest
first
fragment read (includes hard-clipped bases).
maximum last fragment length - length of the longest
last fragment
read (includes hard-clipped bases).
average quality - ratio between the sum of base qualities and
total
length.
insert size average - the average absolute template length for paired and
mapped reads.
insert size standard deviation - standard deviation for the average
template length distribution.
inward oriented pairs - number of paired reads with flag 0x40 (64) set
and flag 0x10 (16) not set or with flag 0x80 (128) set and flag 0x10 (16) set.
outward oriented pairs - number of paired reads with flag 0x40 (64) set
and flag 0x10 (16) set or with flag 0x80 (128) set and flag 0x10 (16) not set.
pairs with other orientation - number of paired reads that don't fall in
any of the above two categories.
pairs on different chromosomes - number of pairs where one read is on one
chromosome and the pair read is on a different chromosome.
percentage of properly paired reads - percentage of
reads properly
paired out of
sequences.
bases inside the target - number of bases inside the target region(s)
(when a target file is specified with -t option).
percentage of target genome with coverage > VAL - percentage of target
bases with a coverage larger than VAL. By default, VAL is 0, but a custom
value can be supplied by the user with -g option.
The FFQ and LFQ sections report the quality distribution per first/last fragment
and per cycle number. They have one row per cycle (reported as the first
column after the FFQ/LFQ key) with remaining columns being the observed
integer counts per quality value, starting at quality 0 in the left-most row
and ending at the largest observed quality. Thus each row forms its own
quality distribution and any cycle specific quality artefacts can be observed.
GCF and GCL report the total GC content of each fragment, separated into first
and last fragments. The columns show the GC percentile (between 0 and 100) and
an integer count of fragments at that percentile.
GCC, FBC and LBC report the nucleotide content per cycle either combined (GCC)
or split into first (FBC) and last (LBC) fragments. The columns are cycle
number (integer), and percentage counts for A, C, G, T, N and other (typically
containing ambiguity codes) normalised against the total counts of A, C, G and
T only (excluding N and other).
GCT offers a similar report to GCC, but whereas GCC counts nucleotides as they
appear in the SAM output (in reference orientation), GCT takes into account
whether a nucleotide belongs to a reverse complemented read and counts it in
the original read orientation. If there are no reverse complemented reads in a
file, the GCC and GCT reports will be identical.
FTC and LTC report the total numbers of nucleotides for first and last
fragments, respectively. The columns are the raw counters for A, C, G, T and N
bases.
BCC, CRC, OXC and RXC are the barcode equivalent of GCC, showing nucleotide
content for the barcode tags BC, CR, OX and RX respectively. Their quality
values distributions are in the QTQ, CYQ, BZQ and QXQ sections, corresponding
to the BC/QT, CR/CY, OX/BZ and RX/QX SAM format sequence/quality tags. These
quality value distributions follow the same format used in the FFQ and LFQ
sections. All these section names are followed by a number (1 or 2),
indicating that the stats figures below them correspond to the first or second
barcode (in the case of dual indexing). Thus, these sections will appear as
BCC1, CRC1, OXC1 and RXC1, accompanied by their quality correspondents QTQ1,
CYQ1, BZQ1 and QXQ1. If a separator is present in the barcode sequence
(usually a hyphen), indicating dual indexing, then sections ending in
"2" will also be reported to show the second tag statistics (e.g.
both BCC1 and BCC2 are present).
IS reports insert size distributions with one row per size, reported in the
first column, with subsequent columns for the frequency of total pairs, inward
oriented pairs, outward orient pairs and other orientation pairs. The
-i option specifies the maximum insert size reported.
RL reports the distribution for all read lengths, with one row per observed
length (up to the maximum specified by the
-l option). Columns are read
length and frequency. FRL and LRL contains the same information separated into
first and last fragments.
ID reports the distribution of indel sizes, with one row per observed size. The
columns are size, frequency of insertions at that size and frequency of
deletions at that size.
IC reports the frequency of indels occurring per cycle, broken down by both
insertion / deletion and by first / last read. Note for multi-base indels this
only counts the first base location. Columns are cycle, number of insertions
in first fragments, number of insertions in last fragments, number of
deletions in first fragments, and number of deletions in last fragments.
COV reports a distribution of the alignment depth per covered reference site.
For example an average depth of 50 would ideally result in a normal
distribution centred on 50, but the presence of repeats or copy-number
variation may reveal multiple peaks at approximate multiples of 50. The first
column is an inclusive coverage range in the form of
[min- max]. The next columns are a repeat
of the
maximum portion of the depth range (now as a single integer) and
the frequency that depth range was observed. The minimum, maximum and range
step size are controlled by the
-c option. Depths above and below the
minimum and maximum are reported with ranges
[<min]
and
[max<].
GCD reports the GC content of the reference data aligned against per alignment
record, with one row per observed GC percentage reported as the first column
and sorted on this column. The second column is a total sequence percentile,
as a running total (ending at 100%). The first and second columns may be used
to produce a simple distribution of GC content. Subsequent columns list the
coverage depth at 10th, 25th, 50th, 75th and 90th GC percentiles for this
specific GC percentage, revealing any GC bias in mapping. These columns are
averaged depths, so are floating point with no maximum value.
-
-c, --coverage
MIN,MAX,STEP
- Set coverage distribution to the specified range (MIN, MAX,
STEP all given as integers) [1,1000,1]
- -d, --remove-dups
- Exclude from statistics reads marked as duplicates
-
-f, --required-flag
STR|INT
- Required flag, 0 for unset. See also `samtools flags`
[0]
-
-F, --filtering-flag
STR|INT
- Filtering flag, 0 for unset. See also `samtools flags`
[0]
-
--GC-depth FLOAT
- the size of GC-depth bins (decreasing bin size increases
memory requirement) [2e4]
- -h, --help
- This help message
-
-i, --insert-size INT
- Maximum insert size [8000]
-
-I, --id STR
- Include only listed read group or sample name []
-
-l, --read-length INT
- Include in the statistics only reads with the given read
length [-1]
-
-m, --most-inserts FLOAT
- Report only the main part of inserts [0.99]
-
-P, --split-prefix STR
- A path or string prefix to prepend to filenames output when
creating categorised statistics files with -S/--split.
[input filename]
-
-q, --trim-quality INT
- The BWA trimming parameter [0]
-
-r, --ref-seq FILE
- Reference sequence (required for GC-depth and
mismatches-per-cycle calculation). []
-
-S, --split TAG
- In addition to the complete statistics, also output
categorised statistics based on the tagged field TAG (e.g., use
--split RG to split into read groups).
Categorised statistics are written to files named
<prefix>_<value>.bamstat, where prefix is
as given by --split-prefix (or the input filename by default) and
value has been encountered as the specified tagged field's value in
one or more alignment records.
-
-t, --target-regions FILE
- Do stats in these regions only. Tab-delimited file
chr,from,to, 1-based, inclusive. []
- -x, --sparse
- Suppress outputting IS rows where there are no
insertions.
- -p, --remove-overlaps
- Remove overlaps of paired-end reads from coverage and base
count computations.
-
-g, --cov-threshold INT
- Only bases with coverage above this value will be included
in the target percentage computation [0]
- -X
- If this option is set, it will allows user to specify
customized index file location(s) if the data folder does not contain any
index file. Example usage: samtools stats [options] -X
/data_folder/data.bam /index_folder/data.bai chrM:1-10
-
-@, --threads INT
- Number of input/output compression threads to use in
addition to main thread [0].
Written by Petr Danacek with major modifications by Nicholas Clarke, Martin
Pollard, Josh Randall, and Valeriu Ohan, all from the Sanger Institute.
samtools(1),
samtools-flagstat(1),
samtools-idxstats(1)
Samtools website: <
http://www.htslib.org/>