samtools-ampliconstats - produces statistics from amplicon sequencing alignment
file
samtools ampliconstats [
options]
primers.bed
in.sam|
in.bam|
in.cram...
samtools ampliconstats collects statistics from one or more input alignment
files and produces tables in text format. The output can be visualized
graphically using plot-ampliconstats.
The alignment files should have previously been clipped of primer sequence, for
example by "samtools ampliconclip" and the sites of these primers
should be specified as a bed file in the arguments. Each amplicon must be
present in the bed file with one or more LEFT primers (direction
"+") followed by one or more RIGHT primers. For example:
MN908947.3 1875 1897 nCoV-2019_7_LEFT 60 +
MN908947.3 1868 1890 nCoV-2019_7_LEFT_alt0 60 +
MN908947.3 2247 2269 nCoV-2019_7_RIGHT 60 -
MN908947.3 2242 2264 nCoV-2019_7_RIGHT_alt5 60 -
MN908947.3 2181 2205 nCoV-2019_8_LEFT 60 +
MN908947.3 2568 2592 nCoV-2019_8_RIGHT 60 -
Ampliconstats will identify which read belongs to which amplicon. For purposes
of computing coverage statistics for amplicons with multiple primer choices,
only the innermost primer locations are used.
A summary of output sections is listed below, followed by more detailed
descriptions.
- SS
- Amplicon and file counts. Always comes first
- AMPLICON
- Amplicon primer locations
- FSS
- File specific: summary stats
- FRPERC
- File specific: read percentage distribution between
amplicons
- FDEPTH
- File specific: average read depth per amplicon
- FVDEPTH
- File specific: average read depth per amplicon, full length
only
- FREADS
- File specific: numbers of reads per amplicon
- FPCOV
- File specific: percent coverage per amplicon
- FTCOORD
- File specific: template start,end coordinate frequencies
per amplicon
- FAMP
- File specific: amplicon correct / double / treble length
counts
- FDP_ALL
- File specific: template depth per reference base, all
templates
- FDP_VALID
- File specific: template depth per reference base, valid
templates only
- CSS
- Combined summary stats
- CRPERC
- Combined: read percentage distribution between
amplicons
- CDEPTH
- Combined: average read depth per amplicon
- CVDEPTH
- Combined: average read depth per amplicon, full length
only
- CREADS
- Combined: numbers of reads per amplicon
- CPCOV
- Combined: percent coverage per amplicon
- CTCOORD
- Combined: template coordinates per amplicon
- CAMP
- Combined: amplicon correct / double / treble length
counts
- CDP_ALL
- Combined: template depth per reference base, all
templates
- CDP_VALID
- Combined: template depth per reference base, valid
templates only
File specific sections start with both the section key and the filename basename
(minus directory and .sam, .bam or .cram suffix).
Note that the file specific sections are interleaved, ordered first by file and
secondly by the file specific stats. To collate them together, use
"grep" to pull out all data of a specific type.
The combined sections (C*) follow the same format as the file specific sections,
with a different key. For simplicity of parsing they also have a filename
column which is filled out with "COMBINED". These rows contain stats
aggregated across all input files.
This section is once per file and includes summary information to be utilised
for scaling of plots, for example the total number of amplicons and files
present, tool version number, and command line arguments. The second column is
the filename or "COMBINED". This is followed by the reference name
(unless single-ref mode is enabled), and the summary statistic name and value.
The AMPLICON section is a reformatting of the input BED file. Each line consists
of the reference name (unless single-ref mode is enable), the amplicon number
and the
start-
end coordinates of the left and right primers.
Where multiple primers are available these are comma separated, for example
10-30,15-40 in the left primer column indicates two primers have been
multiplex together covering genome coordinates 10-30 inclusive and 14-40
inclusively.
This section consists of summary counts for the entire set of input files. These
may be useful for automatic scaling of plots.
Number of amplicons |
Total number of amplicons listed in primer.bed |
Number of files |
Total number of SAM, BAM or CRAM files |
End of summary |
Always the last item. Marker for end of CSS block. |
This lists summary statistics specific to an individual input file. The values
reported are:
raw total sequences |
Total number of sequences found in the file |
filtered sequences |
Number of sequences filtered with -F option |
failed primer match |
Number of sequences that did not correspond to |
|
a known primer location |
matching sequences |
Number of sequences allocated to an amplicon |
For each amplicon, this simply reports the count of reads that have been
assigned to it. A read is assigned to an amplicon if the start and/or end of
the read is within a specified number of bases of the primer sites listed in
the bed file. This distance is controlled via the -m option.
For each amplicon, this lists what percentage of reads were assigned to this
amplicon out of the total number of assigned reads. This may be used to
diagnose how uniform this distribution is.
Note this is a pure read count and has no relation to amplicon size.
Using the reads assigned to each amplicon and their start / end locations on
that reference, computed using the POS and CIGAR fields, we compute the total
number of bases aligned to this amplicon and corresponding the average depth.
The VDEPTH variants are filtered to only include templates with end-to-end
coverage across the amplicon. These can be considered to be "valid"
or "usable" templates and give an indication of the minimum depth
for the amplicon rather than the average depth.
To compute the depth the length of the amplicon is computed using the innermost
set of primers, if multiple choices are listed in the bed file.
Similar to the FDEPTH section, this is a binary status of covered or not covered
per position in each amplicon. This is then expressed as a percentage by
dividing by the amplicon length, which is computed using the innermost set of
primers covering this amplicon.
The minimum depth necessary to constitute a position as being
"covered" is specifiable using the -d option.
It is possible for an amplicon to be produced using incorrect primers, giving
rise to extra-long amplicons (typically double or treble length).
The FTCOORD field holds a distribution of observed template coordinates from the
input data. Each row consists of the file name, the amplicon number in
question, and tab separated tuples of start, end, frequency and status (0 for
OK, 1 for skipping amplicon, 2 for unknown location). Each template is only
counted for one amplicon, so if the read-pairs span amplicons the count will
show up in the left-most amplicon covered.
Th COORD data may indicate which primers are being utilised if there are
alternates available for a given amplicon.
For COORD lines amplicon number 0 holds the frequency data for data that reads
that have not been assigned to any amplicon. That is, they may lie within an
amplicon, but they do not start or end at a known primer location. It is not
recorded for BED files containing multiple references.
The FAMP / CAMP section is a simple count per amplicon of the number of
templates coming from this amplicon. Templates are counted once per amplicon,
but and like the FTCOORD field if a read-pair spans amplicons it is only
counted in the left-most amplicon. Each line consists of the file name,
amplicon number and 3 counts for the number of templates with both ends within
this amplicon, the number of templates with the rightmost end in another
amplicon, and the number of templates where the other end has failed to be
assigned to an amplicon.
Note FAMP / CAMP amplicon number 0 is the summation of data for all amplicons (1
onwards).
These are for depth plots per base rather than per amplicon. They distinguish
between all reads in all templates, and only reads in templates considered to
be "valid". Such templates have both reads (if paired) matching
known primer locations from he same amplicon and have full length coverage
across the entire amplicon.
This FDP_VALID can be considered to be the minimum template depth across the
amplicon.
The difference between the VALID and ALL plots represents additional data that
for some reason may not be suitable for producing a consensus. For example an
amplicon that skips a primer, pairing 10_LEFT with 12_RIGHT, will have
coverage for the first half of amplicon 10 and the last half of amplicon 12.
Counting the number of reads or bases alone in the amplicon does not reveal
the potential for non-uniformity of coverage.
The lines start with the type keyword, file / sample name, reference name
(unless single-ref mode is enabled), followed by a variable number of tab
separated tuples consisting of
depth,length. The length field is a
basic form of run-length encoding where all depth values within a specified
fraction of each other (e.g. >= (1-fract)*midpoint and <=
(1+fract)*midpoint) are combined into a single run. This fraction is
controlled via the
-D option.
-
-f, --required-flag INT|STR
- Only output alignments with all bits set in INT
present in the FLAG field. INT can be specified in hex by beginning
with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e.
/^0[0-7]+/) [0], or in string form by specifying a comma-separated list of
keywords as listed by the "samtools flags" subcommand.
-
-F, --filter-flag INT|STR
- Do not output alignments with any bits set in INT
present in the FLAG field. INT can be specified in hex by beginning
with `0x' (i.e. /^0x[0-9A-F]+/) or in octal by beginning with `0' (i.e.
/^0[0-7]+/) [0], or in string form by specifying a comma-separated list of
keywords as listed by the "samtools flags" subcommand.
-
-a, --max-amplicons INT
- Specify the maximum number of amplicons permitted.
-
-b, --tcoord-bin INT
- Bin the template start,end positions into multiples of
NT prior to counting their frequency and reporting in the FTCOORD /
CTCOORD lines. This may be useful for technologies with higher errors
rates where the alignment ends will vary slightly. Defaults to 1, which is
equivalent to no binning.
-
-c, --tcoord-min-count INT
- In the FTCOORD and CTCOORD lines, only record template
start,end coordinate combination if they occur at least INT times.
-
-d, --min-depth INT
- Specifies the minimum base depth to consider a reference
position to be covered, for purposes of the FRPERC and CRPERC sections.
-
-D, --depth-bin FRACTION
- Controls the merging of neighbouring similar depths for the
FDP_ALL and FDP_VALID plots. The default FRACTION is 0.01, meaning depths
within +/- 1% of a mid point will be aggregated together as a run of the
same value. This merging is useful to reduce the file size. Use -D
0 to record every depth.
-
-l, --max-amplicon-length INT
- Specifies the maximum length of any individual amplicon.
-
-m, --pos-margin INT
- Reads are compared against the primer start and end
locations specified in the BED file. An aligned sequence should start
precisely at these locations, but sequencing errors may cause the primer
clipping to be a few bases out or for the alignment to add a few extra
bases of soft clip. This option specifies the margin of error permitted
when matching a read to an amplicon number.
- -o FILE
- Output stats to FILE. The default is to write to stdout.
- -s, --use-sample-name
- Instead of using the basename component of the input path
names, use the SM field from the first @RG header line.
- -S, --single-ref
- Force the output format to match the older single-reference
style used in Samtools 1.12 and earlier. This removes the reference names
from the SS, AMPLICON, DP_ALL and DP_VALID sections. It cannot be enabled
if the input BED file has more than one reference present. Note that
plot-ampliconstats can process both output styles.
-
-t, --tlen-adjust INT
- Adjust the TLEN field by +/- INT to compensate for
primer clipping. This defaults to zero, but if the primers have been
clipped and the TLEN field has not been updated using samtools fixmate
then the template length will be wrong by the sum of the forward and
reverse primer lengths.
This adjustment does not have to be precise as the --pos-margin field
permits some leeway. Hence if required, it should be set to approximately
double the average primer length.
-
-@ INT
- Number of BAM/CRAM (de)compression threads to use in
addition to main thread [0].
To run ampliconstats on a directory full of CRAM files and then produce a series
of PNG images named "mydata*.png":
samtools ampliconstats V3/nCoV-2019.bed /path/*.cram > astats
plot-ampliconstats -size 1200,900 mydata astats
Written by James Bonfield from the Sanger Institute.
samtools(1),
samtools-ampliconclip(1) samtools-stats(1),
samtools-flags(1)
Samtools website: <
http://www.htslib.org/>