bamfeaturecount - evaluate alignments produce by an RNA-seq aligner
bamfeaturecount annotation.gtf mapped.bam [options]
bamfeaturecount evaluates the alignments produced by an RNA-seq aligner (like
e.g. STAR) and outputs the average coverage found for each transcript and
gene. It requires an annotation file in the GTF format in addition to a BAM
file containing RNA-seq alignments. The annotation file needs to have been
processed by the filtergtf program to ensure a sorting which is compatible
with bamfeaturecounts expectations.
The output contains two types of lines, one starting with [transcript] and
another starting with [gene]. In both cases the information provided is given
as a tab separated set of columns.
The following is an example of a transcript line:
[transcript] chr1:C1orf159 ENST00000472741.5 (0.0618557,5.38144,97)
(0.0496454,11.305,564) (0.0514372,10.4357,661) [1116059,1116087)
[1091990,1092103) [1091045,1091565)
transcript lines have always at least 6 columns. The second column provides the
reference sequence name (as given in the BAM header and GTF file) and the name
of the gen concerned separated by a colon (here chr1 and C1orf159). The third
column contains the transcript identifier (transcript_id in the GTF file).
Column 4, 5 and 6 each contain either a triplet of numbers (A,B,C) or the
symbol *. Column 4 contains regions unique to this transcript (i.e. stretches
on the genome not shared with any other transcript). Column 5 contains regions
shared by at least one other transcript. Column 6 contains the information for
all regions covered by the transcript (unique and non unique). If a transcript
does not have any respective intervals (i.e. if every base is also covered by
at least one other transcript or all bases are unique to this transcript) then
the column contains the symbol *. Otherwise a triplet (A,B,C) is given where A
denotes the fraction of bases not covered by any alignment (in the example
0.0618557 or 6% of the bases unique to this transcript are not covered), B
contains the average coverage (in the example the average sequencing depth on
the unique bases for this transcript is 5.38) and C the total number of bases
(in the example 97 bases of this transcript are not shared with any other
transcript). The rest of the columns (> 6) contain the zero bases intervals
of exons for this transcript on the reference sequence.
The gene lines should be disregarded in the current version of the program.
The following key=value pairs can be given:
T=<filename>: set the prefix for temporary file names
verbose=<1>: print some progress report while processing
threads=<1>: number of threads used for processing. Set this to 0
to use all cores detected on the machine.
mapqmin=<255>: Minimum mapping quality allowed for alignments
considered. By default this is 255 (the value used by STAR to mark uniquely
mapped reads).
mapqmax=<255>: Maximum mapping quality allowed for alignments
considered. By default this is 255 (the value used by STAR to mark uniquely
mapped reads).
uncoveredthres=<0.1>: maximum fraction of bases allowed to be
uncovered in a transcript so the transcript will be reported (i.e. minimum
value allowed for the first number given in column 6 of the transcript
output).
uniqueuncoveredthres=<0.1>: maximum fraction of bases allowed to be
uncovered in the unique region of a transcript so the transcript will be
reported (i.e. minimum value allowed for the first number given in column 4 of
the transcript output).
exclude=<SECONDARY>: Do not include reads in the output that have
any of the given flags set. The flags are given separated by commas. Valid
flags are:
- PAIRED:
- read was paired in sequencing
- PROPER_PAIR:
- read has been mapped as part of a proper pair
- UNMAP:
- read was not mapped
- MUNMAP:
- mate of read was not mapped
- REVERSE:
- read was mapped to the reverse strand
- MREVERSE:
- mate of read was mapped to the reverse strand
- READ1:
- read was first read of a pair during sequencing
- READ2:
- read was second read of a pair during sequencing
- SECONDARY:
- alignment is secondary, i.e. an alternative mapping to the
primary alignment in the same file
- QCFAIL:
- read as marked as having failed quality control
- DUP:
- read is marked as a duplicate of another read in the same
file (see bammarkduplicates)
- SUPPLEMENTARY:
- read is marked as supplementary alignment
exportcdna=<0>: instead of feature counting generate a FastA file
containing the CDNA as designated by the GTF annotation file. The second
parameter (BAM file) needs to be specified, but will not be read. This option
requires a reference FastA file suitable for the GTF file to be provided via
the reference key.
reference=<>: name of a reference FastA file. This is required for
the exportcdna option.
Written by German Tischler-Höhle.
Report bugs to <
[email protected]>
Copyright © 2009-2019 German Tischler-Höhle, © 2011-2014
Genome Research Limited. License GPLv3+: GNU GPL version 3
<
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.