bamtofastq - convert SAM, BAM or CRAM files to FastQ
bamtofastq [options]
bamtofastq reads a SAM, BAM or CRAM file from standard input and converts it to
the FastQ format. The output can be split into multiple files according to the
pair flags of the reads involved. bamtofastq can collate the source reads
according to their read names, i.e. place pairs of reads next to each other in
the output. bamtofastq writes its output to the standard output channel by
default. All output channels can be compressed using gzip.
The following key=value pairs can be given:
F=<stdout>: output file for the first mates of pairs if collation
is active.
F2=<stdout>: output file for the second mates of pairs if collation
is active.
S=<stdout>: output file for single end reads if collation is
active.
O=<stdout>: output file for unmatched (orphan) first mates if
collation is active.
O2=<stdout>: output file for unmatched (orphan) second mates if
collation is active.
collate=<0|1>: Valid values are
- 1:
- collate read pairs
- 0:
- output reads to standard output in the order in which they
appear in the BAM file
combs=<0|1>: print some counts after finishing collation based
output
filename=<stdin>: input file name (data is read from standard input
if this option is not given)
inputformat=<bam>: input file format All versions of bamtofastq
come with support for the BAM input format. If the program in addition is
linked to the io_lib package, then the following options are valid:
- bam:
- BAM (see http://samtools.sourceforge.net/SAM1.pdf)
- sam:
- SAM (see http://samtools.sourceforge.net/SAM1.pdf)
- cram:
- CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)
reference=: file name of the reference for CRAM input files. If this key
is unset, then the CRAM file header will be scanned for obtaining a reference
file name.
exclude=<SECONDARY>: Do not include reads in the output that have
any of the given flags set. The flags are given separated by commas. Valid
flags are:
- PAIRED:
- read was paired in sequencing
- PROPER_PAIR:
- read has been mapped as part of a proper pair
- UNMAP:
- read was not mapped
- MUNMAP:
- mate of read was not mapped
- REVERSE:
- read was mapped to the reverse strand
- MREVERSE:
- mate of read was mapped to the reverse strand
- READ1:
- read was first read of a pair during sequencing
- READ2:
- read was second read of a pair during sequencing
- SECONDARY:
- alignment is secondary, i.e. an alternative mapping to the
primary alignment in the same file
- QCFAIL:
- read as marked as having failed quality control
- DUP:
- read is marked as a duplicate of another read in the same
file (see bammarkduplicates)
- SUPPLEMENTARY:
- read is marked as supplementary alignment
disablevalidation=<0>: Valid values are
- 0:
- run input file validation on alignments (this is the
default)
- 1:
- do not check the validity of the input file (this may help
for some broken input files, but it is a security risk as it can lead to
the execution of arbitrary code through a forged input file).
colhlog=<18> base two logarithm of the size of the hash table used
for collation (the default value is 18 and should work reasonably well for
most input files. Please see the biobambam paper at arxiv.org/abs/1306.0836
for details).
colsbs=<128M> size of hash table overflow list in bytes (the
default is 128MB and should work reasonably well for most input files. Please
see the biobambam paper at arxiv.org/abs/1306.0836 for details).
T=<bamtofastq_hostname_pid_time> file name of temporary file used
for collation
ranges=<>: coordinate ranges selected from input. This option is
only available for input files in BAM and CRAM format which have a
corresponding index file (.bai for BAM, .crai for CRAM) and if input is via
file (i.e. the filename argument is set). Valid ranges consist of either
- whole reference sequence:
- a whole reference sequence (e.g. "chr1")
- half open interval on reference sequence:
- an interval on a reference sequence half open on the right
(e.g. "chr1:50000" which means alignments overlapping chr1 from
position 50000 to the end of chr1)
- interval on reference sequence:
- an interval on a reference sequence (e.g.
"chr1:50000-60000" which means alignments overlapping positions
50000 to 60000 on chr1)
For BAM input multiple ranges are separated by space characters (e.g.
ranges="chr1:10000-20000 chr1:30000-40000"). CRAM input supports a
single range only.
gz=<[0|1]>: compress output files using gzip. By default output is
uncompressed.
level=<-1|0|1|9|11>: set compression level of the output
FastQ/FastA files if gz=1. Valid values are
- -1:
- zlib/gzip default compression level
- 0:
- uncompressed
- 1:
- zlib/gzip level 1 (fast) compression
- 9:
- zlib/gzip level 9 (best) compression
If libmaus has been compiled with support for igzip (see
https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data)
then an additional valid value is
- 11:
- igzip compression
fasta=<0|1>: output FastA instead of FastQ if fasta=1.
outputperreadgroup=<0|1> split output by read group if
outputperreadgroup=1 (default is 0). If splitting by read group is performed
then no output is written on standard output but all data is written to files.
The file names will be generated using the outputdir and
outputperreadgroupsuffix parameters and read group names.
outputdir=<> output directory if outputperreadgroup=1. By default
the output files are generated in the current directory.
outputperreadgrouprgsm=<0|1> include SM field of read group in
output filenames if outputperreadgroup=1 (default is 0)
outputperreadgroupprefix= add given prefix ahead of file names if
outputperreadgroup=1 (default is to add no prefix)
outputperreadgroupsuffixF=<_1.fq> output file name suffix for first
mates of complete pairs if outputperreadgroup=1. Default is _1.fq if gz=0 and
_1.fq.gz for gz=1.
outputperreadgroupsuffixF2=<_2.fq> output file name suffix for
second mates of complete pairs if outputperreadgroup=1. Default is _2.fq if
gz=0 and _2.fq.gz for gz=1.
outputperreadgroupsuffixO=<_o1.fq> output file name suffix for
first mates of incomplete pairs if outputperreadgroup=1. Default is _o1.fq if
gz=0 and _o1.fq.gz for gz=1.
outputperreadgroupsuffixO2=<_o2.fq> output file name suffix for
second mates of incomplete pairs if outputperreadgroup=1. Default is _o2.fq if
gz=0 and _o2.fq.gz for gz=1.
outputperreadgroupsuffixS=<_s.fq> output file name suffix for
singled end reads if outputperreadgroup=1. Default is _s.fq if gz=0 and
_s.fq.gz for gz=1.
tryoq=<0|1>: use content of OQ aux field if present instead of
quality field when converting to FastQ. By default the quality field is used.
This option is currently mutually exclusive with the tags option.
tags=<>: provide a comma separated list of aux fields which will be
copied from the input alignment records to the comment section of the output
FastQ records. By default no aux fields are copied. This option is currently
mutually exclusive with the tryoq option.
split=<0>: split named output files into chunks of this number of
reads. The output file names will be extended by _NNNNNN if gz=0 and by
_NNNNNN.gz if gz=1 where NNNNNN denotes the NNNNNN+1'th output file (i.e.
numbers start with 000000). The suffixes k, m, g, K, M and G can be used to
denote that the argument is to be multiplied by 1024, 1024^2, 1024^3, 1000,
1000^2 or 1000^3 respectively.
cols=<>: If set to an unsigned number then wrap the sequence and
quality lines at this number of columns. By default no wrapping is performed.
splitprefix=<bamtofastq_split>: file prefix if split>0 and
collate=0.
casava18=<0>: produce read names as expected by the c18pe input
option of fastqtobam using the ne aux fields produced by fastqtobam.
maxoutput=<>: produce no more than this number of output records.
By default there is no limit. This option is only active for collate=0.
Written by German Tischler.
Report bugs to <
[email protected]>
Copyright © 2009-2014 German Tischler, © 2011-2014 Genome Research
Limited. License GPLv3+: GNU GPL version 3
<
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO
WARRANTY, to the extent permitted by law.