bbnorm.sh

NAME

bbnorm.sh - Kmer-based error-correction and normalization tool

SYNOPSIS

bbnorm.sh in=<input> out=<reads to keep> outt=<reads to toss> hist=<histogram output>

Normalizes read depth based on kmer counts. Can also error-correct, bin reads by kmer depth, and generate a kmer depth histogram. However, Tadpole has superior error-correction to BBNorm. Please read bbmap/docs/guides/BBNormGuide.txt for more information.

OPTIONS

Input parameters

in=null: Primary input. Use in2 for paired reads in a second file

in2=null: Second input file for paired reads in two files

extra=null: Additional files to use for input (generating hash table) but not for output

fastareadlen=2^31: Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes

tablereads=-1: Use at most this many reads when building the hashtable (-1 means all)

kmersample=1: Process every nth kmer, and skip the rest

readsample=1: Process every nth read, and skip the rest

interleaved=auto: May be set to true or false to force the input read file to override autodetection of the input file as paired interleaved.

qin=auto: ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

Output parameters

out=<file>: File for normalized or corrected reads. Use out2 for paired reads in a second file

outt=<file>: (outtoss) File for reads that were excluded from primary output

reads=-1: Only process this number of reads, then quit (-1 means all)

sampleoutput=t: Use sampling on output as well as input (not used if sample rates are 1)

keepall=f: Set to true to keep all reads (e.g. if you just want error correction).

zerobin=f: Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms.

: Default is false, to prevent confusion about how there can be 0-count kmers. The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter.

tmpdir=: This will specify a directory for temp files (only needed for multipass runs). If null, they will be written to the output directory.

usetempdir=t: Allows enabling/disabling of temporary directory; if disabled, temp files will be written to the output directory.

qout=auto: ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).

rename=f: Rename reads based on their kmer depth.

Hashing parameters

k=31: Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)

bits=32: Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Automatically reduced to 16 in 2-pass.

: Large values decrease accuracy for a fixed amount of memory, so use the lowest number you can that will still capture highest-depth kmers.

hashes=3: Number of times each kmer is hashed and stored. Higher is slower.

: Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory.

prefilter=f: True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable. The prefilter is more memory-efficient because it uses 2-bit cells.

prehashes=2: Number of hashes for prefilter.

prefilterbits=2: (pbits) Bits per cell in prefilter.

prefiltersize=0.35: Fraction of memory to allocate to prefilter.

buildpasses=1: More passes can sometimes increase accuracy by iteratively removing low-depth kmers

minq=6: Ignore kmers containing bases with quality below this

minprob=0.5: Ignore kmers with overall probability of correctness below this

threads=auto: (t) Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X due to I/O threads.

rdk=t: (removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair, even if that kmer occurs more than once.

Normalization parameters

fixspikes=f: (fs) Do a slower, high-precision bloom filter lookup of kmers that appear to have an abnormally high depth due to collisions.

target=100: (tgt) Target normalization depth. NOTE: All depth parameters control kmer depth, not read depth.

: For kmer depth Dk, read depth Dr, read length R, and kmer size K: Dr=Dk*(R/(R-K+1))

maxdepth=-1: (max) Reads will not be downsampled when below this depth, even if they are above the target depth.

mindepth=5: (min) Kmers with depth below this number will not be included when calculating the depth of a read.

minkmers=15: (mgkpr) Reads must have at least this many kmers over min depth to be retained. Aka 'mingoodkmersperread'.

percentile=54.0: (dp) Read depth is by default inferred from the 54th percentile of kmer depth, but this may be changed to any number 1-100.

uselowerdepth=t: (uld) For pairs, use the depth of the lower read as the depth proxy.

deterministic=t: (dr) Generate random numbers deterministically to ensure identical output between multiple runs. May decrease speed with a huge number of threads.

passes=2: (p) 1 pass is the basic mode. 2 passes (default) allows greater accuracy, error detection, better contol of output depth.

Error detection parameters

hdp=90.0: (highdepthpercentile) Position in sorted kmer depth array used as proxy of a read's high kmer depth.

ldp=25.0: (lowdepthpercentile) Position in sorted kmer depth array used as proxy of a read's low kmer depth.

tossbadreads=f: (tbr) Throw away reads detected as containing errors.

requirebothbad=f: (rbb) Only toss bad pairs if both reads are bad.

errordetectratio=125: (edr) Reads with a ratio of at least this much between their high and low depth kmers will be classified as error reads.

highthresh=12: (ht) Threshold for high kmer. A high kmer at this or above are considered non-error.

lowthresh=3: (lt) Threshold for low kmer. Kmers at this and below are always considered errors.

Error correction parameters

ecc=f: Set to true to correct errors. NOTE: Tadpole is now preferred for ecc as it does a better job.

ecclimit=3: Correct up to this many errors per read. If more are detected, the read will remain unchanged.

errorcorrectratio=140: (ecr) Adjacent kmers with a depth ratio of at least this much between will be classified as an error.

echighthresh=22: (echt) Threshold for high kmer. A kmer at this or above may be considered non-error.

eclowthresh=2: (eclt) Threshold for low kmer. Kmers at this and below are considered errors.

eccmaxqual=127: Do not correct bases with quality above this value.

aec=f: (aggressiveErrorCorrection) Sets more aggressive values of ecr=100, ecclimit=7, echt=16, eclt=3.

cec=f: (conservativeErrorCorrection) Sets more conservative values of ecr=180, ecclimit=2, echt=30, eclt=1, sl=4, pl=4.

meo=f: (markErrorsOnly) Marks errors by reducing quality value of suspected errors; does not correct anything.

mue=t: (markUncorrectableErrors) Marks errors only on uncorrectable reads; requires 'ecc=t'.

overlap=f: (ecco) Error correct by read overlap.

Depth binning parameters

lowbindepth=10: (lbd) Cutoff for low depth bin.

highbindepth=80: (hbd) Cutoff for high depth bin.

outlow=<file>: Pairs in which both reads have a median below lbd go into this file.

outhigh=<file>: Pairs in which both reads have a median above hbd go into this file.

outmid=<file>: All other pairs go into this file.

Histogram parameters

hist=<file>: Specify a file to write the input kmer depth histogram.

histout=<file>: Specify a file to write the output kmer depth histogram.

histcol=3: (histogramcolumns) Number of histogram columns, 2 or 3.

pzc=f: (printzerocoverage) Print lines in the histogram with zero coverage.

histlen=1048576: Max kmer depth displayed in histogram. Also affects statistics displayed, but does not affect normalization.

Peak calling parameters

peaks=<file>: Write the peaks to this file. Default is stdout.

minHeight=2: (h) Ignore peaks shorter than this.

minVolume=5: (v) Ignore peaks with less area than this.

minWidth=3: (w) Ignore peaks narrower than this.

minPeak=2: (minp) Ignore peaks with an X-value below this.

maxPeak=BIG: (maxp) Ignore peaks with an X-value above this.

maxPeakCount=8: (maxpc) Print up to this many peaks (prioritizing height).

Java Parameters:

-Xmx: This will set Java's memory usage, overriding autodetection.

: -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom: This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da: Disable assertions.

AUTHOR

Written by Brian Bushnell (Last modified October 19, 2017)

Please contact Brian Bushnell at [email protected] if you encounter any problems, or post at: http://seqanswers.com/forums/showthread.php?t=41057

This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.

April 2019

bbnorm.sh 38.43