NAME

grinder - Versatile omics shotgun and amplicon sequencing read simulator

DESCRIPTION

Usage:

grinder -rf <reference_file> | -reference_file <reference_file> | -gf <reference_file> | -genome_file <reference_file> [cli optional arguments] grinder --help grinder --man grinder --usage grinder --version

Cli required arguments:

-rf <reference_file> | -reference_file <reference_file> | -gf
<reference_file> | -genome_file <reference_file>
FASTA file that contains the input reference sequences (full genomes, 16S rRNA genes, transcripts, proteins...) or '-' to read them from the standard input. See the README file for examples of databases you can use and where to get them from. Default: -

Cli optional arguments:

-tr <total_reads> | -total_reads <total_reads>
Number of shotgun or amplicon reads to generate for each library. Do not specify this if you specify the fold coverage. Default: 100
-cf <coverage_fold> | -coverage_fold <coverage_fold>
Desired fold coverage of the input reference sequences (the output FASTA length divided by the input FASTA length). Do not specify this if you specify the number of reads directly.
-rd <read_dist>... | -read_dist <read_dist>...
Desired shotgun or amplicon read length distribution specified as: average length, distribution ('uniform' or 'normal') and standard deviation.
Only the first element is required. Examples:
All reads exactly 101 bp long (Illumina GA 2x): 101 Uniform read distribution around 100+-10 bp: 100 uniform 10 Reads normally distributed with an average of 800 and a standard deviation of 100
bp (Sanger reads): 800 normal 100
Reads normally distributed with an average of 450 and a standard deviation of 50
bp (454 GS-FLX Ti): 450 normal 50
Reference sequences smaller than the specified read length are not used. Default: 100
-id <insert_dist>... | -insert_dist <insert_dist>...
Create paired-end or mate-pair reads spanning the given insert length. Important: the insert is defined in the biological sense, i.e. its length includes the length of both reads and of the stretch of DNA between them: 0 : off, or: insert size distribution in bp, in the same format as the read length distribution (a typical value is 2,500 bp for mate pairs) Two distinct reads are generated whether or not the mate pair overlaps. Default: 0
-mo <mate_orientation> | -mate_orientation <mate_orientation>
When generating paired-end or mate-pair reads (see <insert_dist>), specify the orientation of the reads (F: forward, R: reverse):
FR:
---> <--- e.g. Sanger, Illumina paired-end, IonTorrent mate-pair
FF:
---> ---> e.g. 454
RF:
<--- ---> e.g. Illumina mate-pair
RR:
<--- <---
Default: FR
-ec <exclude_chars> | -exclude_chars <exclude_chars>
Do not create reads containing any of the specified characters (case insensitive). For example, use 'NX' to prevent reads with ambiguities (N or X). Grinder will error if it fails to find a suitable read (or pair of reads) after 10 attempts. Consider using <delete_chars>, which may be more appropriate for your case. Default: ''
-dc <delete_chars> | -delete_chars <delete_chars>
Remove the specified characters from the reference sequences (case-insensitive), e.g. '-~*' to remove gaps (- or ~) or terminator (*). Removing these characters is done once, when reading the reference sequences, prior to taking reads. Hence it is more efficient than <exclude_chars>. Default:
-fr <forward_reverse> | -forward_reverse <forward_reverse>
Use DNA amplicon sequencing using a forward and reverse PCR primer sequence provided in a FASTA file. The reference sequences and their reverse complement will be searched for PCR primer matches. The primer sequences should use the IUPAC convention for degenerate residues and the reference sequences that that do not match the specified primers are excluded. If your reference sequences are full genomes, it is recommended to use <copy_bias> = 1 and <length_bias> = 0 to generate amplicon reads. To sequence from the forward strand, set <unidirectional> to 1 and put the forward primer first and reverse primer second in the FASTA file. To sequence from the reverse strand, invert the primers in the FASTA file and use <unidirectional> = -1. The second primer sequence in the FASTA file is always optional. Example: AAACTYAAAKGAATTGRCGG and ACGGGCGGTGTGTRC for the 926F and 1392R primers that target the V6 to V9 region of the 16S rRNA gene.
-un <unidirectional> | -unidirectional <unidirectional>
Instead of producing reads bidirectionally, from the reference strand and its reverse complement, proceed unidirectionally, from one strand only (forward or reverse). Values: 0 (off, i.e. bidirectional), 1 (forward), -1 (reverse). Use <unidirectional> = 1 for amplicon and strand-specific transcriptomic or proteomic datasets. Default: 0
-lb <length_bias> | -length_bias <length_bias>
In shotgun libraries, sample reference sequences proportionally to their length. For example, in simulated microbial datasets, this means that at the same relative abundance, larger genomes contribute more reads than smaller genomes (and all genomes have the same fold coverage). 0 = no, 1 = yes. Default: 1
-cb <copy_bias> | -copy_bias <copy_bias>
In amplicon libraries where full genomes are used as input, sample species proportionally to the number of copies of the target gene: at equal relative abundance, genomes that have multiple copies of the target gene contribute more amplicon reads than genomes that have a single copy. 0 = no, 1 = yes. Default: 1
-md <mutation_dist>... | -mutation_dist <mutation_dist>...
Introduce sequencing errors in the reads, under the form of mutations (substitutions, insertions and deletions) at positions that follow a specified distribution (with replacement): model (uniform, linear, poly4), model parameters. For example, for a uniform 0.1% error rate, use: uniform 0.1. To simulate Sanger errors, use a linear model where the errror rate is 1% at the 5' end of reads and 2% at the 3' end: linear 1 2. To model Illumina errors using the 4th degree polynome 3e-3 + 3.3e-8 * i^4 (Korbel et al 2009), use: poly4 3e-3 3.3e-8. Use the <mutation_ratio> option to alter how many of these mutations are substitutions or indels. Default: uniform 0 0
-mr <mutation_ratio>... | -mutation_ratio <mutation_ratio>...
Indicate the percentage of substitutions and the number of indels (insertions and deletions). For example, use '80 20' (4 substitutions for each indel) for Sanger reads. Note that this parameter has no effect unless you specify the <mutation_dist> option. Default: 80 20
-hd <homopolymer_dist> | -homopolymer_dist <homopolymer_dist>
Introduce sequencing errors in the reads under the form of homopolymeric stretches (e.g. AAA, CCCCC) using a specified model where the homopolymer length follows a normal distribution N(mean, standard deviation) that is function of the homopolymer length n:
Margulies: N(n, 0.15 * n)
, Margulies et al. 2005.
Richter
: N(n, 0.15 * sqrt(n)) , Richter et al. 2008.
Balzer
: N(n, 0.03494 + n * 0.06856) , Balzer et al. 2010.
Default: 0
-cp <chimera_perc> | -chimera_perc <chimera_perc>
Specify the percent of reads in amplicon libraries that should be chimeric sequences. The 'reference' field in the description of chimeric reads will contain the ID of all the reference sequences forming the chimeric template. A typical value is 10% for amplicons. This option can be used to generate chimeric shotgun reads as well. Default: 0 %
-cd <chimera_dist>... | -chimera_dist <chimera_dist>...
Specify the distribution of chimeras: bimeras, trimeras, quadrameras and multimeras of higher order. The default is the average values from Quince et al. 2011: '314 38 1', which corresponds to 89% of bimeras, 11% of trimeras and 0.3% of quadrameras. Note that this option only takes effect when you request the generation of chimeras with the <chimera_perc> option. Default: 314 38 1
-ck <chimera_kmer> | -chimera_kmer <chimera_kmer>
Activate a method to form chimeras by picking breakpoints at places where k-mers are shared between sequences. <chimera_kmer> represents k, the length of the k-mers (in bp). The longer the kmer, the more similar the sequences have to be to be eligible to form chimeras. The more frequent a k-mer is in the pool of reference sequences (taking into account their relative abundance), the more often this k-mer will be chosen. For example, CHSIM (Edgar et al. 2011) uses this method with a k-mer length of 10 bp. If you do not want to use k-mer information to form chimeras, use 0, which will result in the reference sequences and breakpoints to be taken randomly on the "aligned" reference sequences. Note that this option only takes effect when you request the generation of chimeras with the <chimera_perc> option. Also, this options is quite memory intensive, so you should probably limit yourself to a relatively small number of reference sequences if you want to use it. Default: 10 bp
-af <abundance_file> | -abundance_file <abundance_file>
Specify the relative abundance of the reference sequences manually in an input file. Each line of the file should contain a sequence name and its relative abundance (%), e.g. 'seqABC 82.1' or 'seqABC 82.1 10.2' if you are specifying two different libraries.
-am <abundance_model>... | -abundance_model <abundance_model>...
Relative abundance model for the input reference sequences: uniform, linear, powerlaw, logarithmic or exponential. The uniform and linear models do not require a parameter, but the other models take a parameter in the range [0, infinity). If this parameter is not specified, then it is randomly chosen. Examples:
uniform distribution: uniform powerlaw distribution with parameter 0.1: powerlaw 0.1 exponential distribution with automatically chosen parameter: exponential
Default: uniform 1
-nl <num_libraries> | -num_libraries <num_libraries>
Number of independent libraries to create. Specify how diverse and similar they should be with <diversity>, <shared_perc> and <permuted_perc>. Assign them different MID tags with <multiplex_mids>. Default: 1
-mi <multiplex_ids> | -multiplex_ids <multiplex_ids>
Specify an optional FASTA file that contains multiplex sequence identifiers (a.k.a MIDs or barcodes) to add to the sequences (one sequence per library, in the order given). The MIDs are included in the length specified with the -read_dist option and can be altered by sequencing errors. See the MIDesigner or BarCrawl programs to generate MID sequences.
-di <diversity>... | -diversity <diversity>...
This option specifies alpha diversity, specifically the richness, i.e. number of reference sequences to take randomly and include in each library. Use 0 for the maximum richness possible (based on the number of reference sequences available). Provide one value to make all libraries have the same diversity, or one richness value per library otherwise. Default: 0
-sp <shared_perc> | -shared_perc <shared_perc>
This option controls an aspect of beta-diversity. When creating multiple libraries, specify the percent of reference sequences they should have in common (relative to the diversity of the least diverse library). Default: 0 %
-pp <permuted_perc> | -permuted_perc <permuted_perc>
This option controls another aspect of beta-diversity. For multiple libraries, choose the percent of the most-abundant reference sequences to permute (randomly shuffle) the rank-abundance of. Default: 100 %
-rs <random_seed> | -random_seed <random_seed>
Seed number to use for the pseudo-random number generator.
-dt <desc_track> | -desc_track <desc_track>
Track read information (reference sequence, position, errors, ...) by writing it in the read description. Default: 1
-ql <qual_levels>... | -qual_levels <qual_levels>...
Generate basic quality scores for the simulated reads. Good residues are given a specified good score (e.g. 30) and residues that are the result of an insertion or substitution are given a specified bad score (e.g. 10). Specify first the good score and then the bad score on the command-line, e.g.: 30 10. Default:
-fq <fastq_output> | -fastq_output <fastq_output>
Whether to write the generated reads in FASTQ format (with Sanger-encoded quality scores) instead of FASTA and QUAL or not (1: yes, 0: no). <qual_levels> need to be specified for this option to be effective. Default: 0
-bn <base_name> | -base_name <base_name>
Prefix of the output files. Default: grinder
-od <output_dir> | -output_dir <output_dir>
Directory where the results should be written. This folder will be created if needed. Default: .
-pf <profile_file> | -profile_file <profile_file>
A file that contains Grinder arguments. This is useful if you use many options or often use the same options. Lines with comments (#) are ignored. Consider the profile file, 'simple_profile.txt':
# A simple Grinder profile -read_dist 105 normal 12 -total_reads 1000
Running: grinder -reference_file viral_genomes.fa -profile_file simple_profile.txt
Translates into: grinder -reference_file viral_genomes.fa -read_dist 105 normal 12 -total_reads 1000
Note that the arguments specified in the profile should not be specified again on the command line.

SEE ALSO

grinder(7), , average_genome_size(1) and change_paired_read_orientation(1).

Recommended readings

Pages related to grinder you should read also: