artfastqgenerator - outputs artificial FASTQ files derived from a reference
genome
artfastqgenerator -O <outputPath>
-R
<referenceGenomePath>
-S <startSequenceIdentifier>
-F1 <fastq1ForQualityScores>
-F2
<fastq2ForQualityScores>
-CMGCS
<coverageMeanGCcontentSpread>
-CMP <coverageMeanPeak>
-CMPGC <coverageMeanPeakGCcontent>
-CSD <coverageSD>
-E <endSequenceIdentifier>
-GCC
<GCcontentBasedCoverage>
-GCR <GCcontentRegionSize>
-L <logRegionStats>
-N <nucleobaseBufferSize>
-OF <outputFormat>
-RCNF <readsContainingNfilter>
-RL <readLength>
-SE <simulateErrorInRead>
-TLM <templateLengthMean>
-TLSD <templateLengthSD>
-URQS <useRealQualityScores>
-X <xStart>
-Y
<yStart>
ArtificialFastqGenerator takes the reference genome (in FASTA format) as input
and outputs artificial FASTQ files in the Sanger format. It can accept Phred
base quality scores from existing FASTQ files, and use them to simulate
sequencing errors. Since the artificial FASTQs are derived from the reference
genome, the reference genome provides a gold-standard for calling variants
(Single Nucleotide Polymorphisms (SNPs) and insertions and deletions
(indels)). This enables evaluation of a Next Generation Sequencing (NGS)
analysis pipeline which aligns reads to the reference genome and then calls
the variants.
- -h
- Print usage help.
-
-O, <outputPath>
- Path for the artificial fastq and log files, including
their base name (must be specified).
-
-R, <referenceGenomePath>
- Reference genome sequence file, (must be specified).
-
-S, <startSequenceIdentifier>
- Prefix of the sequence identifier in the reference after
which read generation should begin (must be specified).
-
-F1, <fastq1ForQualityScores>
- First fastq file to use for real quality scores, (must be
specified if useRealQualityScores = true).
-
-F2, <fastq2ForQualityScores>
- Second fastq file to use for real quality scores, (must be
specified if useRealQualityScores = true).
-
-CMGCS, <coverageMeanGCcontentSpread>
- The spread of coverage mean given GC content (default =
0.22).
-
-CMP, <coverageMeanPeak>
- The peak coverage mean for a region (default = 37.7).
-
-CMPGC, <coverageMeanPeakGCcontent>
- The GC content for regions with peak coverage mean (default
= 0.45).
-
-CSD, <coverageSD>
- The coverage standard deviation divided by the mean
(default = 0.2).
-
-E, <endSequenceIdentifier>
- Prefix of the sequence identifier in the reference where
read generation should stop, (default = end of file).
-
-GCC, <GCcontentBasedCoverage>
- Whether nucleobase coverage is biased by GC content
(default = true).
-
-GCR, <GCcontentRegionSize>
- Region size in nucleobases for which to calculate GC
content, (default = 150).
-
-L, <logRegionStats>
- The region size as a multiple of -NBS for which
summary coverage statistics are recorded (default = 2).
-
-N, <nucleobaseBufferSize>
- The number of reference sequence nucleobases to buffer in
memory, (default = 5000).
-
-OF, <outputFormat>
-
'default': standard fastq output; 'debug_nucleobases(_nuc|read_ids)':
debugging.
-
-RCNF, <readsContainingNfilter>
- Filter out no "N-containing" reads (0),
"all-N" reads (1), "at-least-1-N" reads (2), (default
= 0).
-
-RL, <readLength>
- The length of each read, (default = 76).
-
-SE, <simulateErrorInRead>
- Whether to simulate error in the read based on the quality
scores, (default = false).
-
-TLM, <templateLengthMean>
- The mean DNA template length, (default = 210).
-
-TLSD, <templateLengthSD>
- The standard deviation of the DNA template length, (default
= 60).
-
-URQS, <useRealQualityScores>
- Whether to use real quality scores from existing fastq
files or set all to the maximum, (default = false).
-
-X, <xStart>
- The first read's X coordinate, (default = 1000).
-
-Y, <yStart>
- The first read's Y coordinate, (default = 1000).
Any bugs should be reported to
[email protected]
This manpage was written by Andreas Tille for the Debian distribution and can be
used for any other usage of the program.