transterm - Finds rho-independent transcription terminators in bacterial genomes.
transterm -p expterm.dat seq.fasta annotation.ptt > output.tt
Any number of fasta and annotation files can be listed but fasta files should
come before annotation files. The type of the file is determined by the
extension:
.ptt a GenBank ptt annotation file
.coords or .crd a simple annotation file
Each line of a .coords or .crd file has the format:
gene_name start end chrom_id
The chrom_id specifies which sequence the annotation should apply to. For a .ptt
file, the chrom_id is taken to be the filename with the path and extension
removed. A filename with any other extension is assumed to be a fasta file.
When processing an annotation for a chromosom with id = ID, the first word of
the '>' lines of the input sequences are searched for ID. Because there is
no good standard for how the '>' line is formatted, several heuristics are
tried to find ID in the '>' line. In the order tried, they are:
>ID
>junk|cmr:ID|junk or junk|ID|junk
>junk|gi|ID|junk or >junk|gi|ID.junk|junk
>junk:ID
The option '-p expterm.dat' uses the newest confidence scheme, where expterm.dat
is the path to the file of that name supplied with TransTermHP. If '-p
expterm.dat' is omited, the version 1.0 confidence scheme is used. See section
'COMMAND LINE OPTIONS' for more detail.
The organism's genes are listed sorted by their end coordinate and terminators
are output between them. A terminator entry looks like this:
TERM 19 15310 - 15327 - F 99 -12.7 -4.0 |bidir
(name) (start - end) (sense)(loc) (conf) (hp) (tail) (notes)
where 'conf' is the overall confidence score, 'hp' is the hairpin score, and
'tail' is the tail score. 'Conf' (which ranges from 0 to 100) is what you
probably want to use to assess the quality of a terminator. Higher is better.
The confidence, hp score, and tail scores are described in the paper cited
above. 'Loc' gives type of region the terminator is in:
'G' = in the interior of a gene (at least 50bp from an end),
'F' = between two +strand genes,
'R' = between two -strand genes,
'T' = between the ends of a +strand gene and a -strand gene,
'H' = between the starts of a +strand gene and a -strand gene,
'N' = none of the above (for the start and end of the DNA)
Because of how overlapping genes are handled, these designations are not
exclusive. 'G', 'F', or 'R' can also be given in lowercase, indicating that
the terminator is on the opposite strand as the region. Unless the
--all-context option is given, only candidate terminators that appear to be in
an appropriate genome context (e.g. T, F, R) are output.
Following the TERM line is the sequence of the hairpin and the 5' and 3' tails,
always written 5' to 3'.
You can also set how large a hairpin must be to be considered:
--min-stem=n Stem must be n nucleotides long
--min-loop=n Loop portion of the hairpin must be at least n long
You can also set the maximum size of the hairpin that will be found:
--max-len=n Total extent of hairpin <= n NT long
--max-loop=n The loop portion can be no longer than n
The maximum length is the total length for the hairpin portion (2 stems, 1 loop)
and does not include the U-tail. It's measured in nuceotides in the input
sequence, so because of gaps, the actual structure may be longer than max-len.
Max-len must be less than the compiled-in constant REALLY_MAX_UP (which by
default is 1000). To increase the size of structures found recompile after
increasing this constant.
TransTermHP assigns a score to the hairpin and tail portions of potential
terminators. Lower scores are considered better. Many of the constants used in
scoring hairpins can be set from the command line:
--gc=f Score of a G-C pair
--au=f Score of an A-U pair
--gu=f Score of a G-U pair
--mm=f Score of any other pair
--gap=f Score of a gap in the hairpin
The cost of loops of various lengths can be set using:
--loop-penalty=f1,f2,f3,f4,f5,...fn
where f1 is the cost of a loop of length --min-loop, f2 is the cost of a loop of
length --min-loop+1, as so on. If there are too few terms to cover up to
max-loop, the last term is repeated. Thus --loop-penalty=0,2 would assign cost
0 to any loop of length min-loop, and 2 to any longer loop (up to max-loop,
after which longer loops are given infinite scores). Extra terms are ignored.
Note that if you are using the --pval-conf confidence scheme (see below), you
must regenerate the expterm.dat file if you change any of the above constants.
To weed out any potential terminator with tail or hairpin scores that are too
large, you can use the following options:
--max-hp-score=f Maximum allowable hairpin score
--max-tail-score=f Maximum allowable tail score
Terminator hairpins must be adjacent to a "U-rich" region. You can
adjust the constants the define what constitutes a U-rich region. Using the
options:
--uwin-size=s
--uwin-require=r
requires that there are at least r 'U' nucleotides in the s-nucleotide-long
window adjacent to the hairpin. Again, if you change these constants, you
should regenerate expterms.dat.
Before the main output, TransTermHP will output the values of the above options
in a format suitable to be used on the command line.
In addition to the tail and hairpin scores, each possible terminator is assigned
a confidence --- a value between 0 and 100 that indicates how likely it is
that the sequence is a terminator. The scoring scheme needs a background file
(supplied with TransTermHP) that is specified using:
--pval-conf expterms.dat
This will use the distribution in the file expterms.dat as the background. (You
can abreivate this as "-p expterms.dat".) Though the supplied
expterms.dat file is derived from random sequences, any background
distribution can be used by supplying your own expterms.dat file. See below
for the format of expterms.dat. The values in expterms.dat depend on the
scoring constants, definition of u-rich regions, and the maximum allowed tail
and hp scores. Thus, if you change any of these constants using the options
above, you should regenerate expterms.dat.
The main output of TransTermHP is a list of terminators interleaved between a
listing of the gene annotations that were provided as input. This output can
be customized in a few ways:
-S Don't output the terminator sequences
--min-conf=n Only output terminators with confidence >= n (can
abbreviate this as -c n; default is 76.)
Additional analysis output can be obtained with the following options:
--bag-output file.bag Output the Best terminator After Gene
--t2t-perf file.t2t Output a summary of which tail-to-tail regions
have good terminators
As mentioned above, if you change any of the basic scoring function and search
parameters and are using the version 2.0 confidence scheme (recommended) then
you have to recompute the values in the expterm.dat file. If you have python
installed this is easy (though perhaps time consuming). You can issue the
command:
% calibrate.sh newexpterms.dat [OPTIONS TO TRANSTERM]
where "[OPTIONS TO TRANSTERM]" are TransTermHP options (discussed
above) that set the parameters to what you want them to be. After calibrate.sh
finishes, newexpterms.dat will be in the current directory and can serve as an
argument to -p when using the same parameters you passed to calibrate.sh.
Note that for the newexpterms.dat to be valid, you must supply the same basic
parameters to TransTermHP on subsequent runs. TransTerm (or newexpterms.dat)
will not remember these parameters for you. The best way to handle this is to
make a shell script wrapper around transterm that always passes in your new
parameters.
Output formatting parameters do not require regeneration of expterms.dat --- see
discussion above for which parameters expterm.dat depends on.
calibrate.sh can be found in /usr/share/doc/transtermhp/examples directory.
The 'pval-conf' confidence scheme, selected with the option "--pval-conf
expterms.dat" (or '-p expterms.dat') computes the confidence of a
terminator with HP energy E and tail energy T as follows. First, the ranges of
HP energies and tail energies are evenly divided into bins, and the
appropriate bins e and t are found for E and T. Then the confidence is
computed as described in [2].
The first line of expterms.dat contains 6 numbers:
seqlen num_bins
The (low_hp, high_hp) and (low_tail, high_tail) ranges give the bounds on the
hairpin and tail scores. The integer num_bins gives the number of
equally-sized bins into which those ranges are divided. Seqlen gives the
length of the random sequence that was used to generate the data in the rest
of the file.
Following this line are any number of (at, R, M) triples, where 'at' is the AT
content, R is a 4-tuple (low_hp, high_hp, low_tail, high_tail) giving the
range of the HP and tail scores observed in random sequences of this AT
content, and M is the distribution matrix. These (at, R, M) triples are
formatted as follows:
at low_hp high_hp low_tail high_tail
n11 n12 n13 n14 ... n1,num_bins
n21 ...
...
n_num_bins,1 ...
The mu_r(e,t) term is computed by selecting the matrix with the at value closest
to the computed %AT of the region r. If the total length of region r sequence
is L_r, then
mu_r(e,t) = n_t_e * L_r/seqlen
where n_t_e is the entry in the t-th row and e-th column of the selected matrix,
and seqlen is the first number in the first line of the file.
2ndscore(1)