pfscale - fit parameters of an extreme-value distribution to a profile score
list
- pfscale
- [ -hl ] [ -L log_base ] [ -M
mode_nb ] [ -N db_size ] [ -P
upper_limit ] [ -Q lower_limit ] [ score_list
| - ] [ profile ] [ parameters ]
pfscale fits the two parameters of an extreme-value distribution to a
sorted score distribution obtained by searching a sequence database with a
profile. The file '
score_list' is a sorted list of profile match scores
generated by
pfsearch. If '
-' is specified instead of a
filename, the score list is read from the standard input. The result is
written to the standard output.
If the original profile is given as the second argument, the normalization
function with the lowest mode number or the lowest priority number specified
within the profile will be updated such as to produce -Log10 per-residue
E-values. If the second argument is omitted, the output consists of a header
line containing the normalization parameters followed by a modified score
list, showing
score rank,
original raw scores,
log-cumulative
frequencies and corresponding
normalized scores next to each other.
Note that this program implements the significance estimation procedure for
profile match scores described in Hofmann & Bucher (1995). It has been
used for the calculation of the normalization parameters of all profiles in
the
PROSITE database.
- score_list
- Input score list.
The file must contain a sorted list of scores. The first field of each line
is considered as being a score, all other fields on the same line are
ignored. The different fields of each line should be delimited by
whitespaces. If the filename is replaced by a '-', pfscale
will read the score list from stdin.
- profile
- Optional profile file.
If a filename is specified, the profile will be parsed and either the lowest
priority mode or the mode number specified with option -M will be
scaled. All cut-off levels which use the specified mode number will also
be updated.
- -h
- Display usage help text.
- -l
- Remove output line length limit. Individual lines of the
output profile can exceed a length of 132 characters, removing the need to
wrap them over several lines.
-
-L log_base
- Logarithmic base of the parameters of the estimated
extreme-value distribution. The parameters reported by pfscale are
expressed as logarithms and thus can be inserted directly into a linear
normalization function defined in a generalized profile.
Default: 10
-
-M mode_nb
- Mode number to scale.
Defines which mode number (and implicitly which cut-off level) of the input
PROSITE profile should be scaled. This overrides the
default behaviour of scaling only the normalization mode with the lowest
priority (or lowest mode number). All cut-off levels defined in the
profile as using this mode number (via the MODE keyword) will be
updated as well.
-
-N db_size
- Size of the database from which the input score list was
derived. The searched database is typically a shuffled version of a real
protein or nucleotide sequence database.
Default: 14147368 (size of SWISS-PROT release 30 and shuffled
derivatives of it).
-
-P upper_limit
- Upper threshold of the probability range to which the
extreme-value distribution will be fitted. For instance: if
N=10'000'000 and P=0.0001 then profile match scores below
rank 1000 in the sorted input list (corresponding to occurrence
probabilities > 0.0001) will be ignored.
Default: 0.0001
-
-Q lower_limit
- Lower threshold of the probability range to which the
extreme-value distribution will be fitted. For instance: if
N=10'000'000 and Q=0.000001 then profile match scores above
rank 10 in the sorted input list (corresponding to occurrence
probabilities < 0.000001) will be ignored.
Default: 0.000001
- Note:
- for backwards compatibility, release 2.3 of the
pftools package will parse the version 2.2 style parameters, but
these are deprecated and the corresponding option (refer to the
options section) should be used instead.
- L=#
- Logarithmic base.
Use option -L instead.
- M=#
- Mode number.
Use option -M instead.
- N=#
- Database size.
Use option -N instead.
- P=#
- Upper probability threshold.
Use option -P instead.
- Q=#
- Lower probability threshold.
Use option -Q instead.
- (1)
-
pfsearch -fr -C 200 sh3.prf shuffle20.seq |
sort -nr | pfscale -P 0.0001 -Q 0.000001 -
- derives score-normalization parameters for the SH3 domain
profile in file 'sh3.prf'. The file 'shuffle20.seq' contains
a window-shuffled derivative of SWISS-PROT release 30 in
Pearson/Fasta format (window-size 20). Note that the implicit default of
N corresponds to the size of this database and thus needs not to be
specified on the command line. The cut-off value 200 for the
pfsearch(1) option -C will produce about 2000 matches
completely covering the range defined by the command line parameters
-P and -Q of pfscale. A suitable cut-off value has to
be guessed in advance by computing a few optimal alignment scores for
random sequences.
On successful completion of its task,
pfscale will return an exit code of
0. If an error occurs, a diagnostic message will be output on standard error
and the exit code will be different from 0. When conflicting options where
passed to the program but the task could nevertheless be completed, warnings
will be issued on standard error.
- (1)
- The current version of pfscale does not yet support
the xpsa(5) output format produced by pfscan(1) or
pfsearch(1). The score list should therefore be generated without
the pfscan(1) and pfsearch(1) option -k.
Hofmann K & Bucher P. (1995).
The FHA-domain: a nuclear signalling domain
found in protein kinases and transcription factors. Trends Biochem. Sci.
20:47-349.
pfsearch(1),
pfscan(1),
xpsa(5)
The
pftools package was developed by Philipp Bucher.
Any comments or suggestions should be addressed to
<
[email protected]>.