bmf - efficient Bayesian mail filter
bmf [-t] [-n] [-s] [-N] [-S] [-f fmt] [-d db] [-i file] [-k n] [-m type] [-p]
[-v] [-V] [-h]
bmf is a Bayesian mail filter. In its normal mode of operation, it takes an
email message or other text on standard input, does a statistical check
against lists of "good" and "spam" words, registers the
new data, and returns a status code indicating whether or not the message is
spam. BMF is written with fast, zero-copy algorithms, coded directly in C, and
tuned for speed. It aims to be faster, smaller, and more versatile than
similar applications.
bmf supports both mbox and maildir mail storage formats. It will automatically
process multiple messages within an mbox file separately.
Without command-line options, bmf processes the input, registers it as either
"good" or "spam", and returns the appropriate error code.
The wordlist directory and nonexistent wordfiles are created if absent.
-t Test to see if the input is spam. The word lists are not updated. A
report is written to stdout showing the final score and the tokens with the
highest deviation form a mean of 0.5.
-n Register the input as non-spam.
-s Register the input as spam.
-N Register the input as non-spam and undo a prior registration as spam.
-S Register the input as spam and undo a prior registration as non-spam.
-f fmt Specify database format. Valid formats are text, db, and mysql.
Text is always valid. The others may not be available if the corresponding
option was not enabled at compile time. The default is db if available, else
text.
-d db Specify database or directory for loading and saving word lists.
The default is
~/.bmf in text mode.
-i file Use file for input instead of stdin.
-k n Specify the number of extrema (keepers) to use in the Bayes
calculation. The default is 15.
-m fmt Specify mail storage format. Valid formats are mbox and maildir.
The default is to automatically detect the mail storage format. This option is
deprecated.
-p Copy the input to the output (passthrough) and insert spam headers in
the style of SpamAssassin. An X-Spam-Status header is always inserted with
processing details. The contents of this header always begin with either
"Yes" or "No". If the input is judged to be spam, the
header "X-Spam-Flag: YES" is also inserted.
-v Be more verbose. This option is not well supported yet.
-V Display version information.
-h Display usage information.
bmf treats its input as a bag of tokens. Each token is checked against
"good" and "bad" wordlists, which maintain counts of the
numbers of times it has occurred in non-spam and spam mails. These numbers are
used to compute the probability that a mail in which the token occurs is spam.
After probabilities for all input tokens have been computed, a fixed number of
the probabilities that deviate furthest from average are combined using
Bayes's theorem on conditional probabilities.
While this method sounds crude compared to the more usual pattern-matching
approach, it turns out to be extremely effective. Paul Graham's paper A Plan
For Spam:
http://www.paulgraham.com/spam.html is recommended reading.
bmf improves on Paul's proposal by doing smarter lexical analysis. In
particular, hostnames and IP addresses are not discarded, and certain types of
MTA information are discarded (such as message ids and dates).
MIME and other attachments are not decoded. Experience from watching the token
streams suggests that spam with enclosures invariably gives itself away
through cues in the headers and non-enclosure parts. Nonetheless, I would like
to add the ability to decode quoted-printable and perhaps base64 encodings for
textual attachments.
Please see the /usr/share/doc/bmf/README.gz for samples and suggestions.
In passthrough mode: zero for success, nonzero for failure.
In non-passthrough mode: 0 for spam; 1 for non-spam; 2 for I/O or other errors.
- ~/.bmf/goodlist.txt
- List of good tokens for text mode.
- ~/.bmf/spamlist.txt
- List of bad tokens for text mode.
- ~/.bmf/goodlist.db
- List of good tokens for libdb mode.
- ~/.bmf/spamlist.db
- List of bad tokens for libdb mode.
Only one copy of instance can access the database (see options -d
and -f). In Procmail recipes, ensure sequential access with a lock file:
:0 fw: bmf.lock
| bmf -p
The lexer does not recognize multiline headers.
The lexer does not recognize MIME attachments.
Content-Transfer-Encoding is not decoded.
Tom Marshall <
[email protected]>.
The Bayes algorithm is from bogofilter by Eric S. Raymond
<
[email protected]>. bogofilter can be found at the bogofilter project
page:
http://bogofilter.sourceforge.net/.