Overview
This is a reference article that provides an overview of Bayesian filtering and how MailEssentials employs Bayesian filtering to stay ahead of email spammers. The Bayesian Analysis anti-spam filter can be trained to accurately determine if an email is spam based on past experience with both inbound as well as outbound emails. In addition, the article describes the filtering process that goes on before the Bayesian Analysis filter flags an Email as spam.
Introduction
Bayesian Analysis is a powerful feature in MailEssentials that is part of the Anti-Spam Engine (ASE) chain. It is an anti-spam adaptive technique based on artificial intelligence algorithms, hardened to withstand the widest range of spamming techniques available today.
It is disabled by default as it is highly recommended that administrators “train” the Bayesian filter before
enabling it. GFI recommends operating MailEssentials for at least one week for the Bayesian filter to achieve its optimal performance as it acquires its highest detection rate once it adapts to an organization’s email patterns.
This filter is part of the Anti-Spam Engine (ASE) chain as shown in the illustration. The order in which the various modules scan an email is configurable and can be altered from the MailEssentials Configuration > Anti-Spam > Filter Priority.
Description
Bayesian filtering is based on the principle that most events are dependent and that the probability of an
event occurring in the future can be inferred from the previous occurrences of that event (Bayes' Theorem). The mathematical basis of Bayesian filtering has been adapted by GFI MailEssentials to identify and classify spam. If a snippet of text frequently occurs in spam emails but not in legitimate emails, it would be reasonable to assume that this email is probably spam.
The Bayesian Analysis filter makes use of the Bayesian mathematical theory, which has been adapted to determine if an email is spam or not. It first needs to learn how to differentiate between SPAM emails and legitimate emails known as HAM by automatically learning from outbound e-mails.
MailEssentials should be running for a minimum of one week (depending on your mail volume) in order for the Bayesian Analysis filter to effectively learn legitimate mails from your outbound mail traffic. This should ideally happen prior to enabling the filter. The learning process can however be accelerated by running the Bayesian Wizard as described in this linked article on How to train, manually update and create a new database for the Bayesian Filter.
Bayesian Analysis makes use of various methods to learn, all of which are described below. All this information will be stored in weights.bsp – the Bayesian database. This database is used by the Bayesian Filter, whose internal name is “Mailware”.
- MailReaper: This is the process that checks outbound emails and will update ham_tmp.tok. MailReaper can be enabled and disabled from the MailEssentials configuration > Anti-Spam > Bayesian Analysis > General tab > ‘Automatically learn from outbound e-mails’.
- Bayesian Wizard: This is the Bayesian Analysis Wizard, which can be started from the MailEssentials program group or from ..GFI\MailEssentials\Antispam\BSW. This wizard can be configured to scan a mailbox and learn from the items in the mailbox.
- Rcommands: This is the Remote Commands module. This module can also save to the tok files if an email has the ADDASSPAM or ADDASHAM keywords in the first line of the email.
- RpFolders and PFolders: These two modules have been introduced in newer versions of MailEssentials. They are the modules that check the Exchange Public folders (if MailEssentials is installed on the same machine as Exchange) or the IMAP folders for any emails that will be used to alter the Anti-Spam configuration. Two folders used by these modules change the database of the Bayesian filter by adding information to either the HAM tok file or SPAM tok file.
- Downloaded from the internet: From the MES configuration > Anti-Spam > Bayesian filter > Updates tab, the administrator can download the latest spam profile. The original idea was that GFI maintains a spam profile updated with the latest technologies used by spammers. At the moment this feature is not maintained therefore database updates from the internet are not available.
- *.tok: These are temporary files for the Bayesian filter. New information for the Bayesian filter is not updated immediately to the main Bayesian database (weights.bsp), but is stored in these *.tok files. Ham_tmp.tok will contain information on the good emails, whilst Spam_tmp.tok will contain information on spam emails.
- MailMerge: The MailMerge is a process started by the GFI MailEssentials Legacy Attendant service. This process checks the data directory once every 15 minutes and if any *.tok files are found, they are updated to the weights.bsp database.
- Weights.bsp: This is the information database used by the Bayesian filter to check emails. This database is proprietary i.e. it is only used by the MailEssentials Bayesian filter. It contains a list of keywords, which are referred to as tokens.
Token |
HAM |
SPAM |
Calculation |
HAMTokenEntry |
15 |
2 |
0.01 |
SPAMTokenEntry |
1 |
17 |
0.98 |
AnotherToken |
10 |
12 |
0.45 |
- The above table illustrates a simplified format of the Bayesian database. It contains an example of a HAM token, a SPAM token, and another token, which is neutral. When information is added to the Bayesian database, the Bayesian filter module will work out the calculation, using Bayesian theory.
- The Bayesian database and the Bayesian engine have undergone various updates, which also allows it to learn from emails containing images to combat image spam, as well as emails containing attachments to combat attachment spam.
- In relation to Attachment spam, the Bayesian database supports the following tokens:
- Subject Length and Body Length
- Subject Length, Body Length and Number of Attachments
- Subject Length, Body Length and Number of Attachments, and the attachment file extension when only one attachment is found.
- File extension
- File Extension and File Size
- When an email with an attachment is found, apart from retrieving tokens from the message body, it will also retrieve the properties mentioned above, and use these when processing the email.
- MailWare: This is the Bayesian filter process. When an email is scanned by the Bayesian filter, it will extract the tokens that are found in the email. It will only use the 15 most interesting tokens to decide if the email should be blocked or not. The most interesting tokens are the ones that are nearer to 0 or 1. This will exclude relatively neutral tokens from the decision.
An administrator can monitor the continuous updates of the Bayesian Database (weights.bsp) by monitoring the number of HAM and SPAM emails in the database which is indicated on the Bayesian Filter configuration page. The number of emails shown here should continue to grow in proportion to the inbound and outbound email volumes to show that the filter is indeed learning.
The logging extracts below are from Libpam.gfi_log.txt (the Bayesian filter log file) showing the processing of two emails. The 15 most interesting tokens are also listed in the log file. The token value can be between 0 and 1. A token value of 0.01 indicates a HAM token, whereas a token of 0.99 indicates a SPAM token. Token values in between are also considered, however the nearer the token is to 0.5, the less consideration the token will receive.
The tokens are first extracted and converted to UTF-8 format using the libspamatt module, but no processing is performed on the email itself. These tokens are all passed directly to the libspam module.
"LibSpam","[CSpamFilter::CSpamFilter] c'tor"
"LibSpam","[CSpamFilter::_lock] <GFI_ME_FILTERSWAP_LOCK>"
"LibSpam","[MailWare::testMessage] # tokens: 602039"
"LibSpam","[::token_extract]"
"LibSpam","[::token_extract] setting up regex"
"LibSpam","[::token_extract] html tags [0]<<>>"
"LibSpam","[::token_extract] 59 inner text <<bt3$0 Test message from Exchange Test message from Exchange>>"
"LibSpam","[::token_extract] getting matches 59"
"LibSpam","[::token_extract] plus regex done [9]"
"LibSpam","[MailWare::testMessage] tokens extracted"
"LibSpam","[MailWare::testMessage] clearing 9 matches"
"LibSpam","[MailWare::testMessage] sorting tokens"
"LibSpam","[MailWare::testMessage] famous-15"
"LibSpam","[MailWare::testMessage] classify (# tokens 602039)"
"LibSpam","[MailWare::testMessage] spamicity: 0.000009 (5/9)"
"LibSpam","[CSpamFilter::_unlock]"
"LibSpam","[CSpamFilter::~CSpamFilter] d'tor"
"LibSpam","[::~CMailWare]"
The Bayesian filter found that this email is a legitimate email, and did not block it. This Bayesian installation has learnt that most of the tokens shown in the above logging are of HAM nature. The smaller the value near ‘spamicity’, the more HAM weight the token gives to the processing of the email.
"LibSpam","[CSpamFilter::CSpamFilter] c'tor"
"LibSpam","[CSpamFilter::_lock] <GFI_ME_FILTERSWAP_LOCK>"
"LibSpam","[MailWare::testMessage] # tokens: 602039"
"LibSpam","[::token_extract]"
"LibSpam","[::token_extract] setting up regex"
"LibSpam","[::token_extract] html tags [0]<<>>"
"LibSpam","[::token_extract] 749 inner text <<bt1$5 spam viagra Do You have years of 1ife experience in your field of expertise and are being passed up for promotion? Is Your application for employment is being rejected because you don't hold a University Deegree? 1-718-9895-740 [inside U.S.A.] to inquire about our programs +1-718-9895-740 [International] to inquire about our programs Whether You are seeking a Bachel0rs
"LibSpam","[::token_extract] getting matches 749"
"LibSpam","[::token_extract] plus regex done [111]"
"LibSpam","[MailWare::testMessage] tokens extracted"
"LibSpam","[MailWare::testMessage] clearing 111 matches"
"LibSpam","[MailWare::testMessage] sorting tokens"
"LibSpam","[MailWare::testMessage] famous-15"
"LibSpam","[MailWare::testMessage] classify (# tokens 602039)"
"LibSpam","[MailWare::testMessage] spamicity: 0.999863 (13/15)"
"LibSpam","[CSpamFilter::_unlock]"
"LibSpam","[CSpamFilter::~CSpamFilter] d'tor"
"LibSpam","[::~CMailWare]"
In the example above, the Bayesian filter found that this email is a SPAM email and blocked it. The Bayesian filter has learnt that the tokens listed above are normally found in SPAM emails. The ‘spamicity‘ has the value of 0.999863 having the highest spam rating.