Adaptive Filtering:
One Year On
John Graham-Cumming
Research Director, Sophos’s Anti-Spam Task Force
Author, POPFile
www.sophos.com
Adaptive Filtering
 Definition: An email filter that can be taught to
recognize different types of mail without writing
rules.
 Most use some machine learning technique:

Naïve Bayesian Classification1

knn2

Support Vector Machines3
 All provide some measure of “spamminess”
Machine Learning & Antispam
 A little more than one year
 Papers




Mar 1998: SpamCop: A Spam Classification &
Organization Program1
Jul 1998: A Bayesian Approach to Filtering Junk Email2
2000: An evaluation of Naive Bayesian anti-spam
filtering3
Aug 2002: A Plan for Spam4
 Patents


Jun 1998: 6,161,130: Technique which utilizes a
probabilistic classifier to detect "junk" e-mail by
automatically updating a training and re-training the
classifier based on the updated training set
Jun 1999: 6,592,627: System and method for
organizing repositories of semi-structured documents
such as email
Why now?
 The “Grandma Problem”
 Confluence of events:

Spam getting close to 50% of all mail1

Email reaching 1/3 of adults in US2

Fast processors can handle the processing
load

No other good alternatives

Laws?

Migrate from SMTP?3
Two Routes
 Open Source
 Lots of open source anti-spam solutions
 Many are “wannabe” solutions that simply
implemented Paul Graham’s ideas
 Some are interesting tools (bogofilter, POPFile,
SpamBayes)
 Commercial
 Vendors now incorporating Adaptive Filtering into their
anti-spam products
 Classic tradeoff:
 Free, open source, community supported
 Fee, “productized”, vendor supported
Practical Open Source
Filters
 General mail filters1

Aug 1996: ifile

Aug 2002: POPFile

Oct 2002: dbacl
 Spam Filters2

Bogofilter, SpamBayes, Bayesian Spam Filter,
SpamProbe, SpamWizard, BSpam, The Spam
Secretary, Expaminator, SqueakyMail, Bayespam,
spaminator, Quick Spam Filter, Annoyance Filter,
DSPAM, PASP, Spam Blocker, CRM114

SpamAssassin (added Bayesian in 2.5)
Mainstream Adaptive
Filtering
 General


SwiftFile (for Lotus Notes)1
Ella Pro (for Microsoft Outlook)2
 Anti-spam Desktop
Mozilla 1.3, Eudora 6.0
 Microsoft MSN 8, Microsoft Outlook 2003
 AOL 9.0, Apple Mail.app (Jaguar)
 Anti-spam Gateway
 Sophos PureMessage 4.x

 Prediction: By end of 2004 every major email client
includes adaptive filtering
The Problems
 Man-in-the-street Usability
 False Positives
 Over training
 One man’s spam is another man’s ham
 Internationalization
Usability
 Proxy, plug-in and external filters are too
complex
 General user needs:

To not understand the underlying mechanism

Complete integration with mail client

Obvious operation (e.g. spam is moved into a
folder call Spam)

Automatic whitelisting (if I send to Mom, Mom
is ok)
False Positives
 False Positive == Good mail identified as bad
 False Negative == Spam identified as good
 People tolerate false negatives, but hate false positives
 Spam filters must guard against false positives:

Bias towards False Negatives (“A Plan for Spam”)

Cross check results (SpamBayes)

High spam threshold
Over Training
 Occurs when user loads up adaptive filter with
lots more spam than ham
 e.g. feeds entire spam archive into filter
 Some adaptive filters then think everything is
spam
 For Naïve Bayes classifiers the “train on errors”
methodology works well in practice.
 User teaches filter only on mails it incorrectly
classified
 “No, that’s spam or no, that’s ham” button
One man’s spam…
 Can be hard to unsubscribe from legitimate bulk
mail
 Users tell spam filter that legitimate mail is spam
 Creates false positives for other users in shared
systems
 e.g. I say CNET News email is spam, you want
it
 Ideal system has two parts
 Gateway spam filter run by IT group
 Individual preferences on each client
Internationalization
 Tokenization non-trivial for some languages

In English words are “space separated”

Thisisnotthecaseinsomeotherlanguages:

Japanese (POPFile の特別な使い方)
 Different punctuation

¿Español? «Français»
 UTF-8, Unicode

‫ أخبار و تقارير‬looks like ÃÎÈÇÑ æ ÊÞÇÑíÑ
Spammer’s Response
 Overwhelm filter with “good words”
 Hide those good words from people
 Use HTML as trickery toolbox
 Three techniques:

And the Kitchen Sink

Invisible Ink

Camouflage
 More in Sophos’s Field Guide to Spam1
And the Kitchen Sink
 Throw in innocent words before or after the
HTML
<html><body>
Viagra
</body></html>
Hi, Johnny! It was really nice to
have dinner with you last night.
See you soon, love Mom
And the Kitchen Sink
 Spammer hopes reader concentrates on the
spam message part
 Ineffective because user gets to see the innocent
words
 Spammers need ways to hide the innocent words
 So they’ve taken inspiration from search engine
trickery…
Invisible Ink
 Use HTML font colors to write white on white
<body bgcolor=white>
Viagra
<font color=white>Hi, Johnny!
It
was really nice to have dinner with
you last night. See you soon, love
Mom</font>
</body>
Invisible Ink
 Easily spotted if filter groks HTML
 Can confuse filters that just drop HTML tags
 Spammers have noticed that Invisible Ink is
being targeted
 They’ve adapted…
Camouflage
 Use very similar HTML colors
<body bgcolor=#113333>
<font color=yellow>Viagra</font>
<font color=#123939>some innocent
words</font>
</body>
Camouflage
Hard to see, but
“some innocent words”
do appear
Pythagoras Spots Spam
 Foreground and background colors are coordinates
in 3D
 Imagine a Red axis, a Green axis and a Blue
Red
●
Similar colors are close
• Dissimilar colors are far
apart
• Pythagoras’ Theorem
(3D)1 gives the color
distance
•
(FF,FF,00)
(12,39,39)
●
●
(11,33,33)
(00,00,00)
Blue
Sweet, I
rule in
2003
Spammers love HTML
Spams using HTML
90%
84%
83%
85%
84%
84%
Dec-02
Jan-03
Feb-03
Mar-03
Apr-03
% Messages
80%
70%
60%
50%
40%
30%
20%
10%
0%
Trick Trends - Two Increasing
Two Tricks Showing Gains
HTML Comments
Invisible Ink
30%
25%
% Messages
20%
15%
10%
5%
0%
Dec-02
Jan-03
Feb-03
Mar-03
Apr-03
Tricks Make Spam
Spotting Easier
 Bad news for spammers:
 The harder you try to obscure your messages
the easier they are to filter
 Spam trickery becomes the spam fingerprint
 Bad news for end users:
 Spammers will react by making spam more
innocent
Hi, I saw your profile and wanted
to get in touch, please check out
my site at www.some-viagrasite.com
The Filter Paradox
 Do filters make spam more effective?
 One spammer claimed on /.
“Your filters help cut down on the complaints to ISPs
[…] you no longer complain to [email protected], my
access providers, or anyone else who might cause
me problems”
 Time will tell
The End
 Following slides are for reference purposes
References

Slide 2
1.
http://www.wikipedia.org/wiki/Naive_Baye
sian_classification
2.
http://www.usenix.org/events/sec02/full_
papers/liao/liao_html/node4.html
3.
http://citeseer.nj.nec.com/tong00support
.html
References

Slide 3
1.
2.
3.
4.

1.
2.
3.
http://citeseer.nj.nec.com/pantel98spamcop.html
http://citeseer.nj.nec.com/sahami98bayesian.html
http://citeseer.nj.nec.com/androutsopoulos00evaluation.html
http://www.paulgraham.com/spam.html
Slide 4
Wired, p50, September 2003 predicts 50% of all mail
will be spam by September 2004
US Census Bureau, 2000
One proposal is AMTP:
http://www.ietf.org/internet-drafts/draftweinman-amtp-00.txt
References

Slide 5
1. POPFile:
http://popfile.sourceforge.net
ifile: http://www.nongnu.org/ifile/
2. Search SourceForge and Freshmeat
 Slide 6
1. http://www.research.ibm.com/swiftf
ile/
2. http://www.openfieldsoftware.com/E
lla.asp
References

Slide 17
1.
http://www.activestate.com/Product
s/PureMessage/Field_Guide_to_Spam/
Pythagoras in 3D
 Distance between two points in space
 Pythagoras: δ2 = α2 + β2
 Pythagoras: α2 = (x-a)2 + (z-c)2
 β2 = (y-b)2
δ=√(
(x-a)2
+
(y-b)2
(x, y, z)
+
(z-c)2
)
β
δ
α
(a, b, c)
Descargar

October 2003: Adaptive Filtering: One Year On (conference)