Schrödinger’s Collaborative Spam Filter
June 17th, 2005 by DeWitt Clinton

Schrodinger's Spam

I’d like to throw out an idea that I’ll call Schrödinger’s Collaborative Spam Filter.

As everyone knows, email spam is a huge problem. It costs innocent people money and wastes their time. It preys on the naïve and defenseless. It is often fraudulent and illegal. Now I’m non-violent by desire, but if I happened to be face to face with someone who sends email spam (or blog comment, search engine, or referrer spam) then I would honestly have to restrain myself from getting physical with them. For the life of me I can’t understand the mentality of someone would willingly choose to try and profit by such a morally bankrupt practice.

Some combination of Bayesian filtering, sender validation, and blacklisting have been reasonably effective at combating the worst email spam. If you run your own mail server than you can make some headway again spam. However, if you use your ISP’s mail server or an webmail application such as Hotmail, Yahoo! Mail, or Gmail, then you are essentially at the mercy of how well they keep the spam out of your inbox.

As far as I can tell, all of the major webmail providers employ some form of collaborative filtering technique that leverage Other People’s Work (i.e., their users) to combat spam. In other words, if 100 users mark similar looking email as spam then they can block it from being delivered to the next 100,000. Of course, it’s not that simple — spammers have long been varying the individual contents of each spam email to make it difficult to wholesale reject a particular message. But from the user’s perspective, most webmail providers are getting pretty darn good at blocking the bulk of spam email. Anecdotally, I use Gmail and they are blocking on average 70 spam emails to me each day and are letting about 10 through to my inbox.

It’s those last ten spam emails that slip through that I’m starting to wonder about. Most of them are quite obviously spam. Or at least, they look exactly like all the other spam emails that were caught by the default filters that Gmail employs. So as a user I’m frustrated that they can catch all the other ones but let those slip through. But then you stop and think — well, if they are using collaborative techniques, then perhaps I was the first person to get that particular piece of spam. Maybe no one had been able to flag the patterns yet, so it got delivered to me. If I mark it spam, then it is less likely to land in someone else’s mailbox next time.”

Maybe that’s what you or I would think. But that’s not what Erwin Schrödinger would think. No, that crafty Austrian would think “before I’ve observed it, it is neither spam nor not spam, it is both spam and not spam.” The act of observation would cause the email to collapse into a spam or non-spam state. (Please don’t anyone think I’m serious about the physics of this, it’s a metaphor.) Going further — until I’ve checked my inbox, there is no message, spam or otherwise. (Actually, to be really accurate, that’s not what he’d think — he’d think that is was either/or the whole time.)

So the conclusion is — why stop filtering spam once it has been delivered to your inbox? If it has been delivered, but you haven’t seen it yet, and 100 people say that email is spam, why not remove it before I see it? The system knows whether or not you’ve read your email, so it knows what it can do without you even knowing that it has been done.

Empirical evidence suggests that webmail applications are not Schrödinger filtering yet. I have made a habit of only checking my email a handful of times a day, and when I check it at the end of the day I often find spam that has been sitting around for hours. And since some of that spam is of the obvious type, I can only assume it was delivered to my inbox before the collaborative filtering kicked in. The question is — why was it still there when I checked it 6 hours later? At that point it is known spam, and the user shouldn’t have to see it.

Now the “Copenhagen interpretation” of this idea implies that if you can observe the system prior to checking your inbox, then the wavefunction (spam or not-spam) must collapse early and thus defeat the Schrödinger filter. This is important. It means that if you forward your email, or if you use a mail notifier (a “biff”), if you use the nifty RSS feeds, etc., then you will force an early determination of the spam-ness of each message. If each message needs to collapse into spam or non-spam soon after it is delivered, then you are unable to perform a retroactive purging of spam.

Then again, why not offer the option? Why not ask your users to give you permission to delete email from the inbox if you are really certain it is spam? I for one wouldn’t object to that at all. I’d rather have my inbox clean itself up than force me to do the work.

Granted, this idea has undoubtably been proposed and implemented a hundred times before. I just figured I’d toss it out there in case anyone hadn’t though of it themselves. (Jeremy, feel free to pass this one along… Maybe I’ll switch over to Yahoo Mail.)

And speaking of switching, has anyone built a “home version” of a webmail client that kicks as much ass as Gmail? I’ve been running my own mail server for the past ten years, loyally using mutt as an email client, until the day I tried Gmail for the first time and never looked back. As I wrote before, Gmail’s user experience not only blows away the webmail competition, it blows away the desktop competition.

That said, I’d drop Gmail in a heartbeat if I could find an alternative that I could host myself. But as far as I know, none exist. So in the spirit of the lazyweb (not sure if I like that term), I’d be happy to kick in a bounty of a few hundred dollars to anyone who writes a webmail application that I like enough to use for myself on unto.net. I could write the back-end myself (though not to scale to millions of users, of course), but the front-end is far beyond my meager client-side means. If you know of anything, let me know…

6 Responses to “Schrödinger’s Collaborative Spam Filter”

  1. DeWitt Clinton Says:

    Actually, when I say that Gmail is blocking 70 spam message per day for me, that’s inaccurate. They are really blocking many, many times more than that. However, those are the ones that made it through to the spam folder. I assume that most are caught before they are even delivered past their first lines of defenses.

  2. stephen ogrady Says:

    hi dewitt - if you haven’t checked out Hula (hula-project.org) yet, you might give it a look. it’s certainly not Gmail quality yet, but that’s their stated ambition.

  3. DeWitt Clinton Says:

    Cool. It looks a bit young and their goals are certainly ambitious… I wish them the best.

  4. Chris Says:

    Ok, I really love that graphic…

    Anecdotally, I use Gmail and Hotmail, and Hotmail is abysmally bad at blocking anything, even when it’s obvious that all mails from a domain are evil. I just go in and manually block the domain within Hotmail, and then the spam gets better for a few days, but their automatic systems are pathetic. Gmail, on the other hand, has always worked nearly perfectly for me. Maybe I’ve gotten 5 unfiltered junk messages in the whole time I’ve used them, and 3 were on the first day I had the account, a long time back.

  5. Greg Says:

    Just FYI, this is how my old man is fighting spam, last time I checked: http://www.whitescarver.com/wiki/index.php/SpamBlock

    It’s way simpler, but if it was done collaboratively, it would work pretty darn good. One thing I like about IP group blocking is the incentive it provides for ISPs and hosts to enforce spam rules.

  6. Greg Says:

    Oh… also, I love this idea. It’s an “I wish I thought of that” idea. Similarly, you could apply this method to the “this is not spam” marker users put on messages incorrectly delivered to the junk folder. I have a feeling some of the subscription lists I belong to are frequently marked as spam by people who signed up for them, but are too lazy to unsubscribe; those legitimate messages end up in my junk folder all the time. You could call this the Yin and Yang of Shroedinger’s Collaborative Spam Filter, to abuse another hackneyed metaphor.