This is a project I have done recently as an experiment in very simple machine learning. Recently I downloaded Bill Yerazunis' CRM114 and began using it as a spam filter. It worked very well, and is more accurate with fewer false positives than SpamAssassin with Razor2. One of the questions this raised was "Can this pick the signal out of high noise ratio and high volume information streams?" That's a pretty good description of Usenet, so I went for it.
As a prerequisite, you must install CRM114 first or none of this will do anything useful. Basically, everything included in this package is a convenience wrapper to read news from the network, either classify or train via CRM114 into appropriate buckets, and then write out in mbox format. All of the real work is done by CRM114. All work with these scripts has been done on Linux and while there is no deep fundamental reason why it couldn't run on other platforms, no effort has been expended to make it cross-platform. Any platform that can get CRM114 compiled on it should be able to get this working with a little tweaking.
The results I have had with this are very good. With a training set of under 200 messages across 10 newsgroups, it is able to pick out the correct kind of job listings from jobs groups (as well as do a fine job of selecting geography, too), find things I might want in for-sale groups and generally work according to the theory. One of the highest priority bits of work for the next release will be adding in extra forms of notification for really good matches, such as having them automatically e-mailed. This would let you set up the scripts on cron jobs and have the robot let you know when things of note are posted, rather than you having to manually check.
If you use this, please e-mail me feedback to the address below (please put "CRM" in the subject somewhere). Positive, negative, all is accepted. I'd like to make this as general and useful as possible. There may still be some embedded things that are specific to me and the way I do things. With luck, all that can be eliminated before the next release.
Release 0.2 (March 3, 2003)
Page last modified on Saturday November 15 2003 by Dave Slusher