Last week I had the honor to guest-post at Peter Vogel‘s impressive Internet, Information Technology, & e-Discovery Blog. As some readers may know, Mr. Vogel is a frequent guest-blogger here at Disputing.
Check out my guest blog!
By Victoria VanBuren, October 21, 2009.
The U.S. Supreme Court granting of certiorari to former Enron CEO Jeffrey Skilling dominated the news headlines last week. Interestingly, the Federal Energy Commission (FERC), during its investigation into Enron’s involvement in the energy crisis of 2000-01, made available to the public a large database, called the “Enron Corpus.” This dataset consists of about half a million e-mail communications from former Enron senior executives and energy traders.
Enron E-mail Dataset Research
Because of its size and public status, the Enron Corpus is a rare and valuable tool for experimenting on text classification methods. After FERC posted it to the web, this dataset has been the subject of research by computer science departments of several universities, including the Massachusetts Institute of Technology and Stanford University. The summer of 2009, the team at TREC Legal Track, an organization co-sponsored by the U.S. Department of Defense, started conducting research on the Enron Corpus with the purpose of improving large-scale search techniques.
Our Research – Bayesian Text Classifier
The spring of 2009, computer science students at Texas State University David Villarreal, Thomas McMillen, Andrew Minnick, and I, under the supervision of computer forensic expert Wilbon Davis utilized the Enron Corpus to train a Bayes-based algorithm to classify the Enron e-mails into relevant and irrelevant to a given legal issue. This type of algorithm is commonly used by e-mail spam filters.
The team hoped that this mathematical approach would achieve better accuracy levels than the ~ 20% found using Boolean keyword searching, a method employed by many lawyers. Surprisingly, the Bayesian filter found e-mails to be known relevant at averages ranging between 43% and 66%. And as expected, the irrelevant accuracy results were even higher, averages ranging between 44% and 77%. Texas State Universitypublished the Technical Report last week and it can be downloaded for free here.