Project IV (MATH4072) 2017-18


Bayesian Spam Filtering

Matthias Troffaes

Description

In this project you will learn the basics of naive Bayes classification, and investigate how it is applied to spam filtering. One of the features of spam filtering is that, as the user trains the filter, classification results improve. However, during the initial phase, traditional filters typically misclassify.

You will study techniques from Bayesian decision making, using a set of probability distributions rather than a single one, to allow the filter to produce an "unsure" result in situations where there is lack of training data.

We may also study various statistical methods used in real spam filters. The original algorithm based on a naive Bayes classifier has been improved in many ways, and probably still can be improved further.

Other applications of Bayesian classification and decision making could be a subject of study as well.

Prerequisites

Statistical Concepts II

Either of Decision Theory III or Bayesian Statistics III/IV is recommended although not strictly required.

Resources

  • Jonathan Zdziarski. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press, 2005.
  • Statistical classification on Wikipedia.
  • Zaffalon, M. (1999). A credal approach to naive classification. In de Cooman, G., Cozman, F. G., Moral, S., Walley, P. (Eds), ISIPTA '99: Proceedings of the First International Symposium on Imprecise Probabilities and Their Applications. The Imprecise Probabilities Project, Universiteit Gent, Belgium, pp. 405-414.
  • Zaffalon, M. (2001). Statistical inference of the naive credal classifier. In de Cooman, G., Fine, T., Seidenfeld, T. (Eds), ISIPTA '01: Proceedings of the Second International Symposium on Imprecise Probabilities and Their Applications. Shaker Publishing, The Netherlands, pp. 384–393.
  • R. O. Duda and P. E.Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, 1973.
  • Marco Zaffalon. The naive credal classifier. Journal of Statistical Planning and Inference 105:5-21, 2002.