Project III (MATH3382) 2023-24


Probability Bounding Methods For Spam Filtering

Matthias Troffaes

Description

In this project you will learn the basics of naive Bayes classification, and investigate how it is applied to spam filtering. One of the features of spam filtering is that, as the user trains the filter, classification results improve. However, during the initial phase, traditional filters typically misclassify. In addition, filters may also struggle with words that have not yet seen before.

You will study the idea of bounding probabilities, leading to robust Bayesian analysis when combined with data. This allows the filter to produce an "unsure" result in situations where there is lack of training data.

We may also study various probabilistic methods used in real spam filters, including Markov chains, and their bounding variant, imprecise Markov chains.

Other applications of classification and decision making under severe uncertainty could be a subject of study as well.

Prerequisites

Markov Chains II or Probability II

Resources

  • Jonathan Zdziarski. Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. No Starch Press, 2005.
  • Statistical classification on Wikipedia.
  • Zaffalon, M. (1999). A credal approach to naive classification. In de Cooman, G., Cozman, F. G., Moral, S., Walley, P. (Eds), ISIPTA '99: Proceedings of the First International Symposium on Imprecise Probabilities and Their Applications. The Imprecise Probabilities Project, Universiteit Gent, Belgium, pp. 405-414.
  • Zaffalon, M. (2001). Statistical inference of the naive credal classifier. In de Cooman, G., Fine, T., Seidenfeld, T. (Eds), ISIPTA '01: Proceedings of the Second International Symposium on Imprecise Probabilities and Their Applications. Shaker Publishing, The Netherlands, pp. 384–393.
  • R. O. Duda and P. E.Hart. Pattern Classification and Scene Analysis. Wiley-Interscience, 1973.
  • Marco Zaffalon. The naive credal classifier. Journal of Statistical Planning and Inference 105:5-21, 2002.