Description
In this project you will learn the basics of naive Bayes
classification, and investigate how it is applied to spam
filtering. One of the features of spam filtering is that, as the user
trains the filter, classification results improve.
However, during the initial phase, traditional filters typically
misclassify.
In addition, filters may also struggle with words that have not yet
seen before.
You will study the idea of bounding probabilities,
leading to robust Bayesian analysis when combined with data.
This allows the filter to produce
an "unsure" result in situations where there is lack of training data.
We may also study various probabilistic methods used
in real spam filters, including Markov chains,
and their bounding variant, imprecise Markov chains.
Other applications of classification
and decision making under severe uncertainty could be a subject of
study as well.
Prerequisites
Markov Chains II or Probability II
Resources
- Jonathan Zdziarski. Ending Spam: Bayesian Content Filtering and the
Art of Statistical Language Classification. No Starch Press, 2005.
- Statistical classification on Wikipedia.
- Zaffalon, M. (1999). A credal approach to naive classification. In de Cooman, G., Cozman, F. G., Moral, S., Walley, P. (Eds), ISIPTA '99: Proceedings of the First International Symposium on Imprecise Probabilities and Their Applications. The Imprecise Probabilities Project, Universiteit Gent, Belgium, pp. 405-414.
- Zaffalon, M. (2001). Statistical inference of the naive credal classifier. In de Cooman, G., Fine, T., Seidenfeld, T. (Eds), ISIPTA '01: Proceedings of the Second International Symposium on Imprecise Probabilities and Their Applications. Shaker Publishing, The Netherlands, pp. 384–393.
- R. O. Duda and P. E.Hart. Pattern Classification and Scene
Analysis. Wiley-Interscience, 1973.
- Marco Zaffalon. The naive credal classifier. Journal of Statistical
Planning and Inference 105:5-21, 2002.
|