Information-theoretic learning, 2015-2016 - Prospectus

This course is about fundamental questions, such as: can we mathematically formalize the concept of “information”? Or the concept of “simplicity”? And how are these notions related to the probabilistic concept of “entropy”? But at the same time it is about very practical, statistical questions such as: we want to model data by a polynomial of unknown degree. Given the data, what is our best guess for the degree?

The more we can compress a given set of data, the more regularity we have detected in the data. Viewing ‘learning’ as ‘finding pattern/structure/regularity’ in data, this suggests that the best explanation of a given set of data is the one that allows for the shortest description of the data. This idea is formalized in the Minimum Description Length Principle, an information-theoretic approach to learning from data. In its simplest form, it tells us to select the hypothesis H that minimizes the number of bits needed to describe H plus the number of bits needed to describe the data with the help of H.

When two hypotheses fit the data equally well, MDL tells us to prefer the ‘simplest’ one, allowing for the shortest description. Thus, it can be viewed as a mathematically precise form of Occam’s Razor. The MDL Principle can be applied to problems of statistical inference (e.g. model selection, estimation prediction, nonparametrics) but also to modeling with the very complex models that have traditionally been dealt with in Artificial Intelligence, Pattern Recognition and Machine Learning – learning of neural networks, context-free grammars and hidden Markov models of arbitrary complexity. In more advanced forms, MDL leads to a notion of ‘parametric complexity’ which is a measure of a model’s ability to fit random data, related to, but more sophisticated than, the number of degrees of freedom in the model. This is closely related to the notion of ‘Kolmogorov complexity’, which measures the inherent complexity of an object (a string, function or relation) by the length of the shortest computer program that implements or generates the object.

Central to MDL is the close relationship between codes and probability distributions; as such this course reviews basic concepts from both information theory and (mostly Bayesian) statistics.

Prerequisites
Introductory course on probability theory (no a priori knowledge about information theory is needed). Introductory course on statistics is helpful, but not strictly necessary.

Literature
P. Grünwald. /The Minimum Description Length Principle/. MIT Press, 2007. plus handouts of some articles (the articles will be made available free of charge).

Examination
Graded weekly home work exercises. Written final exam.

Remark
This course is useful for the tracks Applied Mathematics, Mathematics and Communication, Statistical Science for Life and Behavioural Sciences.
During the semester 2016/2017 this course will be lectured by Nishant Mehta.