Friday, January 4, 2008

Information Content in a discrete variable

There are many ways to understant importance of a variable for predictive modeling situation, two of the measures are revolutionary. Fisher information, which is reciprocal of variance and Shanon Entropy or Information Entropy dominate the science. While Fisher information is consistent with the statistical considerations that lend itself to Type I and Type II errors (Sensitivity vs. [1-Specificity]) or (True Positivies vs. True Negatives), Shanon Entropy considers the least number of bits needed to reproduce the original message. Once a message is coded into a (0,1) bits, then different coding methods will give raise to a particular sequence of 0's and 1's. Shannon Entropy provides a way to compare the goodness (information content) of a coding schemes using a particular way of measuring the worthiness of the coding for reproduceability and its accuracy.

For example, if there are 8 teams competing in a sport, then what is the minimum number of bits needed to code the winner to send around the world?

- there are 8 possibilities, and if we use (0,1) system, then 2^3 possibilities, means 3 if bits are needed that would represent (001,010,100,011,101,110,111,000), each code representing one team. Interestingly we are not taking into consideration that there is a probability distribution behind this. To understand the number of bits we have to just use logarithm (base 2).
- if we take into consideration the probability distribution, the number of bits we need is
1.984375, on the average

The key to understanding the whole science of number of bits required to encode a message is the meaning and the importance of the phrase "on the average" and why it is still a very practical definition, despite the fact that the number of bits required is an integer.

A great foundemental paper originally written by Shannon is here http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

A great document that explains why and how to calculate number of bits in coding a message
http://www.cs.cmu.edu/~dst/Tutorials/Info-Theory/

No comments: