This is an old revision of this page, as edited by Jheald (talk | contribs) at 03:30, 19 December 2005 (→"Information gain"). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Revision as of 03:30, 19 December 2005 by Jheald (talk | contribs) (→"Information gain")(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)In probability theory and information theory, the Kullback-Leibler divergence, or relative entropy, is a natural distance measure of an arbitrary probability distribution q from a "true" probability distribution p.
It can be interpreted as the expected extra message-length per datum that must be communicated if a code is used that is optimal for a given (wrong) distribution q, compared to using a code based on the true distribution p.
For probability distributions of a discrete variable the KL divergence (or informally KL distance) is defined
For distributions of a continuous random variable the summations give way to integrals, so
The logarithms in these formulae are conventionally taken to base 2, so that the quantity can be interpreted in units of bits; the other important properties of the KL divergence hold irrespective of log base.
Terminology
Originally introduced by Solomon Kullback and Richard Leibler in 1951, the term "divergence" is a misnomer; it is not the same as divergence in calculus. One might be tempted to call it a "distance metric", but this would also be a misnomer as the Kullback-Leibler divergence is not symmetric and does not satisfy the triangle inequality.
It can be seen from the definition that
where H(p,q) is the cross entropy of p and q, and H(p) the entropy of p.
The cross-entropy is always greater than or equal to the entropy, and the Kullback-Leibler divergence always nonnegative, a result known as Gibbs' inequality, with KL(p:q) zero only iff p = q.
"Information gain"
In Bayesian statistics the KL divergence can be used as a measure of the "distance" between the prior distribution and the posterior distribution. The KL divergence is also the gain in Shannon information involved in going from the prior to the posterior. In Bayesian experimental design a design which is optimised to maximise the KL divergence between the prior and the posterior is said to be Bayes d-optimal.
According to Shannon's source coding theorem, the shortest expected length for a message to identify one out of an set A of possibilities is obtained by devising a code such that each possibility a has a code length -log(p(a|I)). This gives an average message length -∑ p(a|I) log p(a|I), which is the Shannon entropy of A.
If some new fact X=x is now discovered, it can be used to update the probability distribution for A from p(a|I) to a new posterior probability distribution p(a|x). This has a new entropy, -∑ p(a|x) log p(a|x), which may be less than or greater than the original entropy. However, with the new knowledge it can be estimated that to have used the original code based on p(a|I) instead of a new code based on p(a|x) would have added an extra -∑ p(a|x) {log p(a|x) - log p(a|I)} to the message length.
This extra message length per datum that would have been inflicted by using the wrong distribution for A is the K-L distance from p(a|x) to p(a|I),
It therefore represents the amount of useful information about A, or "information gain" (Renyi, 1961) about A, that we can estimate has been learned by discovering X=x.
Quantum information theory
In quantum information science it is used as a measure of entanglement in a state.
Symmetrised divergence
It should be noted that Kullback and Leibler themselves actually defined the divergence as:
which is symmetric and nonnegative.
References
- S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22(1):79–86, March 1951.