Kullback–Leibler divergence

This is an old revision of this page, as edited by 128.84.98.150 (talk) at 08:09, 2 January 2006 (→Symmetrised divergence). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Revision as of 08:09, 2 January 2006 by 128.84.98.150 (talk) (→Symmetrised divergence)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

In probability theory and information theory, the Kullback-Leibler divergence, or relative entropy, or information divergence is a natural distance measure from a "true" probability distribution P to arbitrary probability distribution Q. Typically P represents data, observations, or a precise calculated probability distribution. The measure Q typically represents a theory, a model, a description or an approximation of "P".

It can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution Q is used, compared to using a code based on the true distribution P.

For probability distributions P and Q of a discrete variable the K-L divergence (or informally K-L distance) from P to Q is defined to be

\mathrm {KL} (P,Q)=\sum _{i}P(i)\log {\frac {P(i)}{Q(i)}}\!

For distributions P and Q of a continuous random variable the summations give way to integrals, so that

\mathrm {KL} (P,Q)=\int _{-\infty }^{\infty }p(x)\log {\frac {p(x)}{q(x)}}\;dx\!

where p and q denotes the densities of P and Q.

The logarithms in these formulae are taken to base 2, if information is measured in units of bits, or to base e, if information is measured in nats. Most formulas involving the KL divergence hold irrespective of log base.

Terminology

Originally introduced by Solomon Kullback and Richard Leibler in 1951, the term "divergence" is a misnomer; it is not the same as divergence in calculus. One might be tempted to call it a "distance metric", but this would also be a misnomer as the Kullback-Leibler divergence is not symmetric and does not satisfy the triangle inequality.

It can be seen from the definition that

{\begin{matrix}\mathrm {KL} (P,Q)&=&-\sum _{x}p(x)\log q(x)&+&\sum _{x}p(x)\log p(x)\\&=&H(P,Q)&-&H(P)\,\!\end{matrix}}

where H(P,Q) is the cross entropy of P and Q, and H(P) the entropy of P.

The cross-entropy is always greater than or equal to the entropy, and the Kullback-Leibler divergence always nonnegative, a result known as Gibbs' inequality, with KL(P:Q) zero only iff P = Q.

"Information gain"

In Bayesian statistics the KL divergence can be used as a measure of the "distance" between the prior distribution and the posterior distribution. The KL divergence is also the gain in Shannon information involved in going from the prior to the posterior. In Bayesian experimental design a design which is optimised to maximise the KL divergence between the prior and the posterior is said to be Bayes d-optimal.

According to Shannon's source coding theorem, the shortest expected length for a message to identify one out of an set A of possibilities is obtained by devising a code such that each possibility a has a code length -log(p(a|I)). This gives an average message length -∑ p(a|I) log p(a|I), which is the Shannon entropy of A.

If some new fact X=x is now discovered, it can be used to update the probability distribution for A from p(a|I) to a new posterior probability distribution p(a|x). This has a new entropy, -∑ p(a|x) log p(a|x), which may be less than or greater than the original entropy. However, with the new knowledge it can be estimated that to have used the original code based on p(a|I) instead of a new code based on p(a|x) would have added an extra -∑ p(a|x) {log p(a|I) - log p(a|x)} to the message length.

This extra message length per datum that would have been inflicted by using the wrong distribution for A is the KL distance from p(a|x) to p(a|I),

\mathrm {KL} (p(a|x),p(a|I))=\sum p(a|x)\log {\frac {p(a|x)}{p(a|I)}}.

It therefore represents the amount of useful information about A, or "information gain" (Renyi, 1961) about A, that we can estimate has been learned by discovering X=x.

Quantum information theory

For density matrices P and Q on a Hilber space the K-L divergence (or relative entropy as it is often called in this case) from P to Q is defined to be

\mathrm {KL} (P,Q)=Tr(P(\log(P)-\log(Q)).\!

In quantum information science it can also used as a measure of entanglement in a state.

Symmetrised divergence

It should be noted that Kullback and Leibler themselves actually defined the divergence as:

\mathrm {KL} (P,Q)+\mathrm {KL} (Q,P)\,\!

which is symmetric and nonnegative. This quantity has found almost no applications. Another symmetrized divergence is the Jensen-Shannon divergence defined by

\mathrm {KL} (P,{\frac {P+Q}{2}})+\mathrm {KL} (Q,{\frac {P+Q}{2}})\,\!

which has an interpretation as the capacity of a noisy information channel with two inputs giving the output distribution p and q. The Jensen-Shannon divergence is the square of a metric that is equivalent to the Hellinger metric.

References

S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics 22(1):79–86, March 1951.

Categories: