Content text Entropy,Cross Entropy,KL Divergence, Random Number Generator, Tuncated Distribution.pdf
2 Base Information Unit log2 bit loge nat log10 Hartley Note that changing the base changes the value of the information by a multiplicative constant and its units. loga (X) = logb (X) logb (a) = loga (b) · logb (X). Average Information Measure of a Discrete random variable: Let X be discreate ran- dom variable which takes values x1, x2, . . . , xn with probability p1, p2, . . . , pn respectively, i.e. P(X = xi) = pi ∀ i ∈ {1,2, . . . , n} and Xn i=1 pi = 1. The average information of the random variable X is I(X) = − Xn i=1 P(X = xi) log (P(X = xi)) = Xn i=1 P(X = xi) log 1 P(X = xi) = Xn i=1 pi log 1 pi where log 1 P(X=xi) or log 1 pi is the information contained in the event X = xi . The average information of a discrete random variable is called Entropy, and we formally define it as follows:
3 Definition 0.0.1: Entropy Let X be a discrete random variable with probability mass function P(X). Then the entropy of X is given by H(X) = X x∈RX P(X = x) log 1 P(X = x) [RX denotes the range of X] = − X x∈RX P(X = x) log (P(X = x)) = −E [log(P(X))] = E log 1 P(X) . In the above equations, log denotes the log2 , and the entropy is expressed in bits. For a base other than 2, say b, we denote entropy by Hb(X). Note: The entropy of X can be interpreted as the expected value of the random variable log 1 P(X) , where X is drawn according to the probability mass function P(X). The entropy measures the expected uncertainty in X. We can also say that H(X) is approximately equal to how much information we learn on average from one instance of the random variable X. Properties • H(X) ≥ 0. • Hb(X) = logb (a) Ha(X) Note: In this document, we use log to denote the log base 2 i.e. log2 . Example 0.0.2 Let X = 1, with probability p 0, with probability 1 − p. Compute the entropy of the X.
4 Solution H(X) = −p log(p) − (1 − p) log(1 − p). H(X) is function of p, i.e., P(X = 1) and sometimes we write it as H(p). 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 P(X = 1) = p 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 H(X) Figure 1: H(X) vs. p The above figure shows that H(X) = 1 bit for p = 1/2. This figure also shows some of the basic properties of the entropy • Entropy is a concave function of the distribution and is equal to 0 when p = 0 or 1. This makes sense because for p = 0 and 1, the variable is not random, and there is no uncertainty. • The uncertainty is maximum when p = 1/2, which also corresponds to the max- imum value of the entropy.