Statistics rests heavily on the assumptions of probability distributions. We
all have heard about linear regression and, probably, have some or other
linear regression model, rather mechanically. But it is to be understood that
this statistical model is also relying on the underlying assumptions related
to the distributions of errors. Technically speaking, there are two types of
distributions, i.e., discrete and continuous. Outcomes of tossing of coins, the arrival of customers, and winning a bet
after losing successively 'n' times are typical examples of discrete
outcomes and hence they can be modeled with discrete probability
distributions. Other events such as the time before the failure of bulbs,
inter-arrival time between two customers, or choosing one set of distributions
over another are examples of events that can be modeled using continuous
distributions. The figure below shows a few commonly used discrete and
continuous distributions.
In this blog, we would restrict ourselves to only a few such distributions which
are encountered more frequently in data analytics.
Binomial Distribution
Binomial distribution deals with only 2 outcomes, i.e., success and failure.
It is up to us to decide when an event would be considered a success. For
example, when tossing a coin, we may consider "Head" as a success. But this
choice is rather arbitrary. We may also choose "Tail" as a success too. But,
it is customary to consider "Head" as success. Each distribution is defined
by means of some parameters. For binomial distributions, the parameters are
the number of trials (denoted by N) and the probability of success (denoted
by 'p'). So, if there are x number of successes out of N trials, the
probability of getting exactly x successes is given by: \[\begin{equation}
P(x|N,p)={n \choose x} p^x(1-p)^{N-x} \end{equation}\]Here ${n \choose x}$
denotes the combination. This combination term is important because if there
are N number of trails, x number of successes can happen in ${n \choose x}$
ways and for each combination, the associated probability is
$p^x(1-p)^{N-x}$. The probability got in the case of the discrete event is
called the probability mass function (pmf). PMF is defined for a
particular point (exactly x number of successes in this case) of a discrete
probability distribution. Two important properties related to PMF are
\[\begin{align}\sum_{x_i} P(x_i) = 1 \\ P(x_i) \ge 0
\end{align}\]Binomial distribution is used in parameter estimation of
logistic regression model where the objective is to estimate the probability
of success given some extra information. Logistic regression will be
discussed separately in another blog.
Example
Unguided missiles are not accurate to hit a target from a distance. During a
war, a battery of unguided missiles is fired on a bunker. The bunker will be
completely destroyed if four missiles hit it. If the probability of hitting
the target for a missile is 60%, how many missiles are to be fired to
destroy the bunker with an above 90% probability?
Solution
For complete destruction, a minimum of 4 hits are necessary. That means, the
required probability is \[1-P(No\ hit (x=0)|N,0.6)-P(One\ hit
(x=1)|N,0.6)-P(Two\ hits (x=2)|N,0.6)-\\P(Three\ hits (x=3)|N,0.6)\]Here N is
unknown and the required minimum probability is 0.9. Hence, we need to solve
the following equation:
\[0.9=1-P(x=0|N,0.6)-P(x=1|N,0.6)-P(x=2|N,0.6)-P(x=3|N,0.6)\]\[0.1=0.4^N+N(0.6)(0.4)^{N-1}+\frac{N(N-1)}{2}(0.6^2)(0.4)^{N-2}+\frac{N(N-1)(N-2)}{6}(0.6^3)(0.4)^{N-3}\]The
above problem cannot be solved easily and we need to use numerical methods to
find the solution. We can use the scipy package in Python to solve the
equation
Python Code
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
p=0.6
func = lambda N: 0.1 - ((1-p)**N + N*p*(1-p)**(N-1) + N*(N-1)/2*((p)**2)*(1-p)**(N-2) + N*(N-1)*(N-2)/6*((p)**3)*(1-p)**(N-3))
initial_guess = 4
int(np.round(fsolve(func, initial_guess,maxfev=1000),0)[0])
The solution is "at least 9 rockets are to be fired to be 90% sure of destroying the bunker".
In logistic regression, when we use MLE to estimate model parameters, we use this distribution to model the likelihood of the occurrences of the target variable. The next important discrete distribution is the Poisson Distribution.
Poisson Distribution
Poisson's distribution is another discrete distribution that basically deals with count data. Poisson's distribution is used in Poisson regression and some other statistical tests. Unlike binomial distribution, Poisson's distribution give the probability of occurrence of a given number of events within a fixed interval of time having a constant mean rate and independently of the time since the last event. The above sentence is long and (probably) confusing. To understand in a better way, let us imagine a road crossing and a fixed time frame (say) morning 9 AM to 10 AM (i.e., fixed time interval). If we assume that, on an average, 10 (i.e., mean rate) vehicles crosses the crossing during that time. When a vehicle crosses the crossing, it does not impact the occurrence of another vehicle crossing the same crossing (i.e., independent of the last event). Now, on a day, the number can be 5, 15, 25, 30 or even 50. While 5 and 15 will have higher probability of occurrence, 25, 30 and 50 will have lesser probabilities of occurrences (but not zero!). This can be modeled with Poisson distribution with mean rate 10. Mathematically, Poisson's distribution is defined as \[P(x;\lambda)=\frac{\lambda^x e^{-\lambda}}{x!}\]Here, $\lambda$ is the mean rate and $x$ is the number of occurrence of events. It is to be understood that $P(x;\lambda)$ is the probability mass function.
Example
An ENT specialist, on an average, visits 10 patients in a day. What would be his expected earning per day if the number of patients vary from 15 to 20, (following a uniform distribution) and he charges INR 1000 per patient for the visit? Note: 15-20 patients is a special case which does not disturb the mean rate.
Solution
This problem can be solved by assuming the number of visits of patients follow a Poisson distribution with mean rate 10 (i.e., $\lambda = 10$). The number of patients varies from 15 to 20 (i.e., $x = 15,16,17,18,19,20$). We need to calculate the revenue associated with each number of patients and that need to be multiplied by the associated Poisson probability for finding out the expected earning. Expected earning $E$ is \[E=\sum_{x=15}^{20}P(x;\lambda)(1000x)\]It is important to note that before evaluating the final expected values, probabilities are to be rationalized to make sure that the sum of probabilities becomes exactly one. To solve this problem, it is better to use either R or Python code. I show here the Python code.
Python Code
from scipy.stats import poisson
prob = []
for x in range(15,21):
prob.append(poisson.pmf(x,10))
prob_final = prob/sum(prob)
E = sum(1000*np.array(range(15,21))*prob_final)
print(E)
The final result is INR 16133. In a separate post, I shall elaborate the Poisson regression with mathematical details.