Analytics for Everyone: 2022

Statistics rests heavily on the assumptions of probability distributions. We all have heard about linear regression and, probably, have some or other linear regression model, rather mechanically. But it is to be understood that this statistical model is also relying on the underlying assumptions related to the distributions of errors. Technically speaking, there are two types of distributions, i.e., discrete and continuous. Outcomes of tossing of coins, the arrival of customers, and winning a bet after losing successively 'n' times are typical examples of discrete outcomes and hence they can be modeled with discrete probability distributions. Other events such as the time before the failure of bulbs, inter-arrival time between two customers, or choosing one set of distributions over another are examples of events that can be modeled using continuous distributions. The figure below shows a few commonly used discrete and continuous distributions.

In this blog, we would restrict ourselves to only a few such distributions which are encountered more frequently in data analytics.

Binomial Distribution

Binomial distribution deals with only 2 outcomes, i.e., success and failure. It is up to us to decide when an event would be considered a success. For example, when tossing a coin, we may consider "Head" as a success. But this choice is rather arbitrary. We may also choose "Tail" as a success too. But, it is customary to consider "Head" as success. Each distribution is defined by means of some parameters. For binomial distributions, the parameters are the number of trials (denoted by N) and the probability of success (denoted by 'p'). So, if there are x number of successes out of N trials, the probability of getting exactly x successes is given by: \[\begin{equation} P(x|N,p)={n \choose x} p^x(1-p)^{N-x} \end{equation}\]Here ${n \choose x}$ denotes the combination. This combination term is important because if there are N number of trails, x number of successes can happen in ${n \choose x}$ ways and for each combination, the associated probability is $p^x(1-p)^{N-x}$. The probability got in the case of the discrete event is called the probability mass function (pmf). PMF is defined for a particular point (exactly x number of successes in this case) of a discrete probability distribution. Two important properties related to PMF are \[\begin{align}\sum_{x_i} P(x_i) = 1 \\ P(x_i) \ge 0 \end{align}\]Binomial distribution is used in parameter estimation of logistic regression model where the objective is to estimate the probability of success given some extra information. Logistic regression will be discussed separately in another blog.

Example

Unguided missiles are not accurate to hit a target from a distance. During a war, a battery of unguided missiles is fired on a bunker. The bunker will be completely destroyed if four missiles hit it. If the probability of hitting the target for a missile is 60%, how many missiles are to be fired to destroy the bunker with an above 90% probability?

Solution

For complete destruction, a minimum of 4 hits are necessary. That means, the required probability is \[1-P(No\ hit (x=0)|N,0.6)-P(One\ hit (x=1)|N,0.6)-P(Two\ hits (x=2)|N,0.6)-\\P(Three\ hits (x=3)|N,0.6)\]Here N is unknown and the required minimum probability is 0.9. Hence, we need to solve the following equation: \[0.9=1-P(x=0|N,0.6)-P(x=1|N,0.6)-P(x=2|N,0.6)-P(x=3|N,0.6)\]\[0.1=0.4^N+N(0.6)(0.4)^{N-1}+\frac{N(N-1)}{2}(0.6^2)(0.4)^{N-2}+\frac{N(N-1)(N-2)}{6}(0.6^3)(0.4)^{N-3}\]The above problem cannot be solved easily and we need to use numerical methods to find the solution. We can use the scipy package in Python to solve the equation

Python Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve

p=0.6
func = lambda N: 0.1 - ((1-p)**N + N*p*(1-p)**(N-1) + N*(N-1)/2*((p)**2)*(1-p)**(N-2) + N*(N-1)*(N-2)/6*((p)**3)*(1-p)**(N-3))

initial_guess = 4
int(np.round(fsolve(func, initial_guess,maxfev=1000),0)[0])

The solution is "at least 9 rockets are to be fired to be 90% sure of destroying the bunker".

In logistic regression, when we use MLE to estimate model parameters, we use this distribution to model the likelihood of the occurrences of the target variable. The next important discrete distribution is the Poisson Distribution.

Poisson Distribution

Poisson's distribution is another discrete distribution that basically deals with count data. Poisson's distribution is used in Poisson regression and some other statistical tests. Unlike binomial distribution, Poisson's distribution give the probability of occurrence of a given number of events within a fixed interval of time having a constant mean rate and independently of the time since the last event. The above sentence is long and (probably) confusing. To understand in a better way, let us imagine a road crossing and a fixed time frame (say) morning 9 AM to 10 AM (i.e., fixed time interval). If we assume that, on an average, 10 (i.e., mean rate) vehicles crosses the crossing during that time. When a vehicle crosses the crossing, it does not impact the occurrence of another vehicle crossing the same crossing (i.e., independent of the last event). Now, on a day, the number can be 5, 15, 25, 30 or even 50. While 5 and 15 will have higher probability of occurrence, 25, 30 and 50 will have lesser probabilities of occurrences (but not zero!). This can be modeled with Poisson distribution with mean rate 10. Mathematically, Poisson's distribution is defined as \[P(x;\lambda)=\frac{\lambda^x e^{-\lambda}}{x!}\]Here, $\lambda$ is the mean rate and $x$ is the number of occurrence of events. It is to be understood that $P(x;\lambda)$ is the probability mass function.

Example

An ENT specialist, on an average, visits 10 patients in a day. What would be his expected earning per day if the number of patients vary from 15 to 20, (following a uniform distribution) and he charges INR 1000 per patient for the visit? Note: 15-20 patients is a special case which does not disturb the mean rate.

Solution

This problem can be solved by assuming the number of visits of patients follow a Poisson distribution with mean rate 10 (i.e., $\lambda = 10$). The number of patients varies from 15 to 20 (i.e., $x = 15,16,17,18,19,20$). We need to calculate the revenue associated with each number of patients and that need to be multiplied by the associated Poisson probability for finding out the expected earning. Expected earning $E$ is \[E=\sum_{x=15}^{20}P(x;\lambda)(1000x)\]It is important to note that before evaluating the final expected values, probabilities are to be rationalized to make sure that the sum of probabilities becomes exactly one. To solve this problem, it is better to use either R or Python code. I show here the Python code.

Python Code

from scipy.stats import poisson

prob = []
for x in range(15,21):
    prob.append(poisson.pmf(x,10))

prob_final = prob/sum(prob)
E = sum(1000*np.array(range(15,21))*prob_final)
print(E)

The final result is INR 16133. In a separate post, I shall elaborate the Poisson regression with mathematical details.

Friday, 22 July 2022

Important Probability Distributions

Binomial Distribution

Example

Solution

Poisson Distribution

Example

Solution

Let us understand Logistic Regression (Part 2)

Total Pageviews