Distribution in Descriptive Statistics
Hey everyone, hope you all are doing great. In this article, we will be covering all distributions you need to know for data analysis and machine learning as well. This is the list of Distributions you need to know:
Normal or Gaussian Distribution
Binomial distribution
Log-Normal Distribution
Poisson distribution
Exponential distribution
Gamma distribution
Chi-squared distribution
F-distribution
Uniform distribution
Normal or Gaussian Distribution
Normal distribution, also known as Gaussian distribution, is a continuous probability distribution that is widely used in statistics, data analysis, and machine learning. It is a bell-shaped distribution that is symmetrical around the mean, with most of the data points clustering around the mean and fewer data points further away from the mean.
The normal distribution is characterized by two parameters: the mean (μ) and the standard deviation (σ). The mean represents the centre of the distribution, while the standard deviation represents the spread of the distribution.
The probability density function (PDF) of the normal distribution is given by:
f(x) = (1/√(2πσ^2)) * exp(-(x-μ)^2 / (2σ^2))
where x is the random variable, μ is the mean, σ is the standard deviation, and exp is the exponential function.
The normal distribution is widely used in data analysis because many natural phenomena exhibit a normal distribution, such as height, weight, and IQ scores. Additionally, the central limit theorem states that the sample means of a sufficiently large number of independent and identically distributed random variables are approximately normally distributed, regardless of the underlying distribution of the original random variables. This makes the normal distribution a useful tool for statistical inference and hypothesis testing
Binomial distribution
The binomial distribution is a discrete probability distribution that models the number of successes in a fixed number of independent trials, where each trial has only two possible outcomes: success or failure.
The binomial distribution is characterized by two parameters: the number of trials (n) and the probability of success in each trial (p). The probability mass function (PMF) of the binomial distribution is given by:
P(X=k) = (n choose k) p^k (1-p)^(n-k)
where X is the random variable, k is the number of successes, (n choose k) is the binomial coefficient, p is the probability of success, and (1-p) is the probability of failure.
The binomial distribution is widely used in statistics, data analysis, and machine learning, especially for problems related to binary classification, such as predicting whether a customer will make a purchase or not, or whether a patient will respond to a certain treatment or not. It is also used in quality control to determine the number of defective items in a batch, and in population, genetics to model the frequency of a certain allele in a population.
The mean of the binomial distribution is μ = np, and the variance is σ^2 = np(1-p). When the number of trials n is large and the probability of success p is small, the binomial distribution can be approximated by the Poisson distribution with parameter λ = np.
Log-Normal Distribution
The Log-Normal distribution is a continuous probability distribution that is widely used in finance, economics, and other fields to model variables such as stock prices, asset prices, and income. It models the distribution of a random variable whose logarithm is normally distributed. The probability density function (PDF) of the Log-Normal distribution is given by:
f(x) = (1 / (x σ sqrt(2π))) * exp(-(ln(x) - μ)^2 / (2σ^2))
where x is the random variable, μ is the mean of the logarithm of the variable, and σ is the standard deviation of the logarithm of the variable. The Log-Normal distribution is often used when the data is positively skewed and the values are greater than zero. The mean and variance of the Log-Normal distribution are:
μ' = exp(μ + σ^2/2) σ'^2 = (exp(σ^2) - 1) * exp(2μ + σ^2)
Poisson distribution
The Poisson distribution is a discrete probability distribution that is used to model the number of occurrences of a rare event in a fixed interval of time or space. It is widely used in fields such as biology, physics, and engineering. The probability mass function (PMF) of the Poisson distribution is given by:
P(X=k) = (λ^k / k!) * exp(-λ)
where X is the number of occurrences, k is the number of occurrences, λ is the rate at which the events occur, and exp(-λ) is the probability of no occurrences.
The Poisson distribution has only one parameter, λ, which represents the mean and variance of the distribution. The mean and variance of the Poisson distribution are both equal to λ. The Poisson distribution is often used in situations where the events occur independently of each other and at a constant rate, such as the number of phone calls received by a call centre in an hour or the number of defects in a production process. It is also used as an approximation for the Binomial distribution when the number of trials is large and the probability of success is small.
Exponential distribution
The Exponential distribution is a continuous probability distribution that is used to model the time between two successive events that occur independently of each other and at a constant rate. It is widely used in fields such as reliability engineering, queuing theory, and finance. The probability density function (PDF) of the Exponential distribution is given by:
f(x) = λ * exp(-λx)
where x is the time between two events, and λ is the rate at which the events occur.
The Exponential distribution has only one parameter, λ, which represents the mean and variance of the distribution. The mean of the Exponential distribution is 1/λ, and the variance is 1/λ^2.
The Exponential distribution is often used in situations where the time between events follows an exponential decay pattern, such as the time between radioactive decay events or the time between breakdowns of a machine. It is also used in queuing theory to model the inter-arrival times of customers in a queue, and in finance to model the time between changes in stock prices.
Gamma Distribution
The gamma distribution is a probability distribution that is commonly used in statistical modelling to represent continuous positive variables. It has two parameters, alpha and beta, where alpha is the shape parameter and beta is the scale parameter. The probability density function (PDF) of the gamma distribution is given by:
f(x) = (x^(alpha-1) exp(-x/beta)) / (beta^alpha Gamma(alpha))
where x > 0, and Gamma(alpha) is the gamma function, which is defined as:
Gamma(alpha) = integral from 0 to infinity of t^(alpha-1) * exp(-t) dt
The gamma distribution is often used to model the waiting time until a certain number of events occur, or the lifetime of a product or system. It is also used in finance to model stock prices and interest rates, and in physics to model the energy levels of atoms and molecules. The distribution can take on a variety of shapes, depending on the values of alpha and beta. When alpha is a positive integer, the gamma distribution is known as the Erlang distribution.
Chi-squared distribution
The chi-squared distribution is a probability distribution that arises from the sum of the squares of independent standard normal random variables. It is commonly used in statistical inference, hypothesis testing, and goodness-of-fit tests. The distribution has one parameter, which is the degree of freedom (df).
The probability density function (PDF) of the chi-squared distribution with df degrees of freedom is given by:
f(x) = (1/(2^(df/2) Gamma(df/2))) x^((df/2)-1) * exp(-x/2)
where x ≥ 0, and Gamma(df/2) is the gamma function with shape parameter df/2. The mean of the chi-squared distribution is df, and the variance is 2df.
The chi-squared distribution is widely used in statistical inference, such as in the analysis of variance (ANOVA), and in hypothesis testing, such as the chi-squared test for independence and the chi-squared goodness-of-fit test. It is also used in the construction of confidence intervals for the variance of a normal distribution.
F-distribution
The F-distribution is a probability distribution that arises from the ratio of two independent chi-squared distributions. It is commonly used in statistical inference, particularly in the analysis of variance (ANOVA), regression analysis, and in hypothesis testing, such as testing for the equality of variances of two populations. The distribution has two parameters, the degrees of freedom for the numerator (df1) and the denominator (df2).
The probability density function (PDF) of the F-distribution with df1 degrees of freedom in the numerator and df2 degrees of freedom in the denominator is given by:
f(x) = (df1/df2)^(df1/2) x^((df1/2)-1) / [(df1/2) Beta(df1/2, df2/2) (1 + (df1x/df2))^((df1+df2)/2)]
where x ≥ 0, and Beta(df1/2, df2/2) is the beta function with shape parameters df1/2 and df2/2. The mean of the F-distribution is df2/(df2-2) when df2 > 2, and the variance is (2df2^2(df1+df2-2))/(df1*(df2-2)^2*(df2-4)) when df2 > 4.
Uniform distribution
The uniform distribution is a probability distribution that models the scenario where all values between a minimum and maximum value are equally likely. It is a continuous distribution with a rectangular-shaped probability density function (PDF). The distribution has two parameters, a and b, which are the minimum and maximum values of the distribution.
The probability density function (PDF) of the uniform distribution between a and b is given by:
f(x) = 1/(b-a) for a ≤ x ≤ b
The cumulative distribution function (CDF) of the uniform distribution is given by:
F(x) = 0 for x < a F(x) = (x-a)/(b-a) for a ≤ x ≤ b F(x) = 1 for x > b
The mean of the uniform distribution is (a+b)/2, and the variance is (b-a)^2/12.
The uniform distribution is often used in simulations, in statistical sampling and in random number generation. It is also used in some applications where the outcome is equally likely to be any value between a and b, such as in some games of chance or in modelling the waiting time until an event occurs uniformly over a specific interval of time.
This was all about the distributions in descriptive statistics. Hope you got some value out of this. Subscribe to m newsletter for more such articles
Thank you :)