Important concepts in Descriptive Statistics

Hey everyone, hope you all are doing great. In this article, we will be covering some important concepts in Descriptive stats. Let's start

Percentiles and quartiles

Percentile is a comparison score between a particular score and all the scores in the group.

For example-: Let the grades of students out of 100 be:

85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 74, 65

Now the first step is to sort values in ascending order. After sorting them:

34, 42, 51, 65, 69, 74, 78, 84, 85, 85, 86, 87.

to find the percentile of 86 marks, its position is 6th and the total number of grades is 12. So, 6/12*100 = 50. Hence, the percentile of 86 is 50th.

To find the value at pth percentile: p/100*(n+1) where n is the total number of values.

Quartile

They are classified into 3,

Q1 -: 25th percentile

Q2-: 50th percentile (Median)

Q3-: 75th percentile

5 number summary

  1. Minimum: It is the minimum value present in the group

  2. First quartile: Q2 (25th percentile)

  3. Median: (50th percentile)

  4. Third quartile: Q3 (75th percentile)

  5. Maximum: It is the maximum value present in the group

Outliers

Outliers are data points that are significantly different from other observations in a dataset. In other words, outliers are data points that are either much larger or much smaller than the majority of the other data points in the dataset.

[ Lower fence --- Upper fence]

  1. Calculate your upper fence = Q3 + (1.5 * IQR)

  2. Calculate your lower fence = Q1 – (1.5 * IQR)

where IQR = Q3 – Q1

Variance and standard deviation

Variance and standard deviation are measures of the spread or dispersion of a dataset.

Variance is a measure of how far each value in the dataset is from the mean of the dataset. It is calculated by taking the sum of the squared differences between each value and the mean, divided by the total number of values in the dataset. A higher variance indicates that the values in the dataset are more spread out, while a lower variance indicates that the values are closer together.

The formula for variance is:

Var(X) = (1/n) * Σ[(Xi - μ)^2]

where Var(X) is the variance of the variable X, n is the total number of observations, Σ is the sum of the values, Xi is the individual observation of X, and μ is the mean of X.

In words, the formula calculates the average of the squared differences between each value in the dataset and the mean of the dataset. The resulting value represents the variability or spread of the data around the mean.

Standard deviation is the square root of the variance. It represents the average amount of deviation of individual values from the mean. Standard deviation is a commonly used measure of dispersion because it is expressed in the same units as the original data, and it provides an intuitive interpretation of the spread of the data. A higher standard deviation indicates that the values in the dataset are more spread out, while a lower standard deviation indicates that the values are closer together.

The formula for standard deviation (SD) is the square root of the variance, so:

SD(X) = sqrt(Var(X))

All about Correlation

Correlation is a statistical measure that describes the relationship between two variables. It indicates whether there is a linear association between the two variables, and if so, how strong the association is.

Pearson correlation is a type of correlation coefficient that measures the linear relationship between two continuous variables. It ranges from -1 to +1, where a value of -1 indicates a perfect negative correlation, 0 indicates no correlation, and +1 indicates a perfect positive correlation. The formula for Pearson correlation is:

r = (nΣXY - ΣXΣY) / sqrt[(nΣX^2 - (ΣX)^2)(nΣY^2 - (ΣY)^2)]

where r is the correlation coefficient, X and Y are the two variables of interest, n is the number of observations, Σ is the sum of the values, and sqrt is the square root function.

Spearman correlation is another type of correlation coefficient that measures the relationship between two variables. It is used when the variables are not normally distributed or the relationship between them is not linear. Instead of using the raw data, Spearman correlation ranks the data and then calculates the correlation between the ranked values. The formula for Spearman correlation is:

ρ = 1 - (6Σd^2) / (n(n^2 - 1))

where ρ is the correlation coefficient, d is the difference between the ranks of the two variables, and n is the number of observations.

Both Pearson and Spearman correlation coefficients range from -1 to +1, where higher absolute values indicate stronger correlations. Pearson correlation measures the linear relationship between two continuous variables, while Spearman correlation measures the monotonic relationship (i.e., whether the two variables tend to increase or decrease together).

These are all the concepts you need to know about descriptive stats. Next, we will start with distributions in descriptive stats. Hope you got value out of it. Subscribe to my newsletter to get daily information articles.

Thanks :)