Statistics is like a toolbox that helps us to understand data. Imagine you have a big box of fruits, but you don’t want to check every fruit one by one. Instead, you pick a few, analyze them, and then make an idea about the whole box. Exactly what do statistics do with data?
Statistics is the branch of mathematics focused on collecting, analyzing, interpreting, presenting, and organizing data. It is divided into two main areas:
- Descriptive Statistics:
- It focuses on summarizing data you already have. For example, a teacher calculates the average marks of her class. That is the descriptive statistics; it just describes what’s already there.
- It used tools such as Means, Median, Percentages, Charts, and Graphs.
- Inferential Statistics: Makes predictions or inferences about a population based on a sample of data.
- It is predicting beyond data. Here, we don’t just describe data, we make predictions and test ideas.
- For example, a company surveys 100 customers to predict how 10,000 customers will behave.
- The inferential statistics uses multiple tools such as Hypothesis Testing, Confidence Intervals, and Regression.
Essentials Concepts in Descriptive Statistics
There are two important types of descriptive statistics
- Measures of Central Tendency
- Measures of Dispersion
a) Measures of Central Tendency
These metrics help us to identify the centre or “average” value of a dataset:
- Mean: This is the average of all values. Calculated by adding all numbers and dividing by the total count.
- For example: For values [2,4,6] the mean is (2+4+6)/3=4.
- Median: This is the middle value in a sorted dataset. If the dataset has an even count, the median is the average of the two middle numbers.
- For example: For [3,5,7] the median is 5; for [3,5,7,9] it’s (5+7)/2=6.
- Mode: This is the most frequently occurring value in a dataset. Some datasets have more than one mode.
- For example: In [1,2,2,3] the mode is 2.
b) Measures of Dispersion
These metrics show the distance between data points or the average.
- Range: It refers to the difference between the maximum and minimum values.
- Example: For [10,15,20,25] the range is 25−10=15.
- Variance: Measures how far each value in the dataset is from the mean.
- Standard Deviation: This is the square root of variance, which shows how spread out the values are. A low standard deviation means data points are close to the mean; a high standard deviation indicates a wider spread.
Python Example
import numpy as np
data = [2, 4, 4, 4, 5, 5, 7, 9]
# Central Tendency
mean = np.mean(data)
median = np.median(data)
mode = max(set(data), key=data.count)
# Dispersion
std_dev = np.std(data)
variance = np.var(data)
data_range = max(data) - min(data)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Range:", data_range)
print("Variance:", variance)
print("Standard Deviation:", std_dev)
Data Visualization in Descriptive Statistics
Visualizations help reveal patterns and anomalies in data. Commonly useful in the following charts:
- Histograms: We can show the distribution on the display.
- Box Plots: This is used to show data spread and detect outliers.
- Scatter Plots: It’s also useful to visualize relationships between two variables.
Example: Creating a Histogram
import matplotlib.pyplot as plt
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]
plt.hist(data, bins=5, color="skyblue", edgecolor="black")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Sample Data")
plt.show()
Inferential Statistics
a) Sampling and Population
- Population: First, take the complete set of items or events under study.
- Sample: It is a subset of the population because analyzing the whole population may be impossible, so a sample is taken for analysis.
b) Hypothesis Testing
Hypothesis testing is a useful method, where we can test the claims or assumptions.
Steps of Hypothesis testing:
- Null Hypothesis (H₀): Assume there is no difference or no effect.
- Alternative Hypothesis (H₁): Assume there is a difference or an effect.
- Collect Sample Data: Gather observations from the population.
- Run a Statistical Test: Apply the appropriate test (e.g., t-test, ANOVA).
- Check the p-value:
- If p < 0.05 → Reject the Null Hypothesis (means there is a significant difference).
- If p ≥ 0.05 → Fail to reject the Null Hypothesis (means there is no significant difference).
Common Tests:
- t-test: Compares the means of two groups.
- ANOVA (Analysis of Variance): It tests differences among multiple groups.
- Chi-square test: Examines the association between categorical variables.
Example: Hypothesis Testing with Python
import numpy as np
from scipy import stats
# Group A (Morning study students)
group_a = np.array([72, 75, 78, 70, 74, 73, 76, 77])
# Group B (Night study students)
group_b = np.array([68, 65, 70, 72, 69, 71, 67, 66])
# Perform Independent t-test
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print("T-Statistic:", round(t_stat, 2))
print("P-Value:", round(p_value, 4))
if p_value < 0.05:
print("Result: Significant difference → Study time affects marks.")
else:
print("Result: No significant difference → Study time doesn’t matter much.")
Probability in Data Science
Probability is the study of uncertainty and plays a significant role in inferential statistics and predictive modelling.
Probability is all about uncertainty. In data science, we rarely deal with 100% sure answers. Instead, we often predict how likely something is to happen. For example:
- What’s the chance a customer will buy a product?
- What’s the chance that it will rain?
- What’s the chance a machine will fail in the next 6 months?
These questions are answered using probability.
Basic Probability Concepts
- Probability:
- 0 → Impossible (e.g., rolling a 7 on a 6-sided die).
- 1 → Certain (e.g., the sun rising tomorrow).
- For example: If you flip a coin, the probability of getting heads = 0.5.
- Independent Events: Events where the outcome of one does not affect the other.
- For example: Rolling a die and flipping a coin. Getting “Heads” doesn’t change whether you roll a “6.”
- Dependent Events: Events where the outcome of one event affects the other.
- Example: Drawing cards from a deck without replacement. If you draw one Ace, the chance of drawing another Ace decreases.
Probability Distributions
In real-world data science, we don’t just talk about one event; we need to understand the distribution of many outcomes. That’s where probability distributions come in.
- Normal Distribution:
- Symmetrical shape, most values cluster around the mean (average).
- Common in real life: human height, exam scores, and daily temperatures.
- In ML, many algorithms assume data is normally distributed.
- For example, if the average height of men is 170 cm with small variations, most men will be around that height, while very tall or very short heights are rare.
- Binomial Distribution:
- Used when there are only two outcomes (success/failure, yes/no, win/lose).
- Describes the probability of getting a certain number of successes in a fixed number of independent trials.
- For example: Flipping a coin 10 times and asking, “What’s the chance of getting exactly 6 heads?”.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom, norm
# Normal Distribution Example (heights around mean = 170, std dev = 10)
x = np.linspace(140, 200, 200)
y = norm.pdf(x, 170, 10)
plt.plot(x, y, label="Normal Distribution", color="blue")
# Binomial Distribution Example (coin toss, n=10, p=0.5)
n, p = 10, 0.5
x_binom = np.arange(0, n+1)
y_binom = binom.pmf(x_binom, n, p)
plt.bar(x_binom, y_binom, alpha=0.6, color="orange", label="Binomial Distribution")
plt.title("Probability Distributions: Normal vs Binomial")
plt.xlabel("Values")
plt.ylabel("Probability")
plt.legend()
plt.show()