Data Science - Introduction to Statistics

Statistics is a fundamental component in data science, as it provides tools for analyzing and interpreting data, allowing data scientists to derive insights, make predictions and test hypotheses.

1. What is Statistics?

Statistics is the branch of mathematics focused on collecting, analyzing, interpreting, presenting, and organizing data. It is divided into two main areas:

Descriptive Statistics: Summarizes and describes the features of a dataset.
Inferential Statistics: Makes predictions or inferences about a population based on a sample of data.

2. Key Concepts in Descriptive Statistics

Descriptive statistics provide a way to understand and summarize data at a glance.

a) Measures of Central Tendency

These metrics help identify the center or “average” value of a dataset:

Mean: The average of all values. Calculated by adding all numbers and dividing by the total count.
- Example: For values [2,4,6] the mean is (2+4+6)/3=4.
Median: The middle value in a sorted dataset. If the dataset has an even count, the median is the average of the two middle numbers.
- Example: For [3,5,7] the median is 5; for [3,5,7,9] it’s (5+7)/2=6.
Mode: The most frequently occurring value in a dataset. Some datasets have more than one mode.
- Example: In [1,2,2,3] the mode is 2.

b) Measures of Dispersion

These metrics show the spread of data values:

Range: The difference between the maximum and minimum values.
- Example: For [10,15,20,25] the range is 25−10=15.
Variance: Measures how far each value in the dataset is from the mean.
Standard Deviation: The square root of variance, showing how spread out the values are. A low standard deviation means data points are close to the mean; high standard deviation indicates a wider spread.

Python Example

import numpy as np

data = [2, 4, 4, 4, 5, 5, 7, 9]

# Calculating descriptive statistics
mean = np.mean(data)
median = np.median(data)
std_dev = np.std(data)

print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std_dev)

3. Data Visualization in Descriptive Statistics

Visualizations help reveal patterns and anomalies in data. Commonly used charts include:

Histograms: Display data distribution.
Box Plots: Show data spread and detect outliers.
Scatter Plots: Visualize relationships between two variables.

Example: Creating a Histogram

import matplotlib.pyplot as plt

# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 5]

# Plot histogram
plt.hist(data, bins=5, color="skyblue", edgecolor="black")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram of Sample Data")
plt.show()

4. Inferential Statistics

Inferential statistics help draw conclusions about a larger population based on a sample.

a) Sampling and Population

Population: The complete set of items or events under study.
Sample: A subset of the population. Since analyzing the whole population may be impossible, a sample is taken for analysis.

b) Hypothesis Testing

Hypothesis testing evaluates an assumption or claim about a population parameter using sample data. Common tests include:

t-test: Compares the means of two groups.
ANOVA (Analysis of Variance): Tests differences among multiple groups.
Chi-square test: Examines the association between categorical variables.

Example: Hypothesis Testing with Python

from scipy import stats

# Two sample groups
group1 = [5, 7, 8, 7, 6, 5, 6, 7]
group2 = [6, 9, 6, 8, 7, 8, 6, 7]

# Perform t-test
t_stat, p_value = stats.ttest_ind(group1, group2)
print("t-statistic:", t_stat)
print("p-value:", p_value)

5. Probability in Data Science

Probability is the study of uncertainty and plays a significant role in inferential statistics and predictive modeling.

Basic Probability Concepts

Probability: The likelihood of an event occurring, ranging from 0 to 1.
Independent Events: Events where the outcome of one does not affect the outcome of another.
Dependent Events: Events where the outcome of one influences the outcome of another.

Probability Distributions

Probability distributions describe how probabilities are distributed over values of a random variable:

Normal Distribution: A symmetric, bell-shaped curve, often used in statistics.
Binomial Distribution: Describes the number of successes in a fixed number of independent trials.

6. Applying Statistics in Data Science Projects

Statistics are integral to data science workflows, assisting in:

Data Cleaning: Identifying outliers and missing values.
Data Analysis: Uncovering trends, patterns, and insights.
Predictive Modeling: Using statistical methods to make predictions.
Validation: Ensuring models are accurate and generalizable.

7. Example of Using Statistics in Real-World Data Science

Suppose we have a dataset of customer purchases. To understand purchasing behavior, we might:

Calculate the mean and median to find average spending.
Use standard deviation to see variability in spending.
Perform hypothesis testing to determine if different customer groups (e.g., age or region) have different spending patterns.

Data Science – Intro to Statistics