Data Science – Statistics Standard Deviation

Standard deviation is a key concept in statistics and data science, used to measure the amount of variation or dispersion within a dataset. It provides insight into how spread out the data points are around the mean (average) and is a fundamental measure for understanding data distribution, detecting outliers and assessing risk in fields like finance, research and quality control.

1. What is Standard Deviation?

The standard deviation (SD) quantifies the amount of variation in a dataset by calculating how much each data point differs from the mean.

A lower standard deviation indicates that data points are close to the mean, suggesting consistency. A higher standard deviation means data points are more spread out, indicating variability.

  • Low Standard Deviation: Data points are tightly clustered around the mean.
  • High Standard Deviation: Data points are spread out over a larger range.

2. Formula for Standard Deviation

For a dataset of n values:

Data Science - Statistics Standard Deviation

This formula gives us the standard deviation for a population. For a sample, we divide by n−1 instead of n.

3. Calculating Standard Deviation: Example

Consider the following dataset of test scores:

Data Science - Statistics Standard Deviation

So, the standard deviation is approximately 18.55, indicating how much the scores deviate from the mean.

4. Standard Deviation in Python

Python offers a straightforward way to calculate standard deviation using libraries like NumPy.

import numpy as np

# Data
data = [50, 60, 80, 90, 100]

# Standard deviation calculation
std_dev = np.std(data)
print("Standard Deviation:", std_dev)

5. Interpreting Standard Deviation in Data Science

Standard deviation helps in assessing the reliability and spread of data, crucial for making data-driven decisions. Here are some key applications:

  • Financial Risk: A high standard deviation in asset returns indicates higher risk.
  • Quality Control: A low standard deviation in product measurements shows consistency.
  • Research Studies: A small standard deviation in experimental results indicates replicable and consistent outcomes.

Practical Example: Exam Scores

If the standard deviation of scores in one class is low, it suggests that most students scored close to the class average. In another class with a high standard deviation, scores vary widely, indicating differing levels of understanding.

6. Standard Deviation and the Normal Distribution

In a normal distribution:

  • About 68% of data falls within one standard deviation of the mean.
  • About 95% falls within two standard deviations.
  • About 99.7% lies within three standard deviations.

This property is called the 68-95-99.7 rule, and it helps in understanding the probability of data falling within certain ranges.

7. Visualizing Standard Deviation

Visualization can help illustrate the spread of data. Below is a Python example for plotting data with its standard deviation.

import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
data = [50, 60, 80, 90, 100]
mean = np.mean(data)
std_dev = np.std(data)

# Plot
plt.hist(data, bins=5, alpha=0.5, color="blue", edgecolor="black")
plt.axvline(mean, color='red', linestyle='dashed', linewidth=1)
plt.axvline(mean + std_dev, color='green', linestyle='dashed', linewidth=1)
plt.axvline(mean - std_dev, color='green', linestyle='dashed', linewidth=1)
plt.xlabel("Scores")
plt.ylabel("Frequency")
plt.title("Data Spread with Mean and Standard Deviation")
plt.show()

In this plot:

  • The red line represents the mean.
  • The green lines show one standard deviation above and below the mean, indicating the range within which most values lie.

8. Using Standard Deviation to Detect Outliers

Outliers can skew data and mislead analysis. If a data point is more than two or three standard deviations away from the mean, it might be considered an outlier.

Example: Detecting Outliers in Temperature Data

temperatures = [70, 72, 68, 71, 90, 73, 75, 72, 69, 80, 95] # Sample temperature data
mean_temp = np.mean(temperatures)
std_temp = np.std(temperatures)

# Identify outliers
outliers = [temp for temp in temperatures if abs(temp - mean_temp) > 2 * std_temp]
print("Outliers:", outliers)

Leave a Comment