1. What is a Percentile?
A percentile indicates the relative standing of a value within a dataset by dividing the data into 100 equal parts. For any given percentile value, it represents the point below which a certain percentage of data points fall. For example, the 25th percentile is the value below which 25% of the data points are found.
Key Percentiles
- 25th Percentile (1st Quartile): Often called Q1, this is the lower quartile and represents the point below which 25% of the data lies.
- 50th Percentile (Median): This is the middle value, meaning half the data points are below this percentile.
- 75th Percentile (3rd Quartile): Also known as Q3, the upper quartile below which 75% of the data is found.
2. How Percentiles are Calculated
Percentiles are calculated by sorting the data and identifying the position within that sorted dataset where the percentile lies. The general formula for finding the k-th percentile Pk in a dataset of size n is: Pk=k / 100 × (n+1)
where:
- k: The percentile value (e.g., 25, 50 or 75).
- n: The total number of data points.
If Pk lands between two data points, the percentile value is often interpolated between them.
3. Examples of Percentile Calculations
Suppose we have a dataset: 10,20,30,40,50,60,70,80,90,100.
Calculating the 25th Percentile (Q1)
- Sort the data (already sorted in this case).
- Use the formula for the 25th percentile: P25=25 / 100 × (10+1)=2.75
- Since 2.75 is between the 2nd and 3rd values (20 and 30), we interpolate to get a value between them: Q1=20+0.75×(30−20)=20+7.5=27.5
Thus, the 25th percentile is 27.5, meaning 25% of values are below 27.5.
Calculating the Median (50th Percentile)
Following the same steps, the position for the 50th percentile:P50=50 / 100 × (10+1)=5.5
The median will be interpolated between the 5th (50) and 6th (60) values: Q2=50+0.5×(60−50)=50+5=55
So, the median (50th percentile) is 55.
4. Percentiles in Data Science
Percentiles are especially useful in identifying outliers, understanding distribution patterns and assessing relative standing within a dataset.
Example Use Case: Income Distribution
Consider a dataset of annual incomes for a group of individuals:
Incomes=[25000,30000,40000,50000,55000,60000,75000,85000,90000,100000]
- The 90th percentile income (P90) would represent an income value below which 90% of individuals earn.
- Calculating this can help identify the income bracket of the wealthiest 10%.
5. Plotting Percentiles with Code
Percentiles can be visualized using box plots, which show the interquartile range (IQR) and provide a summary of the distribution, including Q1, median (Q2), and Q3.
import numpy as np
import matplotlib.pyplot as plt
# Income data
incomes = [25000, 30000, 40000, 50000, 55000, 60000, 75000, 85000, 90000, 100000]
# Calculating percentiles
Q1 = np.percentile(incomes, 25)
Median = np.percentile(incomes, 50)
Q3 = np.percentile(incomes, 75)
# Display percentiles
print("25th Percentile (Q1):", Q1)
print("50th Percentile (Median):", Median)
print("75th Percentile (Q3):", Q3)
# Plotting
plt.boxplot(incomes, vert=False)
plt.xlabel("Income")
plt.title("Income Distribution with Percentiles")
plt.show()
6. Practical Applications of Percentiles in Data Science
Percentiles are widely applied in various areas within data science, including:
- Outlier Detection: Values beyond the 1st and 99th percentiles are often considered outliers.
- Customer Segmentation: In marketing, customers can be segmented based on their spending patterns (e.g., the top 10% of spenders).
- Health Metrics: Percentiles help in assessing health statistics, such as a child’s weight or height percentile compared to peers.