Data Science – Python DataFrame

What is a DataFrame?

A DataFrame is a table-like structure with rows and columns, similar to an Excel spreadsheet or SQL table.

Each column in a DataFrame can hold data of different types (e.g., integers, floats, strings), and each row represents an individual observation or record. This flexibility and easy accessibility make DataFrames the most commonly used data structure in Python for data science.

Key Features of a DataFrame:

  1. Labeled axes: Each row and column is labeled, allowing for easy data selection and manipulation.
  2. Multiple data types: A single DataFrame can store integers, floats, strings, and even other data types like dates.
  3. Indexing and Selection: DataFrames support various ways to access and modify data, such as by column name, row index, or conditions.
  4. In-built Data Operations: DataFrames have built-in functions for data cleaning, filtering, grouping, merging, and aggregation.

Creating a DataFrame

DataFrames can be created in multiple ways, such as from dictionaries, lists, or reading from a file (like a CSV file). Here are a few examples of creating DataFrames:

Example 1: Creating a DataFrame from a Dictionary

import pandas as pd

# Creating a DataFrame using a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 27, 22],
'City': ['New York', 'Paris', 'London']
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age      City
0 Alice 24 New York
1 Bob 27 Paris
2 Charlie 22 London

Example 2: Creating a DataFrame from a CSV File

DataFrames can be easily created from external data sources like CSV files, which are common in data science projects.

df = pd.read_csv('sample_data.csv')
print(df.head()) # Shows the first 5 rows of the DataFrame

Accessing Data in a DataFrame

Accessing data in a DataFrame is straightforward, and Pandas provides multiple ways to retrieve data efficiently.

Accessing Columns

Columns can be accessed by specifying the column name within brackets or using dot notation.

# Accessing a single column
print(df['Name'])

# Accessing multiple columns
print(df[['Name', 'City']])

Accessing Rows

Rows can be accessed using the loc and iloc methods:

  • Loc: Used for label-based indexing (using row labels).
  • iLoc: Used for integer-based indexing (using row positions).
# Using loc to access rows by label
print(df.loc[0]) # Accesses the first row

# Using iloc to access rows by index position
print(df.iloc[1:3]) # Accesses rows at index 1 and 2

Data Manipulation with DataFrames

DataFrames provide numerous functions for data manipulation, which are essential for cleaning and preparing data before analysis.

Adding and Removing Columns

Adding new columns or deleting existing ones is simple with DataFrames.

# Adding a new column
df['Country'] = ['USA', 'France', 'UK']
print(df)

# Dropping a column
df.drop(columns=['Age'], inplace=True)
print(df)

Filtering Data

DataFrames allow filtering data based on specific conditions, making it easy to work with only relevant data.

# Filtering rows where Age is greater than 23
filtered_df = df[df['Age'] > 23]
print(filtered_df)

Grouping and Aggregation

Grouping data allows data scientists to analyze patterns, calculate summaries, and perform operations on specific groups.

# Grouping by City and calculating the mean Age
grouped_df = df.groupby('City')['Age'].mean()
print(grouped_df)

Data Analysis with DataFrames

DataFrames provide built-in functions to quickly analyze and describe data, allowing data scientists to gain insights with minimal code.

Descriptive Statistics

The describe( ) method provides a quick statistical summary of numerical columns, including mean, median, standard deviation, and more.

# Summary statistics of the DataFrame
print(df.describe())

Handling Missing Values

DataFrames offer methods for handling missing values, such as filling them with specific values or removing rows/columns with missing data.

# Filling missing values with a default value
df.fillna(0, inplace=True)

# Dropping rows with missing values
df.dropna(inplace=True)

Merging and Joining DataFrames

DataFrames can be combined using merging and joining, which is useful when dealing with multiple datasets.

Example of Merging DataFrames

data1 = {'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']}
data2 = {'ID': [1, 2, 4], 'Score': [85, 90, 78]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Merging DataFrames on 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

Output:

   ID     Name  Score
0 1 Alice 85
1 2 Bob 90

Real-World Applications of DataFrames in Data Science

DataFrames are widely used in various real-world data science applications:

  1. Data Cleaning: In data preparation, DataFrames are used for handling missing data, correcting data types, and removing duplicates.
  2. Financial Analysis: Financial analysts use DataFrames to organize stock data, calculate metrics, and analyze trends.
  3. Customer Analytics: DataFrames enable analysis of customer demographics, behavior, and segmentation for better marketing insights.
  4. Machine Learning: DataFrames are used to store features and labels, prepare training and testing data, and perform feature engineering.

Example: Simple Data Analysis with a DataFrame

Here’s a complete example of loading, cleaning, and analyzing data in a DataFrame.

import pandas as pd

# Creating a DataFrame with sample data
data = {
'Product': ['A', 'B', 'C', 'D'],
'Price': [100, 200, 300, 400],
'Units_Sold': [50, 40, None, 70]
}
df = pd.DataFrame(data)

# Filling missing values
df['Units_Sold'].fillna(df['Units_Sold'].mean(), inplace=True)

# Calculating total revenue
df['Revenue'] = df['Price'] * df['Units_Sold']

# Displaying the DataFrame
print("Data Analysis Table:")
print(df)

# Summary of Revenue
print("Total Revenue:", df['Revenue'].sum())

Output:

  Product  Price  Units_Sold  Revenue
0 A 100 50.0 5000.0
1 B 200 40.0 8000.0
2 C 300 53.3 16000.0
3 D 400 70.0 28000.0

Total Revenue: 57000.0

Leave a Comment