Whenever we hear about data science, the first question that comes to mind is, What is the data in data science?
Data is information. It can be anything that gives us facts or details about something. For example:
- The temperature of outside whether like 30°C.
- My age is 25 years.
- A tweet you posted.
- A photo on Instagram Feed.
- A sound of the song.
In data science, data is the raw material, just as a chef needs raw ingredients to cook, a data scientist needs data to create meaningful insights.
Why Are Data Important in Data Science?
Without data, data scientists don’t exist. Because we can’t analyze trends like sales going up or down. We can’t make predictions about the weather. Also, we can not train AI models for recognizing your face in photos or more.
Data is the starting point of everything in data science. But not all data is ready to use. So, a big part of data science is cleaning, organizing, and understanding data before doing analysis.
Think of data as raw gold. Meaning, it is valuable, but you need to process it before it becomes jewelry. Similarly, data needs to be processed before it becomes useful insights.
Types of Data in DS
In the world of data science, not all data looks the same
1) Structured Data:
- Structured data is a organized data in a highly formatted way. It’s defined in rows and columns like a table in a database. Each small part of data information is stored in a specific “cell” and we can easily sort, filter, or analyzed that data.
- Example: Customer information table that represents names, ages, and purchase amounts. This is structured because each column represents a variable, and each row represents a unique observation.
Real-Life Example:
A bank’s customer list:
Name | Age | Balance |
---|---|---|
Rahul | 28 | ₹50,000 |
Priya | 32 | ₹80,000 |
2) Unstructured Data:
- Unstructured data is more complex and doesn’t follow a fixed format. This type of data includes text, images, videos, emails, social media posts and audio files, which require specialized techniques for processing and analyzing their data. It’s harder to arrange because they are not set in rows and columns.
- Example: A Facebook post contains text, emojis, and images. These data are unstructured.
3) Semi-Structured Data
- Semi-structured data is a hybrid format like well-labeled box in the messy drawer. It’s not fully organized like a table, but it has some structure to make sense of things.
- Example: A JSON or XML file containing product information. It has tags like “Product Name” and “Price” but doesn’t follow a strict row-and-column format.
{
"name": "Anjali",
"age": 29,
"city": "Mumbai"
}
- This is important because many APIs and modern systems use semi-structured formats like JSON because they are flexible and easy for machines to read.
Characteristics of Data
Data is not only numbers and text, but it has multiple characteristics that decide how useful and challenging it is. These characteristics are described as the 5 V’s:
- Volume
- Variety
- Velocity
- Veracity
- Value
1) Volume: Volume means the size or amount of data. We generate data in terabytes and petabytes every single day. For example, WhatsApp generates a large amount of messages, and Instagram uploads posts. These data are stored in a massive volume.
If you want to handle large data, you need powerful tools like Hadoop, Spark, or Cloud storage. Without them, it’s impossible to process.
2) Variety: Data comes in different forms; not everything is stored in a separate table. Variety means it has structured, unstructured, or semi-structured data. Data scientists are able to analyze all forms of data, not only tables.
For example, a hospital stores structured data (patient names, age), unstructured data (X-ray images, doctor notes), and semi-structured data (electronic health records in JSON format).
3) Velocity: Velocity means the speed at which data is generated and collected. Some data comes slowly (monthly reports), while some comes in real-time (like live stock market prices).
For example, a live cricket match score is updated every second and sensor data from self-driving cars is streaming continuously. If data comes fast, you need advanced-level tools for real-time processing like Kafka, Flink, or Spark Streaming.
4) Veracity: This means not all data is accurate or reliable. Veracity checks if the data is true, consistent, and free from errors.
For example, if we fill out a form and enter “12345” as our name, it’s called bad data. Veracity is important because wrong data = wrong analysis = wrong decisions.
5) Value: Value refers to the goal, which means we find benefits from the data. For example, Netflix uses viewing data to recommend specific shows to interested users. Also, a company analyzing customer purchase patterns to increase sales.
Sources of Data
In Data Science, the first step is always collecting data. But data doesn’t come from one fixed place; it can be gathered in many ways depending on our problems. We can find data from many sources, like:
- Primary Sources: This data is collected by us from our project. It is original, raw, and specific to our needs. These data are more accurate for our project, but it can be time-consuming and expensive to gather.
- Secondary Sources: This data is collected by someone else, but it’s also available for us to use. It is freely accessible data, or we might need permission. For example, government databases like census data, weather reports. Secondary data is not our collection, so it may need to be cleaned or adjusted by us.
- Real-Time Sources: This data is generated continuously and instantly from machines, sensors, or online activities. For example, live stock market data, online transactions and more. These data are crucial for instant decision-making for fraud detection in credit cards.
How is Data Represented?
In data science, data are represented in multiple formats:
1) Numerical Data: This type of data is expressed in numbers and can be directly calculated or counted.
Numerical data is easy to understand because you can apply mathematical operations like addition, averaging, or statistical modelling.
Sub-types of numerical data:
- Discrete data → It contains whole numbers, like the number of students in a class.
- Continuous data → It can store decimal values like temperature 26.5°C.
2) Categorical Data: These data represent groups, labels, or categories instead of direct measurements.
It helps in grouping and comparing things, but you can’t apply direct mathematical operations, for example, red + blue does not make sense.
Sub-types of categorical data:
- Nominal data → It contains unordered categories like colors: Red, Blue, and Green.
- Ordinal data → Categories with order, such as Poor, Average, and Good.
3) Text Data: Information stored in natural language, like sentences, words, or documents. It matters because text carries context, opinions, and meaning, but it requires NLP (Natural Language Processing) techniques to analyze.
- Example: A review of the receiver: “The delivery was late, but the product is excellent!”, Social media comment: “Loved this movie”.
Practical Example: Data Representation in Code
We can use Python for data analysis to store data in lists, arrays, or other structures. Each type of data looks different in code.
# Numerical Data (numbers you can calculate with)
prices = [19.99, 29.99, 9.99, 49.99]
# Categorical Data (labels or categories that describe items)
categories = ["Electronics", "Clothing", "Food"]
# Text Data (sentences, opinions, or free text)
reviews = [
"Great product!",
"Will buy again.",
"Not satisfied with the quality."
]
In this example:
- prices are a list of numerical data.
- categories is a list of categorical labels.
- reviews are text data, which might require further processing.
Data Collection Techniques
Data scientists gather data using various methods, depending on the needs of the project:
1) Web Scraping: Collecting data from websites using Python libraries like BeautifulSoup, Scrapy, and Selenium. Imagine you want to collect all the product prices from an online store. Instead of manually copying them, you write a script that fetches all the product names, prices, and ratings.
2) APIs: APIs are like “data delivery services” that are offered by platforms. If you don’t want to do scraping, you can directly ask the platform for data in a structured way. For example, Twitter (X) provides an API where you can request tweets about a topic.
3) Surveys and Forms: Collecting information directly from people. This is one of the oldest ways. If you want to study customer satisfaction, you can ask them through a Google Form or any other survey tool.
4) Sensors: Machines and devices that continuously generate data in real-time. This is common in IoT (Internet of Things). For example, a smartwatch tracks your heart rate every second, or a smart car records fuel usage.