🌟 Data Science Methodology: A Beginner’s Guide
Understanding the 10 Steps to Solve Data Science Problems Effectively.
Data science is powerful—it combines statistics, technology, and domain expertise to extract valuable insights from data. But here’s the problem: with so much data and computing power at our fingertips, we often jump into solutions without understanding the right questions.
That’s where methodology comes in. Methodology provides a structured approach to solving problems in data science. In this article, we’ll explore what methodology means, its role in data science, John Rollins’s contributions, and the 10 stages of the standard Data Science Methodology.
🔹 What is Methodology?
At its simplest:
👉 A methodology is a system of methods used in a particular area of study.
In research and data science, methodology is a guideline for decisions that scientists must make during the problem-solving process. It tells us:
How to define a problem.
How to collect, prepare, and analyze data.
How to validate results before deployment.
Without methodology, we risk:
❌ Jumping too quickly to solutions.
❌ Misinterpreting data.
❌ Solving the wrong problem.
🔹 Data Science Methodology
In data science, methodology is the structured approach that helps data scientists solve complex problems and make data-driven decisions.
It includes:
Data collection forms (what data to collect, how to collect it).
Measurement strategies (what metrics matter).
Comparison of methods (choosing the right technique for the right problem).
One of the most well-known methodologies is from John Rollins, an IBM Senior Data Scientist, who emphasized that asking the right questions is the foundation of successful projects.
🔹 The 10 Stages of Data Science Methodology
According to John Rollins’s framework, data science methodology consists of 10 stages. Let’s go through each, along with the key guiding question.

1. Business Understanding
Question: What is the problem you are trying to solve?
Example: A bank wants to reduce customer churn.
2. Analytic Approach
Question: How can data answer this problem?
Example: Predict churn using machine learning classification models.
3. Data Requirements
Question: What data do you need to solve the problem?
Example: Customer demographics, transaction history, support tickets.
4. Data Collection
Question: Where will the data come from and how will you get it?
Example: Pulling customer records from CRM systems, surveys, and databases.
5. Data Understanding
Question: Does the data represent the problem correctly?
Example: Are there enough churned vs. retained customers in the dataset?
6. Data Preparation
Question: What work is required to clean and prepare the data?
Example: Handling missing values, removing duplicates, feature engineering.
7. Modeling
Question: Which algorithms and models should be applied?
Example: Logistic regression, decision trees, or neural networks.
8. Evaluation
Question: Does the model actually answer the business problem?
Example: If the model predicts churn with 80% accuracy, is it actionable?
9. Deployment
Question: Can the model be put into real-world practice?
Example: Deploying a churn prediction model into a customer service platform.
10. Feedback
Question: Can you get constructive feedback to improve the model?
Example: Monitor predictions and update the model as customer behavior changes.
🔹 Why Methodology Matters
Keeps the team focused on the right problem.
Prevents wasted time chasing irrelevant data.
Provides a repeatable framework for different projects.
Encourages continuous learning through feedback loops.
Skipping methodology may feel faster, but it often leads to poor results. Using a step-by-step approach ensures better insights and more reliable outcomes.
Source: IBM Data Science Professional Certificate (Coursera)
Instructors: A special thanks to the incredible IBM instructors and the entire IBM Skills Network Team.
Disclaimer: This blog post is part of my personal learning journal where I document my progress through the IBM Data Science Professional Certificate. These articles represent my personal understanding and interpretation of the course material. They are not official course notes and are not endorsed by IBM or Coursera.