Introduction to Data Science Process

Bibek Shah Shankhar
2 min readMay 24, 2020

--

The CRISP-DM Process (Cross Industry process for data mining)

The CRISP-DM procedure gives an organized way to deal with arranging an Data mining Process. When we get into a weeds of coding, try to take a step back and realize what part of the process you are in, and assure that you remember the question you are trying to answer and what a solution to that question looks like. This process come in handy and is important before implementing any data science project.

There are several steps to follow for this process:

  1. Business Understanding
  2. Data Understanding
  3. Prepare data
  4. Data Modeling
  5. Evaluate the results
  6. Deploy
fig : CRISP-DM Process

The first two steps o CRISP-DM are

  1. Business Understanding

This means understanding the problems and questions you are interested in tackling in context of whatever domain you are working in. For Instance

How do we acquire new customers ?

Does a new treatment perform better than an existing treatment ?

How can we improve communication ?

How can we improve travel ?

How can we better retain information ?

2. Data Understanding

You need to transfer the questions from an understanding of business to data at this stage. You may already have data that could be used to answer the questions, or you may need to gather data to get in on your interest questions.

3. Prepare data

(In general, the most time-consuming phase): data selection; clean data; data construction; data integration; data formatting. In this step, you need to wrangle the data in a way such that you get answers to your questions. The wrangling and cleaning process is said to take 80% of the time of data analysis process.

4. Data Modeling

Pick the actual modelling technique to be used as the first step in modeling. While you may already have chosen a method during the Business Understanding process, this task relates to the common modeling technique e.g. Linear Regression, Decision trees , Clustering or back-propagated neural network generation. When applying several techniques, execute the task separately for each technique.

5. Evaluate the results

Results are the findings from our wrangling and modeling. They are the answers you found to each of the questions.

6. Deploy

Deploying can happen by shifting the production strategy by using the findings to convince those inside a organization to act on the outcomes. The data scientist plays such an important role in communication.

--

--

Bibek Shah Shankhar
Bibek Shah Shankhar

Written by Bibek Shah Shankhar

I post articles on Data Science | Machine Learning | Deep Learning . Connect with me on Linkedln: https://www.linkedin.com/in/bibek-shah-shankhar/

No responses yet