Introduction to Data Science Process
The CRISP-DM Process (Cross Industry process for data mining)
The CRISP-DM procedure gives an organized way to deal with arranging an Data mining Process. When we get into a weeds of coding, try to take a step back and realize what part of the process you are in, and assure that you remember the question you are trying to answer and what a solution to that question looks like. This process come in handy and is important before implementing any data science project.
There are several steps to follow for this process:
- Business Understanding
- Data Understanding
- Prepare data
- Data Modeling
- Evaluate the results
- Deploy
The first two steps o CRISP-DM are
- Business Understanding
This means understanding the problems and questions you are interested in tackling in context of whatever domain you are working in. For Instance
How do we acquire new customers ?
Does a new treatment perform better than an existing treatment ?
How can we improve communication ?
How can we improve travel ?
How can we better retain information ?
2. Data Understanding
You need to transfer the questions from an understanding of business to data at this stage. You may already have data that could be used to answer the questions, or you may need to gather data to get in on your interest questions.
3. Prepare data
(In general, the most time-consuming phase): data selection; clean data; data construction; data integration; data formatting. In this step, you need to wrangle the data in a way such that you get answers to your questions. The wrangling and cleaning process is said to take 80% of the time of data analysis process.
4. Data Modeling
Pick the actual modelling technique to be used as the first step in modeling. While you may already have chosen a method during the Business Understanding process, this task relates to the common modeling technique e.g. Linear Regression, Decision trees , Clustering or back-propagated neural network generation. When applying several techniques, execute the task separately for each technique.
5. Evaluate the results
Results are the findings from our wrangling and modeling. They are the answers you found to each of the questions.
6. Deploy
Deploying can happen by shifting the production strategy by using the findings to convince those inside a organization to act on the outcomes. The data scientist plays such an important role in communication.