Bibek Shah Shankhar
8 min readApr 3, 2020

Customer Churn Using PySpark

Prediction for Music App.

Project Overview

Sparkify is an App for Music. This data-set includes two months of user activity log sparkify. The log includes some basic user information, and information about its operations. A consumer can have several entries in it. I churn a part of the user in the results, through account cancellation. I uploaded information of my research to GitHub repository.

Find out the GitHub code here:

What you’ll know ?

This article’s primary aim is to demonstrate the SQL interface that PySpark offers and its Machine Learning capabilities. Here, we seek to show that if you already know SQL and the scikit-learn library, you can easily expand your Machine Learning analysis to larger data sets (i.e. data sets that do not fit into a single machine’s memory).

Problem Statement

The project’s task is to identify the characteristics of churned users from these users behavioral data, and to take action to maintain future missing users as soon as possible. The project’s difficulty is that the data handed to the training model should be one item for each person.

Since our data is behavioral data, we need to derive the characteristics of the user from the behavioral data and then hand it to the training model to get the results. It will generate these features from exploratory data analysis and feature engineering. With poor results, the process may iterate many times until the model performs well.

Let’s launch a Spark Session and load it into Jupyter Work Space Notebook.

Below is the code I used to build a locally based spark session.

I used following code for loading data from json file using spark.

Data Pre-processing

Behavioral data differs from the final training data, so we need to clean up those values that are not in our minds. If userId and sessionId are null, then analysis becomes difficult. So you need to delete the null value. The userId includes several empty strings which may be actions of unregistered users from their websites. It does not link these users to this study of the churn rate, so we omit them from the data set. These are the lines I used in the userId to drop NaN values and empty strings..

Exploratory Data Analysis

As we explore the data, we enter all the event data into the ‘page’ attribute that is one of the most important features of our research. We may claim that when user visits the Confirmation of Cancelation page, user churn may be. So we will identify our target variable using the “Cancelation Confirmation” event on the page so if the user(1) is churning otherwise than (0).

Let’s look at Page Function that has the variable “Cancelation Confirmation” from which we can derive our target variable, i.e. churn or not?

We’ve looked at each distribution of features, and they skewer most. Our target variable(churn) is desequilibrated. And so the case with most of the data set in the actual world.

Feature Engineering

Now we’re familiar with the data, we will create the features we’ve found promising to train our model on. In our study, there are several other variables that you can seek to check and enhance efficiency.

As we have seen most variables are biased & unbalanced towards variable class or goal. We have come up with some features we used in our model mentioned below.

  • Add to Playlist
  • Use Time(using Length and Next-tune)
  • Included more companions
  • Client who used Thumbs’ Up or Down alternative
  • Requested help
  • Paid client or Free User
  • Downsized and gender orientation
  • Avg number of meeting logged.

Below is a diagram of features we took into account.

We derived a feature we believe is mostly powerful goal predictor, i.e. churn, and will use that feature to test / train our model.

Now with all the, let’s build a new table. Features which we found above

I joined all of the tables and placed together a new table to use for the final Model preparation.

Let’s test correlation between all the columns selected before training the model:

Correlation between all the selected columns

Modelling

In the course of my many training times, I thought the training speed would be substantially faster after restarting the kernel and re-reading the results. It was important to bring all the features into vectors before training the model.

I noticed that many of my apps were string types when placing data into vectors, so I got to convert them to float. While this does not affect the tree model, standardization for linear models is necessary. I agreed that it would standardize the data.

I used linear training methods and tree methods, including the logistic model, the Random Forest and the GBT Classifier. I pick the best performing model, divide the test set using 3-fold cross validation, and grid check to decide the training set model parameters. Finally, I use the validation package to test efficiency of the model.

We split the variable data set function & goal into preparation, testing, building a pipeline, and implementing various machine learning models. Since the churned users are a relatively small subset, we used the F1 score as the metric for optimization, and we found the best model for GBT classifier compared to another.

Splitting and training the models using pipeline

Result We split the function & goal variable data set into Training ,testing and then developing pipeline and finally implementing 3 models of machine learning. I used bellow functions to find the F1 score and model accuracy.

Attached in the image is the Accuracy and F-1 Score. The code used on test data set to find the accuracy.

Accuracy of each model on Test data-set.

Logistic Regression Classifier Accuracy:0.8430953322074095
Random Forest Classifier Accuracy:0.9638506593840958
GBT-Classifier Classifier Accuracy:0.9915196377879191

The code used for finding the f-1 score on test data-set.

F1 score of each model on Test data-set:

Logistic Regression Classifier F1-Score:0.7765470592259909
Random Forest Classifier F1-Score:0.9619697735890758
GBTClassifier Classifier F1-Score:0.9914322037770119

Since the churned users are a relatively small subset, we used F1 score as the optimizing metric, and we considered a better fit for GBTClassifier compared to other models.

Conclusion

Summary

Last Steps

Clean up your code, adding comments and renaming variables to make the code easier to read and maintain. Refer to the Spark Project Overview page and Data Scientist Capstone Project Rubric to make sure you are including all components of the capstone project and meet all expectations. Remember, this includes thorough documentation in a README file in a GitHub repository, and a web app or blog post.

Description

I introduced a model for forecasting client churn in this notebook. I eliminated rows with no userId in the data cleaning process, converted gender to binary numeric column. We designed 10 features for our model. We selected 3 models: logistical regression, Random Forest and GBTClassifier to predict churn, and finally we selected GBTClassifier as the last model applied to predict the outcome. VectorAssembler, which is a transformer that combines a list of columns into a single vector column, was used to prepare data and transfer to the model. Since the churned users are a fairly small subset, we are optimizing the F1 score as the metric .

Reflections of this project.

This project gave spark atmosphere exposure to analyze an enormous volume of data which a personal laptop could not analyze. By recognizing consumers with a high risk of churning before actual loss, businesses can use low costs to save consumers through the use of targeted communications and offers. The brainstorming of the functionality that we can derive from the data we have on hand is one of the fascinating and challenging things during the project. Developing useful features is crucial to the creation of a successful model and involves an impressive deal of energy and effort. Explanatory and exploratory analyzes of data play an important role in this process, Best part is I got by using my SQL Query competencies.

To prevent churning, it is important to follow the following instructions.

Once we have established who will churn with Prediction model, we will then have to take some action to stop them from getting churned by running some promotional offers to attract this targeted consumer. And figure out what the root cause is for those users who don’t like the Music app and change those root cases. Ex: including improvements to the UI, subscription rates etc …

Upgrades

After considering more variables, including more domain experience and skills, we can enhance the apps a lot. While data volume can require tools like spark to analyze, we can use more data to get better results as user base increases.

We have around 223106 unique user accounts, and only 80 percent of them are used to learn. That said, the model has a huge potential to improve if the sample size increase, and the expected performance will also increase.

Since the churned users are a relatively small subset, we use the F1 score as the optimization metric.

Last measures

Clean your code, add comments and change variables to make it easier to read and manage your code. Refer to the Spark Project Description page and Data Scientist Capstone Project Rubric to ensure all the capstone project elements are included and fulfill all standards. Note, that includes comprehensive documentation in a GitHub repository’s README file, and a web app or blog post.

I have uploaded my analysis details to GitHub repository.

Check out the code on GitHub: here

Bibek Shah Shankhar
Bibek Shah Shankhar

Written by Bibek Shah Shankhar

I post articles on Data Science | Machine Learning | Deep Learning . Connect with me on Linkedln: https://www.linkedin.com/in/bibek-shah-shankhar/

No responses yet