Customer Churn Using PySpark

Prediction for Music App.

Project Overview

Find out the GitHub code here:

What you’ll know ?

Problem Statement

Since our data is behavioral data, we need to derive the characteristics of the user from the behavioral data and then hand it to the training model to get the results. It will generate these features from exploratory data analysis and feature engineering. With poor results, the process may iterate many times until the model performs well.

Let’s launch a Spark Session and load it into Jupyter Work Space Notebook.

Below is the code I used to build a locally based spark session.

I used following code for loading data from json file using spark.

Data Pre-processing

Exploratory Data Analysis

Let’s look at Page Function that has the variable “Cancelation Confirmation” from which we can derive our target variable, i.e. churn or not?

We’ve looked at each distribution of features, and they skewer most. Our target variable(churn) is desequilibrated. And so the case with most of the data set in the actual world.

Feature Engineering

As we have seen most variables are biased & unbalanced towards variable class or goal. We have come up with some features we used in our model mentioned below.

  • Add to Playlist
  • Use Time(using Length and Next-tune)
  • Included more companions
  • Client who used Thumbs’ Up or Down alternative
  • Requested help
  • Paid client or Free User
  • Downsized and gender orientation
  • Avg number of meeting logged.

Below is a diagram of features we took into account.

We derived a feature we believe is mostly powerful goal predictor, i.e. churn, and will use that feature to test / train our model.

Now with all the, let’s build a new table. Features which we found above

I joined all of the tables and placed together a new table to use for the final Model preparation.

Let’s test correlation between all the columns selected before training the model:

Correlation between all the selected columns


I noticed that many of my apps were string types when placing data into vectors, so I got to convert them to float. While this does not affect the tree model, standardization for linear models is necessary. I agreed that it would standardize the data.

I used linear training methods and tree methods, including the logistic model, the Random Forest and the GBT Classifier. I pick the best performing model, divide the test set using 3-fold cross validation, and grid check to decide the training set model parameters. Finally, I use the validation package to test efficiency of the model.

We split the variable data set function & goal into preparation, testing, building a pipeline, and implementing various machine learning models. Since the churned users are a relatively small subset, we used the F1 score as the metric for optimization, and we found the best model for GBT classifier compared to another.

Splitting and training the models using pipeline

Result We split the function & goal variable data set into Training ,testing and then developing pipeline and finally implementing 3 models of machine learning. I used bellow functions to find the F1 score and model accuracy.

Attached in the image is the Accuracy and F-1 Score. The code used on test data set to find the accuracy.

Accuracy of each model on Test data-set.

Logistic Regression Classifier Accuracy:0.8430953322074095
Random Forest Classifier Accuracy:0.9638506593840958
GBT-Classifier Classifier Accuracy:0.9915196377879191

The code used for finding the f-1 score on test data-set.

F1 score of each model on Test data-set:

Logistic Regression Classifier F1-Score:0.7765470592259909
Random Forest Classifier F1-Score:0.9619697735890758
GBTClassifier Classifier F1-Score:0.9914322037770119

Since the churned users are a relatively small subset, we used F1 score as the optimizing metric, and we considered a better fit for GBTClassifier compared to other models.



Last Steps


Reflections of this project.

To prevent churning, it is important to follow the following instructions.


We have around 223106 unique user accounts, and only 80 percent of them are used to learn. That said, the model has a huge potential to improve if the sample size increase, and the expected performance will also increase.

Since the churned users are a relatively small subset, we use the F1 score as the optimization metric.

Last measures

I have uploaded my analysis details to GitHub repository.

Check out the code on GitHub: here

I post articles on Data Science | Machine Learning | Deep Learning . Connect with me on Linkedln:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store