Sparkify Big data — Udacity Capstone Project

7 min readApr 24, 2020

Introduction

Sparkify is a fictitious music streaming service, created by Udacity to resemble the real-world data-sets generated by companies such as Spotify or Pandora. Millions of users play their favorite songs through music streaming services on a daily basis, either through a free tier plan that plays advertisements or by using a premium subscription model, which offers additional functionalities and is typically ad-free. Users can upgrade or downgrade their subscription plan any time, but also cancel it altogether, so it is very important to make sure they like the service.

Every time a user interacts with music streaming app, whether playing songs, adding them to playlists, rating them with a thumbs up or down, adding a friend, logging in or out, changing settings, etc., data is generated. User activity logs contain key insights for helping the businesses understand whether their users are happy with the service.

Project Overview

This project is based on Big data. The full dataset is 12GB, of which we can analyze a mini subset in the workspace on the following page. Optionally, we can choose Spark cluster on the cloud using AWS or IBM Cloud to analyze a larger amount of data.

This project is all about predicting the churn. Predicting churn rates is a challenging and common problem that data scientists and analysts regularly encounter in any customer-facing business. Additionally, the ability to efficiently manipulate large datasets with Spark is one of the highest-demand skills in the field of data.

Learnings

Research and investigate a real-world problem of interest.
Accurately apply specific data science algorithms and techniques.
Properly analyze and visualize data and results for validity.
Document and write a report about work.

Problem Statement

The problem statement of this project is to predict the churn. Predicting user churn rate is one of the most challenging and common tasks that prevails in any customer-oriented business. If such problems are correctly explored and analysed by data scientists and analysts, data science techniques can assist with variety of algorithms to solve them.

Performance Metrics

The performance of each trained model was measured on the respective test set, using standard metrics for measuring the performance of binary classifiers, namely AUC and F1 score. The justification of why they are considered appropriate for the given problem is provided in the below sections.

Data Understanding

Two sets of Sparkify user activity data were provided to support this project: a full 12GB dataset, and it's smaller (128MB) subset mini_sparkify_event_data.json. We have used the latter for most of the data understanding steps, however, several queries have been run also on the full data-set to confirm the observed properties.

Data Exploration and Visualization

We had 128MB large dataset from the whole data which consists of 286,500 records in JSON format. Below is the list of 18 features already present in the dataset-

Upon further investigating, the following things were observed-

There were 8,346 blank user ids, which were removed from the dataset. Thus, the remaining dataset had 278154 rows.
There were total 19 pages in the “page” feature as shown below. From this, it has been observed that “Cancellation Confirmation” is the column that defines the churn of a user.

3. After defining churn, we analyze behavior for users who stayed vs users who churned in terms of different variables. Below is the gender-wise analysis of churn vs non-churn users. It has been observed that count for male users is higher in both scenarios.

4. From variable “ts”, various time components like an hour, day, month, etc were derived and were analyzed. From the below plot, it has been observed that churn seems to be higher for 4 pm duration.

5. Variable auth can take values of ‘Logged In’ and ‘Cancelled’.

6. The variable level can be ‘free’ or ‘paid’ and is not missing for any user.

Feature Engineering:

Now to get our model working we have selected/created features as below:

Gender: we have selected the gender as the first feature for our analysis
Songs: we have created another feature which holds the count of songs played by users.
Thumbs up: These features hold the count of users who has given thumbs up.
Thumbs down: These features hold the count of users who have given thumbs down.
Singers: Pick the number of singers the user has listened to as the fifth feature.

After merging all the features into one variable, let us see few records how it looks:

Modeling and Evaluation

With all the user-level features defined, calculated, and appropriately labeled the next step is to build and train a binary classifier that is able to identify the patterns between the inputs and their labels.

By using Spark’s machine learning capabilities we first defined pipelines that combine: a standardization step, a feature assembler object, and one of the Spark-supported binary classifiers. For each of the pipelines, we performed a grid search with cross-validation to test the performance of several parameter combinations, all on the user-level data obtained from the smaller Sparkify user activity data-set. Based on the performance results obtained in cross-validation (measured by AUC), we identified the best-performing model instances and retrained them on the entire training set. Ultimately, we evaluated their performance on the test set, using standard binary classification metrics, such as AUC and F1 score.

Based on the results obtained on the smaller Sparkify data-set, we selected a subset of good performing models to train and test also on the full 12GB Sparkify data-set, using Amazon’s EMR cluster. While training the models on the smaller data-set suffers from a small sample problem (the data-set reduces to 225 users only, 46 in the test set), and can return rather unstable performance results, the full Sparkify data (combines logs of more than 20k users) allowed us to truly test the efficacy of the developed models.

Refinement

We try to optimize logistic regression model by using grid search with different values for the parameters like elastic net regularization,elasticNetParam which corresponds to α and regParam corresponds to λ.(source )

Logistic Regression Hyper Parameter Tuning

We also optimize random forest model parameters like max depth, impurities, number of trees etc.

Results

After the refinement of hyperparameters in the logistic regression model with different grid searches, the metrics did not improve. Thus, we decide to take vanilla logistic regression as our final model that has an F1 Score of 0.6. Also, with the tuning of random forest model F1 Score came out to be equivalent to vanilla logistic regression. Thus, we took the Logistic Regression model with default parameters as our final model.

Conclusion

To summarise, we started with a small subset of the dataset. We performed steps like data cleaning, data exploration, feature engineering, data pre-processing and then modeling to predict churn.
This capstone project, as a reflection, I would say, has been a great platform to learn and execute a variety of data science skills like exploratory data analysis, data wrangling, feature engineering, machine learning pipeline creation, model evaluation, fine-tuning.
This project comes as a good exercise to understand and solve a problem that is very common in the real-world for customer-oriented businesses.
We were able to achieve an F1 score of 0.6, which is a little low. However, this can be further improved by adding more features for the variables like thumbs up, thumbs down, downgraded which we have ignored d in this exercise.