Customer retention analysis through predictive modeling

Identifying the characteristics which are most responsible for the retention of customers


A comprehensive study was performed on the data provided by a company which organizes school trips. The primary objective of the study was to identify the characteristics which are most responsible for the retention of customers. The secondary objective was to identify the best predictive model to predict the customers’ response.

Different modeling techniques such as linear regression, logistic regression, ridge regression, LASSO regression, CART, random forest, neural network and boosting were applied on the school trip data. The relationship between the response variable and different parameters was established based on the findings of aforementioned models.

The findings of the study suggest that satisfaction and loyalty plays an important role in extending a relationship with a customer. It was found that logistic regression model performed the best in predicting the response of the customers. In addition, linear models outperformed nonlinear models for the data. Our study showed that complex models do not always provide the best predictions.


The study examines the impact of different parameters on retention using various statistical modeling and machine learning methodologies. Comparison of the performance of different statistical modeling methods such as linear regression, logistic, ridge and LASSO regression yields that logistic regression provides the best performance. Analyses performed using different machine learning methods such as CART, random forest, boosting and neural network highlights that response predicted by Boosting is the best among all the machine learning methods.

In addition, the comparison of all methods simultaneously unravelled that logistic regression, which is one of the simplest and oldest classification techniques provides the maximum predictive power. It also highlights that fancy and advanced algorithms do not always work better. Different predictive modelling techniques work better on various kinds of data. Without scrutinizing the data carefully, logical conclusions cannot be made. Logistic, ridge and LASSO regression techniques are linear models, whereas Random Forest, CART, Boosting and Neural Network are nonlinear models and the overall results show that linear models performance is almost equal or better than nonlinear models. However, different model performances will vary depending on the objective of the company i.e. whether they want to focus on top 20% or top 50% of the customers.


Ensembling of all models is a useful technique in determining robust predictions regarding a dataset. The ensembled model accuracy was found to be 82.8% which was higher than most of the models accuracy; only lower than the logistic regression accuracy. If the behaviour of data is not known and it is not sure which method would provide the best predictive method, ensembling of the models is a useful technique that can yield highly accurate predictions.

The analyses determined the most important parameters that affect the response of the customers. The findings confirm that customer satisfaction plays a prominent role in retention. NPS score and Travel with us next year parameters which describe the satisfaction level of the teachers came out to be the most important factors. Whether the customer is Existing or New also decides the retention of the customers. This parameter explains the loyalty of a customer; if a customer is loyal, he/she will definitely extend their relationship with the firm.

This work compared some statistical and machine learning techniques. More predictive modeling techniques such as K- Nearest Neighbour, Elastic Net, Non Linear Decision trees, Elastic Net, Support Vector Machine can be implemented in the future to study the behaviour of the data and to find more robust predictions.