A Comparative Study of Statistical and Machine Learning Techniques
for Predicting Customers Shopping Behavior
Alaa A Elnazer1*, Fawzia Abdu Alsalam Al Tboli2, Gehad Elgebaly3 and
Mahjoub A Elamin4
1Department of Marketing, College of Business, Imam Mohammad Ibn Saud Islamic
University (IMSIU), Riyadh 11432, Saudi Arabia
2Department of Statistics, Faculty of Science, Benghazi, University of Benghazi,
Libya
3Department of Economics, Faculty of Business Administration, Delta University for
Science and Technology, Gamasa, Egypt
4Department of Mathematics, University College of Umluj, University of Tabuk,
Saudi Arabia
*Corresponding Author: Alaa A Elnazer, Department of Marketing, College of
Business, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432,
Saudi Arabia.
Received:
January 27, 2026; Published: May 08, 2026
Abstract
TThis study develops a comprehensive predictive framework by systematically comparing five classification models—Logistic
Regression (LR), K-Nearest Neighbors (KNN), Random Forest (RF), Artificial Neural Networks (ANN), and Extreme Gradient Boosting
(XGBoost)—using the Online Shoppers’ Purchasing Intention dataset. A diverse set of performance metrics, including Accuracy, Root
Mean Squared Error (RMSE), Mean Absolute Error (MAE), R², Correlation Coefficient (CC), Coefficient of Variation (COV), and Error
Coefficient (EC), were employed to evaluate and benchmark the models. Descriptive statistics and correlation analysis provided a
foundational understanding of the behavioral attributes shaping purchasing outcomes, while inferential analyses, including ANOVA
and the Wilcoxon Signed-Rank Test, confirmed statistically significant differences among models and validated the robustness of
the comparative framework. The findings suggest that the Random Forest model was the best in most of the evaluation measures
as it had the lowest RMSE, the highest correlation with the actual outcomes, and the most stable. Even though Artificial Neural
Networks displayed similar levels of accuracy, the Random Forest was more consistent and reduced the number of predictive errors,
which highlights why this algorithm can be used to find out the customer behavior in very complex and nonlinear scenarios. The
results show that ensemble techniques are significant in prediction of e-commerce and that hybrid methods have the potential to
increase the accuracy and generalization. The research has both methodological and practical significance because it provides a
strict standard of the classification algorithms and offers practical information to online retailer which needs to optimize its decision
making process, customer satisfaction and long term customer loyalty.
References
- Abdullah-Al-Tanvir M., et al. “A gradient boosting classifier for purchase intention prediction of online shoppers”. (2023).
- Armstrong JS and Collopy F. “Error measures for generalizing about forecasting methods: Empirical comparisons”. International Journal of Forecasting1 (1992): 69-80.
- Balasundaram E., et al. “A hybrid approach for customer segmentation and loyalty prediction in e-commerce”. Prabandhan: Indian Journal of Management10 (2024): 56-69.
- Bartroff J., et al. “Sequential experimentation in clinical trials: Design and analysis (Vol. 298)”. Springer (2012).
- Benesty J., et al. “Pearson Correlation Coefficient. In Noise Reduction in Speech Processing”. Springer (2009).
- Best H and Wolf C. “Logistic regression”. In The SAGE handbook of regression analysis and causal inference (2015): 153-171.
- Bottou L. “Large-scale machine learning with stochastic gradient descent”. In Proceedings of COMPSTAT 2010 (2010). Physica-Verlag.
- Breiman L. “Random forests”. Machine Learning 45 (2001): 5-32.
- Cai K and Rodavia MR. “XGBoost analysis based on consumer behavior”. Frontiers in Computing and Intelligent Systems 6 (2023): 1-10.
- Chai T and Draxler RR. “Root mean square error (RMSE) or mean absolute error (MAE)?” Geoscientific Model Development 7 (2014): 1247-1250.
- Chen T and Guestrin C. “XGBoost: A scalable tree boosting system”. In Proceedings of the 22nd ACM SIGKDD (2016): 785-794. ACM.
- Dey D., et al. “The proper application of logistic regression model in complex survey data: A systematic review”. BMC Medical Research Methodology 25 (2025): Article 15.
- Dormann C F., et al. “Collinearity: A review of methods to deal with it”. Ecography1 (2013): 27-46.
- Ertan E and Akay K U. “Identifying a class of ridge-type estimators in binary logistic regression models”. Statistics 5 (2024): 1092-1116.
- Everitt BS and Skrondal A. “The Cambridge dictionary of statistics”. Cambridge University Press 4 (2010).
- Friedman JH. “Greedy function approximation: A gradient boosting machine”. Annals of Statistics5 (2001): 1189-1232.
- Friedman JH. “Stochastic gradient boosting”. Computational Statistics and Data Analysis4 (2002): 367-378.
- Guyon I and Elisseeff A. “An introduction to variable and feature selection”. Journal of Machine Learning Research 3 (2003): 1157-1182.
- Hair JF., et al. “Multivariate data analysis (8th)”. Cengage Learning (2019).
- Han J., et al. “Data mining: Concepts and techniques”. Morgan Kaufmann (2012).
- James G., et al. “An introduction to statistical learning”. Springer (2013).
- Kim S and Kim H. “A new metric of absolute percentage error for intermittent demand forecasts”. International Journal of Forecasting3 (2016): 669-679.
- LeCun Y., et al. “Deep learning”. Nature 521 (2015): 436-444.
- Li Y., et al. “Customer online behavior analysis and purchase prediction in e-commerce”. Electronic Commerce Research and Applications 40 (2020): 100935.
- Midha M., et al. “Empathetic analytics: Understanding depression through AI”. In APCIT 2024. IEEE (2024).
- Neter J., et al. “Applied linear regression models”. Richard D. Irwin (1983).
- Pagan M., et al. “Investigating the impact of data scaling on the k-nearest neighbor algorithm”. Computer Science and Information Technologies2 (2023): 135-142.
- Pedregosa F., et al. “Scikit-learn: Machine learning in Python”. Journal of Machine Learning Research 12 (2011): 2825-2830.
- Peng C Y J., et al. “An introduction to logistic regression analysis and reporting”. Journal of Educational Research1 (2020): 3-14.
- Pham LT., et al. “Evaluation of random forests for short-term daily streamflow forecasting”. Hydrology and Earth System Sciences 25 (2021): 2997-3015.
- Qu Y., et al. “Product-based neural networks for user response prediction”. In ICDM 2016. IEEE (2016).
- Song P and Liu Y. “An XGBoost algorithm for predicting purchasing behaviour”. Tehnički vjesnik 5 (2020): 1467-1471.
- Sreesouthry S., et al. “Loan prediction using logistic regression”. Annals of the Romanian Society for Cell Biology 4 (2021): 2790-2794.
- Stoltzfus J C. “Logistic regression: A brief primer”. Academic Emergency Medicine10 (2011): 1099-1104.
- Syaliman KU., et al. “Improving the accuracy of features weighted k-NN”. ICoSET (2020): 326-330.
- Willmott C J and Matsuura K. “Advantages of MAE over RMSE”. Climate Research1 (2005): 79-82.
- Yang L., et al. “RF-LightGBM”. arXiv (2021).
- Zaghloul M., et al. “Predicting e-commerce customer satisfaction”. Journal of Retailing and Consumer Services 79 (2024): 103865.
- Zhang S., et al. “Traffic accidents severity using ordinal logistic regression”. In ICAI 2024 (2024): 1007-1012.
- Zhu S., et al. “Evaluation of random forests for streamflow forecasting”. Hydrology and Earth System Sciences6 (2021): 2997-3013.
Citation
Copyright