International Journal of

Business & Management Studies

ISSN 2694-1430 (Print), ISSN 2694-1449 (Online)
DOI: 10.56734/ijbms
Employee Attrition Prediction Under Data Constraints: A Comparative Analysis Of Real And Synthetic Datasets

Abstract


Employee attrition remains a persistent challenge for organizations, with significant impaction for human capital sustainability and overall strategic organisational performance. However, existing empirical research in this field often relies on limited real-world datasets to predict employee turnover and explicitly assume that insights from real employee datasets are inherently superior despite increasing constraints related to data access, privacy and ethical governance, which raises substantial concerns regarding data accessibility, privacy, and ethical governance. Consequently, there is still limited understanding of whether synthetic data can be a reliable alternative for predictive modeling in human resources.

To address this gap, this study examines whether the employee attrition prediction differ when models are trained on real versus synthetically generated data in the context of employee turnover. The research uses the International Business Machines (IBM) employee dataset, binary logistic regression, random forest, and gradient boosting models are applied to real, bootstrapped, and synthetic datasets to assess predictive reliability. The original sample of 1,470 observations is expanded through bootstrapping and synthetic generation to create comparable datasets of 5,000 observations each. The study systematically compares the performance of binary logistic regression, random forest, and gradient boosting tree models across these different types of data.

The results indicate that synthetic data preserves key attrition-related relationship and yields predictive performance comparable to real data, although minor reductions are observed in identifying rare attrition cases. Factors affecting attrition, such as job satisfaction and age, remain consistent across both real and synthetic datasets. Furthermore, statistical analyses reveal no significant differences in predictive accuracy among the models. The study contributes to employee attrition and HR analytics research by demonstrating that attrition knowledge is not solely dependent on access to real employee data. It further offers practical insights for organisations seeking to leverage privacy-preserving analytics to support workforce planning and retention strategies under data constraints.