Employee attrition
remains a persistent challenge for organizations, with significant impaction
for human capital sustainability and overall strategic organisational
performance. However, existing empirical research in this field often relies on
limited real-world datasets to predict employee turnover and explicitly assume
that insights from real employee datasets are inherently superior despite
increasing constraints related to data access, privacy and ethical governance,
which raises substantial concerns regarding data accessibility, privacy, and
ethical governance. Consequently, there is still limited understanding of
whether synthetic data can be a reliable alternative for predictive modeling in
human resources.
To address this gap, this study examines whether the
employee attrition prediction differ when models are trained on real versus
synthetically generated data in the context of employee turnover. The research
uses the International Business Machines (IBM) employee dataset, binary
logistic regression, random forest, and gradient boosting models are applied to
real, bootstrapped, and synthetic datasets to assess predictive reliability.
The original sample of 1,470 observations is expanded through bootstrapping and
synthetic generation to create comparable datasets of 5,000 observations each.
The study systematically compares the performance of binary logistic
regression, random forest, and gradient boosting tree models across these
different types of data.
The results indicate that synthetic data preserves key
attrition-related relationship and yields predictive performance comparable to
real data, although minor reductions are observed in identifying rare attrition
cases. Factors affecting attrition, such as job satisfaction and age, remain
consistent across both real and synthetic datasets. Furthermore, statistical
analyses reveal no significant differences in predictive accuracy among the
models. The study contributes to employee attrition and HR analytics research
by demonstrating that attrition knowledge is not solely dependent on access to
real employee data. It further offers practical insights for organisations
seeking to leverage privacy-preserving analytics to support workforce planning
and retention strategies under data constraints.