Capstone project: Data-driven suggestions for HR¶

Description and deliverables¶

This is an example project in which I analyze a dataset and build predictive models to provide insights to the Human Resources (HR) department of a large consulting firm.

Business scenario and problem¶

The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, the task is to provide data-driven suggestions based on the understanding of the data. The client has the following question: what’s likely to make the employee leave the company?

Your goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.

My goal is to predict employees likely to quit, and possibly identify factors that contribute to their leaving.

Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company.

Familiarize yourself with the HR dataset¶

The dataset that you'll be using in this lab contains 15,000 rows and 10 columns for the variables listed below.

Note: The source of the data is Kaggle.

Variable Description
satisfaction_level Employee-reported job satisfaction level [0–1]
last_evaluation Score of employee's last performance review [0–1]
number_project Number of projects employee contributes to
average_monthly_hours Average number of hours employee worked per month
time_spend_company How long the employee has been with the company (years)
Work_accident Whether or not the employee experienced an accident while at work
left Whether or not the employee left the company
promotion_last_5years Whether or not the employee was promoted in the last 5 years
Department The employee's department
salary The employee's salary (U.S. dollars)

Import packages¶

In [48]:
# Import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pickle

Load dataset¶

In [49]:
# Load dataset into a dataframe
df = pd.read_csv("HR_capstone_dataset.csv")

# Display first few rows of the dataframe
df.head()
Out[49]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

Step 2. Data Exploration (Initial EDA and data cleaning)¶

  • Understand your variables
  • Clean your dataset (missing data, redundant data, outliers)

Gather basic information about the data¶

In [50]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

Gather descriptive statistics about the data¶

In [51]:
# Gather descriptive statistics about the data
df.describe()
Out[51]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

Rename columns¶

In [52]:
# Display all column names
df.columns = [item.lower().replace(' ','_') for item in list( df.columns ) ]
df.head()
Out[52]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company work_accident left promotion_last_5years department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
In [53]:
# Check for missing values
df.isna().sum()
Out[53]:
satisfaction_level       0
last_evaluation          0
number_project           0
average_montly_hours     0
time_spend_company       0
work_accident            0
left                     0
promotion_last_5years    0
department               0
salary                   0
dtype: int64
In [54]:
# Check for missing values
df_orig = df.copy() # original data to refer to later
df = pd.get_dummies(df, drop_first=True) # Create dummy variables, drop the first category

Check duplicates¶

In [55]:
# Check for duplicates
dups = df[ (df.duplicated(keep='first')) ] # Get duplicate rows. Keep first occurrence
dups.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3008 entries, 396 to 14998
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   satisfaction_level      3008 non-null   float64
 1   last_evaluation         3008 non-null   float64
 2   number_project          3008 non-null   int64  
 3   average_montly_hours    3008 non-null   int64  
 4   time_spend_company      3008 non-null   int64  
 5   work_accident           3008 non-null   int64  
 6   left                    3008 non-null   int64  
 7   promotion_last_5years   3008 non-null   int64  
 8   department_RandD        3008 non-null   uint8  
 9   department_accounting   3008 non-null   uint8  
 10  department_hr           3008 non-null   uint8  
 11  department_management   3008 non-null   uint8  
 12  department_marketing    3008 non-null   uint8  
 13  department_product_mng  3008 non-null   uint8  
 14  department_sales        3008 non-null   uint8  
 15  department_support      3008 non-null   uint8  
 16  department_technical    3008 non-null   uint8  
 17  salary_low              3008 non-null   uint8  
 18  salary_medium           3008 non-null   uint8  
dtypes: float64(2), int64(6), uint8(11)
memory usage: 243.8 KB
In [56]:
# Inspect some rows containing duplicates
dups
Out[56]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company work_accident left promotion_last_5years department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical salary_low salary_medium
396 0.46 0.57 2 139 3 0 1 0 0 0 0 0 0 0 1 0 0 1 0
866 0.41 0.46 2 128 3 0 1 0 0 1 0 0 0 0 0 0 0 1 0
1317 0.37 0.51 2 127 3 0 1 0 0 0 0 0 0 0 1 0 0 0 1
1368 0.41 0.52 2 132 3 0 1 0 1 0 0 0 0 0 0 0 0 1 0
1461 0.42 0.53 2 142 3 0 1 0 0 0 0 0 0 0 1 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14994 0.40 0.57 2 151 3 0 1 0 0 0 0 0 0 0 0 1 0 1 0
14995 0.37 0.48 2 160 3 0 1 0 0 0 0 0 0 0 0 1 0 1 0
14996 0.37 0.53 2 143 3 0 1 0 0 0 0 0 0 0 0 1 0 1 0
14997 0.11 0.96 6 280 4 0 1 0 0 0 0 0 0 0 0 1 0 1 0
14998 0.37 0.52 2 158 3 0 1 0 0 0 0 0 0 0 0 1 0 1 0

3008 rows × 19 columns

In [57]:
# Drop duplicates and save resulting dataframe in a new variable as needed
df = df[ ~(df.duplicated(keep='first')) ] 
df.shape

# Display first few rows of new dataframe as needed
df.describe()
Out[57]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company work_accident left promotion_last_5years department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical salary_low salary_medium
count 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.000000 11991.00000 11991.000000 11991.000000
mean 0.629658 0.716683 3.802852 200.473522 3.364857 0.154282 0.166041 0.016929 0.057877 0.051789 0.050121 0.036361 0.056125 0.057210 0.270119 0.151864 0.18714 0.478692 0.438746
std 0.241070 0.168343 1.163238 48.727813 1.330240 0.361234 0.372133 0.129012 0.233520 0.221610 0.218204 0.187194 0.230173 0.232252 0.444040 0.358904 0.39004 0.499567 0.496254
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 0.480000 0.570000 3.000000 157.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
50% 0.660000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
75% 0.820000 0.860000 5.000000 243.000000 4.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.00000 1.000000 1.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000

Check outliers¶

Check for outliers in the data.

In [58]:
# Create a boxplot to visualize distribution of employee `tenure` and detect any outliers
sns.boxplot( x=df.time_spend_company )
#sns.histplot( x=df.time_spend_company)
Out[58]:
<AxesSubplot:xlabel='time_spend_company'>
In [59]:
# Determine the number of rows containing outliers
o = df[df.time_spend_company>5]
o.shape

# 19 employees have been at the company for more than 5 years.
Out[59]:
(824, 19)

Exploratory data analysis¶

Begin by understanding how many employees left and what percentage of all employees this figure represents.

In [61]:
# Get numbers of people who left vs. stayed
print( df.left.value_counts() )
print("Out of 10,876 employees, 1115 people left.")

# Get percentages of people who left vs. stayed
print( df.left.value_counts(normalize=True) )
print("i.e. That's about 9% turnover rate!")


# Now focus on good workers who left.
df['left_good_worker'] = ( df.last_evaluation >= .6 ) * df.left 

print('----')
print("Focus on good workers leaving.")
# Get numbers of people who left vs. stayed
print( df.left_good_worker.value_counts() )
print("Wow -- all of the workers who left had an evaluation of 6 or above.")

# Get percentages of people who left vs. stayed
print( df.left_good_worker.value_counts(normalize=True) )

# Write over left field so that I can see the results with this more targetted DV
df.left = df.left_good_worker
df = df.drop(columns='left_good_worker')
0    10876
1     1115
Name: left, dtype: int64
Out of 10,876 employees, 1115 people left.
0    0.907014
1    0.092986
Name: left, dtype: float64
i.e. That's about 9% turnover rate!
----
Focus on good workers leaving.
0    10876
1     1115
Name: left_good_worker, dtype: int64
Wow -- all of the workers who left had an evaluation of 6 or above.
0    0.907014
1    0.092986
Name: left_good_worker, dtype: float64

Data visualizations¶

Now, examine variables of interest, and create plots to visualize relationships between variables in the data.

In [62]:
# Create a plot as needed
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11991 entries, 0 to 11999
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   satisfaction_level      11991 non-null  float64
 1   last_evaluation         11991 non-null  float64
 2   number_project          11991 non-null  int64  
 3   average_montly_hours    11991 non-null  int64  
 4   time_spend_company      11991 non-null  int64  
 5   work_accident           11991 non-null  int64  
 6   left                    11991 non-null  int64  
 7   promotion_last_5years   11991 non-null  int64  
 8   department_RandD        11991 non-null  uint8  
 9   department_accounting   11991 non-null  uint8  
 10  department_hr           11991 non-null  uint8  
 11  department_management   11991 non-null  uint8  
 12  department_marketing    11991 non-null  uint8  
 13  department_product_mng  11991 non-null  uint8  
 14  department_sales        11991 non-null  uint8  
 15  department_support      11991 non-null  uint8  
 16  department_technical    11991 non-null  uint8  
 17  salary_low              11991 non-null  uint8  
 18  salary_medium           11991 non-null  uint8  
dtypes: float64(2), int64(6), uint8(11)
memory usage: 971.9 KB
In [33]:
# Create a histogram based on those who left vs stayed
df.target = df.left==1
sns.histplot(x=df.satisfaction_level, hue=df.left)
plt.title("Histogram of satisfaction level -- workers who left vs stayed")
Out[33]:
Text(0.5, 1.0, 'Histogram of satisfaction level -- workers who left vs stayed')
In [34]:
# Create a plot as needed
sns.histplot(x=df.last_evaluation, hue=df.left)
plt.title("Histogram of last evaluation score -- workers who left vs stayed")
Out[34]:
Text(0.5, 1.0, 'Histogram of last evaluation score -- workers who left vs stayed')
In [35]:
# Create a plot as needed
df.info()
compare = [ 'left', 'satisfaction_level', 'last_evaluation', 'average_montly_hours']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 11991 entries, 0 to 11999
Data columns (total 19 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   satisfaction_level      11991 non-null  float64
 1   last_evaluation         11991 non-null  float64
 2   number_project          11991 non-null  int64  
 3   average_montly_hours    11991 non-null  int64  
 4   time_spend_company      11991 non-null  int64  
 5   work_accident           11991 non-null  int64  
 6   left                    11991 non-null  int64  
 7   promotion_last_5years   11991 non-null  int64  
 8   department_RandD        11991 non-null  uint8  
 9   department_accounting   11991 non-null  uint8  
 10  department_hr           11991 non-null  uint8  
 11  department_management   11991 non-null  uint8  
 12  department_marketing    11991 non-null  uint8  
 13  department_product_mng  11991 non-null  uint8  
 14  department_sales        11991 non-null  uint8  
 15  department_support      11991 non-null  uint8  
 16  department_technical    11991 non-null  uint8  
 17  salary_low              11991 non-null  uint8  
 18  salary_medium           11991 non-null  uint8  
dtypes: float64(2), int64(6), uint8(11)
memory usage: 971.9 KB
In [70]:
# Look at relationships between our variables of interest
sns.pairplot(df[compare], hue="left")
#plt.title("Pairwise plot relating hours, job satisfaction and evaluation")
Out[70]:
<seaborn.axisgrid.PairGrid at 0x12483185220>
In [71]:
# Create a plot as needed
sns.histplot(x=df.average_montly_hours, hue=df.left)
plt.title("Average monthly hours worked of those who left vs those who remained")
Out[71]:
Text(0.5, 1.0, 'Average monthly hours worked of those who left vs those who remained')
In [72]:
# Appears there is a mis-match in working hours. Some worked too much while others have too few hours.
sns.scatterplot(x=df.average_montly_hours,y=df.satisfaction_level,hue=df.left,alpha=0.05)
plt.title("Employee Satisfaction vs hours worked")
Out[72]:
Text(0.5, 1.0, 'Employee Satisfaction vs hours worked')
In [73]:
# Create a plot as needed
sns.scatterplot(x=df.last_evaluation,y=df.satisfaction_level,hue=df.left,alpha=0.05)
plt.title("Employee satisfaction vs their last evaluation")
Out[73]:
Text(0.5, 1.0, 'Employee satisfaction vs their last evaluation')
In [74]:
# color pallette
# Get Unique depts
color_labels = df_orig['department'].unique()

# List of colors in the color palettes
rgb_values = sns.color_palette("Set2", len(color_labels))

# Map continents to the colors
color_map = dict(zip(color_labels, rgb_values))
print( color_map)
{'sales': (0.4, 0.7607843137254902, 0.6470588235294118), 'accounting': (0.9882352941176471, 0.5529411764705883, 0.3843137254901961), 'hr': (0.5529411764705883, 0.6274509803921569, 0.796078431372549), 'technical': (0.9058823529411765, 0.5411764705882353, 0.7647058823529411), 'support': (0.6509803921568628, 0.8470588235294118, 0.32941176470588235), 'management': (1.0, 0.8509803921568627, 0.1843137254901961), 'IT': (0.8980392156862745, 0.7686274509803922, 0.5803921568627451), 'product_mng': (0.7019607843137254, 0.7019607843137254, 0.7019607843137254), 'marketing': (0.4, 0.7607843137254902, 0.6470588235294118), 'RandD': (0.9882352941176471, 0.5529411764705883, 0.3843137254901961)}
In [75]:
%matplotlib widget

from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap
fig = plt.figure(figsize=(8,8))
ax = Axes3D(fig)
#ax.scatter(xs=df_orig.last_evaluation,ys=df_orig.satisfaction_level,zs=df_orig.average_montly_hours,   c=df_orig.department.map(color_map), alpha=0.15)
ax.scatter(xs=df.last_evaluation,ys=df.satisfaction_level,zs=df.average_montly_hours,   c=df.left, alpha=0.15)

ax.set_xlabel('Last evaluation (quality)')
ax.set_ylabel('Satisfaction level')
ax.set_zlabel('Monthly hours')

plt.show()
C:\Users\jon\AppData\Local\Temp\ipykernel_18508\844919536.py:6: MatplotlibDeprecationWarning: Axes3D(fig) adding itself to the figure is deprecated since 3.4. Pass the keyword argument auto_add_to_figure=False and use fig.add_axes(ax) to suppress this warning. The default value of auto_add_to_figure will change to False in mpl3.5 and True values will no longer work in 3.6.  This is consistent with other Axes classes.
  ax = Axes3D(fig)
Figure
In [69]:
# add a ave_monthly_hours2 to capture non-linear effects
df['long_hours'] = df.average_montly_hours >= 220 # 173.33 is average

Insights¶

What insights can you gather from the plots you created to visualize the data?

It seems there are 3 main groups according to evaluation (quality), satisfaction, monthly hours

Employees who leave belong to three groups: 1) get low monthly hours, have low satisfaction, low evaluations. 2) have high monthly hours (>200hrs) and low satisfaction, high evaluations 3) have high monthly hours (> 200hrs) and high satisfaction, high evaluations

It seems the company has low performers that are unhappy and get few hours It also has high performers and are burned out and leave OR are performing well and say they are satisfied but then leave.

In essence, there seems to be a culture of working long hours in the company. This likely leads people to work unsustainable hours. They are recognized for their good work, but they eventually leave. Satisfaction surveys are likely insufficient because people likely leave before they indicate dissatisfaction.

Construct a model¶

In this section, I'm build a model to predict employees at higher risk of leaving.

I evaluate the model using a train-test split of the data.

The task for the model is to predict the category (stay or leave) a worker will belong to.

In practice, I would build a series of models and then compare their performance on the test data. In this case, I chose to build a random forest model for its ease of use and interpretation.

One of the nice features is that the RF model yields measures of variable importance. As seen below, the number of projects was found to be a useful predictor. In my EDA, this was not something I initially focused on because I assumed it to be highly correlated with hours worked.

One of the other nice features is that I am also able to display a decision tree from the estimation process. RF models build a number of trees, but examining some of these trees can give useful insights into which variables and thresholds are optimal in categorizing our predictions.

🔎

Recall model assumptions¶

Logistic Regression model assumptions

  • Outcome variable is categorical
  • Observations are independent of each other
  • No severe multicollinearity among X variables
  • No extreme outliers
  • Linear relationship between each X variable and the logit of the outcome variable
  • Sufficiently large sample size

Modeling¶

In [70]:
%matplotlib inline

import sklearn.model_selection as ms
import sklearn.preprocessing as pp
import sklearn.ensemble as en

rfc = en.RandomForestClassifier(random_state=42)
y = df.left
X = df.drop(columns=['left','average_montly_hours'])
X['long_hours'] = X.long_hours.apply(lambda x: 0 + 1*(x))

X_train, X_test, y_train, y_test = ms.train_test_split(X,y,train_size=0.8, random_state=42)
X_train.shape
X_trn, X_val, y_trn, y_val = ms.train_test_split(X_train,y_train,train_size=0.75, random_state=42)
print(X_trn.shape)
print(X_test.shape)
print(X_val.shape)

f0 = rfc.fit(X_trn,y_trn)
(7194, 18)
(2399, 18)
(2398, 18)
In [ ]:
 
In [71]:
# CV
cv_params = {'max_depth': [None], 
             'min_samples_leaf': [10,20,50],
             'min_samples_split': [.02,.05,.10,.20],
             'max_features': [2,4,6,8],
             'n_estimators': [75, 100, 125, 150]
             }  
scoring = {'accuracy', 'precision', 'recall', 'f1'}

#custom_split = PredefinedSplit(split_index)
rf_val = ms.GridSearchCV(rfc, cv_params, scoring=scoring, cv=5, refit='f1')
In [50]:
%%time
rf_val.fit(X_train, y_train)
CPU times: total: 6min 14s
Wall time: 6min 14s
Out[50]:
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [None], 'max_features': [2, 4, 6, 8],
                         'min_samples_leaf': [10, 20, 50],
                         'min_samples_split': [0.02, 0.05, 0.1, 0.2],
                         'n_estimators': [75, 100, 125, 150]},
             refit='f1', scoring={'f1', 'recall', 'accuracy', 'precision'})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [None], 'max_features': [2, 4, 6, 8],
                         'min_samples_leaf': [10, 20, 50],
                         'min_samples_split': [0.02, 0.05, 0.1, 0.2],
                         'n_estimators': [75, 100, 125, 150]},
             refit='f1', scoring={'f1', 'recall', 'accuracy', 'precision'})
RandomForestClassifier(random_state=42)
RandomForestClassifier(random_state=42)
In [51]:
# Obtain optimal parameters.
rf_val.best_params_
Out[51]:
{'max_depth': None,
 'max_features': 8,
 'min_samples_leaf': 10,
 'min_samples_split': 0.02,
 'n_estimators': 75}
In [52]:
pickle.dump(rf_val, open("rf_model.p",'wb') )
In [53]:
import sklearn.metrics as m
import sklearn.tree as tree
import sklearn.model_selection as ms
import sklearn.ensemble as e
In [54]:
# Fit the optimal model.

my_rfc = e.RandomForestClassifier(max_features=8,min_samples_leaf=10, min_samples_split=0.02, n_estimators=75, random_state=42)
f0 = my_rfc.fit(X_train,y_train)
In [55]:
estimator = my_rfc.estimators_[3]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = X_trn.columns,
                #class_names = df.target_names,
                rounded = True, proportion = False, 
                precision = 2, filled = True, max_depth=4)

# Convert to png using system command (requires Graphviz)
#from subprocess import call
#call(['dot', '-Tpng', '/home/jovyan/work/tree.dot', '-o', '/home/jovyan/work/tree.png', '-Gdpi=600'])

# Display in jupyter notebook
#from IPython.display import Image
#Image(filename = 'tree.png')
In [56]:
import matplotlib.pyplot as plt
plt.figure(figsize=(18,18))

tree.plot_tree(my_rfc.estimators_[4],max_depth=3, fontsize=10,filled=True, feature_names=X.columns, class_names={0:'Stayed', 1:'Left'})
plt.show()
In [57]:
%matplotlib inline
o = my_rfc.feature_importances_
o = [round(i,3) for i in o]
c = X_trn.columns
pair = list( zip( *sorted( zip(o, c), key=lambda pair: pair[0] ) ) )
print(pair[1])
sns.barplot( x=list(pair[0]), y=list(pair[1]))
plt.title(f'Variable importance reported by Random Forest model.')
#print(pair[1])
('promotion_last_5years', 'department_RandD', 'department_accounting', 'department_hr', 'department_management', 'department_marketing', 'department_product_mng', 'department_sales', 'department_support', 'department_technical', 'salary_medium', 'work_accident', 'salary_low', 'long_hours', 'number_project', 'last_evaluation', 'time_spend_company', 'satisfaction_level')
Out[57]:
Text(0.5, 1.0, 'Variable importance reported by Random Forest model.')
In [76]:
#smaller = ['long_hours','satisfaction_level','last_evaluation','number_project','average_montly_hours','time_spend_company','work_accident']
#sdf = df[smaller]
sns.histplot(x=df.average_montly_hours, alpha=.53)
plt.title("Average monthly hours - Note the bimodal distribution")
C:\Users\jon\anaconda3\envs\BERTopic\lib\site-packages\mpl_toolkits\mplot3d\art3d.py:900: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if zdir == 'x':
C:\Users\jon\anaconda3\envs\BERTopic\lib\site-packages\mpl_toolkits\mplot3d\art3d.py:902: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  elif zdir == 'y':
Out[76]:
Text(0.5, 0.92, 'Average monthly hours - Note the bimodal distribution')
In [59]:
y_pred = f0.predict(X_val)

acc = m.accuracy_score(y_val,y_pred)
rec = m.recall_score(y_val,y_pred)
prec = m.precision_score(y_val,y_pred)
f1 = m.f1_score(y_val,y_pred)
print(f'Acc: {acc}  Recall:{rec}  Precision: {prec}   F1:{f1}')
Acc: 0.98790658882402  Recall:0.8772727272727273  Precision: 0.9897435897435898   F1:0.9301204819277108
In [60]:
cm = m.confusion_matrix(y_val,y_pred)
print(cm)
p = m.ConfusionMatrixDisplay(cm,display_labels=[])
p.plot(values_format='')
[[2176    2]
 [  27  193]]
Out[60]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1f0426b40d0>
In [61]:
# Nothing excessively correlated
co = df.corr()
sns.heatmap(co,cmap='Blues')
Out[61]:
<AxesSubplot:>
In [62]:
import sklearn.linear_model as lm
lmc = lm.LogisticRegressionCV( cv=5, random_state=42,solver='liblinear')
f1 = lmc.fit( X_trn, y_trn )
In [63]:
import statsmodels.api as sm
myX = X_trn.copy()
print(y_trn)
myX = pd.get_dummies(myX,drop_first=True,columns=['long_hours'])

log_reg = sm.Logit( y_trn, myX ).fit()
#log_reg.get_margeff(at='mean').summary()
log_reg.summary()
#df.info()
9569     0
9097     0
978      0
11670    0
2077     0
        ..
4753     0
6290     0
5264     0
11860    0
7907     0
Name: left, Length: 7194, dtype: int64
Optimization terminated successfully.
         Current function value: 0.228906
         Iterations 8
Out[63]:
Logit Regression Results
Dep. Variable: left No. Observations: 7194
Model: Logit Df Residuals: 7176
Method: MLE Df Model: 17
Date: Fri, 22 Sep 2023 Pseudo R-squ.: 0.2691
Time: 18:10:22 Log-Likelihood: -1646.8
converged: True LL-Null: -2253.1
Covariance Type: nonrobust LLR p-value: 2.439e-247
coef std err z P>|z| [0.025 0.975]
satisfaction_level -3.9861 0.169 -23.651 0.000 -4.316 -3.656
last_evaluation 1.1055 0.270 4.091 0.000 0.576 1.635
number_project 0.0573 0.039 1.483 0.138 -0.018 0.133
time_spend_company 0.1874 0.031 6.022 0.000 0.126 0.248
work_accident -1.4099 0.188 -7.480 0.000 -1.779 -1.040
promotion_last_5years -1.8695 0.750 -2.494 0.013 -3.339 -0.400
department_RandD -2.1607 0.245 -8.819 0.000 -2.641 -1.680
department_accounting -1.8912 0.233 -8.125 0.000 -2.347 -1.435
department_hr -1.6956 0.238 -7.137 0.000 -2.161 -1.230
department_management -2.2922 0.290 -7.894 0.000 -2.861 -1.723
department_marketing -1.9541 0.247 -7.911 0.000 -2.438 -1.470
department_product_mng -1.7958 0.230 -7.795 0.000 -2.247 -1.344
department_sales -1.7930 0.144 -12.460 0.000 -2.075 -1.511
department_support -1.6228 0.161 -10.095 0.000 -1.938 -1.308
department_technical -1.6603 0.153 -10.857 0.000 -1.960 -1.361
salary_low -1.0619 0.137 -7.776 0.000 -1.330 -0.794
salary_medium -1.4823 0.143 -10.365 0.000 -1.763 -1.202
long_hours_1 2.0089 0.109 18.439 0.000 1.795 2.222
In [ ]:
 
In [64]:
print( np.exp( log_reg.params) )
np.exp( log_reg.conf_int() )
satisfaction_level        0.018572
last_evaluation           3.020675
number_project            1.059011
time_spend_company        1.206150
work_accident             0.244168
promotion_last_5years     0.154198
department_RandD          0.115243
department_accounting     0.150885
department_hr             0.183482
department_management     0.101046
department_marketing      0.141697
department_product_mng    0.165992
department_sales          0.166464
department_support        0.197339
department_technical      0.190080
salary_low                0.345785
salary_medium             0.227109
long_hours_1              7.455418
dtype: float64
Out[64]:
0 1
satisfaction_level 0.013347 0.025841
last_evaluation 1.778721 5.129798
number_project 0.981748 1.142354
time_spend_company 1.134774 1.282015
work_accident 0.168754 0.353284
promotion_last_5years 0.035488 0.669997
department_RandD 0.071294 0.186282
department_accounting 0.095612 0.238111
department_hr 0.115175 0.292299
department_management 0.057193 0.178521
department_marketing 0.087320 0.229937
department_product_mng 0.105677 0.260733
department_sales 0.125554 0.220705
department_support 0.144003 0.270430
department_technical 0.140854 0.256510
salary_low 0.264586 0.451903
salary_medium 0.171595 0.300583
long_hours_1 6.021901 9.230185

Logit testing¶

In [65]:
y_pred = f0.predict(X_test)

acc = m.accuracy_score(y_test,y_pred)
rec = m.recall_score(y_test,y_pred)
prec = m.precision_score(y_test,y_pred)
f1score = m.f1_score(y_test,y_pred)
print(f'Acc: {acc}  Recall:{rec}  Precision: {prec}   F1:{f1score}')
Acc: 0.9820758649437266  Recall:0.8317757009345794  Precision: 0.9621621621621622   F1:0.8922305764411027
In [66]:
y_pred = f0.predict(X_test)
cm = m.confusion_matrix(y_test,y_pred)
print(cm)
p = m.ConfusionMatrixDisplay(cm,display_labels=["Stayed","Left"])
p.plot(values_format='')
plt.title(f"Random forest confusion matrix - Test data")
[[2178    7]
 [  36  178]]
Out[66]:
Text(0.5, 1.0, 'Random forest confusion matrix - Test data')

Logit model¶

In [67]:
y_pred = f1.predict(X_val)
cm = m.confusion_matrix(y_val,y_pred)
print(cm)
p = m.ConfusionMatrixDisplay(cm,display_labels=["Stayed","Left"])
p.plot(values_format='')
plt.title(f"Logit model confusion matrix - Test data")
[[2145   33]
 [  63  157]]
Out[67]:
Text(0.5, 1.0, 'Logit model confusion matrix - Test data')
In [68]:
y_pred = f1.predict(X_test)

acc = m.accuracy_score(y_test,y_pred)
rec = m.recall_score(y_test,y_pred)
prec = m.precision_score(y_test,y_pred)
f1score = m.f1_score(y_test,y_pred)
print(f'Acc: {acc}  Recall:{rec}  Precision: {prec}   F1:{f1score}')
Acc: 0.9578991246352647  Recall:0.7009345794392523  Precision: 0.8021390374331551   F1:0.7481296758104737

✏

Recall evaluation metrics¶

  • AUC is the area under the ROC curve; it's also considered the probability that the model ranks a random positive example more highly than a random negative example.
  • Precision measures the proportion of data points predicted as True that are actually True, in other words, the proportion of positive predictions that are true positives.
  • Recall measures the proportion of data points that are predicted as True, out of all the data points that are actually True. In other words, it measures the proportion of positives that are correctly classified.
  • Accuracy measures the proportion of data points that are correctly classified.
  • F1-score is an aggregation of precision and recall.

Brief conclusion¶

In practice, I prefer the random forest model. The F1 score is higher with the random forest model, which indicates that the model has better predictive abilities compare to the logistic regression.

The random forest model also confirms that the most important variables to consider when predicting employee turnover are indicators of overwork: long hours, number of projects. The other indications such as reported satisfaction, and time spent in company and last evaluation are useful but may be lagging indicators. In other words, an employee that reports being dissatisfied may already be job hunting. Also, many employees report being satisfied before they left.

Managers can more easily track the number of projects and the number of hours employees work. One potential solution to encourage retention is to cap the number of projects an employee can be involved in and the weekly hours they can work to the legal limit. These policy changes can help stop burn-out and increase long-run retention of employees.