There have been many news stories about Ontario colleges. Concerns have grown that Ontario colleges have forgotten their core mandate of training skilled workers in favor of attracting international students who pay much higher tuition. Questions have been raised about the quality of education students now receive and the incentives of college administrators. In this analysis, I focus narrowly on the salaries paid by Ontario colleges. In particular:
How much do highly paid workers in Ontario colleges make? And,
What determines their salaries?
Is it possible to predict college workers’ salaries?
I use the Ontario government’s Sunshine List of employees who make over $100,000 annually in this analysis. (This analysis focuses on Ontario colleges, which are distinct from Ontario universities.)
The first section describes the data from the Sunshine List, focusing on the years between 2012 and 2023. The following section builds OLS, fixed effects, and between-effects models for inference about the potential causes of salary increases. These models will help explain the factors related to changes in college workers’ salaries.
Given the limited public information available, advanced predictive models build on these insights to predict workers’ salaries. To test the quality of these models, the data is split into a training component to train predictive models and test data to verify the results. The quality of the predictions and areas for improvement are discussed.
I account for inflation using the CPI to calculate the yearly price relative to a single base year. This enables the analysis to compare people’s salaries across time validly.
It is also important to remember that the Sunshine List only contains workers who make more than $100,000. This salary level is a nominal cut-off, so as inflation increases and the purchasing power of $100,000 decreases, we see an increasing number of people on the list over time. I have restricted the analysis from 2013 onwards to partially account for this. I also use 2013 as the base year to account for inflation. After adjusting for inflation, workers who make less than $100,000 are removed from the panel.
# -----------------------
# Read All-items CPI from Statistics Canada (Series 1810000501)
# -----------------------
# https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810000501&pickMembers%5B0%5D=1.2&cubeTimeFrame.startYear=2002&cubeTimeFrame.endYear=2023&referencePeriods=20020101%2C20230101
# This will be used to chain the salary amounts across years
#cpi <- read.csv("data/Canada_CPI_2022-2023-eng.csv")
cpi <- read.csv("data/Canada_CPI_1996-2023-eng.csv")
# Use All-items GDP
# Switch from wide to long format so we can lookup CPI easier
cpi <- cpi %>%
filter(Group=="All-items") %>%
pivot_longer(
cols=starts_with("X"),
values_to = "CPI",
names_to = "year")
cpi<-cpi%>%
mutate(year=gsub("X","",year)) %>%
select(-Group)
# Configure the base year
BASE_YEAR = 2013
# Calculate the current value of a price in terms of the base year
get_value <- function(val, current_year) {
#return (val)
curr_cpi = cpi$CPI[ cpi$year == current_year ]
base_cpi = cpi$CPI[ cpi$year == BASE_YEAR ]
return (val * (base_cpi / curr_cpi))
}
I’ve created a panel of observations about workers at Ontario colleges. To accomplish this, I combined a series of annual Sunshine List data files into a panel of employees, which I followed over time. Observations of individuals are linked over time based on a common employer and the person’s first and last name.
The shortcomings of this approach are that I cannot easily capture people who have changed jobs between different colleges and whose salary in those transition years is below $100,000. It also potentially misses women who have changed their last name after marriage. However, an examination of the data shows that number of cases meeting these criteria is not large.
The following plot shows the distribution of salaries across Ontario colleges. Below is a box plot of salaries paid at Ontario colleges. The colored bands represent the mass of the distribution, and the dots are ‘outlying’ points. In this case, the dots represent employees who make significantly more than others in the organization.
Note that I use 2013 as the inflation-adjusted year. Many workers who make nominally more than $100,000 in later years are making less than $100,000 in 2013 inflation-adjusted dollars. These workers are kept in our panel because they still provide helpful salary information.
gp <- worker_panel %>% #filter(Employer %in% c("conestoga","humber","lambton") ) %>%
filter(salary >= 100000) %>%
arrange(Employer) %>%
ggplot(aes(x=salary, fill = Employer)) +
geom_boxplot() + scale_x_log10(labels = scales::dollar_format()) + ggtitle("Plots of Salaries By College")
#geom_histogram() + #, position = "stacked") # +
#pl <- ggplotly(gp, tooltip = c("text"))
#htmlwidgets::saveWidget(pl, glue("box_plot_salary.html"))
gp
Salaries in the data tend to have a long right tail, with very few college administrators making large amounts of money. The mass of the distribution, which includes lower paid professionals, comprises college teaching faculty and administrators. A common approach to normalize such distributions is to log transform salary. This transformation tends to improve the fit of the models we use, and we can easily exponentiate the results to see them again in dollars.
I have also merged data about each college’s student demographics to enrich the data. In particular, I focus on the two largest groups – Canadian students and students from India enrolled in Ontario colleges.
The final data transformation has to do with job titles. The Sunshine List contains job titles given by colleges, but there is no consistency in those titles. To simplify the variation, I have searched for common positions. For example, if the position title contains the words ‘professor,’ ‘lecturer,’ and ‘teacher,’ then the person is given the simplified title of ‘Professor’. If the title contains ‘vice-president of …’, they are categorized as a ‘VP,’ etc.
# This function is based on schools with > 1/3 indian students.
# Not very useful in practice so we omit it.
#worker_panel <- worker_panel %>%
# rowwise() %>%
# mutate(
# is_indian = is_indian(Employer)
# )
# Log transform of salary
worker_panel <- worker_panel %>%
mutate(
ln_salary = log(salary)
)
# Make simpler job categories
worker_panel <- worker_panel %>%
mutate(
cln = tolower( gsub("[,-].*",'',title,perl = TRUE) ),
cln = gsub("[[:punct:]]+", "", cln),
title2 = case_when(
grepl("vice pre", cln) ~ "VP",
grepl("officer", cln) ~ "Officer",
#grepl("consultant", cln) ~ "Consultant",
##grepl("senior.*analyst", cln,perl = TRUE) ~ "Senior Analyst",
grepl("vp", cln) ~ "VP",
#grepl("coordinator", cln) ~ "Coordinator",
#grepl("manager", cln) ~ "Manager",
grepl("chair", cln) ~ "Chair",
grepl("dean", cln) ~ "Dean",
grepl("director", cln) ~ "Director",
grepl("professor", cln) ~ "Professor",
grepl("faculty", cln) ~ "Professor",
grepl("instructor", cln) ~ "Professor",
##grepl("chef", cln) ~ "Chef",
grepl("president", cln) ~ "President",
##grepl("specialist", cln) ~ "Specialist",
##grepl("nurse", cln) ~ "Nurse",
grepl("exec dir", cln) ~ "Director",
##grepl("coach", cln) ~ "Coach",
#grepl("registrar", cln) ~ "Registrar",
#grepl("counsellor", cln) ~ "Counsellor",
.default = "Other"
)
)
I have also merged data about the number of enrollments in each college over time. My prior analysis found that Canadians and Indians comprise these institutions’ two largest groups of students.
This was an experiment where I took all of the job titles and attempted to cluster them into groups. My hope was that this automated approach (using TF-IDF and sentence transformers) to creating clusters would show groups that were sensible. Unfortunately, this is not the case. It seems it is better to create groups based on heuristics of job groupings. The downside is the ‘other’ category, where there is a high variance in salaries. My suspicion, is that these are administrative positions with varying responsibilities not captured by the traditional titles of higher education.
This section should NOT be run
Chair Dean Director Officer Other President Professor VP
1748 2990 4952 506 11410 270 45985 1211
<NA>
0
Below is a summary of the variables that are available for examination.
Salaries by College in 2013 adjusted dollars | ||||
2022 | ||||
Employer | Median | SD | Max | N |
---|---|---|---|---|
Humber | $94,620 | $22,366 | $449,086 | 796 |
Sheridan | $95,170 | $18,921 | $373,444 | 687 |
George Brown | $94,702 | $21,678 | $368,889 | 609 |
Seneca | $95,192 | $18,808 | $342,789 | 764 |
Conestoga | $95,082 | $21,201 | $332,908 | 577 |
Algonquin | $95,082 | $16,618 | $274,107 | 597 |
La Cit | $95,081 | $19,947 | $265,873 | 173 |
Sir Sandford | $94,880 | $16,679 | $249,764 | 230 |
Centennial | $95,155 | $17,786 | $247,264 | 545 |
Durham | $94,862 | $16,867 | $245,880 | 340 |
Fanshawe | $94,841 | $15,207 | $244,277 | 503 |
St Clair | $95,096 | $17,314 | $243,781 | 253 |
Sault | $95,171 | $19,413 | $233,358 | 109 |
Niagara | $95,082 | $16,409 | $231,887 | 352 |
Mohawk | $95,115 | $17,080 | $223,765 | 426 |
St Lawrence | $93,623 | $18,411 | $222,775 | 215 |
Lambton | $95,118 | $21,386 | $214,770 | 132 |
Northern | $95,269 | $19,737 | $212,716 | 80 |
Georgian | $94,821 | $17,566 | $212,587 | 340 |
Bor | $95,095 | $19,861 | $212,430 | 103 |
Cambrian | $94,717 | $16,905 | $207,437 | 173 |
Canadore | $95,115 | $15,662 | $199,430 | 108 |
Loyalist | $97,085 | $18,443 | $187,978 | 124 |
Confederation | $94,517 | $14,644 | $184,300 | 132 |
The correlation plot shows the degree to which variables move together. Highly correlated variables should not appear in the same inferential regression model because they will affect the size and significance of the coefficients, making interpretation impossible.
There are expected relationships between the previously listed variables because of how they are constructed or, in some cases because they measure similar phenomena. For example, the number of Canadian students enrolled is correlated with the total count of all students. Similarly, the proportion of Indian students positively correlates with the time trend. We know from other analyses that Indian international students only began to arrive in Ontario colleges towards the end of our panel.
It’s also unsurprising that the number of faculty positively correlates with the number of students enrolled. Of course, as a college gains more students, it requires more teachers.
The full dataset is divided into train and test data. The train data is the part of the data I use to build models. The remaining data is the ‘test’ data. The test data is used to test the accuracy of predictive models based on the training data.
As previously noted, individual IDs are assigned based on the individual’s first and last name and employer. This assumption means we cannot track people who switch between employers or change their names (ex., married women.)
To create valid test and train data, I sample from all available person IDs and assign a portion to the train data set and a portion to test for final verification.
[1] “fn” “ln” “year” “Employer”
[5] “title” “salary” “ln_salary” “title2”
[9] “exp” “is_prof” “faculty_num” “enroll_indian”
[13] “enroll_canadian” “prop” “student_count” “ID”
[1] 69072 16
Some code was written to capture cases of people who left a job at one college and switched to another. However, there can be cases where people switch and are not in the data because their mid-year salary does not meet the threshold of $100,000 to be included in the Sunshine list. In practice, it turns out that there are few people who fit this criteria (>5).
[1] “fn” “ln” “year” “Employer”
[5] “title” “salary” “ln_salary” “title2”
[9] “exp” “is_prof” “faculty_num” “enroll_indian”
[13] “enroll_canadian” “prop” “student_count” “ID”
[17] “ID.x” “exp.x” “exp.y”
[1] “fn” “ln” “year” “Employer”
[5] “title” “salary” “ln_salary” “title2”
[9] “exp” “is_prof” “faculty_num” “enroll_indian”
[13] “enroll_canadian” “prop” “student_count” “ID”
# Sample split into train and test
set.seed(66664)
rfdf <- rfdf %>%
mutate( # transformations
Employer = as.factor( Employer ),
title2 = as.factor( title2 ),
#year = as.factor(year), # if commented out, year will be trend
#ID = as.factor( ID )
)
# Test - remove observations with less than 100K salary
rfdf <- rfdf %>% filter( salary >= 100000 ) # removes 27K obs!
# Get sample to train models
sample_id_list <- rfdf$ID %>% unique()
count_ids <- length(sample_id_list)
train_size = count_ids * 0.75 # 75% of obs to training. 25% to test
sample_ids = sample(x = sample_id_list, size = train_size)
train <- rfdf %>% filter(ID %in% sample_ids)
test <- rfdf %>% filter(!(ID %in% sample_ids))
#dim(train)
#dim(test)
train_uid <- sort( unique( train$title2 ) )
test_uid <- sort( unique( test$title2 ) )
if ( !setequal(unique( test$title2 ), unique( train$title2 )) ) {
stop("Test does not have all factors. Change random seed.") # Check for panel data regressions
}
dim(train)
[1] 31987 16
# Make dataframe numeric -- TODO when to implement?
#rfdf <- set_dummies(rfdf)
#rfdf <- rfdf %>% select(-year, -Employer, -title2)
#rfdf <- rfdf %>% select(where(is.numeric))
[1] “Employeralgonquin” “Employerbor” “Employercambrian”
[4] “Employercanadore” “Employercentennial” “Employerconestoga”
[7] “Employerconfederation” “Employerdurham” “Employergeorge_brown” [10]
“Employergeorgian” “Employerhumber” “Employerla_cit”
[13] “Employerlambton” “Employerloyalist” “Employermohawk”
[16] “Employerniagara” “Employernorthern” “Employersault”
[19] “Employerseneca” “Employersheridan” “Employersir_sandford” [22]
“Employerst_clair” “Employerst_lawrence” “title2Chair”
[25] “title2Dean” “title2Director” “title2Officer”
[28] “title2Other” “title2President” “title2Professor”
[31] “title2VP” “year2014” “year2015”
[34] “year2016” “year2017” “year2018”
[37] “year2019” “year2020” “year2021”
[40] “year2022” “year2023”
Manual and automated testing with OLS models showed that a relatively small subset of variables provided the most explanatory power. For simplicity, this work will focus on variations of these explanatory variables.
We start with OLS models because of the flexibility and ease of interpretation. We start the specification using dummies for each simplified title commonly used at colleges. The current year is also set as a dummy variable because we have no reason to believe that salaries increase linearly once we take into account inflation.
Instead, with the inclusion of the worst of pandemic years in our panel it seems likely that there will be significant salary changes. We also include ‘exp,’ which represents the years of experience of workers at the college.
Two other additional explanatory variables are tested. The first is the proportion (prop) of Indian international students enrolled. The other is a count of the number of students enrolled in the college.
Sometimes, the most straightforward specification is the best, and adding more explanatory variables does not add to the overall explanatory power of a model. To determine which of the possible specifications is the best, I use the log-likelihood test to compare different nested models (not shown here).
The following visualization shows how the model coefficients change with the addition of new variables. We can see that the models are relatively consistent – which is a good thing.
Ordinary Least Squares, OLS, models, which we have just looked at, can be specified to control for within variation by adding time-varying dummy variables. However, they still contain some between-variation (more on this below). OLS models are generally used in cross-sectional data. When they are used with panel data, they are referred to as pooled OLS. One needs to be aware of the risk of bias with mis-specified OLS models and the risk that the errors of pooled OLS models used on panel data tend to be biased downward. (i.e., coefficients are listed as significant when they are really not.) This is why we often employ specialized models and standard error calculations when developing inferential models with panel data.
Call: lm(formula = ln_salary ~ ., data = ols_data)
Residuals: Min 1Q Median 3Q Max -0.97816 -0.03898 -0.00714 0.02628 0.93676
Coefficients: (3 not defined because of singularities) Estimate Std.
Error t value Pr(>|t|)
(Intercept) 11.97695601 0.00654029 1831.257 < 0.0000000000000002
exp 0.00891074 0.00018254 48.815 < 0.0000000000000002
faculty_num 0.00003026 0.00001409 2.147 0.031784 *
prop 0.00095131 0.00591617 0.161 0.872253
title2Chair -0.34748015 0.00425471 -81.670 < 0.0000000000000002
title2Dean -0.25060472 0.00384791 -65.128 <
0.0000000000000002 title2Director -0.31738000 0.00365964
-86.724 < 0.0000000000000002 title2Officer -0.08285893
0.00663715 -12.484 < 0.0000000000000002 title2Other
-0.39258564 0.00363614 -107.968 < 0.0000000000000002
title2President 0.33445062 0.00726043 46.065 < 0.0000000000000002
title2Professor -0.48568005 0.00334882 -145.030 <
0.0000000000000002 title2VP NA NA NA NA
Employeralgonquin -0.00675531 0.00506118 -1.335 0.181974
Employerbor 0.02785337 0.00572193 4.868 0.000001133689494218
Employercambrian -0.01084737 0.00518889 -2.090 0.036581
*
Employercanadore -0.04193647 0.00577848 -7.257 0.000000000000403714
Employercentennial -0.02307922 0.00539250 -4.280
0.000018754230967445 Employerconestoga -0.01293048
0.00537236 -2.407 0.016096 *
Employerconfederation -0.02018470 0.00561828 -3.593 0.000328
Employerdurham -0.01232889 0.00465736 -2.647 0.008121
Employerfanshawe -0.01665469 0.00502575 -3.314 0.000921
Employergeorge_brown -0.01735081 0.00527045 -3.292
0.000996 Employergeorgian -0.00589012 0.00463284 -1.271
0.203601
Employerhumber -0.00230076 0.00563600 -0.408 0.683110
Employerla_cit -0.01393487 0.00515489 -2.703 0.006871
Employerlambton 0.00259233 0.00580607 0.446 0.655249
Employerloyalist -0.00247317 0.00556199 -0.445 0.656572
Employermohawk -0.01026809 0.00474299 -2.165 0.030403 *
Employerniagara -0.02442646 0.00472810 -5.166 0.000000240289716232
Employernorthern 0.00429772 0.00674020 0.638
0.523722
Employersault -0.02090194 0.00558284 -3.744 0.000181
Employerseneca -0.00849113 0.00601890 -1.411 0.158330
Employersheridan -0.01222221 0.00512407 -2.385 0.017073
Employersir_sandford -0.02350343 0.00495487 -4.743 0.000002109676836765
Employerst_clair -0.02068232 0.00480314 -4.306
0.000016672995049545 Employerst_lawrence NA NA NA NA
year2013 0.04198980 0.00437610 9.595 < 0.0000000000000002
year2014 0.03402977 0.00416576 8.169 0.000000000000000323
year2015 0.02001124 0.00407451 4.911 0.000000909095568558
year2016 0.01970152 0.00381814 5.160 0.000000248451230866
year2017 0.03128737 0.00614858 5.089 0.000000362855219694
year2018 0.02206859 0.00318428 6.930 0.000000000004273788
year2019 0.00741979 0.00297834 2.491 0.012735 *
year2020 0.01841651 0.00272609 6.756 0.000000000014463404
year2021 -0.00168118 0.00269481 -0.624 0.532727
year2022 -0.01303475 0.00331774 -3.929 0.000085547415653697 **
year2023 NA NA NA NA
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
Residual standard error: 0.09452 on 31943 degrees of freedom Multiple R-squared: 0.6565, Adjusted R-squared: 0.656 F-statistic: 1420 on 43 and 31943 DF, p-value: < 0.00000000000000022
The differently shaped icons show how the model coefficients differ. Model 1 is the baseline model.
The visualization shows that the simplified titles are the most prominent single salary indicator. All the salaries are compared to those of a professor (teaching staff). The default category ‘other’ includes admin staff and those job categories that were not obviously classifiable. Chairs and Directors make much more, followed by Deans and Officers of the college. The highest-paid positions belong to VPs and Presidents. The dummy variables used for years show some annual non-linear variation. Around the pandemic, we see a drop in average salary levels.
The expected annual salary increases with each additional year of worker experience (exp). This is a small effect, but it is multiplied by the number of years of service.
The size of the school, as measured by student count in thousands, does not represent a large change in any of the models despite being statistically significant. In other words, colleges with larger enrollments do not appear to pay much better than their smaller peers. Remember that regression models are interpreted as ‘holding everything else constant.’ So, people may still choose to work at larger colleges for other professional reasons – such as promotion opportunities – that are not available at smaller institutions.
Model 1 | |
---|---|
(Intercept) | 11.98 *** |
(0.01) | |
exp | 0.01 *** |
(0.00) | |
faculty_num | 0.00 * |
(0.00) | |
prop | 0.00 |
(0.01) | |
title2Chair | -0.35 *** |
(0.00) | |
title2Dean | -0.25 *** |
(0.00) | |
title2Director | -0.32 *** |
(0.00) | |
title2Officer | -0.08 *** |
(0.01) | |
title2Other | -0.39 *** |
(0.00) | |
title2President | 0.33 *** |
(0.01) | |
title2Professor | -0.49 *** |
(0.00) | |
Employerbor | 0.03 |
(0.01) | |
Employercambrian | -0.01 |
(0.01) | |
Employercanadore | -0.04 *** |
(0.01) | |
Employercentennial | -0.02 * |
(0.01) | |
Employerconestoga | -0.01 *** |
(0.01) | |
Employerconfederation | -0.02 *** |
(0.01) | |
Employerdurham | -0.01 * |
(0.01) | |
Employerfanshawe | -0.02 *** |
(0.00) | |
Employergeorge_brown | -0.02 ** |
(0.01) | |
Employergeorgian | -0.01 *** |
(0.01) | |
Employerhumber | -0.00 *** |
(0.00) | |
Employerla_cit | -0.01 |
(0.01) | |
Employerlambton | 0.00 |
(0.01) | |
Employerloyalist | -0.00 ** |
(0.01) | |
Employermohawk | -0.01 |
(0.01) | |
Employerniagara | -0.02 |
(0.00) | |
Employernorthern | 0.00 * |
(0.00) | |
Employersault | -0.02 *** |
(0.01) | |
Employerseneca | -0.01 |
(0.01) | |
Employersheridan | -0.01 *** |
(0.01) | |
Employersir_sandford | -0.02 |
(0.01) | |
Employerst_clair | -0.02 * |
(0.00) | |
year2013 | 0.04 *** |
(0.00) | |
year2015 | 0.02 *** |
(0.00) | |
year2016 | 0.02 |
(0.00) | |
year2017 | 0.03 *** |
(0.00) | |
year2018 | 0.02 *** |
(0.00) | |
year2019 | 0.01 *** |
(0.01) | |
year2020 | 0.02 *** |
(0.00) | |
year2021 | -0.00 *** |
(0.00) | |
year2022 | -0.01 *** |
(0.00) | |
N | 31987 |
R2 | 0.66 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Fixed-effects models are commonly used in panel data studies because they are consistent and don’t have the same bias that misspecified pooled OLS models can have. The trade-off is that fixed-effect models generally have less explanatory power. Still, they remain the gold standard in inferential panel data analysis because they are unbiased.
All models attempt to explain changes in variation. Fixed-effects models focus on the variation within groups of observations. In contrast, regression models called between-effects regressions examine how the variation differs on average between groups.(Note: A Hausman test indicated that a random-effects model would not be justified.)
The FE model interpretation is quite similar to the OLS models. The simplified titles explain much of the variation in salaries. Experience also has a small but significant impact on salary.
GLM estimation, family = gaussian, Dep. Var.: ln_salary Observations:
31,987 Standard-errors: Clustered (id) Estimate Std. Error t value
Pr(>|t|)
(Intercept) 19.55049249 0.842405 23.207961 < 2.2e-16 exp
0.00879873 0.000515 17.078510 < 2.2e-16 time
-0.00391909 0.000419 -9.346523 < 2.2e-16 Employerbor
0.03258740 0.010114 3.221930 0.00127836702481 Employercambrian
-0.00521839 0.006836 -0.763400 0.44524743786485
Employercanadore -0.03690722 0.008047 -4.586457 0.00000457644643
Employercentennial -0.01605076 0.004698 -3.416313
0.00063789002589 Employerconestoga -0.00600457 0.005046
-1.189983 0.23408818075370
Employerconfederation -0.01441197 0.009652 -1.493097
0.13545094775001
Employerdurham -0.00611177 0.006623 -0.922806 0.35613572332678
Employerfanshawe -0.00992555 0.005825 -1.704086 0.08840360522418 .
Employergeorge_brown -0.00992237 0.005468 -1.814461 0.06964396947580
.
Employergeorgian 0.00000192 0.007210 0.000266 0.99978761121857
Employerhumber 0.00529262 0.006999 0.756232 0.44953218235334
Employerla_cit -0.00878685 0.010601 -0.828871 0.40720223752840
Employerlambton 0.00807869 0.008512 0.949120 0.34258821081809
Employerloyalist 0.00250416 0.008760 0.285860 0.77499238978733
Employermohawk -0.00372155 0.005238 -0.710528 0.47739713225573
Employerniagara -0.01879562 0.006640 -2.830521 0.00465874316195
Employernorthern 0.00921564 0.018860 0.488633 0.62511485759151
Employersault -0.01611276 0.007334 -2.196846 0.02805986120063 *
Employerseneca -0.00104556 0.004998 -0.209193 0.83430284901327
Employersheridan -0.00514874 0.004991 -1.031579 0.30230047845838
Employersir_sandford -0.01760834 0.005756 -3.058983 0.00222819573772
Employerst_clair -0.01496385 0.005248 -2.851533
0.00436197091870 Employerst_lawrence 0.00510284 0.008547
0.597021 0.55051022869671
title2Dean 0.09700256 0.006582 14.737211 < 2.2e-16
title2Director 0.03013252 0.005607 5.373660 0.00000007931303
title2Officer 0.26437349 0.028242 9.361041 < 2.2e-16
title2Other -0.04490423 0.006447 -6.965501
0.00000000000353 title2President 0.68235261 0.060872
11.209694 < 2.2e-16 title2Professor -0.13798414
0.003996 -34.528713 < 2.2e-16 title2VP 0.34768231
0.011890 29.241931 < 2.2e-16 faculty_num 0.00002313
0.000013 1.762391 0.07804123725563 .
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1 Log-Likelihood: 30,003.4 Adj. Pseudo R2: -1.30562 BIC:
-59,654.1 Squared Cor.: 0.654657
Model 1 | |
---|---|
(Intercept) | 19.55 *** |
(0.84) | |
exp | 0.01 *** |
(0.00) | |
time | -0.00 *** |
(0.00) | |
Employerbor | 0.03 ** |
(0.01) | |
Employercambrian | -0.01 |
(0.01) | |
Employercanadore | -0.04 *** |
(0.01) | |
Employercentennial | -0.02 *** |
(0.00) | |
Employerconestoga | -0.01 |
(0.01) | |
Employerconfederation | -0.01 |
(0.01) | |
Employerdurham | -0.01 |
(0.01) | |
Employerfanshawe | -0.01 |
(0.01) | |
Employergeorge_brown | -0.01 |
(0.01) | |
Employergeorgian | 0.00 |
(0.01) | |
Employerhumber | 0.01 |
(0.01) | |
Employerla_cit | -0.01 |
(0.01) | |
Employerlambton | 0.01 |
(0.01) | |
Employerloyalist | 0.00 |
(0.01) | |
Employermohawk | -0.00 |
(0.01) | |
Employerniagara | -0.02 ** |
(0.01) | |
Employernorthern | 0.01 |
(0.02) | |
Employersault | -0.02 * |
(0.01) | |
Employerseneca | -0.00 |
(0.00) | |
Employersheridan | -0.01 |
(0.00) | |
Employersir_sandford | -0.02 ** |
(0.01) | |
Employerst_clair | -0.01 ** |
(0.01) | |
Employerst_lawrence | 0.01 |
(0.01) | |
title2Dean | 0.10 *** |
(0.01) | |
title2Director | 0.03 *** |
(0.01) | |
title2Officer | 0.26 *** |
(0.03) | |
title2Other | -0.04 *** |
(0.01) | |
title2President | 0.68 *** |
(0.06) | |
title2Professor | -0.14 *** |
(0.00) | |
title2VP | 0.35 *** |
(0.01) | |
faculty_num | 0.00 |
(0.00) | |
nobs | 31987 |
r.squared | |
adj.r.squared | |
within.r.squared | |
pseudo.r.squared | -1.31 |
sigma | |
nobs.1 | 31987.00 |
AIC | -59938.80 |
BIC | -59654.12 |
logLik | 30003.40 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Note that there is a negative adjusted R^2 https://stats.stackexchange.com/questions/444041/negative-adjusted-r2-in-twoway-effects-within-model
Model 1 | |
---|---|
(Intercept) | 19.55 *** |
(0.84) | |
exp | 0.01 *** |
(0.00) | |
time | -0.00 *** |
(0.00) | |
Employerbor | 0.03 ** |
(0.01) | |
Employercambrian | -0.01 |
(0.01) | |
Employercanadore | -0.04 *** |
(0.01) | |
Employercentennial | -0.02 *** |
(0.00) | |
Employerconestoga | -0.01 |
(0.01) | |
Employerconfederation | -0.01 |
(0.01) | |
Employerdurham | -0.01 |
(0.01) | |
Employerfanshawe | -0.01 |
(0.01) | |
Employergeorge_brown | -0.01 |
(0.01) | |
Employergeorgian | 0.00 |
(0.01) | |
Employerhumber | 0.01 |
(0.01) | |
Employerla_cit | -0.01 |
(0.01) | |
Employerlambton | 0.01 |
(0.01) | |
Employerloyalist | 0.00 |
(0.01) | |
Employermohawk | -0.00 |
(0.01) | |
Employerniagara | -0.02 ** |
(0.01) | |
Employernorthern | 0.01 |
(0.02) | |
Employersault | -0.02 * |
(0.01) | |
Employerseneca | -0.00 |
(0.00) | |
Employersheridan | -0.01 |
(0.00) | |
Employersir_sandford | -0.02 ** |
(0.01) | |
Employerst_clair | -0.01 ** |
(0.01) | |
Employerst_lawrence | 0.01 |
(0.01) | |
title2Dean | 0.10 *** |
(0.01) | |
title2Director | 0.03 *** |
(0.01) | |
title2Officer | 0.26 *** |
(0.03) | |
title2Other | -0.04 *** |
(0.01) | |
title2President | 0.68 *** |
(0.06) | |
title2Professor | -0.14 *** |
(0.00) | |
title2VP | 0.35 *** |
(0.01) | |
faculty_num | 0.00 |
(0.00) | |
nobs | 31987 |
r.squared | |
adj.r.squared | |
within.r.squared | |
pseudo.r.squared | -1.31 |
sigma | |
nobs.1 | 31987.00 |
AIC | -59938.80 |
BIC | -59654.12 |
logLik | 30003.40 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
GLM estimation, family = gaussian, Dep. Var.: ln_salary Observations:
31,987 Standard-errors: Clustered (id) Estimate Std. Error t value
Pr(>|t|)
(Intercept) 19.55049249 0.842405 23.207961 < 2.2e-16 exp
0.00879873 0.000515 17.078510 < 2.2e-16 time
-0.00391909 0.000419 -9.346523 < 2.2e-16 Employerbor
0.03258740 0.010114 3.221930 0.00127836702481 Employercambrian
-0.00521839 0.006836 -0.763400 0.44524743786485
Employercanadore -0.03690722 0.008047 -4.586457 0.00000457644643
Employercentennial -0.01605076 0.004698 -3.416313
0.00063789002589 Employerconestoga -0.00600457 0.005046
-1.189983 0.23408818075370
Employerconfederation -0.01441197 0.009652 -1.493097
0.13545094775001
Employerdurham -0.00611177 0.006623 -0.922806 0.35613572332678
Employerfanshawe -0.00992555 0.005825 -1.704086 0.08840360522418 .
Employergeorge_brown -0.00992237 0.005468 -1.814461 0.06964396947580
.
Employergeorgian 0.00000192 0.007210 0.000266 0.99978761121857
Employerhumber 0.00529262 0.006999 0.756232 0.44953218235334
Employerla_cit -0.00878685 0.010601 -0.828871 0.40720223752840
Employerlambton 0.00807869 0.008512 0.949120 0.34258821081809
Employerloyalist 0.00250416 0.008760 0.285860 0.77499238978733
Employermohawk -0.00372155 0.005238 -0.710528 0.47739713225573
Employerniagara -0.01879562 0.006640 -2.830521 0.00465874316195
Employernorthern 0.00921564 0.018860 0.488633 0.62511485759151
Employersault -0.01611276 0.007334 -2.196846 0.02805986120063 *
Employerseneca -0.00104556 0.004998 -0.209193 0.83430284901327
Employersheridan -0.00514874 0.004991 -1.031579 0.30230047845838
Employersir_sandford -0.01760834 0.005756 -3.058983 0.00222819573772
Employerst_clair -0.01496385 0.005248 -2.851533
0.00436197091870 Employerst_lawrence 0.00510284 0.008547
0.597021 0.55051022869671
title2Dean 0.09700256 0.006582 14.737211 < 2.2e-16
title2Director 0.03013252 0.005607 5.373660 0.00000007931303
title2Officer 0.26437349 0.028242 9.361041 < 2.2e-16
title2Other -0.04490423 0.006447 -6.965501
0.00000000000353 title2President 0.68235261 0.060872
11.209694 < 2.2e-16 title2Professor -0.13798414
0.003996 -34.528713 < 2.2e-16 title2VP 0.34768231
0.011890 29.241931 < 2.2e-16 faculty_num 0.00002313
0.000013 1.762391 0.07804123725563 .
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1 Log-Likelihood: 30,003.4 Adj. Pseudo R2: -1.30562 BIC:
-59,654.1 Squared Cor.: 0.654657
The models show the importance of formal titles, events over time, and worker experience. It also seems that the demographic makeup of students (international vs domestic) is not a significant factor in working salary between colleges.
The random forest model is a tree-based model that uses bootstrap aggregation to utilize the results of multiple trees that use different subsets of explanatory variables. The result is that random forest models can tell which variables provide the most explanatory power (summarized in their purity scores.)
The most important variables in the model to determine salary are the simplified position title, worker experience, the year, employer, the number of students/faculty, and, at the end, the proportion of international students.
[1] “ln_salary” “exp” “faculty_num”
[4] “prop” “title2Chair” “title2Dean”
[7] “title2Director” “title2Officer” “title2Other”
[10] “title2President” “title2Professor” “title2VP”
[13] “Employeralgonquin” “Employerbor” “Employercambrian”
[16] “Employercanadore” “Employercentennial” “Employerconestoga”
[19] “Employerconfederation” “Employerdurham” “Employerfanshawe”
[22] “Employergeorge_brown” “Employergeorgian” “Employerhumber”
[25] “Employerla_cit” “Employerlambton” “Employerloyalist”
[28] “Employermohawk” “Employerniagara” “Employernorthern”
[31] “Employersault” “Employerseneca” “Employersheridan”
[34] “Employersir_sandford” “Employerst_clair”
“Employerst_lawrence”
[37] “year2013” “year2014” “year2015”
[40] “year2016” “year2017” “year2018”
[43] “year2019” “year2020” “year2021”
[46] “year2022” “year2023”
The difference between the predicted values and the actual salary in the test set shows the model’s errors. In this section, we examine the characteristics of the larger errors to get some sense of how we can improve the model. In particular, we examine cases where the predicted and actual differences are more than three standard deviations away from the average.
The histogram of the number of observations shows that the model tends to predict poorly for just a single period of a person’s work career. It could be that these observations are from people who have just appeared in the data once. There are comparatively few cases where a person’s salary was poorly predicted over two periods. (In a few cases, it seems the model did poorly for ten periods.)
The table relating the number of outliers by job title clearly shows that the ‘Other’ category provides less value to the model. A noticeable improvement would be adjusting the code that assigns simplified titles to provide somewhat more granularity. There may also be cause to revisit other simplified job titles that performed poorly. Of interest is the President’s job title. It seems likely that there is a fair amount of variation between colleges in terms of what Presidents and other high-level administrators are paid.
The table relating the number of outliers by college shows that the largest colleges in the province (Humber, Sheridan, Seneca, and George Brown) have many highly paid people on staff, which the model fails to predict well. More investigation is warranted to see if there is a close relationship here.
Title | Count |
---|---|
Other | 65 |
Director | 31 |
Officer | 30 |
Professor | 9 |
VP | 8 |
Dean | 7 |
Chair | 1 |
President | 1 |
Employer | Count |
---|---|
humber | 26 |
sheridan | 19 |
st_lawrence | 14 |
seneca | 13 |
georgian | 11 |
bor | 9 |
canadore | 8 |
george_brown | 8 |
la_cit | 8 |
algonquin | 7 |
loyalist | 6 |
centennial | 5 |
durham | 4 |
mohawk | 3 |
st_clair | 3 |
cambrian | 2 |
conestoga | 2 |
confederation | 2 |
fanshawe | 1 |
northern | 1 |
Gradient boosting is similar to the Random Forest models in that it also utilizes many decision trees to develop a model. Random Forests use a bagging approach or bootstrap aggregation of many decision trees. Many, many decision trees are aggregated to find the best result.
Gradient boosting machines also build an ensemble of shallow trees, but it does so in sequence, with each tree learning and improving on the previous one. Shallow trees are weak predictive models because they tend to overfit on the training data. However, many trees can be “boosted” to produce a robust model in which they are treated as a committee. When properly tuned, such models are often hard to beat with other algorithms.
The GBM also shows the most important variables. The simplified title (title2), work experience, and the year are again the top explanatory variables. The measure of the student population, the college, and the proportion of international Indian students follows.
[17:35:28] WARNING: src/learner.cc:767: Parameters: { “rmse”, “trees” } are not used.
" RMSE for: OLS:0.0966 FE:0.0971 RF:0.0948 RMSE for: OLS:0.0966 RF:0.0948 The ensemble (equal weighted) model RMSE is: 0.0924 XGB RMSE is: 0.093 [1] "
”
The Ontario Sunshine List is a unique resource of Ontario public servant salaries that exceed $100,000. I’ve used this data to construct both inferential and predictive models. By merging the data with college enrollment data, I found that a simplified job title, year, and employee work experience go a long way to explaining job salaries.
The predictive tree-based models, including random forest and gradient boosting machines, are better at predicting salaries based on the training data. Additional work should be done to choose improved, simplified titles that would likely improve the models. Overall, an ensemble method narrowly outperformed any single model.