There have been many news stories about Ontario colleges. Concerns have grown that Ontario colleges have forgotten their core mandate of training skilled workers in favor of attracting international students who pay much higher tuition. Questions have been raised about the quality of education students now receive and the incentives of college administrators. In this analysis, I focus narrowly on the salaries paid by Ontario colleges. In particular:
How much do highly paid workers in Ontario colleges make? And,
What determines their salaries?
Is it possible to predict college workers’ salaries?
I use the Ontario government’s Sunshine List of employees who make over $100,000 annually in this analysis. (This analysis focuses on Ontario colleges, which are distinct from Ontario universities.)
The first section describes the data from the Sunshine List, focusing on the years between 2012 and 2023. The following section builds OLS, fixed effects, and between-effects models for inference about the potential causes of salary increases. These models will help explain the factors related to changes in college workers’ salaries.
Given the limited public information available, advanced predictive models build on these insights to predict workers’ salaries. To test the quality of these models, the data is split into a training component to train predictive models and test data to verify the results. The quality of the predictions and areas for improvement are discussed.
I account for inflation using the CPI to calculate the yearly price relative to a single base year. This enables the analysis to compare people’s salaries across time validly.
It is also important to remember that the Sunshine List only contains workers who make more than $100,000. This salary level is a nominal cut-off, so as inflation increases and the purchasing power of $100,000 decreases, we see an increasing number of people on the list over time. I have restricted the analysis from 2013 onwards to partially account for this. I also use 2013 as the base year to account for inflation. After adjusting for inflation, workers who make less than $100,000 are removed from the panel.
I’ve created a panel of observations about workers at Ontario colleges. To accomplish this, I combined a series of annual Sunshine List data files into a panel of employees, which I followed over time. Observations of individuals are linked over time based on a common employer and the person’s first and last name.
The shortcomings of this approach are that I cannot easily capture people who have changed jobs between different colleges and whose salary in those transition years is below $100,000. It also potentially misses women who have changed their last name after marriage.
The following plot shows the distribution of salaries across Ontario colleges. Below is a box plot of salaries paid at Ontario colleges. The colored bands represent the mass of the distribution, and the dots are ‘outlying’ points. In this case, the dots represent employees who make significantly more than others in the organization.
Note that I use 2013 as the inflation-adjusted year. Many workers who make nominally more than $100,000 in later years are making less than $100,000 in 2013 inflation-adjusted dollars. These workers are kept in our panel because they still provide helpful salary information.
Salaries in the data tend to have a long right tail, with very few college administrators making large amounts of money. The mass of the distribution, which includes lower paid professionals, comprises college teaching faculty and administrators. A common approach to normalize such distributions is to log transform salary. This transformation tends to improve the fit of the models we use, and we can easily exponentiate the results to see them again in dollars.
I have also merged data about each college’s student demographics to enrich the data. In particular, I focus on the two largest groups – Canadian students and students from India enrolled in Ontario colleges.
The final data transformation has to do with job titles. The Sunshine List contains job titles given by colleges, but there is no consistency in those titles. To simplify the variation, I have searched for common positions. For example, if the position title contains the words ‘professor,’ ‘lecturer,’ and ‘teacher,’ then the person is given the simplified title of ‘Professor’. If the title contains ‘vice-president of …’, they are categorized as a ‘VP,’ etc.
I have also merged data about the number of enrollments in each college over time. My prior analysis found that Canadians and Indians comprise these institutions’ two largest groups of students.
Below is a summary of the variables that are available for examination.
Salaries by College in 2013 adjusted dollars | ||||
2022 | ||||
Employer | Median | SD | Max | N |
---|---|---|---|---|
Humber | $94,620 | $22,366 | $449,086 | 796 |
Sheridan | $95,170 | $18,921 | $373,444 | 687 |
George Brown | $94,702 | $21,678 | $368,889 | 609 |
Seneca | $95,192 | $18,808 | $342,789 | 764 |
Conestoga | $95,082 | $21,201 | $332,908 | 577 |
Algonquin | $95,082 | $16,618 | $274,107 | 597 |
La Cit | $95,081 | $19,947 | $265,873 | 173 |
Sir Sandford | $94,880 | $16,679 | $249,764 | 230 |
Centennial | $95,155 | $17,786 | $247,264 | 545 |
Durham | $94,862 | $16,867 | $245,880 | 340 |
Fanshawe | $94,841 | $15,207 | $244,277 | 503 |
St Clair | $95,096 | $17,314 | $243,781 | 253 |
Sault | $95,171 | $19,413 | $233,358 | 109 |
Niagara | $95,082 | $16,409 | $231,887 | 352 |
Mohawk | $95,115 | $17,080 | $223,765 | 426 |
St Lawrence | $93,623 | $18,411 | $222,775 | 215 |
Lambton | $95,118 | $21,386 | $214,770 | 132 |
Northern | $95,269 | $19,737 | $212,716 | 80 |
Georgian | $94,821 | $17,566 | $212,587 | 340 |
Bor | $95,095 | $19,861 | $212,430 | 103 |
Cambrian | $94,717 | $16,905 | $207,437 | 173 |
Canadore | $95,115 | $15,662 | $199,430 | 108 |
Loyalist | $97,085 | $18,443 | $187,978 | 124 |
Confederation | $94,517 | $14,644 | $184,300 | 132 |
The correlation plot shows the degree to which variables move together. Highly correlated variables should not appear in the same inferential regression model because they will affect the size and significance of the coefficients, making interpretation impossible.
There are expected relationships between the previously listed variables because of how they are constructed or, in some cases because they measure similar phenomena. For example, the number of Canadian students enrolled is correlated with the total count of all students. Similarly, the proportion of Indian students positively correlates with the time trend. We know from other analyses that Indian international students only began to arrive in Ontario colleges towards the end of our panel.
It’s also unsurprising that the number of faculty positively correlates with the number of students enrolled. Of course, as a college gains more students, it requires more teachers.
The full dataset is divided into train and test data. The train data is the part of the data I use to build models. The remaining data is the ‘test’ data. The test data is used to test the accuracy of predictive models based on the training data.
As previously noted, individual IDs are assigned based on the individual’s first and last name and employer. This assumption means we cannot track people who switch between employers or change their names (ex., married women.)
To create valid test and train data, I sample from all available person IDs and assign a portion to the train data set and a portion to test for final verification.
Some code was written to capture cases of people who left a job at one college and switched to another. However, there can be cases where people switch and are not in the data because their mid-year salary does not meet the threshold of $100,000 to be included in the Sunshine list.
Manual and automated testing with OLS models showed that a relatively small subset of variables provided the most explanatory power. For simplicity, this work will focus on variations of these explanatory variables.
We start with OLS models because of the flexibility and ease of interpretation. We start the specification using dummies for each simplified title commonly used at colleges. The current year is also set as a dummy variable because we have no reason to believe that salaries increase linearly. Instead, with the inclusion of the worst of pandemic years in our panel, it seems quite likely that there will be significant salary changes. We also include ‘exp,’ which represents the years of experience of workers at the college.
Two other additional explanatory variables are tested. The first is the proportion (prop) of Indian international students enrolled. The other is a count of the number of students enrolled in the college.
Sometimes, the most straightforward specification is the best, and adding more explanatory variables does not add to the overall explanatory power of a model. To determine which of the possible specifications is the best, I use the log-likelihood test to compare different nested models.
A series of log-likelihood tests indicate that the full model provides the best explanation of the variation in the data.
The following visualization shows how the model coefficients change with the addition of new variables. We can see that the models are relatively consistent – which is a good thing.
Ordinary Least Squares, OLS, models, which we have just looked at, can be specified to control for within variation by adding time-varying dummy variables. However, they still contain some between-variation (more on this below). OLS models are generally used in cross-sectional data. When they are used with panel data, they are referred to as pooled OLS. One needs to be aware of the risk of bias with misspecified OLS models and the risk that the errors of pooled OLS models used on panel data tend to be biased downward. (i.e., coefficients are listed as significant when they are really not.) This is why we often employ specialized models when developing inferential models with panel data.
The differently shaped icons show how the model coefficients differ. Model 1 is the baseline model. Model 4 is the optimal model that contains the proportion of Indian students in the specification in addition to the same explanatory variables in Model 1.
The visualization shows that the simplified titles are the most prominent single salary indicator. All the salaries are compared to those of a professor (teaching staff). The default category ‘other’ includes admin staff and those job categories that were not obviously classifiable. Chairs and Directors make much more, followed by Deans and Officers of the college. The highest-paid positions belong to VPs and Presidents. The dummy variables used for years show some annual non-linear variation. Around the pandemic, we see a drop in average salary levels.
The expected annual salary increases with each additional year of worker experience (exp). This is a small effect, but it is multiplied by the number of years of service.
The size of the school, as measured by student count in thousands, does not represent a large change in any of the models despite being statistically significant. In other words, colleges with larger enrollments do not appear to pay much better than their smaller peers. Remember that regression models are interpreted as ‘holding everything else constant.’ So, people may still choose to work at larger colleges for other professional reasons – such as promotion opportunities – that are not available at smaller institutions.
The proportion of Indian international students is negative, indicating that institutions with larger cohorts of Indian international students compared to Canadian students tend to pay people less. This could be interpreted as a desire by these institutions to maximize profits regardless of learning outcomes. But of course, we can only speculate why colleges with higher international enrollments tend to pay employees less, all other things being equal.
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
---|---|---|---|---|---|
(Intercept) | 11.56 *** | 11.53 *** | 11.52 *** | 11.53 *** | 11.52 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
title2Chair | 0.13 *** | 0.14 *** | 0.14 *** | 0.14 *** | 0.14 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
title2Dean | 0.24 *** | 0.23 *** | 0.23 *** | 0.23 *** | 0.23 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
title2Director | 0.17 *** | 0.17 *** | 0.17 *** | 0.17 *** | 0.17 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
title2Officer | 0.46 *** | 0.44 *** | 0.45 *** | 0.44 *** | 0.45 *** |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | |
title2Other | 0.09 *** | 0.09 *** | 0.09 *** | 0.09 *** | 0.09 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
title2President | 0.88 *** | 0.84 *** | 0.84 *** | 0.84 *** | 0.84 *** |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | |
title2VP | 0.49 *** | 0.48 *** | 0.48 *** | 0.48 *** | 0.48 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2014 | -0.00 | -0.01 * | -0.01 ** | -0.01 * | -0.01 ** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2015 | -0.01 *** | -0.02 *** | -0.02 *** | -0.02 *** | -0.02 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2016 | -0.01 *** | -0.02 *** | -0.02 *** | -0.02 *** | -0.02 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2017 | 0.00 | -0.01 ** | -0.01 *** | -0.01 ** | -0.01 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2018 | -0.00 | -0.02 *** | -0.02 *** | -0.01 *** | -0.02 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2019 | -0.01 *** | -0.03 *** | -0.03 *** | -0.03 *** | -0.03 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2020 | 0.00 | -0.02 *** | -0.02 *** | -0.02 *** | -0.02 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2021 | -0.01 *** | -0.04 *** | -0.04 *** | -0.04 *** | -0.04 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2022 | -0.02 *** | -0.05 *** | -0.06 *** | -0.05 *** | -0.05 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
year2023 | -0.01 * | -0.03 *** | -0.04 *** | -0.03 *** | -0.04 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
exp | 0.01 *** | 0.01 *** | 0.01 *** | 0.01 *** | |
(0.00) | (0.00) | (0.00) | (0.00) | ||
student_count | 0.00 *** | 0.00 *** | |||
(0.00) | (0.00) | ||||
prop | -0.02 *** | -0.02 *** | |||
(0.00) | (0.00) | ||||
N | 31801 | 31801 | 31801 | 31801 | 31801 |
R2 | 0.62 | 0.65 | 0.65 | 0.65 | 0.65 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Fixed-effects models are commonly used in panel data studies because they are consistent and don’t have the same bias that misspecified pooled OLS models can have. The trade-off is that fixed-effect models generally have less explanatory power. Still, they remain the gold standard in inferential panel data analysis.
All models attempt to explain changes in variation. Fixed-effects models focus on the variation within groups of observations. In contrast, regression models called between-effects regressions examine how the variation differs on average between groups.
Below are four fixed effects (within) models and one between effects model. The similarities and differences are discussed below. (A Hausman test indicated that a random-effects model would not be justified.)
The model interpretation is quite similar to the OLS models. The simplified titles explain much of the variation in salaries. Experience also has a small but significant impact on salary. Interestingly, the proportion of Indian students is also negatively related to employee salary in the within-model. The number of college students has a significant but small relationship with college employee salary.
The between model (Model 5) focuses on the average differences between groups. In this model, the Presidents and VPs of colleges have a much larger average relationship with salary. Similarly, the between model also emphasizes the difference in salaries on average between employees at colleges with larger international Indian student groups and those without.
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
---|---|---|---|---|---|
title2Chair | 0.02 ** | 0.02 ** | 0.02 ** | 0.02 ** | 0.13 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.01) | |
title2Dean | 0.07 *** | 0.07 *** | 0.07 *** | 0.07 *** | 0.22 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
title2Director | 0.05 *** | 0.06 *** | 0.05 *** | 0.05 *** | 0.15 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
title2Officer | 0.21 *** | 0.21 *** | 0.21 *** | 0.21 *** | 0.39 *** |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | |
title2Other | 0.04 *** | 0.04 *** | 0.04 *** | 0.04 *** | 0.07 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
title2President | 0.18 *** | 0.18 *** | 0.18 *** | 0.18 *** | 0.74 *** |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | |
title2VP | 0.16 *** | 0.16 *** | 0.16 *** | 0.16 *** | 0.48 *** |
(0.01) | (0.01) | (0.01) | (0.01) | (0.01) | |
exp | 0.00 *** | 0.00 *** | 0.00 *** | 0.00 *** | 0.01 *** |
(0.00) | (0.00) | (0.00) | (0.00) | (0.00) | |
prop | -0.01 ** | -0.02 ** | -0.05 *** | ||
(0.00) | (0.00) | (0.01) | |||
student_count | -0.00 | 0.00 | 0.00 ** | ||
(0.00) | (0.00) | (0.00) | |||
(Intercept) | 11.51 *** | ||||
(0.00) | |||||
nobs | 31801 | 31801 | 31801 | 31801 | 8052 |
r.squared | 0.09 | 0.09 | 0.09 | 0.09 | 0.65 |
adj.r.squared | -0.22 | -0.22 | -0.22 | -0.22 | 0.65 |
statistic | 291.39 | 260.15 | 259.06 | 234.23 | 1523.97 |
p.value | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
deviance | 75.04 | 75.01 | 75.04 | 75.01 | 51.27 |
df.residual | 23741.00 | 23740.00 | 23740.00 | 23739.00 | 8041.00 |
nobs.1 | 31801.00 | 31801.00 | 31801.00 | 31801.00 | 8052.00 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Hausman Test
data: ln_salary ~ title2 + exp + prop chisq = 3643.9, df = 9, p-value < 0.00000000000000022 alternative hypothesis: one model is inconsistent
The models show the importance of formal titles, events over time, and worker experience. It also seems that the demographic makeup of students (international vs domestic) is a significant factor in working between colleges.
The random forest model is a tree-based model that uses bootstrap aggregation to utilize the results of multiple trees that use different subsets of explanatory variables. The result is that random forest models can tell which variables provide the most explanatory power (summarized in their purity scores.)
The most important variables in the model to determine salary are the simplified position title, worker experience, the year, employer, the number of students/faculty, and, at the end, the proportion of international students.
The difference between the predicted values and the actual salary in the test set shows the model’s errors. In this section, we examine the characteristics of the larger errors to get some sense of how we can improve the model. In particular, we examine cases where the predicted and actual differences are more than three standard deviations away from the average.
The histogram of the number of observations shows that the model tends to predict poorly for just a single period of a person’s work career. It could be that these observations are from people who have just appeared in the data once. There are comparatively few cases where a person’s salary was poorly predicted over two periods. (In a few cases, it seems the model did poorly for ten periods.)
The table relating the number of outliers by job title clearly shows that the ‘Other’ category provides less value to the model. A noticeable improvement would be adjusting the code that assigns simplified titles to provide somewhat more granularity. There may also be cause to revisit other simplified job titles that performed poorly. Of interest is the President’s job title. It seems likely that there is a fair amount of variation between colleges in terms of what Presidents and other high-level administrators are paid.
The table relating the number of outliers by college shows that the largest colleges in the province (Humber, Sheridan, Seneca, and George Brown) have many highly paid people on staff, which the model fails to predict well. More investigation is warranted to see if there is a close relationship here.
Title | Count |
---|---|
Other | 65 |
President | 29 |
Director | 25 |
Dean | 15 |
Officer | 11 |
Professor | 10 |
VP | 4 |
Gradient boosting is similar to the Random Forest models in that it also utilizes many decision trees to develop a model. Random Forests use a bagging approach or bootstrap aggregation of many decision trees. Many, many decision trees are aggregated to find the best result.
Gradient boosting machines also build an ensemble of shallow trees, but it does so in sequence, with each tree learning and improving on the previous one. Shallow trees are weak predictive models because they tend to overfit on the training data. However, many trees can be “boosted” to produce a robust model in which they are treated as a committee. When properly tuned, such models are often hard to beat with other algorithms.
The GBM also shows the most important variables. The simplified title (title2), work experience, and the year are again the top explanatory variables. The measure of the student population, the college, and the proportion of international Indian students follows.
[00:47:41] WARNING: src/learner.cc:767: Parameters: { “rmse”, “trees” } are not used.
RMSE for: OLS:0.0918 FE:0.1221 RF:0.0907 BE:0.0923 The ensemble (equal weighted) model RMSE is: 0.0878 XGB RMSE is: 0.09099
The Ontario Sunshine List is a unique resource of Ontario public servant salaries that exceed $100,000. I’ve used this data to construct both inferential and predictive models. By merging the data with college enrollment data, I found that a simplified job title, year, and employee work experience go a long way to explaining job salaries. Interestingly, colleges with a large proportion of international students from India tend to pay their staff worse, all other things being equal.
The predictive tree-based models, including random forest and gradient boosting machines, are better at predicting salaries based on the training data. Additional work should be done to choose improved, simplified titles that would likely improve the models. Overall, an ensemble method outperformed any single model.