Introduction

There have been many news stories about Ontario colleges. Concerns have grown that Ontario colleges have forgotten their core mandate of training skilled workers in favor of attracting international students who pay much higher tuition. Questions have been raised about the quality of education students now receive and the incentives of college administrators. In this analysis, I focus narrowly on the salaries paid by Ontario colleges. In particular:

I use the Ontario government’s Sunshine List of employees who make over $100,000 annually in this analysis. (This analysis focuses on Ontario colleges, which are distinct from Ontario universities.)

The first section describes the data from the Sunshine List, focusing on the years between 2012 and 2023. The following section builds OLS, fixed effects, and between-effects models for inference about the potential causes of salary increases. These models will help explain the factors related to changes in college workers’ salaries.

Given the limited public information available, advanced predictive models build on these insights to predict workers’ salaries. To test the quality of these models, the data is split into a training component to train predictive models and test data to verify the results. The quality of the predictions and areas for improvement are discussed.

Data

Accounting for Inflation in Salaries

I account for inflation using the CPI to calculate the yearly price relative to a single base year. This enables the analysis to compare people’s salaries across time validly.

It is also important to remember that the Sunshine List only contains workers who make more than $100,000. This salary level is a nominal cut-off, so as inflation increases and the purchasing power of $100,000 decreases, we see an increasing number of people on the list over time. I have restricted the analysis from 2013 onwards to partially account for this. I also use 2013 as the base year to account for inflation. After adjusting for inflation, workers who make less than $100,000 are removed from the panel.

Data collection

I’ve created a panel of observations about workers at Ontario colleges. To accomplish this, I combined a series of annual Sunshine List data files into a panel of employees, which I followed over time. Observations of individuals are linked over time based on a common employer and the person’s first and last name.

The shortcomings of this approach are that I cannot easily capture people who have changed jobs between different colleges and whose salary in those transition years is below $100,000. It also potentially misses women who have changed their last name after marriage.

Salaries at Ontario Colleges

The following plot shows the distribution of salaries across Ontario colleges. Below is a box plot of salaries paid at Ontario colleges. The colored bands represent the mass of the distribution, and the dots are ‘outlying’ points. In this case, the dots represent employees who make significantly more than others in the organization.

Note that I use 2013 as the inflation-adjusted year. Many workers who make nominally more than $100,000 in later years are making less than $100,000 in 2013 inflation-adjusted dollars. These workers are kept in our panel because they still provide helpful salary information.

Box Plot of Salaries

Changes to the Data

Salaries in the data tend to have a long right tail, with very few college administrators making large amounts of money. The mass of the distribution, which includes lower paid professionals, comprises college teaching faculty and administrators. A common approach to normalize such distributions is to log transform salary. This transformation tends to improve the fit of the models we use, and we can easily exponentiate the results to see them again in dollars.

I have also merged data about each college’s student demographics to enrich the data. In particular, I focus on the two largest groups – Canadian students and students from India enrolled in Ontario colleges.

The final data transformation has to do with job titles. The Sunshine List contains job titles given by colleges, but there is no consistency in those titles. To simplify the variation, I have searched for common positions. For example, if the position title contains the words ‘professor,’ ‘lecturer,’ and ‘teacher,’ then the person is given the simplified title of ‘Professor’. If the title contains ‘vice-president of …’, they are categorized as a ‘VP,’ etc.

College Enrollments

I have also merged data about the number of enrollments in each college over time. My prior analysis found that Canadians and Indians comprise these institutions’ two largest groups of students.

Summary Statistics

Below is a summary of the variables that are available for examination.

  • Title2 is the simplified title given to a college worker that denotes their approximate position in the organization.
  • Employer is the name of the college that a given employee works for.
  • Year is the financial year a given Ontario college worker’s name, salary, and employer have been recorded on the Sunshine List. Years are used mainly as factors in this analysis but also as a trend measure for some earlier models.
  • Student_count is the number of students enrolled in the college for a given year. For simplicity, this is the sum of all Canadian and Indian students. Both groups make up the vast majority of students during 2012-2023.
  • enroll_canadian and enroll_indian are the number of students listed as Canadian or Indian as recorded by the college enrollment data.
  • “prop” is the percentage of Indian international students enrolled. This variable is the number of enrolled Indian international students divided by the sum of Indian and Canadian students enrolled.
  • is_prof is an indicator set for each observation where the person is listed as a lecturer, professor, or teacher.
  • faculty_num is the total number of faculty listed on the Sunshine List for a given college. (i.e., this is the sum of workers listed as professors for a given college.) This can be used as a rough guide to the school’s teaching faculty size.
  • The experience of the person is captured with the exp variable. This is the number of years a person has been employed by the college(s) for which they are employed.
  • ln_salary is the natural log-transformed salary of a college employee.
  • salary is the annual salary paid to a college employee

Table of Salaries in Ontario Colleges, 2022

Salaries by College in 2013 adjusted dollars
2022
Employer Median SD Max N
Humber $94,620 $22,366 $449,086 796
Sheridan $95,170 $18,921 $373,444 687
George Brown $94,702 $21,678 $368,889 609
Seneca $95,192 $18,808 $342,789 764
Conestoga $95,082 $21,201 $332,908 577
Algonquin $95,082 $16,618 $274,107 597
La Cit $95,081 $19,947 $265,873 173
Sir Sandford $94,880 $16,679 $249,764 230
Centennial $95,155 $17,786 $247,264 545
Durham $94,862 $16,867 $245,880 340
Fanshawe $94,841 $15,207 $244,277 503
St Clair $95,096 $17,314 $243,781 253
Sault $95,171 $19,413 $233,358 109
Niagara $95,082 $16,409 $231,887 352
Mohawk $95,115 $17,080 $223,765 426
St Lawrence $93,623 $18,411 $222,775 215
Lambton $95,118 $21,386 $214,770 132
Northern $95,269 $19,737 $212,716 80
Georgian $94,821 $17,566 $212,587 340
Bor $95,095 $19,861 $212,430 103
Cambrian $94,717 $16,905 $207,437 173
Canadore $95,115 $15,662 $199,430 108
Loyalist $97,085 $18,443 $187,978 124
Confederation $94,517 $14,644 $184,300 132

Correlations

The correlation plot shows the degree to which variables move together. Highly correlated variables should not appear in the same inferential regression model because they will affect the size and significance of the coefficients, making interpretation impossible.

There are expected relationships between the previously listed variables because of how they are constructed or, in some cases because they measure similar phenomena. For example, the number of Canadian students enrolled is correlated with the total count of all students. Similarly, the proportion of Indian students positively correlates with the time trend. We know from other analyses that Indian international students only began to arrive in Ontario colleges towards the end of our panel.

It’s also unsurprising that the number of faculty positively correlates with the number of students enrolled. Of course, as a college gains more students, it requires more teachers.

Creating Training and Testing Data

The full dataset is divided into train and test data. The train data is the part of the data I use to build models. The remaining data is the ‘test’ data. The test data is used to test the accuracy of predictive models based on the training data.

As previously noted, individual IDs are assigned based on the individual’s first and last name and employer. This assumption means we cannot track people who switch between employers or change their names (ex., married women.)

To create valid test and train data, I sample from all available person IDs and assign a portion to the train data set and a portion to test for final verification.

Job Movers - Looking For People Who Moved Between Colleges

Some code was written to capture cases of people who left a job at one college and switched to another. However, there can be cases where people switch and are not in the data because their mid-year salary does not meet the threshold of $100,000 to be included in the Sunshine list.

Inferential Models

Manual and automated testing with OLS models showed that a relatively small subset of variables provided the most explanatory power. For simplicity, this work will focus on variations of these explanatory variables.

OLS Model Specifications

We start with OLS models because of the flexibility and ease of interpretation. We start the specification using dummies for each simplified title commonly used at colleges. The current year is also set as a dummy variable because we have no reason to believe that salaries increase linearly. Instead, with the inclusion of the worst of pandemic years in our panel, it seems quite likely that there will be significant salary changes. We also include ‘exp,’ which represents the years of experience of workers at the college.

Two other additional explanatory variables are tested. The first is the proportion (prop) of Indian international students enrolled. The other is a count of the number of students enrolled in the college.

Sometimes, the most straightforward specification is the best, and adding more explanatory variables does not add to the overall explanatory power of a model. To determine which of the possible specifications is the best, I use the log-likelihood test to compare different nested models.

A series of log-likelihood tests indicate that the full model provides the best explanation of the variation in the data.

The following visualization shows how the model coefficients change with the addition of new variables. We can see that the models are relatively consistent – which is a good thing.

Ordinary Least Squares, OLS, models, which we have just looked at, can be specified to control for within variation by adding time-varying dummy variables. However, they still contain some between-variation (more on this below). OLS models are generally used in cross-sectional data. When they are used with panel data, they are referred to as pooled OLS. One needs to be aware of the risk of bias with misspecified OLS models and the risk that the errors of pooled OLS models used on panel data tend to be biased downward. (i.e., coefficients are listed as significant when they are really not.) This is why we often employ specialized models when developing inferential models with panel data.

OLS Model Visualizations

The differently shaped icons show how the model coefficients differ. Model 1 is the baseline model. Model 4 is the optimal model that contains the proportion of Indian students in the specification in addition to the same explanatory variables in Model 1.

The visualization shows that the simplified titles are the most prominent single salary indicator. All the salaries are compared to those of a professor (teaching staff). The default category ‘other’ includes admin staff and those job categories that were not obviously classifiable. Chairs and Directors make much more, followed by Deans and Officers of the college. The highest-paid positions belong to VPs and Presidents. The dummy variables used for years show some annual non-linear variation. Around the pandemic, we see a drop in average salary levels.

The expected annual salary increases with each additional year of worker experience (exp). This is a small effect, but it is multiplied by the number of years of service.

The size of the school, as measured by student count in thousands, does not represent a large change in any of the models despite being statistically significant. In other words, colleges with larger enrollments do not appear to pay much better than their smaller peers. Remember that regression models are interpreted as ‘holding everything else constant.’ So, people may still choose to work at larger colleges for other professional reasons – such as promotion opportunities – that are not available at smaller institutions.

The proportion of Indian international students is negative, indicating that institutions with larger cohorts of Indian international students compared to Canadian students tend to pay people less. This could be interpreted as a desire by these institutions to maximize profits regardless of learning outcomes. But of course, we can only speculate why colleges with higher international enrollments tend to pay employees less, all other things being equal.

OLS Model Full Results

Model 1Model 2Model 3Model 4Model 5
(Intercept)11.56 ***11.53 ***11.52 ***11.53 ***11.52 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
title2Chair0.13 ***0.14 ***0.14 ***0.14 ***0.14 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
title2Dean0.24 ***0.23 ***0.23 ***0.23 ***0.23 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
title2Director0.17 ***0.17 ***0.17 ***0.17 ***0.17 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
title2Officer0.46 ***0.44 ***0.45 ***0.44 ***0.45 ***
(0.01)   (0.01)   (0.01)   (0.01)   (0.01)   
title2Other0.09 ***0.09 ***0.09 ***0.09 ***0.09 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
title2President0.88 ***0.84 ***0.84 ***0.84 ***0.84 ***
(0.01)   (0.01)   (0.01)   (0.01)   (0.01)   
title2VP0.49 ***0.48 ***0.48 ***0.48 ***0.48 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year2014-0.00    -0.01 *  -0.01 ** -0.01 *  -0.01 ** 
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year2015-0.01 ***-0.02 ***-0.02 ***-0.02 ***-0.02 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year2016-0.01 ***-0.02 ***-0.02 ***-0.02 ***-0.02 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year20170.00    -0.01 ** -0.01 ***-0.01 ** -0.01 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year2018-0.00    -0.02 ***-0.02 ***-0.01 ***-0.02 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year2019-0.01 ***-0.03 ***-0.03 ***-0.03 ***-0.03 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year20200.00    -0.02 ***-0.02 ***-0.02 ***-0.02 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year2021-0.01 ***-0.04 ***-0.04 ***-0.04 ***-0.04 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year2022-0.02 ***-0.05 ***-0.06 ***-0.05 ***-0.05 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
year2023-0.01 *  -0.03 ***-0.04 ***-0.03 ***-0.04 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
exp       0.01 ***0.01 ***0.01 ***0.01 ***
       (0.00)   (0.00)   (0.00)   (0.00)   
student_count              0.00 ***       0.00 ***
              (0.00)          (0.00)   
prop                     -0.02 ***-0.02 ***
                     (0.00)   (0.00)   
N31801       31801       31801       31801       31801       
R20.62    0.65    0.65    0.65    0.65    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Fixed-Effects Model Specifications

Fixed-effects models are commonly used in panel data studies because they are consistent and don’t have the same bias that misspecified pooled OLS models can have. The trade-off is that fixed-effect models generally have less explanatory power. Still, they remain the gold standard in inferential panel data analysis.

All models attempt to explain changes in variation. Fixed-effects models focus on the variation within groups of observations. In contrast, regression models called between-effects regressions examine how the variation differs on average between groups.

Below are four fixed effects (within) models and one between effects model. The similarities and differences are discussed below. (A Hausman test indicated that a random-effects model would not be justified.)

The model interpretation is quite similar to the OLS models. The simplified titles explain much of the variation in salaries. Experience also has a small but significant impact on salary. Interestingly, the proportion of Indian students is also negatively related to employee salary in the within-model. The number of college students has a significant but small relationship with college employee salary.

The between model (Model 5) focuses on the average differences between groups. In this model, the Presidents and VPs of colleges have a much larger average relationship with salary. Similarly, the between model also emphasizes the difference in salaries on average between employees at colleges with larger international Indian student groups and those without.

Fixed-Effects Visualizations

Fixed Effects Full Results

Model 1Model 2Model 3Model 4Model 5
title2Chair0.02 ** 0.02 ** 0.02 ** 0.02 ** 0.13 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.01)   
title2Dean0.07 ***0.07 ***0.07 ***0.07 ***0.22 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
title2Director0.05 ***0.06 ***0.05 ***0.05 ***0.15 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
title2Officer0.21 ***0.21 ***0.21 ***0.21 ***0.39 ***
(0.01)   (0.01)   (0.01)   (0.01)   (0.01)   
title2Other0.04 ***0.04 ***0.04 ***0.04 ***0.07 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
title2President0.18 ***0.18 ***0.18 ***0.18 ***0.74 ***
(0.01)   (0.01)   (0.01)   (0.01)   (0.01)   
title2VP0.16 ***0.16 ***0.16 ***0.16 ***0.48 ***
(0.01)   (0.01)   (0.01)   (0.01)   (0.01)   
exp0.00 ***0.00 ***0.00 ***0.00 ***0.01 ***
(0.00)   (0.00)   (0.00)   (0.00)   (0.00)   
prop       -0.01 **        -0.02 ** -0.05 ***
       (0.00)          (0.00)   (0.01)   
student_count              -0.00    0.00    0.00 ** 
              (0.00)   (0.00)   (0.00)   
(Intercept)                            11.51 ***
                            (0.00)   
nobs31801       31801       31801       31801       8052       
r.squared0.09    0.09    0.09    0.09    0.65    
adj.r.squared-0.22    -0.22    -0.22    -0.22    0.65    
statistic291.39    260.15    259.06    234.23    1523.97    
p.value0.00    0.00    0.00    0.00    0.00    
deviance75.04    75.01    75.04    75.01    51.27    
df.residual23741.00    23740.00    23740.00    23739.00    8041.00    
nobs.131801.00    31801.00    31801.00    31801.00    8052.00    
*** p < 0.001; ** p < 0.01; * p < 0.05.
Hausman Test

data: ln_salary ~ title2 + exp + prop chisq = 3643.9, df = 9, p-value < 0.00000000000000022 alternative hypothesis: one model is inconsistent

Summary - Inferential Models

The models show the importance of formal titles, events over time, and worker experience. It also seems that the demographic makeup of students (international vs domestic) is a significant factor in working between colleges.

Models for Prediction

Random Forest Model

The random forest model is a tree-based model that uses bootstrap aggregation to utilize the results of multiple trees that use different subsets of explanatory variables. The result is that random forest models can tell which variables provide the most explanatory power (summarized in their purity scores.)

Variable Importance

The most important variables in the model to determine salary are the simplified position title, worker experience, the year, employer, the number of students/faculty, and, at the end, the proportion of international students.

Inspect Failed Predictions

The difference between the predicted values and the actual salary in the test set shows the model’s errors. In this section, we examine the characteristics of the larger errors to get some sense of how we can improve the model. In particular, we examine cases where the predicted and actual differences are more than three standard deviations away from the average.

The histogram of the number of observations shows that the model tends to predict poorly for just a single period of a person’s work career. It could be that these observations are from people who have just appeared in the data once. There are comparatively few cases where a person’s salary was poorly predicted over two periods. (In a few cases, it seems the model did poorly for ten periods.)

The table relating the number of outliers by job title clearly shows that the ‘Other’ category provides less value to the model. A noticeable improvement would be adjusting the code that assigns simplified titles to provide somewhat more granularity. There may also be cause to revisit other simplified job titles that performed poorly. Of interest is the President’s job title. It seems likely that there is a fair amount of variation between colleges in terms of what Presidents and other high-level administrators are paid.

The table relating the number of outliers by college shows that the largest colleges in the province (Humber, Sheridan, Seneca, and George Brown) have many highly paid people on staff, which the model fails to predict well. More investigation is warranted to see if there is a close relationship here.

Number Of Outliers By Job Title
TitleCount
Other65
President29
Director25
Dean15
Officer11
Professor10
VP4

Gradient Boosting Machine (GBM)

Gradient boosting is similar to the Random Forest models in that it also utilizes many decision trees to develop a model. Random Forests use a bagging approach or bootstrap aggregation of many decision trees. Many, many decision trees are aggregated to find the best result.

Gradient boosting machines also build an ensemble of shallow trees, but it does so in sequence, with each tree learning and improving on the previous one. Shallow trees are weak predictive models because they tend to overfit on the training data. However, many trees can be “boosted” to produce a robust model in which they are treated as a committee. When properly tuned, such models are often hard to beat with other algorithms.

Variable Importance

The GBM also shows the most important variables. The simplified title (title2), work experience, and the year are again the top explanatory variables. The measure of the student population, the college, and the proportion of international Indian students follows.

[00:47:41] WARNING: src/learner.cc:767: Parameters: { “rmse”, “trees” } are not used.

Testing Prediction Results

RMSE for: OLS:0.0918 FE:0.1221 RF:0.0907 BE:0.0923 The ensemble (equal weighted) model RMSE is: 0.0878 XGB RMSE is: 0.09099

Conclusions

The Ontario Sunshine List is a unique resource of Ontario public servant salaries that exceed $100,000. I’ve used this data to construct both inferential and predictive models. By merging the data with college enrollment data, I found that a simplified job title, year, and employee work experience go a long way to explaining job salaries. Interestingly, colleges with a large proportion of international students from India tend to pay their staff worse, all other things being equal.

The predictive tree-based models, including random forest and gradient boosting machines, are better at predicting salaries based on the training data. Additional work should be done to choose improved, simplified titles that would likely improve the models. Overall, an ensemble method outperformed any single model.