Introduction

There have been many news stories about Ontario colleges. Concerns have grown that Ontario colleges have forgotten their core mandate of training skilled workers in favor of attracting international students who pay much higher tuition. Questions have been raised about the quality of education students now receive and the incentives of college administrators. In this analysis, I focus narrowly on the salaries paid by Ontario colleges. In particular:

I use the Ontario government’s Sunshine List of employees who make over $100,000 annually in this analysis. (This analysis focuses on Ontario colleges, which are distinct from Ontario universities.)

The first section describes the data from the Sunshine List, focusing on the years between 2012 and 2023. The following section builds OLS, fixed effects, and between-effects models for inference about the potential causes of salary increases. These models will help explain the factors related to changes in college workers’ salaries.

Given the limited public information available, advanced predictive models build on these insights to predict workers’ salaries. To test the quality of these models, the data is split into a training component to train predictive models and test data to verify the results. The quality of the predictions and areas for improvement are discussed.

Data

Accounting for Inflation in Salaries

I account for inflation using the CPI to calculate the yearly price relative to a single base year. This enables the analysis to compare people’s salaries across time validly.

It is also important to remember that the Sunshine List only contains workers who make more than $100,000. This salary level is a nominal cut-off, so as inflation increases and the purchasing power of $100,000 decreases, we see an increasing number of people on the list over time. I have restricted the analysis from 2013 onwards to partially account for this. I also use 2013 as the base year to account for inflation. After adjusting for inflation, workers who make less than $100,000 are removed from the panel.

# -----------------------
# Read All-items CPI from Statistics Canada (Series 1810000501)
# -----------------------
# https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810000501&pickMembers%5B0%5D=1.2&cubeTimeFrame.startYear=2002&cubeTimeFrame.endYear=2023&referencePeriods=20020101%2C20230101

# This will be used to chain the salary amounts across years
#cpi <- read.csv("data/Canada_CPI_2022-2023-eng.csv")
cpi <- read.csv("data/Canada_CPI_1996-2023-eng.csv")


# Use All-items GDP
# Switch from wide to long format so we can lookup CPI easier
cpi <- cpi %>% 
  filter(Group=="All-items") %>% 
  pivot_longer(
    cols=starts_with("X"), 
    values_to = "CPI", 
    names_to = "year") 

cpi<-cpi%>% 
  mutate(year=gsub("X","",year)) %>% 
  select(-Group)

# Configure the base year 
BASE_YEAR = 2013
# Calculate the current value of a price in terms of the base year
get_value <- function(val, current_year) {
  #return (val)
  curr_cpi = cpi$CPI[ cpi$year == current_year ]
  base_cpi = cpi$CPI[ cpi$year == BASE_YEAR ]
  return (val * (base_cpi / curr_cpi))
}

Data collection

I’ve created a panel of observations about workers at Ontario colleges. To accomplish this, I combined a series of annual Sunshine List data files into a panel of employees, which I followed over time. Observations of individuals are linked over time based on a common employer and the person’s first and last name.

The shortcomings of this approach are that I cannot easily capture people who have changed jobs between different colleges and whose salary in those transition years is below $100,000. It also potentially misses women who have changed their last name after marriage. However, an examination of the data shows that number of cases meeting these criteria is not large.

Salaries at Ontario Colleges

The following plot shows the distribution of salaries across Ontario colleges. Below is a box plot of salaries paid at Ontario colleges. The colored bands represent the mass of the distribution, and the dots are ‘outlying’ points. In this case, the dots represent employees who make significantly more than others in the organization.

Note that I use 2013 as the inflation-adjusted year. Many workers who make nominally more than $100,000 in later years are making less than $100,000 in 2013 inflation-adjusted dollars. These workers are kept in our panel because they still provide helpful salary information.

Box Plot of Salaries

gp <- worker_panel %>% #filter(Employer %in% c("conestoga","humber","lambton") ) %>% 
  filter(salary >= 100000) %>% 
  arrange(Employer) %>% 
  ggplot(aes(x=salary, fill = Employer)) +
  geom_boxplot() + scale_x_log10(labels = scales::dollar_format()) + ggtitle("Plots of Salaries By College")
  #geom_histogram() + #, position = "stacked") # + 
#pl <- ggplotly(gp, tooltip = c("text"))
#htmlwidgets::saveWidget(pl, glue("box_plot_salary.html"))
gp

Changes to the Data

Salaries in the data tend to have a long right tail, with very few college administrators making large amounts of money. The mass of the distribution, which includes lower paid professionals, comprises college teaching faculty and administrators. A common approach to normalize such distributions is to log transform salary. This transformation tends to improve the fit of the models we use, and we can easily exponentiate the results to see them again in dollars.

I have also merged data about each college’s student demographics to enrich the data. In particular, I focus on the two largest groups – Canadian students and students from India enrolled in Ontario colleges.

The final data transformation has to do with job titles. The Sunshine List contains job titles given by colleges, but there is no consistency in those titles. To simplify the variation, I have searched for common positions. For example, if the position title contains the words ‘professor,’ ‘lecturer,’ and ‘teacher,’ then the person is given the simplified title of ‘Professor’. If the title contains ‘vice-president of …’, they are categorized as a ‘VP,’ etc.

# This function is based on schools with > 1/3 indian students. 
# Not very useful in practice so we omit it.
#worker_panel <- worker_panel %>% 
#  rowwise() %>% 
#  mutate(
#    is_indian = is_indian(Employer)
#  )

# Log transform of salary
worker_panel <- worker_panel %>%
  mutate(
    ln_salary = log(salary)
  )

# Make simpler job categories
worker_panel <- worker_panel %>% 
  mutate(
    cln = tolower( gsub("[,-].*",'',title,perl = TRUE) ),
    cln = gsub("[[:punct:]]+", "", cln),
    title2 = case_when(
      grepl("vice pre", cln) ~ "VP",
      grepl("officer", cln) ~ "Officer",
      #grepl("consultant", cln) ~ "Consultant",
      ##grepl("senior.*analyst", cln,perl = TRUE) ~ "Senior Analyst",
      grepl("vp", cln) ~ "VP",
      #grepl("coordinator", cln) ~ "Coordinator",
      #grepl("manager", cln) ~ "Manager",
      grepl("chair", cln) ~ "Chair",
      grepl("dean", cln) ~ "Dean",
      grepl("director", cln) ~ "Director",
      grepl("professor", cln) ~ "Professor",
      grepl("faculty", cln) ~ "Professor",
      grepl("instructor", cln) ~ "Professor",
      ##grepl("chef", cln) ~ "Chef",
      grepl("president", cln) ~ "President",
      ##grepl("specialist", cln) ~ "Specialist",
      ##grepl("nurse", cln) ~ "Nurse",
      grepl("exec dir", cln) ~ "Director",
      ##grepl("coach", cln) ~ "Coach",
      #grepl("registrar", cln) ~ "Registrar",
      #grepl("counsellor", cln) ~ "Counsellor",
      .default = "Other"
    )
  ) 

College Enrollments

I have also merged data about the number of enrollments in each college over time. My prior analysis found that Canadians and Indians comprise these institutions’ two largest groups of students.

Update titles using the clusters determined using the embedded vectors for the title and K-Means clustering

This was an experiment where I took all of the job titles and attempted to cluster them into groups. My hope was that this automated approach (using TF-IDF and sentence transformers) to creating clusters would show groups that were sensible. Unfortunately, this is not the case. It seems it is better to create groups based on heuristics of job groupings. The downside is the ‘other’ category, where there is a high variance in salaries. My suspicion, is that these are administrative positions with varying responsibilities not captured by the traditional titles of higher education.

This section should NOT be run

Visual sanity check of job title clusters

Chair      Dean  Director   Officer     Other President Professor        VP 
 1748      2990      4952       506     11410       270     45985      1211 
 <NA> 
    0 

Create Dummy Variables

Summary Statistics

Below is a summary of the variables that are available for examination.

  • Title2 is the simplified title given to a college worker that denotes their approximate position in the organization.
  • Employer is the name of the college that a given employee works for.
  • Year is the financial year a given Ontario college worker’s name, salary, and employer have been recorded on the Sunshine List. Years are used mainly as factors in this analysis but also as a trend measure for some earlier models.
  • Student_count is the number of students enrolled in the college for a given year. For simplicity, this is the sum of all Canadian and Indian students. Both groups make up the vast majority of students during 2012-2023.
  • enroll_canadian and enroll_indian are the number of students listed as Canadian or Indian as recorded by the college enrollment data. These are, by a wide margin, the two largest nationalities that attend Ontario colleges.
  • “prop” is the percentage of Indian international students enrolled. This variable is the number of enrolled Indian international students divided by the sum of Indian and Canadian students enrolled.
  • is_prof is an indicator set for each observation where the person is listed as a lecturer, professor, or teacher.
  • faculty_num is the total number of faculty listed on the Sunshine List for a given college. (i.e., this is the sum of workers listed as professors for a given college.) This can be used as a rough guide to the school’s teaching faculty size.
  • The experience of the person is captured with the exp variable. This is the number of years a person has been employed by the college(s) for which they are employed.
  • ln_salary is the natural log-transformed salary of a college employee.
  • salary is the annual salary paid to a college employee

Table of Salaries in Ontario Colleges, 2022

Salaries by College in 2013 adjusted dollars
2022
Employer Median SD Max N
Humber $94,620 $22,366 $449,086 796
Sheridan $95,170 $18,921 $373,444 687
George Brown $94,702 $21,678 $368,889 609
Seneca $95,192 $18,808 $342,789 764
Conestoga $95,082 $21,201 $332,908 577
Algonquin $95,082 $16,618 $274,107 597
La Cit $95,081 $19,947 $265,873 173
Sir Sandford $94,880 $16,679 $249,764 230
Centennial $95,155 $17,786 $247,264 545
Durham $94,862 $16,867 $245,880 340
Fanshawe $94,841 $15,207 $244,277 503
St Clair $95,096 $17,314 $243,781 253
Sault $95,171 $19,413 $233,358 109
Niagara $95,082 $16,409 $231,887 352
Mohawk $95,115 $17,080 $223,765 426
St Lawrence $93,623 $18,411 $222,775 215
Lambton $95,118 $21,386 $214,770 132
Northern $95,269 $19,737 $212,716 80
Georgian $94,821 $17,566 $212,587 340
Bor $95,095 $19,861 $212,430 103
Cambrian $94,717 $16,905 $207,437 173
Canadore $95,115 $15,662 $199,430 108
Loyalist $97,085 $18,443 $187,978 124
Confederation $94,517 $14,644 $184,300 132

Correlations

The correlation plot shows the degree to which variables move together. Highly correlated variables should not appear in the same inferential regression model because they will affect the size and significance of the coefficients, making interpretation impossible.

There are expected relationships between the previously listed variables because of how they are constructed or, in some cases because they measure similar phenomena. For example, the number of Canadian students enrolled is correlated with the total count of all students. Similarly, the proportion of Indian students positively correlates with the time trend. We know from other analyses that Indian international students only began to arrive in Ontario colleges towards the end of our panel.

It’s also unsurprising that the number of faculty positively correlates with the number of students enrolled. Of course, as a college gains more students, it requires more teachers.

Creating Training and Testing Data

The full dataset is divided into train and test data. The train data is the part of the data I use to build models. The remaining data is the ‘test’ data. The test data is used to test the accuracy of predictive models based on the training data.

As previously noted, individual IDs are assigned based on the individual’s first and last name and employer. This assumption means we cannot track people who switch between employers or change their names (ex., married women.)

To create valid test and train data, I sample from all available person IDs and assign a portion to the train data set and a portion to test for final verification.

[1] “fn” “ln” “year” “Employer”
[5] “title” “salary” “ln_salary” “title2”
[9] “exp” “is_prof” “faculty_num” “enroll_indian”
[13] “enroll_canadian” “prop” “student_count” “ID”
[1] 69072 16

Job Movers - Looking For People Who Moved Between Colleges

Some code was written to capture cases of people who left a job at one college and switched to another. However, there can be cases where people switch and are not in the data because their mid-year salary does not meet the threshold of $100,000 to be included in the Sunshine list. In practice, it turns out that there are few people who fit this criteria (>5).

[1] “fn” “ln” “year” “Employer”
[5] “title” “salary” “ln_salary” “title2”
[9] “exp” “is_prof” “faculty_num” “enroll_indian”
[13] “enroll_canadian” “prop” “student_count” “ID”
[17] “ID.x” “exp.x” “exp.y”
[1] “fn” “ln” “year” “Employer”
[5] “title” “salary” “ln_salary” “title2”
[9] “exp” “is_prof” “faculty_num” “enroll_indian”
[13] “enroll_canadian” “prop” “student_count” “ID”

Split the data into training and testing sets.

# Sample split into train and test
set.seed(66664)


rfdf <- rfdf %>% 
  mutate( # transformations
    Employer = as.factor( Employer ),
    title2 = as.factor( title2 ),
    #year = as.factor(year), # if commented out, year will be trend
    #ID = as.factor( ID )
  ) 

# Test - remove observations with less than 100K salary
rfdf <- rfdf %>% filter( salary >= 100000 ) # removes 27K obs!

# Get sample to train models
sample_id_list <- rfdf$ID %>% unique()
count_ids <- length(sample_id_list)
train_size = count_ids * 0.75 # 75% of obs to training. 25% to test

sample_ids = sample(x = sample_id_list, size = train_size)

train <- rfdf %>% filter(ID %in% sample_ids)
test <- rfdf %>% filter(!(ID %in% sample_ids))
#dim(train)
#dim(test)

train_uid <- sort( unique( train$title2 ) )
test_uid  <- sort( unique( test$title2 ) )
if ( !setequal(unique( test$title2 ), unique( train$title2 )) ) {
  stop("Test does not have all factors. Change random seed.") # Check for panel data regressions
}

dim(train)

[1] 31987 16

# Make dataframe numeric -- TODO when to implement?
#rfdf <- set_dummies(rfdf)
#rfdf <- rfdf %>% select(-year, -Employer, -title2)
#rfdf <- rfdf %>% select(where(is.numeric))

One-hot encodings and make numeric matrices

[1] “Employeralgonquin” “Employerbor” “Employercambrian”
[4] “Employercanadore” “Employercentennial” “Employerconestoga”
[7] “Employerconfederation” “Employerdurham” “Employergeorge_brown” [10] “Employergeorgian” “Employerhumber” “Employerla_cit”
[13] “Employerlambton” “Employerloyalist” “Employermohawk”
[16] “Employerniagara” “Employernorthern” “Employersault”
[19] “Employerseneca” “Employersheridan” “Employersir_sandford” [22] “Employerst_clair” “Employerst_lawrence” “title2Chair”
[25] “title2Dean” “title2Director” “title2Officer”
[28] “title2Other” “title2President” “title2Professor”
[31] “title2VP” “year2014” “year2015”
[34] “year2016” “year2017” “year2018”
[37] “year2019” “year2020” “year2021”
[40] “year2022” “year2023”

Inferential Models

Manual and automated testing with OLS models showed that a relatively small subset of variables provided the most explanatory power. For simplicity, this work will focus on variations of these explanatory variables.

OLS Model Specifications

We start with OLS models because of the flexibility and ease of interpretation. We start the specification using dummies for each simplified title commonly used at colleges. The current year is also set as a dummy variable because we have no reason to believe that salaries increase linearly once we take into account inflation.

Instead, with the inclusion of the worst of pandemic years in our panel it seems likely that there will be significant salary changes. We also include ‘exp,’ which represents the years of experience of workers at the college.

Two other additional explanatory variables are tested. The first is the proportion (prop) of Indian international students enrolled. The other is a count of the number of students enrolled in the college.

Sometimes, the most straightforward specification is the best, and adding more explanatory variables does not add to the overall explanatory power of a model. To determine which of the possible specifications is the best, I use the log-likelihood test to compare different nested models (not shown here).

The following visualization shows how the model coefficients change with the addition of new variables. We can see that the models are relatively consistent – which is a good thing.

Ordinary Least Squares, OLS, models, which we have just looked at, can be specified to control for within variation by adding time-varying dummy variables. However, they still contain some between-variation (more on this below). OLS models are generally used in cross-sectional data. When they are used with panel data, they are referred to as pooled OLS. One needs to be aware of the risk of bias with mis-specified OLS models and the risk that the errors of pooled OLS models used on panel data tend to be biased downward. (i.e., coefficients are listed as significant when they are really not.) This is why we often employ specialized models and standard error calculations when developing inferential models with panel data.

Call: lm(formula = ln_salary ~ ., data = ols_data)

Residuals: Min 1Q Median 3Q Max -0.97816 -0.03898 -0.00714 0.02628 0.93676

Coefficients: (3 not defined because of singularities) Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.97695601 0.00654029 1831.257 < 0.0000000000000002 exp 0.00891074 0.00018254 48.815 < 0.0000000000000002 faculty_num 0.00003026 0.00001409 2.147 0.031784 *
prop 0.00095131 0.00591617 0.161 0.872253
title2Chair -0.34748015 0.00425471 -81.670 < 0.0000000000000002 title2Dean -0.25060472 0.00384791 -65.128 < 0.0000000000000002 title2Director -0.31738000 0.00365964 -86.724 < 0.0000000000000002 title2Officer -0.08285893 0.00663715 -12.484 < 0.0000000000000002 title2Other -0.39258564 0.00363614 -107.968 < 0.0000000000000002 title2President 0.33445062 0.00726043 46.065 < 0.0000000000000002 title2Professor -0.48568005 0.00334882 -145.030 < 0.0000000000000002 title2VP NA NA NA NA
Employeralgonquin -0.00675531 0.00506118 -1.335 0.181974
Employerbor 0.02785337 0.00572193 4.868 0.000001133689494218
Employercambrian -0.01084737 0.00518889 -2.090 0.036581 *
Employercanadore -0.04193647 0.00577848 -7.257 0.000000000000403714 Employercentennial -0.02307922 0.00539250 -4.280 0.000018754230967445 Employerconestoga -0.01293048 0.00537236 -2.407 0.016096 *
Employerconfederation -0.02018470 0.00561828 -3.593 0.000328 Employerdurham -0.01232889 0.00465736 -2.647 0.008121 Employerfanshawe -0.01665469 0.00502575 -3.314 0.000921 Employergeorge_brown -0.01735081 0.00527045 -3.292 0.000996 Employergeorgian -0.00589012 0.00463284 -1.271 0.203601
Employerhumber -0.00230076 0.00563600 -0.408 0.683110
Employerla_cit -0.01393487 0.00515489 -2.703 0.006871 Employerlambton 0.00259233 0.00580607 0.446 0.655249
Employerloyalist -0.00247317 0.00556199 -0.445 0.656572
Employermohawk -0.01026809 0.00474299 -2.165 0.030403 *
Employerniagara -0.02442646 0.00472810 -5.166 0.000000240289716232
Employernorthern 0.00429772 0.00674020 0.638 0.523722
Employersault -0.02090194 0.00558284 -3.744 0.000181 Employerseneca -0.00849113 0.00601890 -1.411 0.158330
Employersheridan -0.01222221 0.00512407 -2.385 0.017073

Employersir_sandford -0.02350343 0.00495487 -4.743 0.000002109676836765
Employerst_clair -0.02068232 0.00480314 -4.306 0.000016672995049545 Employerst_lawrence NA NA NA NA
year2013 0.04198980 0.00437610 9.595 < 0.0000000000000002
year2014 0.03402977 0.00416576 8.169 0.000000000000000323 year2015 0.02001124 0.00407451 4.911 0.000000909095568558 year2016 0.01970152 0.00381814 5.160 0.000000248451230866 year2017 0.03128737 0.00614858 5.089 0.000000362855219694 year2018 0.02206859 0.00318428 6.930 0.000000000004273788 year2019 0.00741979 0.00297834 2.491 0.012735 *
year2020 0.01841651 0.00272609 6.756 0.000000000014463404
year2021 -0.00168118 0.00269481 -0.624 0.532727
year2022 -0.01303475 0.00331774 -3.929 0.000085547415653697
** year2023 NA NA NA NA
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.09452 on 31943 degrees of freedom Multiple R-squared: 0.6565, Adjusted R-squared: 0.656 F-statistic: 1420 on 43 and 31943 DF, p-value: < 0.00000000000000022

OLS Model Visualizations

The differently shaped icons show how the model coefficients differ. Model 1 is the baseline model.

The visualization shows that the simplified titles are the most prominent single salary indicator. All the salaries are compared to those of a professor (teaching staff). The default category ‘other’ includes admin staff and those job categories that were not obviously classifiable. Chairs and Directors make much more, followed by Deans and Officers of the college. The highest-paid positions belong to VPs and Presidents. The dummy variables used for years show some annual non-linear variation. Around the pandemic, we see a drop in average salary levels.

The expected annual salary increases with each additional year of worker experience (exp). This is a small effect, but it is multiplied by the number of years of service.

The size of the school, as measured by student count in thousands, does not represent a large change in any of the models despite being statistically significant. In other words, colleges with larger enrollments do not appear to pay much better than their smaller peers. Remember that regression models are interpreted as ‘holding everything else constant.’ So, people may still choose to work at larger colleges for other professional reasons – such as promotion opportunities – that are not available at smaller institutions.

OLS Model Full Results

Model 1
(Intercept)11.98 ***
(0.01)   
exp0.01 ***
(0.00)   
faculty_num0.00 *  
(0.00)   
prop0.00    
(0.01)   
title2Chair-0.35 ***
(0.00)   
title2Dean-0.25 ***
(0.00)   
title2Director-0.32 ***
(0.00)   
title2Officer-0.08 ***
(0.01)   
title2Other-0.39 ***
(0.00)   
title2President0.33 ***
(0.01)   
title2Professor-0.49 ***
(0.00)   
Employerbor0.03    
(0.01)   
Employercambrian-0.01    
(0.01)   
Employercanadore-0.04 ***
(0.01)   
Employercentennial-0.02 *  
(0.01)   
Employerconestoga-0.01 ***
(0.01)   
Employerconfederation-0.02 ***
(0.01)   
Employerdurham-0.01 *  
(0.01)   
Employerfanshawe-0.02 ***
(0.00)   
Employergeorge_brown-0.02 ** 
(0.01)   
Employergeorgian-0.01 ***
(0.01)   
Employerhumber-0.00 ***
(0.00)   
Employerla_cit-0.01    
(0.01)   
Employerlambton0.00    
(0.01)   
Employerloyalist-0.00 ** 
(0.01)   
Employermohawk-0.01    
(0.01)   
Employerniagara-0.02    
(0.00)   
Employernorthern0.00 *  
(0.00)   
Employersault-0.02 ***
(0.01)   
Employerseneca-0.01    
(0.01)   
Employersheridan-0.01 ***
(0.01)   
Employersir_sandford-0.02    
(0.01)   
Employerst_clair-0.02 *  
(0.00)   
year20130.04 ***
(0.00)   
year20150.02 ***
(0.00)   
year20160.02    
(0.00)   
year20170.03 ***
(0.00)   
year20180.02 ***
(0.00)   
year20190.01 ***
(0.01)   
year20200.02 ***
(0.00)   
year2021-0.00 ***
(0.00)   
year2022-0.01 ***
(0.00)   
N31987       
R20.66    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Fixed-Effects Model Specifications

Fixed-effects models are commonly used in panel data studies because they are consistent and don’t have the same bias that misspecified pooled OLS models can have. The trade-off is that fixed-effect models generally have less explanatory power. Still, they remain the gold standard in inferential panel data analysis because they are unbiased.

All models attempt to explain changes in variation. Fixed-effects models focus on the variation within groups of observations. In contrast, regression models called between-effects regressions examine how the variation differs on average between groups.(Note: A Hausman test indicated that a random-effects model would not be justified.)

The FE model interpretation is quite similar to the OLS models. The simplified titles explain much of the variation in salaries. Experience also has a small but significant impact on salary.

GLM estimation, family = gaussian, Dep. Var.: ln_salary Observations: 31,987 Standard-errors: Clustered (id) Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.55049249 0.842405 23.207961 < 2.2e-16 exp 0.00879873 0.000515 17.078510 < 2.2e-16 time -0.00391909 0.000419 -9.346523 < 2.2e-16 Employerbor 0.03258740 0.010114 3.221930 0.00127836702481 Employercambrian -0.00521839 0.006836 -0.763400 0.44524743786485
Employercanadore -0.03690722 0.008047 -4.586457 0.00000457644643
Employercentennial -0.01605076 0.004698 -3.416313 0.00063789002589 Employerconestoga -0.00600457 0.005046 -1.189983 0.23408818075370
Employerconfederation -0.01441197 0.009652 -1.493097 0.13545094775001
Employerdurham -0.00611177 0.006623 -0.922806 0.35613572332678
Employerfanshawe -0.00992555 0.005825 -1.704086 0.08840360522418 .
Employergeorge_brown -0.00992237 0.005468 -1.814461 0.06964396947580 .
Employergeorgian 0.00000192 0.007210 0.000266 0.99978761121857
Employerhumber 0.00529262 0.006999 0.756232 0.44953218235334
Employerla_cit -0.00878685 0.010601 -0.828871 0.40720223752840
Employerlambton 0.00807869 0.008512 0.949120 0.34258821081809
Employerloyalist 0.00250416 0.008760 0.285860 0.77499238978733
Employermohawk -0.00372155 0.005238 -0.710528 0.47739713225573
Employerniagara -0.01879562 0.006640 -2.830521 0.00465874316195 Employernorthern 0.00921564 0.018860 0.488633 0.62511485759151
Employersault -0.01611276 0.007334 -2.196846 0.02805986120063 *
Employerseneca -0.00104556 0.004998 -0.209193 0.83430284901327
Employersheridan -0.00514874 0.004991 -1.031579 0.30230047845838
Employersir_sandford -0.01760834 0.005756 -3.058983 0.00222819573772
Employerst_clair -0.01496385 0.005248 -2.851533 0.00436197091870 Employerst_lawrence 0.00510284 0.008547 0.597021 0.55051022869671
title2Dean 0.09700256 0.006582 14.737211 < 2.2e-16
title2Director 0.03013252 0.005607 5.373660 0.00000007931303 title2Officer 0.26437349 0.028242 9.361041 < 2.2e-16 title2Other -0.04490423 0.006447 -6.965501 0.00000000000353 title2President 0.68235261 0.060872 11.209694 < 2.2e-16 title2Professor -0.13798414 0.003996 -34.528713 < 2.2e-16 title2VP 0.34768231 0.011890 29.241931 < 2.2e-16 faculty_num 0.00002313 0.000013 1.762391 0.07804123725563 .
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 Log-Likelihood: 30,003.4 Adj. Pseudo R2: -1.30562 BIC: -59,654.1 Squared Cor.: 0.654657

Fixed-Effects Visualizations

Fixed Effects Full Results

Model 1
(Intercept)19.55 ***
(0.84)   
exp0.01 ***
(0.00)   
time-0.00 ***
(0.00)   
Employerbor0.03 ** 
(0.01)   
Employercambrian-0.01    
(0.01)   
Employercanadore-0.04 ***
(0.01)   
Employercentennial-0.02 ***
(0.00)   
Employerconestoga-0.01    
(0.01)   
Employerconfederation-0.01    
(0.01)   
Employerdurham-0.01    
(0.01)   
Employerfanshawe-0.01    
(0.01)   
Employergeorge_brown-0.01    
(0.01)   
Employergeorgian0.00    
(0.01)   
Employerhumber0.01    
(0.01)   
Employerla_cit-0.01    
(0.01)   
Employerlambton0.01    
(0.01)   
Employerloyalist0.00    
(0.01)   
Employermohawk-0.00    
(0.01)   
Employerniagara-0.02 ** 
(0.01)   
Employernorthern0.01    
(0.02)   
Employersault-0.02 *  
(0.01)   
Employerseneca-0.00    
(0.00)   
Employersheridan-0.01    
(0.00)   
Employersir_sandford-0.02 ** 
(0.01)   
Employerst_clair-0.01 ** 
(0.01)   
Employerst_lawrence0.01    
(0.01)   
title2Dean0.10 ***
(0.01)   
title2Director0.03 ***
(0.01)   
title2Officer0.26 ***
(0.03)   
title2Other-0.04 ***
(0.01)   
title2President0.68 ***
(0.06)   
title2Professor-0.14 ***
(0.00)   
title2VP0.35 ***
(0.01)   
faculty_num0.00    
(0.00)   
nobs31987       
r.squared       
adj.r.squared       
within.r.squared       
pseudo.r.squared-1.31    
sigma       
nobs.131987.00    
AIC-59938.80    
BIC-59654.12    
logLik30003.40    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Note that there is a negative adjusted R^2 https://stats.stackexchange.com/questions/444041/negative-adjusted-r2-in-twoway-effects-within-model

Model 1
(Intercept)19.55 ***
(0.84)   
exp0.01 ***
(0.00)   
time-0.00 ***
(0.00)   
Employerbor0.03 ** 
(0.01)   
Employercambrian-0.01    
(0.01)   
Employercanadore-0.04 ***
(0.01)   
Employercentennial-0.02 ***
(0.00)   
Employerconestoga-0.01    
(0.01)   
Employerconfederation-0.01    
(0.01)   
Employerdurham-0.01    
(0.01)   
Employerfanshawe-0.01    
(0.01)   
Employergeorge_brown-0.01    
(0.01)   
Employergeorgian0.00    
(0.01)   
Employerhumber0.01    
(0.01)   
Employerla_cit-0.01    
(0.01)   
Employerlambton0.01    
(0.01)   
Employerloyalist0.00    
(0.01)   
Employermohawk-0.00    
(0.01)   
Employerniagara-0.02 ** 
(0.01)   
Employernorthern0.01    
(0.02)   
Employersault-0.02 *  
(0.01)   
Employerseneca-0.00    
(0.00)   
Employersheridan-0.01    
(0.00)   
Employersir_sandford-0.02 ** 
(0.01)   
Employerst_clair-0.01 ** 
(0.01)   
Employerst_lawrence0.01    
(0.01)   
title2Dean0.10 ***
(0.01)   
title2Director0.03 ***
(0.01)   
title2Officer0.26 ***
(0.03)   
title2Other-0.04 ***
(0.01)   
title2President0.68 ***
(0.06)   
title2Professor-0.14 ***
(0.00)   
title2VP0.35 ***
(0.01)   
faculty_num0.00    
(0.00)   
nobs31987       
r.squared       
adj.r.squared       
within.r.squared       
pseudo.r.squared-1.31    
sigma       
nobs.131987.00    
AIC-59938.80    
BIC-59654.12    
logLik30003.40    
*** p < 0.001; ** p < 0.01; * p < 0.05.

GLM estimation, family = gaussian, Dep. Var.: ln_salary Observations: 31,987 Standard-errors: Clustered (id) Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.55049249 0.842405 23.207961 < 2.2e-16 exp 0.00879873 0.000515 17.078510 < 2.2e-16 time -0.00391909 0.000419 -9.346523 < 2.2e-16 Employerbor 0.03258740 0.010114 3.221930 0.00127836702481 Employercambrian -0.00521839 0.006836 -0.763400 0.44524743786485
Employercanadore -0.03690722 0.008047 -4.586457 0.00000457644643
Employercentennial -0.01605076 0.004698 -3.416313 0.00063789002589 Employerconestoga -0.00600457 0.005046 -1.189983 0.23408818075370
Employerconfederation -0.01441197 0.009652 -1.493097 0.13545094775001
Employerdurham -0.00611177 0.006623 -0.922806 0.35613572332678
Employerfanshawe -0.00992555 0.005825 -1.704086 0.08840360522418 .
Employergeorge_brown -0.00992237 0.005468 -1.814461 0.06964396947580 .
Employergeorgian 0.00000192 0.007210 0.000266 0.99978761121857
Employerhumber 0.00529262 0.006999 0.756232 0.44953218235334
Employerla_cit -0.00878685 0.010601 -0.828871 0.40720223752840
Employerlambton 0.00807869 0.008512 0.949120 0.34258821081809
Employerloyalist 0.00250416 0.008760 0.285860 0.77499238978733
Employermohawk -0.00372155 0.005238 -0.710528 0.47739713225573
Employerniagara -0.01879562 0.006640 -2.830521 0.00465874316195 Employernorthern 0.00921564 0.018860 0.488633 0.62511485759151
Employersault -0.01611276 0.007334 -2.196846 0.02805986120063 *
Employerseneca -0.00104556 0.004998 -0.209193 0.83430284901327
Employersheridan -0.00514874 0.004991 -1.031579 0.30230047845838
Employersir_sandford -0.01760834 0.005756 -3.058983 0.00222819573772
Employerst_clair -0.01496385 0.005248 -2.851533 0.00436197091870 Employerst_lawrence 0.00510284 0.008547 0.597021 0.55051022869671
title2Dean 0.09700256 0.006582 14.737211 < 2.2e-16
title2Director 0.03013252 0.005607 5.373660 0.00000007931303 title2Officer 0.26437349 0.028242 9.361041 < 2.2e-16 title2Other -0.04490423 0.006447 -6.965501 0.00000000000353 title2President 0.68235261 0.060872 11.209694 < 2.2e-16 title2Professor -0.13798414 0.003996 -34.528713 < 2.2e-16 title2VP 0.34768231 0.011890 29.241931 < 2.2e-16 faculty_num 0.00002313 0.000013 1.762391 0.07804123725563 .
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1 Log-Likelihood: 30,003.4 Adj. Pseudo R2: -1.30562 BIC: -59,654.1 Squared Cor.: 0.654657

Summary - Inferential Models

The models show the importance of formal titles, events over time, and worker experience. It also seems that the demographic makeup of students (international vs domestic) is not a significant factor in working salary between colleges.

Models for Prediction

Random Forest Model

The random forest model is a tree-based model that uses bootstrap aggregation to utilize the results of multiple trees that use different subsets of explanatory variables. The result is that random forest models can tell which variables provide the most explanatory power (summarized in their purity scores.)

Variable Importance

The most important variables in the model to determine salary are the simplified position title, worker experience, the year, employer, the number of students/faculty, and, at the end, the proportion of international students.

[1] “ln_salary” “exp” “faculty_num”
[4] “prop” “title2Chair” “title2Dean”
[7] “title2Director” “title2Officer” “title2Other”
[10] “title2President” “title2Professor” “title2VP”
[13] “Employeralgonquin” “Employerbor” “Employercambrian”
[16] “Employercanadore” “Employercentennial” “Employerconestoga”
[19] “Employerconfederation” “Employerdurham” “Employerfanshawe”
[22] “Employergeorge_brown” “Employergeorgian” “Employerhumber”
[25] “Employerla_cit” “Employerlambton” “Employerloyalist”
[28] “Employermohawk” “Employerniagara” “Employernorthern”
[31] “Employersault” “Employerseneca” “Employersheridan”
[34] “Employersir_sandford” “Employerst_clair” “Employerst_lawrence”
[37] “year2013” “year2014” “year2015”
[40] “year2016” “year2017” “year2018”
[43] “year2019” “year2020” “year2021”
[46] “year2022” “year2023”

Inspect Failed Predictions

The difference between the predicted values and the actual salary in the test set shows the model’s errors. In this section, we examine the characteristics of the larger errors to get some sense of how we can improve the model. In particular, we examine cases where the predicted and actual differences are more than three standard deviations away from the average.

The histogram of the number of observations shows that the model tends to predict poorly for just a single period of a person’s work career. It could be that these observations are from people who have just appeared in the data once. There are comparatively few cases where a person’s salary was poorly predicted over two periods. (In a few cases, it seems the model did poorly for ten periods.)

The table relating the number of outliers by job title clearly shows that the ‘Other’ category provides less value to the model. A noticeable improvement would be adjusting the code that assigns simplified titles to provide somewhat more granularity. There may also be cause to revisit other simplified job titles that performed poorly. Of interest is the President’s job title. It seems likely that there is a fair amount of variation between colleges in terms of what Presidents and other high-level administrators are paid.

The table relating the number of outliers by college shows that the largest colleges in the province (Humber, Sheridan, Seneca, and George Brown) have many highly paid people on staff, which the model fails to predict well. More investigation is warranted to see if there is a close relationship here.

Number Of Outliers By Job Title
TitleCount
Other65
Director31
Officer30
Professor9
VP8
Dean7
Chair1
President1
Number Of Outliers By College
EmployerCount
humber26
sheridan19
st_lawrence14
seneca13
georgian11
bor9
canadore8
george_brown8
la_cit8
algonquin7
loyalist6
centennial5
durham4
mohawk3
st_clair3
cambrian2
conestoga2
confederation2
fanshawe1
northern1

Gradient Boosting Machine (GBM)

Gradient boosting is similar to the Random Forest models in that it also utilizes many decision trees to develop a model. Random Forests use a bagging approach or bootstrap aggregation of many decision trees. Many, many decision trees are aggregated to find the best result.

Gradient boosting machines also build an ensemble of shallow trees, but it does so in sequence, with each tree learning and improving on the previous one. Shallow trees are weak predictive models because they tend to overfit on the training data. However, many trees can be “boosted” to produce a robust model in which they are treated as a committee. When properly tuned, such models are often hard to beat with other algorithms.

Variable Importance

The GBM also shows the most important variables. The simplified title (title2), work experience, and the year are again the top explanatory variables. The measure of the student population, the college, and the proportion of international Indian students follows.

[17:35:28] WARNING: src/learner.cc:767: Parameters: { “rmse”, “trees” } are not used.

Testing Prediction Results

[1] 0.09706412 [1] “
"
RMSE for: 
OLS:0.0966 
FE:0.0971 
RF:0.0948
RMSE for: 
OLS:0.0966  
RF:0.0948 
The ensemble (equal weighted) model RMSE is: 0.0924
XGB RMSE is: 0.093 
[1] "

Conclusions

The Ontario Sunshine List is a unique resource of Ontario public servant salaries that exceed $100,000. I’ve used this data to construct both inferential and predictive models. By merging the data with college enrollment data, I found that a simplified job title, year, and employee work experience go a long way to explaining job salaries.

The predictive tree-based models, including random forest and gradient boosting machines, are better at predicting salaries based on the training data. Additional work should be done to choose improved, simplified titles that would likely improve the models. Overall, an ensemble method narrowly outperformed any single model.