It’s a new academic year. Before delving into the grim topic of school closures, here’s something well outside my usual focus on San Francisco.
Many college ranking schemes group colleges by type. For example, US News differentiates between National Universities and Regional Universities and between National Liberal Arts Colleges and Regional Colleges. If the terms have any meaning, a national university must draw students from across the nation whereas a regional university must primarily attract students from its own region. But I’ve never seen an attempt to quantify exactly which university is the most national i.e. which draws students from across the country in closest proportion to the population of each state.
My interest in this question was triggered earlier this year when Child-1 and I were touring a university to which he’d been admitted. The university’s admissions director boasted in a speech that his university had the most nationally representative student body of any school. Most other schools, he claimed, admitted disproportionate numbers of students from their home regions whereas his institution did not display such home state bias. The claim sounded plausible but, rather than taking his word for it, I decided to run the numbers and figure out which college’s student body is most representative of the country’s population distribution. Note that we’re only concerned about which states students hail from. Their gender, race, and ethnicity are, for this particular analysis, irrelevant.
The Data
Data about college enrollment is readily available. IPEDS reports, for each college, the “state of residence when first admitted” of all “first-time degree-seeking undergraduate students”. The national population distribution is a bit trickier. It doesn’t seem right to use the total population by state because some states like Florida have lots of old people and fewer young people. Students heading off to college for the first time are generally seventeen or eighteen years old. The census bureau does not publish a count of eighteen year olds by state so I decided to use the closest match which is the count of the number of people aged 15-17 in each state1. It is also subtly different from the number who do go to college or even the number who graduate high school2.
Comparing a school’s actual enrollment figures by state with the national population distribution by state allows us to see which states are under- and over-represented. The map below shows which states were over and under-represented at Stanford in 2022. 41.8% of Stanford’s incoming freshmen were from California but only 11.9% of 15-17 year olds live in California. California is thus heavily over-represented at Stanford whereas Alabama and other states in the south are under-represented.
Meanwhile, at Harvard, Massachusetts, New York, Connecticut, and New Jersey are over-represented.
Measuring Divergence
Is the Stanford distribution (which heavily overweights California) or the Harvard distribution (which heavily overweights Massachusetts and New York) closer to the national distribution? To decide, we need a way to express the difference between a college’s distribution and the national distribution as a single number. One common way to do that is to calculate the Kullback-Leibler divergence3 for each college’s student mix. All you need to understand is that the smaller the divergence value, the closer the college’s enrollment is to the national population distribution, with a divergence of zero being the best possible. By this metric, Stanford beats Harvard: Stanford’s divergence score was 0.38 whereas Harvard’s was 0.454. Stanford is far from the best, however.
Results
Before I ran the numbers, I had a couple of assumptions about what the results would show.
Private universities would have lower divergence scores than public universities because public universities typically have an explicit mission to serve the students of their state. A school that draws most of its students from one state, particularly if that state is a small one, will have a high divergence score.
The more well known a college is, the lower its divergence score would be. Students are more likely to travel across the country to attend a famous school with a big reputation than a school that only locals have heard of.
These assumptions were directionally correct but I had ignored two categories of schools that score very well: service academies and online schools.
The school whose freshman student body is most representative of the country is the Air Force academy in Colorado (divergence score: 0.07). It’s followed by the U.S. Military Academy at West Point (0.09) and the US Naval Academy at Annapolis (0.13). These schools are the most representative because their admissions process is set up to ensure geographic representation from every Congressional district.
After the service academies comes Southern New Hampshire University (0.15), an online university with over 13,000 freshmen students. Other online universities also have low divergence scores, including University of Massachusetts Global (0.19), University of the People (0.19), Unity College (0.26), Johnson & Wales (0.32), and Western Governors University (0.32).
Ignoring the online universities, the colleges with the lowest divergence scores are shown in the table below.
My assumption that the schools with the lowest divergence scores would all be famous institutions turns out to be wrong. Six of the top 30 accept more than half of all applicants, including Wheaton (88%), Savannah College of Art and Design (82%), Howard (53%), and SMU (52%).
Every school shows a home state preference but the key to having a good score is not to show an extreme preference for the home state. Or to come from a small state so that even an extreme preference doesn’t take up that many students.
Vanderbilt has 5x as many Tennessee students as Tennessee’s share of the national population but Tennesseans still constitute only 11% of Vanderbilt’s freshmen.
Notre Dame has lots of students from across the midwest but only 3x as many students as expected from Indiana.
Highly Selective Schools
Somewhat contrary to my expectations, some of the most prestigious schools show greater home state biases than other highly selective schools in the same state:
New Yorkers are less than 6% of the population but constitute more than 37% of Cornell students. That’s a higher proportion than at NYU (34%), Barnard (32%), Skidmore (32%), Hamilton (27%), Vassar (26%), Columbia (24%), Colgate (21%). Cornell’s divergence score (0.73) is by far the worst of any Ivy+ school and barely cracks the top 90 in the country. By divergence score, it’s less of a national institution than the University of Miami (0.58), Boston University (0.66), or Texas Christian University (0.57). Even Florida’s Ave Maria University (0.59) scores better. Maybe this is a legacy of Cornell’s history as a land grant university. Maybe Cornell should be considered a “regional university”.
Penn takes 19% of its students from Pennsylvania, more than Carnegie-Mellon (14%), Haverford (14%), Swarthmore (12%), and Bryn Mawr (11%).
Harvard takes 18% of its students from Massachusetts, more than Amherst (15%), Williams (15%), Wellesley (14%), Smith (13%), and MIT (8%).
Stanford takes 42% of its students from California, which is less than USC (51%) but more than Caltech (33%), Scripps (31%), and Pomona (30%).
Northwestern takes 24% of its students from Illinois whereas Chicago takes only 15%.
Public Universities
Service academies aside, the public university with the lowest divergence score is the University of Alabama (0.98). Only 35% of its students are in-state and it draws more students than expected from its neighbors and from as far away as Illinois. The University of Vermont has an even lower in-state percentage (16%), but its divergence score is significantly higher (1.54) because all the New England states are significantly overrepresented.
Among public universities in California, San Diego State has the lowest divergence index (1.33) because “only” 78% of its US first-year students are in-state. UCLA, where 84% of US students are from California, comes next at 1.41.
The universities with the highest divergence indexes have no out-of-state students despite being in small states. 99% of the students at both UC Merced and University of Hawaii-West Oahu are in-state but, because California is a much bigger state than Hawaii, it’s statistically much less surprising to find that 99% of students at a school are from California than to find that 99% are from Hawaii. Thus, the divergence score for UC Merced is only 2.12, much less than the 5.49 for Hawaii-West Oahu.
Schools’ Geographic preferences
It is not so surprising that schools show a preference for students from the home state or region. It is more surprising when schools show a preference for students from a state that is outside their region.
New York and Massachusetts students are not just overrepresented at practically every private school from New England down to DC but also at very selective schools outside those areas such as Duke, Vanderbilt, WashU, Chicago, Northwestern, Notre Dame, Emory, Tulane, and Pomona.
California students are at least proportionally represented at every Ivy bar Cornell and at other very selective schools such as Wesleyan, Chicago, Tulane, Johns Hopkins, MIT, Wellesley, Williams, and Carnegie-Mellon.
Florida students are at least proportionally represented at Georgetown, Emory, MIT, Duke, Wake Forest, and Vanderbilt.
Texas students are the real losers. Selective schools really do not like students from Texas. The most selective school outside of Texas that has more students from Texas than would be expected by population is Louisiana’s Grambling State which accepts 42% of applicants. Many selective schools have less than half as many Texans as would be expected including Northwestern, Princeton, Dartmouth, Cornell, Brown, Penn, Carnegie-Mellon, Swarthmore, Claremont McKenna, USC, Wesleyan, Georgetown, Amherst, and Williams.
I wondered whether my choice of using the population aged 15-17 as my reference might have affected the results. Using the estimate of high school graduates as the reference would have made things worse: Texas has 10.13% of all 15-to-17 year olds but 10.26% of all high school graduates. Using the population of all ages would have helped a bit, because Texas has only 9.01% of the total population, but Texans would still be heavily under-represented at most selective schools.
Both the over-representation of the students from the North-East and California and the under-representation of the students from Texas could be explained if there were far more high quality students on the coast than in Texas. You might think that’s plausible (for example, Asian students tend to be high-achieving and there are more of them on the coasts than in Texas) but it’s hard to find evidence on the relative quality of students by state and what evidence there is doesn’t support the idea that Texas students are weaker. For the high school class of 2022, the cutoff score to be a National Merit Semifinalist was 220 in Texas, the same as New York and just one behind California and Massachusetts. The Texas cutoff was far higher than that in West Virginia (207) or Maine (211) or Rhode Island (213) or New Hampshire (214).
One factor is surely the preference that many of these selective schools show to children of alumni. If the alumni of selective schools are concentrated in the North-East and California, then those areas will be over-represented even if the students are not any better. In this regard, it is notable that MIT and Caltech, which do not grant legacy preferences in admissions, are both in the top 20 whereas Harvard is not.
Finally, it should be remembered that enrollment figures reflect both the schools’ decisions on who to admit and the students’ decisions on where to enroll. Theoretically, it could be that students from Texas don’t like to enroll out-of-state. I wouldn’t bet on that one, however.
Exploring The Data Yourself
Substack supports embedding Datawrapper charts, which is why you could mouseover all those maps earlier to see detailed data on each state (did you notice that?). But Datawrapper does not support user-driven exploration of data. There’s no way for you to explore the enrollment data on the school of your choice or to see how its divergence score was calculated. I’ve therefore prepared a Tableau viz that has all the data. On the other hand, substack doesn’t seem to support embedding vizzes. All I can do is paste a screenshot below. You’ll have to follow the link to interact with it.
There are four tabs:
Scatter (shown above) shows a scatter plot of the divergence indexes of all schools. against the percentage of US students who are in-state. There are over 1400 schools in the dataset so, by default, the plot excludes those that are very small, have large numbers of distance learners, or have divergence indexes over 2.0. The filters on the right enable you to see more or fewer schools or to find a particular school.
List is simply a bar chart showing the schools with the lowest divergence indexes. It can also be filtered.
Pie contains a pie chart of the home states of first-year students at a particular school.
Map compares the actual number of enrollees with the expected number for each state. You have three choices for what the color can represent. “Ratio of Actual to Expected” is simply the ratio of the actual number of enrollees to the number that would be expected if it were proportional to population. It’s what is mapped in the Datawrapper charts above. This is easy to understand but does not adjust for the state population. “Log(Actual to Expected)” is the natural log of this and is included because it’s used in the divergence index formula. Finally, “Divergence” shows each state’s contribution to the school’s overall Divergence Index value i.e. it’s the Log (PctOfExpected) multiplied by the actual percentage of enrollees from the state.
Up Next
This venture outside my usual bailiwick will probably be a one-off, unless I can think of something else that isn’t covered in Jon Boeckenstedt’s richly detailed Higher Ed Data Stories.
Next up for me is the more serious issue of school closures in San Francisco.
More precisely, I used this table from the ACS Community Survey for 2022
The census bureau publishes population figures for various age groups. I chose to use the Aged 15-17 group because it was the best fit. I rejected the idea of using the population aged 18-24 (the next age range for which the census bureau publishes estimates) because so many people move after graduating high school and college that this would not be representative of those applying to college.
I didn’t want to use the total population (i.e. the population of all ages) because states like Florida have large numbers of retired people. Using the total population would imply that colleges should admit more students from states with more old people, which seems wrong.
I considered the idea of using the Department of Education’s estimate of high school graduates by state but eventually decided against it because it differed from the aged 15-17 population mix by far more than could be explained by varying high-school graduation rates. If I’ve got to choose between census bureau estimates and Department of Education estimates, I’d prefer those from the census bureau because population estimates are core to what the bureau do. Moreover, using the distribution of high school graduates as the benchmark would reward colleges that enrolled fewer students from states with higher high school dropout rates or higher rates of homeschooling. That didn’t seem ideal.
I also considered the idea of using the number of freshmen students at all postsecondary institutions (i.e. including junior colleges, for-profit institutions, online schools etc.). That would have the advantage that I could get all the data from IPEDS but I decided it would be better if the comparison set was everyone of the right age, not everyone of the right age who ended up going to college.
Long-term readers will remember that I used the same process when figuring out which schools in San Francisco most closely mirrored the city’s population.
Here, from the Wikipedia page, is the formula to calculate the divergence of an observed distribution (P) from a reference distribution (Q) is:
where P(x) is the proportion of freshmen at the college from state x while Q(x) is the nationwide proportion of 15-17 year olds who are from state x. Conceptually, the second term (i.e. log (P(x)/Q(x))) measures how much the observed term differs from the expected term while the first term (i.e. P(x)) ensures that we assign greater weight to states that provide more students.
Recall that only 11.9% of 15-to-17-year-olds were from California. Since 41.8% of Stanford’s incoming freshmen were from California, the contribution of California to Stanford’s divergence is therefore 0.418 * log (0.418 / 0.119) = 0.53 (using natural logarithms). 5.6% of Stanford’s students and 10.1% of 15-to-17-year-olds nationwide hail from Texas so the contribution of Texas to Stanford’s divergence is 0.056 * log (0.056 /0.101) = -0.03. Sum the contributions for all 50 states and the District of Columbia and you arrive at Stanford’s overall divergence figure which was 0.38.
By comparison, Harvard’s overall divergence figure was 0.45, largely because it’s statistically very surprising that 18.4% of freshmen would come from Massachusetts, a state that has only 1.9% of 15-to-17-year-olds.
SFED up
I grew up in Texas and went to University of Texas at Austin. What your analysis doesn’t take into account is the culture of Texas, which is that for many communities, it doesn’t occur to them to leave the state! My dad’s side of the family went back nine generations in Texas. Even with family over in “crazy” California, I had no thought of leaving the state for college. Of my suburban high school for which most kids went onto to college (this was the late 70s), virtually everyone I know applied for and attended Texas universities. I knew only one of my friends who went out of state - she went to Georgetown. Higher achieving kids went to UT Austin or Rice. Next tier (at that time) was Texas A&M then the state schools (what is now Texas State.)
Additionally, at UT, hardly anyone I know was from out of state. I was very active in college in student government and other organizations and almost everyone I knew was from Texas. It WAS however, the first time I met a handful of people from other states (Yankees!)
Now, it is MUCH more difficult to get into UT now (I’m certain I would not have qualified if I were applying today!). Maybe things are way different, but even for my nephews from Austin, all their friends stayed in the state. My SIL teaches HS at a high performing public school - she says it’s still pretty much the same.
I think this is why fewer Texans are going to some of the elite universities - they aren’t applying as much or to the extent represented by the populations of other states. Texas truly is a “state of mind”
Note: I got out of Texas as soon as I could and have happily been in SF and raised my kids here and in SF public schools.
Great stuff thanks