How To Identify the Best and Worst Schools
The promise of Growth Scores
When people say that such-and-such a school is a good school or a bad school, they are usually referring to the students’ achievement levels. People judge a school by how well its students do on standardized tests. SBAC scores are indeed a very good measure of how much the students at a school know at any point in time. But they are a very bad measure of how good a job the school or district is doing at teaching those kids.
Consider middle schools. They have no control over the quality of the kids they welcome on the first day of sixth grade. One school might get lucky and draw a bunch of students most of whom met or exceeded standards in 5th grade. Another school might get unlucky and draw a bunch of students very few of whom met or exceeded standards. The lucky school will show a higher proficiency rate on SBAC tests than the unlucky school but we have no way of knowing whether it is doing a better job of teaching its students because we have no visibility into the quality of the incoming students.
Similarly, elementary schools have no control over the quality of kids who show up on the first day of kindergarten. Some turn up knowing how to read and do arithmetic. When they sit their first SBAC tests in 3rd grade, they’ll score highly. But how much credit does their school deserve for that?
One of the points I’ve emphasized repeatedly is that much of the difference in achievement levels can be explained by demographic factors outside the control of the school. Kids who are fluent in English score higher than kids who are English learners. Kids whose parents went to graduate school score higher than kids whose parents just obtained a bachelor’s degree and they in turn score higher than kids whose parents didn’t attend college at all. Asian kids score higher than Latino kids. A poorly-run school filled with Asian kids who are fluent in English and have college-educated parents will score higher than a well-run school filled with Latino kids who are English learners and whose parents didn’t attend college. It won’t even be close.
The standard way to control for these demographic variables is to run complex multiple regression analyses with many independent variables representing various demographic factors that might plausibly affect educational achievement. The idea is that anything that can’t be explained by all these demographic variables must be the contribution of the school or district. Such analyses are fine in theory but unsatisfying in practice, not least because you need advanced math to understand the results. They also tend to reduce children to the sum of their demographics. If a kid with an inauspicious background thrives in school, is it because the school did a great job or because the kid won the genetic lottery despite great odds? If a kid who appears to have every demographic advantage does poorly in school, is it because the school did a poor job or because the kid lost out in the genetic lottery?
The ideal way to measure school performance would be to measure each student’s current abilities and subtract their abilities on the first day of kindergarten. This would give us a measure of how much the children have learned since starting school, while controlling for all of the genetic, and some of the environmental, contribution of parents1.
We don’t have a way to measure ability on the first day of kindergarten but, starting in 3rd grade, we do have SBAC scores. The basic idea behind growth scores is that the change in students’ SBAC scores from one year to the next is a good measure of district and school performance.
How should this growth score be computed? The CDE hired Educational Testing Service (ETS) to help them evaluate the many possible alternative methods. They ended up choosing a method they call Residual Gain. Residual Gain measures how much better or worse a student than expected a student did this year given the student’s prior test scores. The process is as follows:
The scores of every student in the state who took the SBAC both last year and this year in both Math and ELA are fed into two linear regression models. One regression model predicts this year’s ELA scores; the other predicts this year’s Math scores. The only independent variables in the models are last year’s ELA and Math scores. In particular, the student’s grade, gender, ethnicity, English language fluency, and socioeconomic status are not variables in the model. The only factors that determine the prediction are the prior year’s scores.
Given any individual student’s prior scores, the models will predict what the student should have scored this year. A student who scored 2400 in ELA and 2450 in Math last year might be predicted to score 2440 in ELA and 2470 in Math this year.
A student’s growth score for a subject is the student’s actual score minus the score the student was predicted to get based on last year’s results. If the student actually scored 2660 in both ELA and Math, the student’s growth scores would be +20 (2460 - 2440) for ELA and -10 (2460 - 2470) for Math. Notice that the Math growth score is negative, even though the student’s Math scale score went up by 10 points from 2450 to 2460, because the Math scale score was predicted to go up by 20.
Growth scores for districts, schools, and various demographic groups are calculated by aggregating the individual student growth scores. If the district is sufficiently large, this aggregation can be a simple average. For smaller districts, schools, and groups, the aggregation is based on a weighted average of this year’s growth scores and last year’s growth scores. The larger the group the more weight is given to this year’s results. Since last year’s growth score is based on the SBAC scores of last year and the year before that, the computation of growth scores require three consecutive years of SBAC scores.
By construction, the average growth score for students is zero. Roughly half of all students, schools, and districts will have growth scores less than zero. This doesn’t mean that the students in those schools learned nothing. It just means they learned less than their prior year’s scores predicted.
Since the growth score model uses 3rd grade SBAC scores as the starting point, and requires students to have taken the SBAC in successive years, growth scores can only be calculated for grades 4-8. They can thus provide a full assessment of middle schools and a partial assessment of elementary schools (partial, because they don’t assess what happens in grades K-3). There remains no easy way to assess high schools.
The theory is that schools with high growth scores are doing an excellent job of teaching their students even if their overall proficiency rate is low. Similarly, schools with low growth scores are doing a poor job even if their students have high achievement levels due to their demographics.
Before jumping in to the growth score numbers, it is important to understand four properties of the SBAC scores which underlie the growth score calculations. If we understand the implications of these properties, we’ll be able to interpret growth scores with appropriate caution.
Property #1: SBAC Scores Have a Measurement Error
SBAC scores are the best measures we have of what each student knows but each score is still just an estimate of the student’s true knowledge level. As with all estimates, there’s a measurement error. That measurement error is typically between 15 and 45 points, depending on the particular questions a student is asked. Interestingly, California and most other states have chosen to hide this from parents and do not include the measurement error on score reports.

Hawaii is one of the few states that do show parents the measurement error.

The existence of the measurement error means that a student’s SBAC score might underestimate his true score one year and overestimate it the next or vice versa.
Imagine a student whose true knowledge level was at the 50th percentile last year and is still at the 50th percentile this year and that the 50th percentile score last year was 2500 and this year is 2530, an increase of 30. Now suppose that the student’s actual SBAC score last year was 2475 ± 32 (underestimating the student’s true score of 2500) and this year was 2560 ± 32 (overestimating the student’s true score of 2530). The student’s measured gain is 85 points (i.e. 2560 - 2475) even though the true gain was only 30 points. Conversely, suppose that last year the student scored 2520 ± 25 (overestimating the true score of 2500) and this year the student scored 2505 ± 34 (underestimating the true score of 2530). Now the student’s measured gain is - 15 (i.e. 2505 - 2520). So, an average student whose true score increased by 30 might show a change in SBAC score anywhere between -15 or +85, a range of 100 points.
The range of possible values for this student’s growth score won’t be as wide as 100 points. Recall that the growth model uses both the Math and ELA scores from the prior year to predict the current year scores. It might seem odd to use last year’s Math score to predict this year’s ELA score2, but doing so reduces the effect of the measurement error because it is unlikely that both the ELA and Math scale scores will diverge from the true scores by the same amount in the same direction3.
Nevertheless, the error term for an individual student’s growth score is so wide that it would be impossible for a parent or a teacher to know if the individual growth score reflected a real change in learning trajectory or was just statistical noise. For this reason, there are no plans to publish individual growth scores.
To eliminate all this statistical noise, we need to average the results of a lot of students so that those whose scores are underestimates offset those whose scores are overestimates. It is illegal in California to use standardized test results for teacher evaluation. Even if it weren’t illegal, it would not be wise to evaluate a teacher based on the growth scores of her students in one year. The average growth of a class of 20 could be significantly affected by just one student with an outlier score.
Property #2: SBAC Score Growth Is Not Constant Across Grades
Students don’t advance at the same rate from one grade to another. The average 4th grade SBAC score is 42 points higher than the average 3rd grade SBAC score in ELA. The rate of progress slows in every subsequent year, going from 42 to 39 to 25 to 21 to 13. In Math, the precipitous decline starts after 4th grade. The average improvement goes from 39 points in 4th grade to 22 in 5th, 17 in 6th, 13 in 7th, and 11 in 8th. In the three traditional middle school grades of 6-8, average Math scores grow by a total of 41 points, barely more than the 39 they grew in 4th grade alone.
Property #3: Annual Growth is lower than within-grade variation
Below is a chart showing how SBAC scale scores vary by percentile and grade. Notice that the horizontal range is much greater than the vertical range. At any percentile, the difference between the 3rd grade score and the 8th grade score is between 115 and 150 points. Within any grade, the difference between the highest and lowest possible score is about 500 points.
The within-grade differences are apparently far greater than the effect of education. A 3rd grader in the 30th percentile scores 2352. If that student progresses in line with his peers, no better and no worse, he’ll still be in the 30th percentile in 8th grade and his score will have advanced to 2487. But a 3rd grader in the 80th percentile already has a score of 25014.
When we look at the same chart for Math, we see that the vertical range is particularly small at the lower percentiles. It’s only 52 points at the 10th percentile and 63 at the 20th percentile. The horizontal range is much wider. A 3rd grade student in the 30th percentile (i.e. a student well below average) scores higher than an 8th grade student in the 10th percentile.
At most percentile levels, the expected year-to-year growth is smaller than the measurement error in any one student’s SBAC score.
Property #4: SBAC Score Growth is Greater For Stronger Students
Students at one end of the achievement spectrum don’t advance at the same rate as students from the other end of the achievement spectrum.
The difference is particularly egregious in Math. In grade 4, the 20th percentile improves by 35 points while the 99th percentile improves by 49. In subsequent grades, the students in the bottom half of the distribution almost stop improving while those in the top half, particularly at the very top, grow consistently from grade to grade. In the four year span between grades 5 and 8, the 10th percentile score grows by only 13 points while the 90th percentile score grows by 123 points and the 99th percentile score by 167.
In English, the scores for the higher percentiles also increase by more than the scores for the lower percentiles but the pattern is nowhere near as pronounced.
It’s interesting, but out of scope for today, to speculate on why these patterns appear. Is the declining annual growth just a weird artifact of the model used to produce SBAC scores or does the amount students learn actually decline each year? Do strong students grow proportionately more in Math than in ELA because students who didn’t master the previous year’s material find it harder to learn new material in Math than in ELA? Does the existence of accelerated Math classes in many middle schools explain, or contribute to, the greater growth shown by strong students in middle school Math?
Implications for Growth Scores
The growth model is based on a linear regression but we have seen that there are multiple ways in which growth observed in the real world is not linear:
students in lower grades experience greater score gains than students in higher grades, even though classroom grade is not a variable in the model. If there are two students with identical prior year SBAC scores in both ELA and Math, the growth model will give them both the same forecast scores for the current year. But, in practice, if one of the students is in 4th grade and the other is in 8th grade, the 4th grader will tend to exceed the forecast score, thereby showing positive growth, while the 8th grader will tend to end up below the forecast score, thereby showing negative growth.
students in very high percentiles experience greater score gains than a linear model would predict while students in very low percentiles experience lower score gains than a linear model would predict. The growth model, being based on a linear regression, will thus tend to underestimate the growth of strong students and overestimate the growth of weak students.
When we start comparing schools and districts on the basis of their growth scores, we can therefore expect to observe the following:
Schools (and districts) that have lots of high-scoring students will show higher growth than schools (and districts) that have lots of low-scoring students. A school where 60% of students are proficient will tend to have a higher growth score than a school where 30% of students are proficient.
Similarly, demographic groups with lots of high scorers can be expected to show higher growth than demographic groups with lots of low scorers. Asian and White students, in other words, will tend to show higher growth than Latino and Black students.
Since average gains decline from 4th grade to 8th grade, a school’s grade mix will affect its growth score. K-5 schools will show higher growth than K-6 or K-8 schools. Middle schools that start in grade 6 will show lower growth than any elementary school but higher growth than middle schools that start in grade 7.
If we rank schools and districts by growth score, the highest and lowest ranked districts will tend to be small because they don’t have enough students to eliminate the statistical noise. If we find a large district with a high growth score, we can expect to find that it’s a district with a lot of high-achieving students.
I’m not claiming to have discovered anything new here. The statisticians who developed the growth model were fully aware of these properties and their effect on growth scores5. The takeaway is just that growth scores need to be interpreted with care.
If I’m comparing two schools, I’m going to try to compare like with like and I’ll try to keep the underlying proficiency rate in mind. Suppose School A and School B serve the same grades, are the same size, and both have a growth score of 0 (meaning the students’ growth matched expectations, not that they didn’t learn anything). If School A has a 25% overall proficiency rate and School B has a 75% overall proficiency rate, my conclusion is going to be that School A is doing a much better job than School B because most other schools with a 25% proficiency rate will have negative growth scores and most other schools with a 75% proficiency rate will have positive growth scores.
Even though growth scores need to be interpreted carefully, I believe they add a lot to our understanding of how well schools are doing. Consider Buena Vista / Horace Mann K-8, a largely Latino school in San Francisco where over 60% of students are English learners. Here’s what the California School Dashboard says about their academic performance in 2024.

Not good is it. But when we look at the growth scores, which are based on the same underlying results, we get a very different picture. In ELA, their growth is above what was expected.
In Math, the growth was exactly in line with what the growth model predicted, given where the students started the year.
In short, the school is doing fine. I can’t say any more than that because it’s hard to compare it with other schools. Growth scores are only available for 2024 and they are hidden away on a secondary page of the California School Dashboard. They’re not available for download which makes comparing schools a very laborious task because you have to look up each school separately.
The 2025 growth scores are scheduled to be published in December and CDE has promised to make them available for download at that time. Expect me to produce lots of charts then, comparing San Francisco with other districts.
Next week, we’ll look at how San Francisco’s other schools scored.
This still wouldn’t be a perfect assessment of the school because parents can have an ongoing effect on their children’s education by providing a stable home environment or additional tutoring but it would be better than we have today.
It intuitively feels less odd for last year’s ELA score to affect this year’s Math score because students who can’t read a Math question will have trouble answering it correctly.
Suppose the prior year scores are 2400 for ELA and 2500 for Math. The model will predict a higher current year ELA score than it would based on the ELA score alone because the 2500 Math score suggests that 2400 is an underestimate of the student’s true ELA score. Conversely, the lower prior year ELA score suggests that 2500 might be an overestimate of the student’s true Math score and so the predicted Math score will be lower than it would be if the prediction were purely based on the 2500 Math score.
You could imagine a model that gets smarter about students over time and bases the grade 8 growth projection on the results of grades 3-7 instead of just grade 7 but that’s not the model we have.
I’m very careful to say that the 3rd grader scores higher than the 8th grader, not that he knows more. The 3rd grader and the 8th grader are not taking the same test. The SBAC is an adaptive test with certain limits. In each grade, the test starts with questions designed to measure the student’s knowledge of grade level material. As the student progresses through the questions, the mix of correct and incorrect answers affects which questions come next. About 60% of the way through the test, students who are doing particularly well or badly will start to see questions from “no more than two adjacent grades”. A strong third grader will see some fifth-grade questions and a weak 8th grader will see some 6th grade questions but there will be no overlap between the questions they are asked. It would thus be a mistake to say that a 3rd grader with a score of 2501 “knows more” than an 8th grader with a score of 2487.
Mathematically, it might have been feasible to produce a more accurate growth model by including additional independent variables in the model such as student grade. Maybe race or socioeconomic disadvantage or English language fluency would also improve the model’s accuracy. Today, Asian students have higher growth scores than Latino students. With this enhanced model, maybe the average growth scores would be similar. I suspect that the reason they didn’t go down this path is that the resulting model would have some controversial implications. Among all the students who score 2500 in 3rd grade, there are disproportionately more Asians than Latinos. Among all the students who score 2500 in 7th grade, there are disproportionately more Latinos than Asians. If the model were to start taking grade into account, Latino kids who score 2500 would be projected to improve by less than Asian kids who score 2500. We would be expecting less growth from the Latino child than the Asian child. Such a model would legitimately be attacked for exhibiting what George W. Bush once called “the soft bigotry of low expectations.”
That districts and schools with lots of Latino students will tend to show negative growth is intended to keep districts and schools focused on improving those students’ learning. The goal is for the Asian and Latino growth scores to be similar because Asian and Latino students are learning at the same rate not because we have lower expectations of the Latino students.


Great write up and very interesting way to evaluate schools.
In this section, did you mean to say actual score of 2460 instead of 2660?
“A student’s growth score for a subject is the student’s actual score minus the score the student was predicted to get based on last year’s results. If the student actually scored 2660 in both ELA and Math, the student’s growth scores would be +20 (2460 - 2440) for ELA and -10 (2460 - 2470) for Math. Notice that the Math growth score is negative, even though the student’s Math scale score went up by 10 points from 2450 to 2460, because the Math scale score was predicted to go up by 20.”
... And then when you apply to UC, they admit based on zip code (and refuse to look at SAT scores) as a proxy for race, to prosecute their social engineering program. High quality kids from schools with high concentrations of high quality kids get shafted. Best bet is to avoid Mission San Jose HS in Fremont, and attend Mission HS in SF. Move from Palo Alto to Gilroy...