Celebrating Traditional K-12 Public Education

Goldilocks and the Three Bears and Criterion-Referecned Testing?

Mar 28, 2025

In the classic tale of Goldilocks and the Three Bears, Goldilocks enters an empty house and samples three bowls of porridge, three chairs, and three beds, in each case seeking the one that is “just right.” Oddly enough, this fairy tale offers a helpful metaphor for understanding a key concept in standardized testing: the criterion-referenced test.

This week’s newsletter builds on last week's discussion of standardized testing by focusing specifically on criterion-referenced assessments. Understanding both types of tests is essential, as misunderstandings around testing often fuel inaccurate narratives about public school performance. The more we understand how these tools work, the better equipped we are to push back on disinformation and highlight the strengths of traditional K-12 public education regarding school quality and student achievement.

Criterion-referenced tests compare a person’s knowledge or skills against a predetermined standard, learning goal, performance level, or other criterion. With criterion-referenced tests, each person’s performance is compared directly to the standard, without considering how other students perform on the test. Criterion-referenced tests often use “cut scores” to place students into categories such as “basic,” “proficient,” and “advanced .” ¹Both Arizona's state assessments and the National Assessment of Educational Progress (NAEP) exemplify this approach.

The essential element in these types of tests is the predetermined standard. Determining the “right” standard, like Goldilocks searching for porridge that is not too hot or too cold, requires thoughtful calibration. It's not just about academic rigor; it's also about being realistic and fair. The process is both a science and an art, and it retains a distinctly human element in its creation.

As a former track coach, I like to use the high jump to illustrate this concept. If the standard is to set the bar at five inches, nearly everyone clears it. Set the standard at seven feet two inches, and almost no one will. The challenge is to place the bar “just right” for each grade level so that the results reflect meaningful learning progress.

It’s important to remember that these standards are not set in stone. In fact, they evolve. In early 2000s Arizona, eighth-grade math proficiency hovered between 18% and 25% for four years. Then, in 2004, after the standards and test were modified, proficiency jumped to around 60%.² That wasn’t a case of dumbing things down, it was about aligning expectations with realistic, developmentally appropriate outcomes. That adjustment helped Arizona find a set of math standards that were a more reasonable balance of rigor and realism or, like Goldilocks, “just right.”

Today’s Arizona state assessments and NAEP reflect identified standards. While Arizona’s state tests have changed frequently over the last two decades, NAEP has remained consistent, making it a more reliable source for long-term analysis. That’s why critics of public education often point to NAEP scores when questioning the effectiveness of our schools. Recently, the critique using these scores has gained steam, so this newsletter will use NAEP results for a perspective on and analysis of criterion-referenced tests.

Leading up to former President Trump’s executive order to eliminate the Department of Education, the Trump Administration raised alarms about student performance. Specifically, they argued that “The Department of Education has spent over $3 trillion since its creation in 1979 but there has been virtually no measurable improvement in student achievement.” ³

Project 2025, a policy blueprint guiding the current administration, echoes this point using NAEP data. They claim NAEP scores for 9- and 13-year-olds in reading and math have remained stagnant over time. For example, they note that average eighth-grade reading scores in 1992 and 2022 were both 260, while fourth-grade reading remained at 217. The report also used long-term trends for 9 year olds and 13 year olds in reading and math to try and show lack of academic growth.⁴

But here's the thing: Project 2025 only tells part of the story.

To understand the full picture, I went directly to the National Center for Education Statistics (NCES) database. NAEP data is publicly available through the “Nation’s Report Card,” and it offers a much richer context than critics suggest. I specifically reviewed average scores in reading and math for 9-year-olds, 13-year-olds, fourth graders, and eighth graders, the same data points cited by Project 2025.

Before we dive into scores, it’s helpful to understand NAEP’s sample sizes. For 9- and 13-year-olds, the sample includes about 7,400 and 8,800 students, respectively, with fewer than 8% attending private schools. The 4th- and 8th-grade NAEP assessments involve about 110,000 students, but private school students represent less than 2% of the total.⁴ These tests involve primarily public school students.

The NAEP test isn’t administered every year and the year of administration varies depending on age and grade level assessed. I chose three data points for comparison: the first year of the test, 2019 (pre-pandemic), and the latest year available (either 2022 or 2023). This approach helps account for pandemic-related disruptions and gives us a long-term view.

The headlines with the recent release of the NAEP scores centered on the decline in scores post-pandemic. However, decades of NAEP data reveal meaningful progress in reading and math achievement, particularly for historically underserved student groups. The data demonstrates that despite recent challenges, traditional K-12 public education has made substantial progress in raising achievement levels. My analysis shows: ⁵

Reading (9-year-olds): Scores improved by 7 points since 1971. Black students gained 29 points, significantly narrowing racial achievement gaps.
Reading (13-year-olds): Overall growth was modest, but Black and Hispanic students improved by 15 points each, clear signs of long-term gains.
Reading (4th grade): Modest gains since 1992, especially among Black (+7) and Hispanic (+6) students, despite pandemic setbacks.
Reading (8th grade): Scores dipped slightly post-pandemic, but every major group still made progress since 1992, and gaps continued to close.
Math (9-year-olds): Scores rose 15 points since 1971. All racial/ethnic groups improved by at least 20 points pre-pandemic.
Math (13-year-olds): Gains were seen across the board. Males improved by 10 points and females by 2, even with some recent decline.
Math (4th grade): The strongest gains were seen here, Black students improved by 32 points and Hispanic students by 27 since 1990.
Math (8th grade): Eighth-grade math scores rose 11 points since 1990 across all groups, even with slight setbacks since 2019, reflecting resilience in core academic skills.
Achievement gaps have narrowed dramatically over time, especially between Black and White students, showing the positive impact of decades of focused efforts.

Because criterion-referenced tests are calibrated to measure against a standard, large-scale improvement isn’t necessarily expected, or even designed for. The goal is to see more students meet the benchmark. However, the expectation for improvement takes time as strategies and interventions are used to improve student performance. That’s why even incremental gains are meaningful, especially across decades.

Returning to the high jump scenario. If the bar is set at 5'7", the objective isn't to raise the bar every year, but to help more students clear it through coaching, training, and encouragement. Whether or not a student improves can depend on many factors, one of the most important being their personal interest in the high jump itself. Without that spark of motivation, progress becomes less likely, even with the best support in place.

This same concept holds true for criterion-referenced tests. David Berliner and Gene Glass address some of the issues in their book, 50 Myths & Lies That Threaten America’s Public Schools ⁶ that may impact success on a test. They discuss how factors such as poverty and inequality affect education outcomes. In fact, poverty is one of the strongest indicators of academic success. Access to resources, early childhood learning opportunities and experiences growing up will all impact success on tests. Student interest in the subject matter will also determine success.

Expecting every student to reach the exact same standard at the same time ignores this nuance. Like Goldilocks, we need to find what’s just right, a balance between holding high expectations and recognizing the individual paths students take to get there. Standardized tests can provide valuable insights, but only when we interpret their results with care, context, and compassion.

The next time you encounter media coverage of the "Nation's Report Card," approach it critically. While post-pandemic scores have declined, pre-pandemic results showed meaningful gains. More importantly, NAEP data demonstrates progress in closing the achievement gap and positive growth on any number of metrics for students compared to initial testing years. Dig into the data yourself, you’ll find a more nuanced story than critics admit. In addition the understanding of how criterion-referenced tests work may inform you about the test results.

We should celebrate the positive trends for students since the test’s inception. Yes, there’s been a slight drop since the pandemic, but the long-term direction is up. Our traditional K-12 public school classrooms are working, and given time, they will bounce back from the disruption of the past few years. The criterion-referenced test, such as NAEP, demonstrates these positive trends.

Understanding the types of tests required of students in traditional K-12 education is critical. Knowing what to expect or not expect from a norm-referenced or criterion-referenced test is important. This understanding will ensure that any messaging that is overly simplistic or misrepresenting the story of traditional K-12 public education can be countered. The true story is that, while testing has a role to play in determining how well schools or students are doing, the utilization of academic achievement as a rationale to support universal choice is misguided.

As we continue to strive for educational excellence, we need to celebrate the inclusive nature of the traditional K-12 public schools. Let’s take time to appreciate the responsive and resilient nature of traditional K–12 public schools. Let’s celebrate the educators, students, and communities who show every day what public education can achieve. The work is not done, but the inclusive mission of our educators is having positive outcomes for our students. The commitment to educating every child, regardless of background or ability, and their growth is something that should be celebrated, recognized, and applauded.

Notes:

¹Renaissance. What’s the difference? Criterion-referenced tests vs. norm-referenced tests. July 11, 2018. https://www.renaissance.com/2018/07/11/blog-criterion-referenced-tests-norm-referenced-tests/

² Snowflake Unified School District Raises Student Achievement Utilizing the Principles of 21 Keys for High Performance Teaching and Learning, Seattle, WA. (The Pacific Institute) September 2006.

³ Micah Ward and Matt Zalaznick, Trump orders elimination of Department of Education; shifts core functions’ New York, NY (DA District Administration) March 20, 2025.

⁴Mandate for Leadership: The Conservative Promise, Project 2025 Presidential Transition Project. (National Center for Education Statistics), Washington, DC. January 24, 2025. www.NCES.ed.gov/nationsreportcard

⁵NAEP Report Card. (The Heritage Foundation), Washington, DC. 2022. Pg 319-362

⁶ David Berliner and Gene Glass, 50 Myths & Lies That Threaten America’s Public Schools, New York, NY (Teacher College Press) 2014.

Greg Wyman's Substack

Discussion about this post