This post originally was published on Tom VanderArk’s “VanderArk on Innovation” blog on Edweek. It was also published on GettingSmart. The following is an edited version.
Specific programs and content, not just teachers and ‘teacher quality’, must be held accountable for student outcomes. A recent study published by WestEd shows how, given certain program conditions, cost-effective and rigorous student test score evaluations of a digitally-facilitated program can now be pursued, annually, at any time in any state.
Historically, the glare of the student results spotlight has been so intensely focused on teachers alone, that the programs and content ‘provided’ to teachers have often not even been recorded. Making the case for the vital importance of paying attention is this scathing white paper, Choosing Blindly, Instructional Materials, Teacher Effectiveness, and the Common Core, from the Brown Center on Educational Policy’s Matthew Chingos and Grover Whitehurst. The good news is: digital programs operated by education publishers for schools organically generate a record of where and when they were used.
Today’s diversity of choices in digital content – choices about scope, vehicles, approaches & instructional design – is far greater than the past’s teacher-selection-committee picking among “Big 3” publishers’ textbook series. This wide variety means content can no longer be viewed as a commodity; as if it were merely a choice among brands of gasoline. Some of this new content may run smoothly in your educational system, yet some may sputter and stall, while others may achieve substantially more than normal mileage or power.
It is important to take advantage of this diversity, important to search for more powerful content. The status quo has not been able to deliver results improvements in a timely manner at scale. And spearheaded by goals embodied in the Common Core, we are targeting much deeper student understanding, while retaining last decade’s goals of demonstrably reaching all students. In this pursuit, year after year, the teachers and students stay the same. What can change are the content and programs they use; ‘programs’ including the formal training programs we provide to our teachers.
But how do you tell what works? This has been extremely challenging in the education field, due in equal measures to a likely lack of programs that do work significantly better, to the immense and hard-to-replicate variations in program use and school cultures, and to the high cost, complexity, and delay inherent in conventional rigorous, experimental evaluations.
But. There is a cost-effective, universally applicable way for a large swath of content or programs to be rigorously evaluated: do they add value vs. business-as-usual. The method is straightforward, requires no pre-planning, can be applied in arrears, and is replicable across years, states, and program-types. It can cover every school in a state, thus taking into account all real-world variability, and it’s seamless across districts, aggregating up to hundreds of schools.
To be evaluated via this method, the program must be:
- able to generate digital records of where/when/how-much it was used at a grade
- in a grade-level and subject (e.g. 3-8 math) that posts public grade-average test scores
- a full curriculum program (so that summative assessments are valid)
- in use at 100% of the classrooms/teachers in each grade (so that grade-average assessment numbers are valid)
- new to the grade (i.e. evaluating the first one or two years of use)
- adopted at sufficient “n” within a state (e.g. a cohort of ~25 or more school sites)
Every program, in every state, every year, that meets the above criteria can be studied, whether for the first time or to validate continuing effectiveness. The data is waiting in state and NCES research files to be used, in conjunction with publisher records of school/grade program usage. This example illustrates a quasi-experimental study to high standards of rigor.
It may be too early for this level of accountability to be palatable for many programs just yet. Showing robust, positive results requires the program itself be actually capable of generating differential program efficacy. And of course some program outcomes are not measured via standardized test scores. There will be many findings of small effect sizes, many implementations which fail, and much failure to show statistical significance. External factors may confound the findings. Program publishers would need to report out failures as well as successes. But the alternative is to continue in ignorance, rely only on peer word-of-mouth recommendations, or make do with a handful of small ‘gold-standard’ studies on limited contexts.
The potential to start applying this method now for many programs exists. Annual content evaluations can become a market norm, giving content an annual seat at the accountability table alongside teachers, and stimulating competition to improve content and its implementation.