Category Archives: Evaluation

Headed for Proof: a Baseball Analogy

A new baseball strategyI appreciate Robert Slavin’s baseball analogy in his HuffPost article ‘Pilot Studies: On the Path to Solid Evidence‘. Dr. Slavin is the Director of the Center for Research and Reform in Education at Johns Hopkins. He puts forward the idea that a new baseball strategy would be “headed towards proof” if 10 teams adopt a strategy and see comparative improvements at season’s end. It’s easy to take for granted that there’s universal data in sports: ranking are reported out in the sports section. So if we knew which teams adopted any new strategy in a given year, we could do an evaluation.

I would label this baseball analogy an unmatched-controls quasi-experiment. It pits teams using the new strategy vs whatever other ‘business-as-usual’ teams. Not a gold standard RCT, yet sensible that 10 teams improving their relative rankings is headed towards proof.

Back to edtech: you can get very close to the baseball analogy via published school proficiency rankings. You can improve it via matching controls. And there exist quantitative digital records of how much/well schools used digital programs. End of school-year summative tests are like a whole ‘season’. And for schoolwide programs, individual teachers and students aren’t opting in or out depending on motivation or capability. This quasi-experimental route, “headed for proof”, enables the game-changing advantages of volume and speed: use it for any new cohort of schools (>10), for any schoolwide program, every year. #edtechguidelines

Tagged , , ,

Future Vision 2025 Assessments: It’s in the Practice

What future are we aiming at? This series of 6 posts, Future Vision 2025, describes some of my personal education mission milestones. These are not predictions, they are aspirational. They are framed as significant differences one could see or make by 2025. What’s noticeably different in 2025 when one examines students, parents, teachers, learning, assessment, media & society? How and when these milestones are reached are not addressed. Some milestones are indicated by the emergence of something ‘new’ (at least at robust scale), others by the fading away of something familiar and comfortable.

Assessment 2025

In the 1970’s, I remember taking the Iowa Test of Basic Skills in math & English, in a few grades, for a few hours.

By 2015 a Council of Great City Schools evaluation showed students undergo standardized testing for 20-25 hours per year, not to mention testing prep time. By the time they graduate, students have been administered about 112 exams. Now, this is great fodder for the program evaluation work I do now, understanding what is working, how much and for whom. It would be impossible at scale without plenty of universal standardized test data. But in the future, given digital content, the 20-25 hours per year of standardized testing can be eliminated while retaining the benefits of the information they used to provide. This reduction of non-learning-added time is in more than just the test hours, it includes eliminating the prep hours for the style of test. And most importantly, this implodes the paradigm that test scores are the purpose, and test day is the culmination, of the school year’s efforts.

By 2025, “sitting tests” in March and April has been replaced by a continual assessment of knowledge and ability throughout the school year, via organic student interaction with the digital learning activities themselves. These activities each week still include practicing solving many problems, aka “doing problem sets.” The information generated from the digital “Practice” IS the new “Assessment.” Indeed summative standardized tests were essentially a review problem set, given in a huge dose at the end of the year. In 2025, each week every student’s use of digital content indicates mastery of that week’s content…or not. Gaps are identified as they occur, and are filled before moving on. You may ask, thinking back to cramming for a final, what about the retention that summative tests checked? In 2025, the digital content and practice adaptively checks retention of key prior knowledge for each individual student, intelligently spiraling problems back and forth to build fluency.

Moreover, beyond the conventional goal of “producing the right answer,” 2025’s digital device interface and pattern recognition assesses student strategy. Tablets collect, and the backend cloud parses and interprets, student handwriting and diagrams. “Show your work” is digitized and thus comprehensively purposeful. The information gleaned evaluates methods and strategies, and yes even productivity and speed. Insightful and actionable feedback on all of these is provided in real-time to the teacher and especially to the student. Why a student “isn’t getting it” becomes detailed and transparent. In 2025 haven’t just replaced “right answer” to “right strategy” though; it’s a different paradigm. Mastery is not tied to one “right” strategy, but it is about learning and applying strategies and methods that are productive – efficiency in thought, effort, and time.

In 2025 comprehensive content breadth and mastery of all techniques, what used to be the summative test’s job, has been measured in this digital, formative way throughout the school year. Indeed because of the continual feedback and intelligent spiraling, it has been not just measured, but refined and improved throughout the year towards fluency. There is still however a “final.” The benefits of a deadline to display one’s complete picture of a complex and broad topic are maintained. But because all that ongoing broad content mastery is already well known, the “final” can focus on a specific “narrow” area of interest. The final can  be a performance – authentic, creative and rigorous and very human which shows off the learner’s ability to communicate, and to creatively transfer to different domains.

Yes, I mean that a middle schooler’s integrated math “final” in 2025 can be a performance, hard to make, challenging to deliver, but fun and maybe even beautiful to watch.

Authentic Performance

Tagged , , , , ,

Why Not: 3 Ingredients Enable Universal, Annual Digital Program Evaluations

This post originally appeared in an EdSurge guide, Measuring Efficacy in Ed Tech. Similar content, from a perspective about sharing accountability that teachers alone have shouldered, is in this prior post.

Curriculum-wide programs purchased by districts need to show that they work. Even products aimed mainly at efficiency or access should at minimum show that they can maintain status quo results. Rigorous evaluations have been complex, expensive and time-consuming at the student-level. However, given a digital math or reading program that has reached a scale of 30 or more sites statewide, there is a straightforward yet rigorous evaluation method using public, grade-average proficiencies, which can be applied post-adoption. The method enables not only districts, but also publishers to hold their programs accountable for results, in any year and for any state.

Three ingredients come together to enable this cost-effective evaluation method: annual school grade-average proficiencies in math and reading for each grade posted by each state, a program adopted across all classrooms in each using grade at each school, and digital records of grade average program usage. In my experience, school cohorts of 30 or more sites using a program across a state can be statistically evaluated. Once methods and state posted data are in place, the marginal cost and time per state-level evaluation can be as little as a few man-weeks.

A recently published WestEd study of MIND Research Institute’s ST Math, a supplemental digital math curriculum using visualization (disclosure: I am Chief Strategist for MIND Research) validates and exemplifies this method of evaluating grade-average changes longitudinally, aggregating program usage across 58 districts and 212 schools. In alignment with this methodological validation, in 2014 MIND began evaluating all new implementations of its elementary grade ST Math program in any state with 20 or more implementing grades (from grades 3, 4, and 5).

Clearly, evaluations of every program, every year have not been the prior market norm: it wasn’t possible before annual assessment and school proficiency posting requirements, and wasn’t possible before digital program usage measurements. Moreover, the education market has greatly discounted the possibility that curriculum makes all that much difference to outcomes, to the extent of not even trying to uniformly record what programs are being used by what schools. (Choosing Blindly: Instructional Materials, Teacher Effectiveness, and the Common Core by Matthew Chingos and Russ Whitehurst crisply and logically highlights this “scandalous lack of information” on usage and evaluation of instructional materials, as well as pointing out the high value of improving knowledge in this area.)

But publishers themselves are now in a position, in many cases, to aggregate their own digital program usage records from schools across districts, and generate timely, rigorous, standardized evaluations of their own products, using any state’s posted grade-level assessment data. It may be too early or too risky for many publishers. Currently, even just one rigorous, student-level study can serve as sufficient proof for a product. It’s an unnecessary risk for publishers to seek more universal, annual product accountability. It would be as surprising as if, were the anonymized data available, a fitness company started evaluating and publishing its overall average annual fitness impact on club member cohorts, by usage. By observation of the health club market, this level of accountability is neither a market requirement, nor even dreamed of. No reason for those providers to take on extra accountability.

But while we may accept that member-paid health clubs are not accountable for average health improvements, we need not accept that digital content’s contribution to learning outcomes in public schools goes unaccounted for. And universal content evaluation, enabled for digital programs, can launch a continuous improvement cycle, both for content publishers and for supporting teachers.

Once rigorous program evaluations start becoming commonplace, there will be many findings which lack statistical significance, and even some outright failures. Good to know. We will find that some local district implementation choices, as evidenced by digital usage patterns, turn out to be make-or-break for any given program’s success. Where and when robust teacher and student success is found, and as confidence is built, programs and implementation expertise can also start to be baked into sustained district pedagogical strategies and professional development.

Tagged , , , , , , ,

Hold Content Accountable Too: a scalable Method

This post originally was published on Tom VanderArk’s “VanderArk on Innovation” blog on Edweek. It was also published on GettingSmart. The following is an edited version.

Specific programs and content, not just teachers and ‘teacher quality’, must be held accountable for student outcomes. A recent study published by WestEd shows how, given certain program conditions, cost-effective and rigorous student test score evaluations of a digitally-facilitated program can now be pursued, annually, at any time in any state.

Historically, the glare of the student results spotlight has been so intensely focused on teachers alone, that the programs and content ‘provided’ to teachers have often not even been recorded. Making the case for the vital importance of paying attention is this scathing white paperChoosing Blindly, Instructional Materials, Teacher Effectiveness, and the Common Core, from the Brown Center on Educational Policy’s Matthew Chingos and Grover Whitehurst.  The good news is: digital programs operated by education publishers for schools organically generate a record of where and when they were used.

Today’s diversity of choices in digital content – choices about scope, vehicles, approaches & instructional design – is far greater than the past’s teacher-selection-committee picking among “Big 3” publishers’ textbook series. This wide variety means content can no longer be viewed as a commodity; as if it were merely a choice among brands of gasoline. Some of this new content may run smoothly in your educational system, yet some may sputter and stall, while others may achieve substantially more than normal mileage or power.

It is important to take advantage of this diversity, important to search for more powerful content. The status quo has not been able to deliver results improvements in a timely manner at scale. And spearheaded by goals embodied in the Common Core, we are targeting  much deeper student understanding, while retaining last decade’s goals of demonstrably reaching all students. In this pursuit, year after year, the teachers and students stay the same. What can change are the content and programs they use; ‘programs’ including the formal training programs we provide to our teachers.

But how do you tell what works? This has been extremely challenging in the education field, due in equal measures to a likely lack of programs that do work significantly better, to the immense and hard-to-replicate variations in program use and school cultures, and to the high cost, complexity, and delay inherent in conventional rigorous, experimental evaluations.

But. There is a cost-effective, universally applicable way for a large swath of content or programs to be rigorously evaluated: do they add value vs. business-as-usual. The method is straightforward, requires no pre-planning, can be applied in arrears, and is replicable across years, states, and program-types. It can cover every school in a state, thus taking into account all real-world variability, and it’s seamless across districts, aggregating up to hundreds of schools.

To be evaluated via this method, the program must be:

  1. able to generate digital records of where/when/how-much it was used at a grade
  2. in a grade-level and subject (e.g. 3-8 math) that posts public grade-average test scores
  3. a full curriculum program (so that summative assessments are valid)
  4. in use at 100% of the classrooms/teachers in each grade (so that grade-average assessment numbers are valid)
  5. new to the grade (i.e. evaluating the first one or two years of use)
  6. adopted at sufficient “n” within a state (e.g. a cohort of ~25 or more school sites)

Every program, in every state, every year, that meets the above criteria can be studied, whether for the first time or to validate continuing effectiveness. The data is waiting in state and NCES research files to be used, in conjunction with publisher records of school/grade program usage. This example illustrates a quasi-experimental study to high standards of rigor.

It may be too early for this level of accountability to be palatable for many programs just yet. Showing robust, positive results requires the program itself be actually capable of generating differential program efficacy. And of course some program outcomes are not measured via standardized test scores. There will be many findings of small effect sizes, many implementations which fail, and much failure to show statistical significance. External factors may confound the findings. Program publishers would need to report out failures as well as successes. But the alternative is to continue in ignorance, rely only on peer word-of-mouth recommendations, or make do with a handful of small ‘gold-standard’ studies on limited contexts.

The potential to start applying this method now for many programs exists. Annual content evaluations can become a market norm, giving content an annual seat at the accountability table alongside teachers, and stimulating competition to improve content and its implementation.


Tagged , , , , , , ,
%d bloggers like this: