Kirkpatrick may only be a starting point

Nov 14

Some days, evaluation models feel less like tools and more like competing voices in your head. As an instructional designer, you know you should be intentional about evaluation. The more complex question is more straightforward and more practical: which model fits this specific project right now?

From an Instructional Designer's perspective, I will present the decision-making process in three real-world scenarios and three evaluation frameworks to help outline which of these evaluation models to use.

A quick primer on the three models

Before getting into the scenarios, it helps to be clear on what each model is trying to do at a high level. Kirkpatrick’s Four Levels of Evaluation looks at training from four angles: reaction, learning, behavior, and results, so you can see whether people liked the experience, what they learned, what they applied, and what changed for the business. Phillips' Return on Investment (ROI) keeps those four levels. Still, it adds a fifth layer by converting results into financial terms so leaders can compare learning against other investments in money language. The (Context, Input, Process, and Product) (CIPP) model steps back even further and treats evaluation as a means to support decisions throughout the life of a program, from understanding the problem all the way through to identifying intended and unintended outcomes in the real system.

When Kirkpatrick is the least bad option

Let’s say a stakeholder is a regional sales director who wants a new onboarding program for account executives. Her opening line is blunt, and the designer needs them to ramp faster and stop losing deals that should be won. She does not care about theory. She wants to know that the training is not a time sink.

In this scenario, reach for Kirkpatrick first.

Level one provides structured reactions from new hires and their managers after each module.
Level two captures what they can actually do in simulations and knowledge checks.
Level three asks what managers see in the field thirty to ninety days later.
Level four connects to the ramp to quota time win rates and maybe average deal size over the first six months.

A concrete example:
At a previous company, the designer rebuilt sales onboarding around customer problem discovery instead of stakeholders’ pitch-heavy decks. They used simple post-session surveys and short scenario-based quizzes for levels one and two. For level three, frontline managers completed a brief checklist after joint customer calls in the first month to assess whether reps asked diagnostic questions and handled objections as they practiced. For level four, time was tracked to the first closed deal, and first-quarter win rates were compared with those of the last cohort that went through the old onboarding.

Why Kirkpatrick works here:

The director already thinks in pipeline and conversion terms, so level four feels natural.
The behaviors we care about are observable on calls, so level three is realistic.
The model gives me a straightforward narrative from learner experience to sales results without bogging her down in evaluation jargon.

Why doesn’t it:

Level three requires that the designer stay contracted, be rehired as a contractor, or be an in-house designer who has the luxury of assessing managers in the field thirty to ninety days later for a level three evaluation. The client has the training in hand but may not want to pay the contractor's time to return or stay for the 90 days (if the designer is available). In some cases, a whole new contractor will be hired to conduct the level four analysis, requiring them to familiarize themselves with the previous contractor’s work. The in-house designer may not be afforded the bandwidth to perform anything beyond superficial level-one and level-two evaluations.
Level four is more costly, as it requires the designer to connect the ramp to quota time win rates and average deal size over the first six months. The cost here is much like level three above, but over 180 days rather than 90 to evaluate training successes. Much more data must be culled and coded to identify the efficacy of the training initiative through data analysis, thus increasing the cost.

The catch is that the designer has to protect themselves from stopping at the easy levels. If all the designer does is walk away with high satisfaction scores, they have not really evaluated anything that matters to the stakeholder(s). In my career, I cannot remember a time when leadership allowed the budget, fiscally or temporally, to conduct any level three or four assessments.

When Phillips ROI earns its keep

Now imagine a different conversation. The chief financial officer is reviewing budgets and sees a seven-figure line for leadership development. A designer gets the classic questions, “How do we know this is worth it”? “If we cut it in half, what would we lose”?

In this situation, using Phillips ROI in addition to Kirkpatrick is worth the extra effort. The Learning and Development (L&D) specialist still cares about learning behavior and results, but the conversation will not land unless training evaluation translates at least part of those results into money.

A concrete example:
Let’s say that in a distribution network, there is a problem of high turnover in warehouse roles, which drives overtime recruiting expenses and lost productivity. A designer rolls out a leadership program focused on coaching and retention for frontline supervisors.

First, the Kirkpatrick levels one to three for survey, skill practice assessments, and manager observation of coaching conversations were used.
For a level four outcome, results in targeted warehouses with voluntary turnover were compared to similar sites that did not yet have the program.
To add Phillips style ROI, avoided turnover was translated into cost savings using the company standard replacement cost per role, which included recruiting time, onboarding, and lost productivity.

The result was not a single magic ROI number carved in stone. It was a range. If the turnover improvement holds at the current level, the program likely pays for itself twice over in the first year. If it drops by half, they still break even. If the effect disappears, the program is pure cost. That range gave the CFO a way to view the program as an investment with a clear risk profile rather than a reasonable expense.

Why Phillips works here:

The question on the table is explicitly financial, not just did the training work.
The business already has credible cost metrics, such as cost per hire or cost of overtime, that can be reused instead of inventing numbers.
Senior leaders are comparing this program to other investments, so putting it into the same currency keeps the training program in the conversation.

The tradeoff is transparency. L&D has to be willing to show the assumptions behind the ROI estimate and to say this is an informed estimate, not a laboratory experiment. Honesty about expenditures usually increases trust, not decreases it.

When CIPP stops you from building the wrong thing

The third scenario looks very different. A customer support director asks for a training program because the average handle time is high and customer satisfaction scores have slipped. The initial request is familiar: soft skills training for agents. Experience tells the designer that there may be no needs assessment at all, that there is a wrong assumption, or at least an incomplete interpretation of a needs analysis.

Here, CIPP helps more than either Kirkpatrick or Phillips at the start. CIPP advocates evaluating the situation as a living system rather than just a course.

Context:
What is really going on in the support environment? Are call volumes up? Is the product more complex? Are agents juggling too many systems? Are performance targets clear? Do they have conflicting metrics, such as speed versus first contact resolution?

Input:
What solutions are on the table? Is training actually the right lever? Do we need better knowledge base tools, updated scripts, revised policies, or even changes to staffing models?

Process:
How are we delivering whatever solution we pick? Are supervisors reinforcing new behaviors? Are systems changes rolling out cleanly? Is there space in the schedule for practice and coaching?

Product:
What happens as a result? What happens to handle time, first contact resolution, escalations, and customer satisfaction scores? What unintended effects show up, such as burned-out agents or frustrated customers, in one segment?

A concrete example:
In this case, a support leader wanted empathy training because customer satisfaction scores dropped for a new product. Upon review, it was discovered that the product had launched with incomplete documentation, and agents had to jump between three systems to find accurate answers. Many calls ran long because they were searching while apologizing to the customer. If a designer had jumped straight into designing a Kirkpatrick-based training evaluation, they would have implicitly accepted training as the solution.

By using CIPP, we treated the evaluation as an ongoing, iterative decision tool:

Context data interviews, call reviews, and system walk-throughs showed that the biggest blockers were process and tool issues, not empathy.
Input-focused discussions led to a combined plan to simplify knowledge access, adjust staffing at peak times, and then add a short, focused training piece on how to handle uncertainty with customers.
Process checks during rollout helped us spot that some supervisors were not giving agents time to practice the new flows.
Product measures showed improvements in handle time and customer satisfaction once the tools and scripts improved. The training component helped, but it was not the hero of the story.

Why CIPP works here:

The root problem may not be a skill gap.
Multiple interventions will likely run in parallel, and that needs a way to evaluate the whole package.
Stakeholders benefit more from better decisions over time than from a tidy after-action report on a single course.

CIPP does not replace Kirkpatrick or Phillips. It makes sure that the right problem is solved before someone invests time in a detailed training evaluation.

A quick way to choose in the moment

Under time pressure, a designer can mentally run three questions that map to these models:

Do we already know training is the right lever, and do stakeholders mostly care whether the program worked straightforwardly? If yes, start with Kirkpatrick.
Is the primary challenge making a financial case to leaders who see this as a cost item? If yes, extend Kirkpatrick with Phillips and plan for at least a rough ROI estimate.
Are we still fuzzy about the real problem? Is training just one of several possible levers? If yes, use CIPP to structure the discovery and keep the evaluation focused on decisions, not just reports.

From an instructional design seat, the real move is not picking a single perfect model. It is choosing the minimum structure that gives stakeholders better decisions and gives you, as a designer, enough evidence to either defend your work or adjust it. On good days, each of these three models provides a different way to ask the most helpful question for any project: what are we learning about the system we are trying to change?

Stephen Sypert

Kirkpatrick may only be a starting point

“Six Seven?” No, “Six Sigma”

Are you a “T-Shaped person”?

S. Matthew Sypert