What Good AI Actually Evaluates

The real question is not whether a tool uses AI. It is whether the tool measures signals that predict performance instead of proxies that only feel convenient.

March 4, 20266 min read
What Good AI Actually Evaluates

Insight Preview

What Good AI Actually Evaluates

Part 1 drew a line the hiring technology market has worked hard to blur: there is a major difference between automating a broken process and improving hiring quality.

This article is about the second path. If a system is going to claim intelligence, it should be able to explain what it actually evaluates and why that signal matters.

The signal problem

Every AI hiring tool evaluates something. The real question is whether that something predicts the outcome you care about: whether a person will perform well in a specific role, in a specific environment.

Most systems still evaluate proxies. Titles, years of experience, credentials, and keyword density all correlate with outcomes often enough to feel useful, but correlation is not the same as prediction.

That gap is where many hiring failures live. Harvard Business School and Accenture documented that qualified candidates are regularly deprioritized because their backgrounds do not produce the right markers, even when the underlying capability is present.

Genuine capability evaluation starts from a different premise: what someone can do, how they solve problems, and how they have performed in comparable contexts matters more than the visual shape of their resume.

Two candidates standing together while one is surrounded by richer capability and outcome signals beyond the resume.
The strongest signals often live outside the visible resume shorthand most tools were built to rank.

What genuine capability evaluation looks at

Demonstrated work product. The first question is what a candidate has actually produced, changed, or solved, not just what they were nominally responsible for.

Problem-solving approach. How someone reasons through a challenge is often more predictive than whether they have seen an identical scenario before.

Context-specific fit. A strong candidate in one organization can fail in another if the leadership style, team shape, or operating conditions are materially different.

Trajectory, not snapshot. A resume captures a moment. Better evaluation looks at how someone has developed over time and whether their capability is compounding.

Framed hiring outcome and film reel representing evidence over time rather than a single resume snapshot.
Capability evaluation looks at demonstrated work and trajectory, not just the most polished snapshot in front of you.

Why generalist models can't do this well

Large generalist models are good at pattern recognition across broad data. They can surface candidates who resemble populations that have historically performed well elsewhere.

What they cannot do automatically is understand your organization specifically. They do not know what your top performers share beyond what is visible in generalized training data.

That limitation matters. The signals that predict success for a senior engineer in a fast-growth software company are not the same signals that predict success for a senior engineer in a regulated financial institution.

Context-specific models trained on your environment, your workflows, and your actual top-performer patterns can make decisions that are calibrated to outcomes you care about instead of broad inference from everyone else's data.

A side-by-side map showing a blank area versus a route-rich city map to illustrate generic versus context-specific guidance.
Context matters. A generic model gives a rough map; a calibrated model gives a route through your actual operating environment.

What this looks like in practice

The practical difference shows up at the interview stage. Proxy evaluation sends forward the candidates whose backgrounds most closely resemble past hires.

Capability evaluation sends forward candidates who have demonstrated the relevant skills and behaviors, even if the path they took to develop them looks unconventional on paper.

That changes more than ranking. It changes the pool itself, because candidates who would have been filtered out by proxy logic now become visible.

The final article in this series covers the buyer side of the equation: what HR leaders should demand when a vendor claims their system is making better hiring decisions.

Conclusion

  • Good AI evaluates capability, context, and trajectory rather than relying on resume proxies alone.
  • The practical result is not just a different ranking of familiar candidates, but access to a different and often stronger candidate pool.

References

  1. 1. Harvard Business School and Accenture. Hidden Workers: Untapped Talent. https://www.hbs.edu/managing-the-future-of-work/Documents/research/hiddenworkers09032021.pdf
  2. 2. Harvard Business School Working Knowledge. How to Tap the Talent Automated HR Platforms Miss. https://www.library.hbs.edu/working-knowledge/how-to-tap-the-talent-automated-hr-platforms-miss

Next in this series

What HR Leaders Should Actually Demand

HR leaders should demand signal clarity, organizational specificity, explainability, integration with existing systems, and evidence tied to performance outcomes instead of demo theatrics.

What AI Can and Can't DoPart 2 of 3

Stay ahead of the ecosystem curve

One email per week. Original research, operator playbooks, and ecosystem signals that matter - no fluff, no spam.