AI·Engineering·Q1 2026

Head of Inference.

An AI infrastructure platform serving three model families had reached the point where production cost ceilings, not model quality, set the shape of the company. The CEO retained Spectrum to find a Head of Inference who could own the serving stack end to end. The seat was newly created and reported directly to the CEO.

The brief

What this seat existed to do.

The platform was scaling fast through a long, unglamorous middle: tail-latency regressions, fleet utilisation gaps, GPU class mismatches and cost-per-token slippage that compounded across customers. The CEO and CTO had been splitting the work of inference between them and had reached the point where it warranted its own owner — one accountable for latency, cost and reliability across the three model families on the platform. The brief was for someone who had personally shipped a production inference stack at frontier scale, not supervised one.

The non-negotiables were narrow. Hands-on systems experience at scale, demonstrable judgement on the cost/latency/reliability triangle, and an operating temperament suited to a small senior leadership group. The comp band was structured around equity at the inflection rather than cash. The geography was open within Europe and the US west coast, with a clear preference for time-zone overlap with the existing engineering leadership. The CEO was explicit that they did not want a research-leaning candidate, nor a manager who had drifted from the stack.

Market read

How we read the available pool.

Inference leadership sits in an awkward seam in the market. The work spans systems engineering, ML serving, accelerator economics and customer-facing reliability — a combination that few senior engineers have lived through in full. The frontier-lab bench has compressed and reference networks are tight; the small number of practitioners who have shipped serving at scale are well-known to one another, and most are not on the open market. The pull from foundation labs and the infrastructure companies adjacent to them has thinned the pool further, particularly for candidates with judgement on multi-model serving rather than single-family deployment.

Our read was that a credible shortlist would not be drawn from inbound interest or from public profiles alone. It would be built from inside the labs, from the infrastructure teams underneath them, and from the small group of practitioners who had moved between them. Search firms without practitioner depth tend to cycle the same shortlist for this seat — the same handful of names on the same handful of decks. We assessed candidates on what they had actually owned: which production incidents they had carried, which cost ceilings they had moved, which serving systems they had personally designed.

Shortlist

How we composed it.

The shortlist composition reflected the brief's emphasis on hands-on serving experience over generalist platform leadership. We weighted candidates with multi-model production exposure ahead of those whose track record was deepest on a single family, and we held the line on operating temperament alongside technical depth.

Inference lead from a frontier-lab production team.
Systems engineer who had built and operated multi-tenant serving.
Practitioner with public work on accelerator economics.
Senior engineer who had moved between labs and applied AI.

Outcome

What was placed.

The hire came from inside the broader frontier ecosystem — a practitioner who had personally led a multi-model serving redesign and had carried the on-call rota through it. What made them right was less the pedigree than the operating shape: comfortable owning the cost ceiling, willing to sit in the customer-impact meetings, and explicit about which problems they would not delegate. The CEO and CTO read the same signal in the second interview and the search compressed quickly from there.

The close ran in parallel with a competing offer from an adjacent infrastructure company. We held the candidate close on the equity structure rather than on cash and worked through the comp framing with the CEO over the final week. Brief to offer ran twelve weeks, with the offer accepted inside the second of two reference cycles.

The firm's reflection

“The lesson reinforced one we already held: senior inference is a hands-on seat dressed up as a leadership one, and the assessment has to follow the work rather than the title. Candidates who had managed serving without owning it read fluently in interview and fell apart in the technical reference. The right hire was the one whose former colleagues described the production incidents in the same language the candidate had used unprompted. That alignment between a candidate's own narrative and the back-channel record is the signal we lean on hardest at this seniority.”

— Craig Oliver

Get in touch

Briefs welcome.

Get in touch if a senior or executive role is on your roadmap. A specialist will reply within two working days.

Head of Inference.

Head of Inference.

What this seat existed to do.

How we read the available pool.

How we composed it.

What was placed.

More from our AI practice.

CFO

Head of AI Infrastructure

Briefs welcome.

Head of Inference.

What this seat existed to do.

How we read the available pool.

How we composed it.

What was placed.

More from our AI practice.

CFO

Head of AI Infrastructure

Briefs welcome.