NewThe 6-week AI Pilot ·  Fixed scope. Fixed price. £18k. 4 pilots per quarter.  See how it works →
Change · UK & Ireland

Eval-driven AI.
Not vibes-driven.

Bespoke AI solutions built on the latest frontier models - document understanding, agents, copilots, retrieval. We engineer to a measurable quality bar, not a demo - and ship into your environment: any cloud, private VPC, or fully on-prem.

WHAT AN AI BUILD LOOKS LIKE
30 days
to first live workflow
Bespoke
scoped per business
Latest models
day-one ready
Any cloud
on-prem, VPC or hosted
Two pilot slots open · this quarterSee engagement shapes →
How we measure 'good'

A quality bar,
before a single prompt is shipped.

Real numbers from a recent legal contract-AI build. Eight categories, 312 test cases, weighted by client priority. Same suite runs on every commit, and on every frontier model release. No green bar, no deploy.

EVAL SUITE · contract-ai · 312 casesauto-runs on every commit · CI · main
V157%
V385%
V596%
CATEGORYV1 BASELINEV3 ITERATIONV5 SHIPPED
Extract: party names
71%
92%
99%
Extract: dates + amounts
64%
88%
97%
Classify: contract type
82%
94%
98%
Flag: indemnity clauses
46%
79%
95%
Flag: data-residency terms
38%
74%
93%
Cite source paragraph
55%
83%
96%
Refuse out-of-scope queries
69%
91%
99%
Stay within playbook
42%
81%
94%
Cases: 312 · Models tested: 4 · Cost-per-eval: £0.18v5 shipped to production →
Every AI build is backed by its own eval suite. We run it on every commit during the build, and re-run it on every new frontier model release - so an upgrade ships only if it earns its way.
What we mean

We don't sell AI dreams.

We sell narrowly-scoped AI workflows that are measurably better than the spreadsheet they replace. No "AI strategy", no AI ethics workshop, no glossy demo that breaks on your real data. Just the thing, evaluated, in your stack.

WHAT WE BUILD
  • Document AI - extraction, classification, redaction
  • RAG copilots over your data, with citations
  • Agents that take real actions in your CRM / DMS
  • Eval suites that survive next year's model swap
  • Custom AI shipped into your stack - any cloud, on-prem or VPC
WHAT WE WON'T
  • Run an 'AI readiness' workshop with sticky notes
  • Ship a chatbot trained on your PDF wiki and call it done
  • Build something that works in demo but not in eval
  • Bolt a copilot onto a process we haven't redesigned
  • Lock you into our hosted models or vector store
Shipped AI

Three AI builds,
live in production.

Legal90 fee-earners

Contract triage agent

BEFORE

Senior associates spending 6+ hours per MSA on first-pass review. Compliance bottleneck.

BUILT

On-prem AI reviewer fine-tuned to the firm's playbook, integrated with iManage. Eval suite: 312 cases across 8 contract categories.

Claude Sonnet 4.6iManagepgvectorVPC
12×
faster review
94%
eval pass rate
7 wks
to live
Accountancy240 staff

AML doc-AI pipeline

BEFORE

Four FTEs reviewing KYC files by hand, 78% first-pass error rate, audit anxiety.

BUILT

Custom document-AI pipeline pulling identity, source-of-funds, sanctions evidence. Human reviews exceptions only.

GPT-5.5Claude Opus 4.7Custom .NET
98%
straight-through
£120k
annual payback
6 wks
to live
Recruitment60 consultants

CV-to-JD ranking + outreach

BEFORE

Consultants reading CVs that didn't fit briefs. Half a day per consultant, wasted.

BUILT

Retrieval-grounded ranking over CV history with auto-drafted outreach. Two-way Bullhorn writeback with full audit log.

GPT-5.5BullhornNext.jspgvector
8 hrs/wk
per consultant
interviews booked
5 wks
to live
What we actually build

Four shapes of practical AI.

Almost every AI engagement we ship lands in one of these four buckets. We say which in the scoping call - and which model, vector store and host before we charge a penny.

Document understanding
Extraction · classification · redaction
OCR + LLM pipelines that read messy real-world docs - contracts, KYC, suitability files, invoices - and write structured data straight into your systems. Always with citations back to source.
Retrieval (RAG)
Copilots over your data
Question-answering grounded in your documents, knowledge base or CRM. Cited paragraphs only. We design the chunking, the retriever and the re-ranker - not just paste pgvector in.
Agents
AI that takes real actions
Multi-step workflows where the AI plans, calls your tools (CRM, DMS, email, calendar), and reports back. Human checkpoints at every action that changes data.
Copilots
In-app, in-context
AI surfaced inside the screens your team already uses. Drafting, summarising, deciding. Knows the user's role and the firm's playbook. No 'open a new tab' tax.
Latest frontier models, day one. We pick the strongest model for the job - and re-pick when a better one ships. Deployed where it fits: any cloud, a private VPC, or fully on-prem for data-residency-bound work.
Claude Opus 4.7GPT-5.5Gemini 3.1 ProLlama 4Custom fine-tunes
The deal

Fixed scope. Fixed fee.
Per workflow.

Four engagement shapes. Bespoke scope, bespoke pricing per business - we quote a fixed fee after a free 30-minute scoping call.

What every shape includes
  • Engineered to a quality bar
    50–500 weighted test cases that define 'good'. Re-run on every commit during the build.
  • Deployed where it fits
    Any cloud, a private VPC, or fully on-prem for residency-bound work. Your auth, your audit log.
  • Model-agnostic design
    Swap Claude Opus 4.7 for GPT-5.5, Gemini 3.1 Pro or an on-prem Llama later, without rewriting the app.
  • 30-day post-launch tail
    Tuning, regression-guarding and bug fixes included for the first month after go-live.
All application code in your repos. No platform fees. No per-call tax. Bring-your-own-model API keys, on whichever frontier provider fits the job.
How we work

Eval-first. Then build.

The eval is week one's deliverable. It's the contract between "looks impressive in demo" and "actually works on your real data".

01 · Week 1
Scope
Pick one workflow where AI clearly wins. Map current state, baseline error rate, agree the success metric.
02 · Week 1–2
Build evals
50–500 real cases that define 'good'. Drawn from your data, scored by your experts. The eval ships before the prompt.
03 · Week 2–5
Build to eval
Iterate prompts, retrieval, structure. CI re-runs evals on every commit. We don't deploy a version that regresses.
04 · Week 5–6
Land
Deploy to production. Onboarding, audit log, on-call rota. Confidence intervals on the metrics that matter.
05 · Ongoing
Iterate
Optional. Quality-bar growth, frontier model upgrades, drift-watch dashboards under our Run service.
At go-live

What's in the box.

Every AI build comes with the same engineering kit. The quality bar is the bit other AI vendors don't bother to write - and the bit that pays for itself the first time a frontier model release would have broken something.

Quality bar (eval suite)
50–500 weighted cases, CI-tested on every commit during the build. Re-run on every new frontier model release.
AI workflow in production
Application code in your repo. Deployed to your cloud, VPC, or on-prem. Production-grade, not demo-grade.
Audit-ready logging
Every prompt, retrieval, tool call, output and confidence score logged. FCA / ICO / internal-audit ready.
Data-residency design
Where the data lives, where it goes, where it doesn't. UK / EU / on-prem options. Signed DPA.
Live MI dashboard
Pass rate, latency, cost-per-call, hallucination flags. Daily, not retrospective.
Operator training
Two sessions: end-users who use it, leaders who report on it. Recorded for the next hire.
Runbook + failure modes
What breaks, why, how to recover. The bit other AI vendors skip and you find out at 2am.
Optional Run handover
We operate it. Frontier model upgrades, drift-watch, regression guarding. Or hand-over with a 30-day support tail.
Receipts

Less talking. More shipping.

Accountancy240 staffUK · Manchester

From four FTEs reviewing AML files by hand - to one reviewer, 12 minutes per file.

A 240-person UK accountancy was burning four FTEs on AML onboarding, with a 78% first-pass error rate. They'd already had two prior AI initiatives die in slide decks. We scoped it in week one, shipped to staging by week three, and went live in week six.

"AssurePath had something running in our staging tenant by week three. We'd had two prior 'AI initiatives' that died in slide decks."
- Operations Director, 240-person UK accountancy
IMPACT · 6 MONTHS POST-LAUNCH
22 hrs/wk
saved across the AML team
£120k
annual cost reduction, year 1
98%
first-pass straight-through
6 wks
scope to in-production
100%
SAR-ready audit trail
+2.3
new partners hired off the savings

"They scoped what we were trying to do in 40 minutes. The previous shop took six weeks and got it wrong."

Managing Partner
90-fee-earner law firm, London

"We were quoted £450k by a Big Four practice. AssurePath delivered the same outcome in eight weeks for £62k."

Finance Director
Mid-cap recruitment group, Birmingham

"Honest, technical, on time. They told us not to do half the things we asked. That's why we trust them."

Operations Director
FCA-regulated wealth manager, Edinburgh
FAQ

Questions we answer in week one.

Eight of the questions clients ask before commissioning an AI build.

Before we write a prompt, we write the test suite. Fifty to five hundred real cases drawn from your data, scored by your experts, weighted by what actually matters. It runs on every commit during the build, and on every new frontier model release after go-live. We don't deploy a version that regresses against it - the model that's good enough in week three has to still be good enough next year when a frontier model upgrade lands.

Let's scope a build

Eval-driven AI.
Not vibes-driven.

Book a 30-minute scoping call with an engineer who's shipped AI into production. We talk through the workflow, what 'good' looks like, and whether the maths works. If it doesn't, we'll say so.

01
Quality bar first
Week one writes the test suite, not the demo.
02
Any cloud, any model
Latest frontier models, deployed where it fits - cloud, VPC or on-prem.
03
Live in 4–9 weeks
Or we tell you in week one it isn't worth doing.