We are regularly asked whether a local LLM — an AI model running on our own servers, sending nothing to a third party — can replace a hosted model for business tasks. Rather than answer on instinct, we measured. As of 20 May 2026, after putting 6 local models (via Ollama) up against several real RDEM tasks — from content generation to code audit, through advisory second opinions —, here are our conclusions, and the reasoning behind each one.
The result fits in one sentence: the format holds, factual reliability collapses. The ability to write and structure is there; but as soon as you leave the best model for each task, the flaw is not the form — it is fact hallucination. The answer is not binary, though: it all depends on where the error lands. Here is how we verified it.
Why test local AI
Local is not a fad: it answers two constraints a hosted model does not always address.
Data sovereignty
Some data must never leave your infrastructure — trade secrets, health data, GDPR or contractual constraints. A local model running on servers you control removes the transfer to a third party.
Near-zero marginal cost
Once the hardware is amortized, each generation is free. At massive, repetitive volume the economic argument becomes serious — provided quality follows.
The real question is therefore not "is it possible?" but "is it reliable without human review?". That is what the protocol set out to settle.
The protocol: two real business tasks, fact-checked
No academic benchmark: two tasks we actually perform, with the same prompt for each model, then systematic confrontation with a verified ground truth. It is this fact-check that separates a flattering demo from a usable result.
Code audit of a business application (Laravel)
Prompt: "full audit — quality, security, duplication, tests, performance". Output confronted with the actual repository, line by line.
E-commerce product page generation (SEO)
6 mandated sections (HTML description, meta, SEO title, H1, keywords, tags), on 5 products chosen for their factual traps (publisher, year, license not to be invented).
Results — code audit
Only one model proved usable, and only for review: Gemma 26B (128k context). It identified the project's real architectural issue — a monolithic service of more than 5,000 lines — and correctly qualified the security posture. Typos in names (an effect of quantization) remained minor.
| Model | Real file/line references | False claims | Usable |
|---|---|---|---|
| Gemma 26B (128k) | Yes | Almost none | Yes, for review |
| Qwen (v2) | Partial | Few, but generic | Debatable |
| Qwen (v1) | No | Numerous | No |
| Qwen 27B | No | Almost all | No |
The most telling example
One model invented the absence of CSRF protection, automated tests and rate limiting — all present — as well as a Selenium test folder and a configuration file that do not exist. A hallucinated "gap" looks like a serious finding: without review, you fix a problem that was never there.
The common blind spot: nobody read the CI
Every model recommended adding static analysis, security scanners and tests… already in place in the continuous integration pipeline. None had read the CI file. That is the first fact-check reflex: did the model actually look at what is running, or did it hallucinate a generic project?
Results — SEO product page generation
Here, no local model is usable as-is. The most striking: Qwen3-coder produces a perfect format (6 sections, valid slugs, structure respected) but disastrous facts — for one game it invented the franchise, the development studio and the release year.
| Model | Format | Factual fidelity | Verdict |
|---|---|---|---|
| Qwen3-coder | Excellent | Poor | Great format, wrong facts |
| Gemma 27B (CPU offload) | Average | Least bad | Semi-supervised only, slow |
| Mistral | Absent | Variable | 2/5 products, unusable |
| deepseek-r1 | Forced markdown | Good reasoning | Drops the description |
| Gemma e4b | Template leaks | The worst | No |
The trap: a hallucination with a valid slug
The worst case is not a broken format — that shows. It is the hallucination whose form is valid: a well-formed but wrong license slug passes the server filter without raising any alarm. Exactly the kind of error no syntactic validation catches, and that ends up live on the store.
Why a wrong fact costs more than a missing fact
On an indexable catalog, factual hallucination is not a mere quality defect: it is the costliest dimension of the whole experiment, because those facts end up indexed and structured. This is our area of expertise — here is why.
1. Structured data = machine-readable hallucination
A product page carries Product / Offer schema. An invented publisher, year or license is not just a text error: it is serialized into JSON-LD and read as-is by Google. Risk: loss of rich-result eligibility ("Misleading structured data"). Yet local fails precisely on the fields that feed the markup.
2. Amplified GEO / E-E-A-T risk
A growing share of traffic comes from AI search. A page asserting a false fact becomes a source that LLMs cite and then propagate: the error is "laundered" into an AI answer, and the E-E-A-T damage becomes durable, hard to correct once ingested.
3. Taxonomic pollution — the worst case
The "valid-slug" hallucination passes the server filter and ties the product to a wrong license / wrong tag. Internal linking then builds false thematic clusters: product filed under the wrong franchise, diluted category pages, cannibalization. Invisible when reviewing the page itself.
4. Over-optimization and thin content at scale
"Spammy" keywords and off-template titles are negative signals and hurt CTR. And an unanchored model tends to produce near-duplicate descriptions across similar products — thin content + cannibalization. Yet a page's SEO value comes precisely from the exact differentiating fact (edition, region, year): what local invents.
Implication: the safeguard is not "reread the text" but anchor the facts before generation and validate the markup and the taxonomy server-side — not just slug validity. A page with no year does not lie to Google; a page with the wrong year does.
The common finding: the format holds, the facts collapse
Across both tasks, the failure has the same signature. Local models know how to structure; they cannot guarantee a fact. And the more the model is compressed to fit in memory, the more fidelity drops.
Two mechanisms make it worse: quantization (model compression) degrading precision, and a poorly controlled context — an oversized data block saturates the window and contaminates the output, the model picking from what it is shown. Pre-filtering the context on the application side fixes part of the problem.
Conclusion: on a task where a wrong fact has a cost (an audit that drives a decision, a published product page), local does not reach the unsupervised threshold. Human review remains mandatory — which cancels much of the gain.
The real dividing line: where does the error land?
Our harshest results concern one specific use: producing output that ships as-is — a report that drives a decision, a page that ends up indexed. There, a hallucination is fatal. But we also exercised local on the opposite use: the second opinion, sparring on an analysis, where the output is reviewed by a human and nothing is published.
The result flips. There, the hallucination risk is bounded: the fatal flaw disappears. Better still, the models did not just confirm the analysis — they surfaced angles we had not anticipated. In advisory mode, the cost of a false positive is a few seconds of triage; not a polluted, indexed page. An imperfect local model that produces one correct insight already pays off.
A real example: two local models in tandem
On a commercial-positioning review, a first local model (Gemma) proposed an angle we had not centered: rather than trying to replace the tool the technical prospect already runs himself, sell it as the complement to that tool.
A second model, DeepSeek, run as a stress test, surfaced a lead we had not considered: framed that way, the offering becomes a commodity an advanced user can reproduce cheaply. Hence a refined conclusion — lead with what precisely cannot be self-built: real immutability guarantees, verified restores, sovereignty.
This is no one-off: across several of our SEO audits, DeepSeek's explicit reasoning has revealed angles we had not considered — exactly the value of a second opinion, where any error is filtered by a human before any decision.
None of these outputs were published: they fed our thinking, and a positioning misstep was avoided before it could cost anything. The pattern is reusable — one model to generate angles, another to attack them, in sequence.
So the dividing line is not "local vs hosted", but: "does the output ship as-is?"
- Shipped as-is (indexed content, client deliverable) → hosted model.
- Intercepted by a human (second opinion, brainstorming, triage) → local adds value, at zero marginal cost.
What we have not (yet) tested
Methodological honesty: our protocol tested the model "bare", in a single pass, without tools. That is the least favorable setting for local. Two architectures we have not yet explored could reshuffle the deck — and both share the property that makes local viable: the error is intercepted or corrected before it counts.
Agents (tool-use)
A local model equipped with tools — search, API access, code execution — can go fetch the fact instead of guessing it. Agent-oriented models (such as Hermes) could compensate, through tooling, for exactly the factual weakness that sinks single-pass generation.
Pre-processing in a cascade
A small local model, even of lower quality, can handle pre-processing at zero cost (extraction, classification, disambiguation, rough formatting), before escalating to a more capable model — a larger local one, or a hosted one — for the critical part only. You pay the expert only where it truly matters.
Our conclusions therefore hold for the most demanding case — a single model, in one pass, with no safety net. Architected as an agent or a cascade, local could regain real value. That is our next test step.
When local AI makes sense anyway
Two cases, and one imperative condition: the constrained pipeline.
- Absolute sovereignty: the data legally cannot leave your infrastructure.
- Massive volume at zero marginal cost, where even a residual error rate stays economically manageable after targeted review.
The pattern that works: anchor the facts, constrain the model
You never ask the local LLM to know the facts. You ask it to format facts already verified:
- Facts come from a deterministic source (business API, internal database, reference data). The model only disambiguates the entity, it does not invent its attributes.
- The LLM is used as a constrained rendering engine: "here are the verified facts, format them, add nothing".
- The context is pre-filtered on the application side, and the output validated server-side (syntax and consistency).
Cost: a hosted model remains unbeatable
When sovereignty is not at stake, the economic trade-off clearly leans toward hosted. On our SEO generation profile (~3,500 input tokens of which ~3,100 static, ~1,200 output), with prompt caching:
| Model | Per item (with cache) | 1,000 items |
|---|---|---|
| Claude Haiku 4.5 | ~€0.006 | ~€6.5 |
| Claude Sonnet 4.6 | ~€0.018 to 0.02 | ~€18 |
| Claude Opus 4.7 | ~€0.09 | ~€92 |
Sonnet 4.6 is the reliability / cost sweet spot; Haiku 4.5 holds the floor. Cost is dominated by the output — hence the value of caching only the stable system prompt. Above all, these cents must be weighed against the local hardware investment: a GPU server capable of running these models is several thousand euros to amortize, plus electricity and operations — which calls billed at €0.02 only "pay back" at very high volume, if at all. At these rates, local is therefore not justified by economics alone: it is justified by sovereignty.
FAQ
Can a local LLM replace a hosted model in production?
Not autonomously on a fact-bound task. In our May 2026 tests (6 models via Ollama), none reached the unsupervised threshold. Writing and formatting are there, but factual reliability collapses as soon as you leave the best model for each task. The flaw is not the form — it is fact hallucination.
When does local AI make sense anyway?
Under two constraints: data sovereignty (data must not be sent to a third party) or massive volume at near-zero marginal cost. And even then, only in a constrained pipeline: facts are anchored by a deterministic source and the LLM is reduced to a rendering engine that invents nothing.
Is local AI useless for fact-bound tasks?
No. Our negative verdict only applies to producing content that ships as-is. As a second opinion or sparring partner — when a human reviews before publication — local AI is usable and adds value at zero marginal cost: in our tests it even surfaced angles we had not anticipated. The dividing line is: does the output ship as-is, or is it intercepted by a human?
Which local model was the least bad?
For code audit, Gemma 26B (128k context) is the only one we judged usable, with mandatory human review. For SEO generation, no local model is usable as-is: Qwen3-coder produces the best format but with wrong facts.
Why do local models hallucinate so much on facts?
Two compounding factors. Quantization (compressing the model to fit in VRAM) degrades precision and introduces typos. And a poorly controlled context: an oversized data block saturates the window and contaminates the output. Pre-filtering the context on the application side fixes part of the problem.
Key takeaways
- The format holds, factual reliability collapses — the flaw is fact hallucination, not the form.
- The real question is not "local vs hosted" but "does the output ship as-is?".
- For published content, no local model reaches the unsupervised threshold (Gemma 26B remains the exception for audit review) → hosted model.
- As a human-reviewed second opinion, local becomes usable at zero marginal cost — it even surfaced angles we had not anticipated.
- Agents (tool-use) and pre-processing cascades not yet tested — they could restore local's appeal. With no sovereignty at stake, hosted (Sonnet 4.6 / Haiku 4.5) stays the best reliability / cost / risk ratio in production.
A sovereign AI or large-scale generation project?
We help you arbitrate between local and hosted, design a reliable constrained pipeline, and anchor the facts to avoid hallucinations. Describe your case and we will reply concretely.