Monday, December 22, 2025

The Geometry of Laziness: What Angles Reveal About AI Hallucinations


a few failure that became one thing attention-grabbing.

For months, I — together with lots of of others — have tried to construct a neural community that would study to detect when AI programs hallucinate — after they confidently generate plausible-sounding nonsense as a substitute of really participating with the knowledge they got. The concept is simple: prepare a mannequin to acknowledge the refined signatures of fabrication in how language fashions reply.

However it didn’t work. The realized detectors I designed collapsed. They discovered shortcuts. They failed on any knowledge distribution barely completely different from coaching. Each method I attempted hit the identical wall.

So I gave up on “studying”. And I began to assume, why we don’t convert this right into a geometry drawback? And that is what I did.

Backing Up

Earlier than I get into the geometry, let me clarify what we’re coping with. As a result of “hallucination” has turn into a kind of phrases meaning every part and nothing. Right here’s the precise scenario. You could have a Retrieval-Augmented Technology system — a RAG system. If you ask it a query, it first retrieves related paperwork from some data base. Then it generates a response that’s speculated to be grounded in these paperwork.

  • The promise: solutions backed by sources.
  • The truth: generally the mannequin ignores the sources completely and generates one thing that sounds affordable however has nothing to do with the retrieved content material.

This issues as a result of the entire level of RAG is trustworthiness. Should you wished inventive improvisation, you wouldn’t hassle with retrieval. You’re paying the computational and latency value of retrieval particularly since you need grounded solutions.

So: can we inform when grounding failed?

Sentences on a Sphere

LLMs signify textual content as vectors. A sentence turns into a degree in high-dimensional area — 768 embedding dimensions for the primary fashions, although the precise quantity doesn’t matter a lot (DeepSeek-V3 and R1 have an embedding dimension of seven,168). These embedding vectors are normalized. Each sentence, no matter size or complexity, will get projected onto a unit sphere.

Determine 1: Semantic geometry of grounding. On the embedding sphere S^{d-1}, legitimate responses (blue) depart from the query q towards the retrieved context c; hallucinated responses (crimson) keep near the query. SGI captures this as a ratio of angular distances: responses with SGI > 1 traveled towards their sources. Picture by creator.

As soon as we predict on this projection, we are able to play with angles and distances on the sphere. For instance, we count on that related sentences cluster collectively. “The cat sat on the mat” and “A feline rested on the rug” find yourself close to one another. Unrelated sentences find yourself far aside. This clustering is how the embedding fashions are educated.

So now think about what occurs in RAG. We have now three items of textual content (Determine 1):

  • The query, (one level on the sphere)
  • The retrieved context, c one other level)
  • The generated response, r (a 3rd level)

Three factors on a sphere kind a triangle. And triangles have geometry (Determine 2).

The Laziness Speculation

When a mannequin makes use of the retrieved context, what ought to occur? The response ought to depart from the query and transfer towards the context. It ought to choose up the vocabulary, framing, and ideas from the supply materials. Geometrically it implies that the response must be nearer to the context than to the query (Determine 1).

However when a mannequin hallucinates — when it ignores the context and generates one thing from its personal parametric data — the response keep within the query’s neighborhood. It continues the query’s semantic framing with out venturing into unfamiliar territory. I referred to as this semantic laziness. The response doesn’t journey. It stays dwelling. Determine 1 illsutrates the laziness signature. Query q, context c, and response r, kind a triangle on the unit sphere. A grounded response ventures towards the context; a hallucinated one stays dwelling close to the query. The geometry is high-dimensional, however the instinct is spatial: did the response truly go anyplace?

Semantic Grounding Index

To measure this, I outlined a ratio:

and I referred to as it Semantinc Grounding Index or SGI.

If SGI is larger than 1, the response departed towards the context. If SGI is lower than 1, the response stayed near the query, which means that mannequin isn’t capable of finding a option to explare the solutions area and stays too near the query (a type of security state). The SGI has simply two angles and a division. No neural networks, no realized parameters, no coaching knowledge. Pure geometry.

Determine 2: Geometric interpretation of SGI on the embedding hypersphere. Legitimate responses (blue) depart angularly towards context; hallucinations (crimson) stay close to the query—the semantic laziness signature. Picture by the creator.

Does It Truly Work?

Easy concepts want empirical validation. I ran this on 5,000 samples from HaluEval, a benchmark the place we all know floor fact — which responses are real and that are hallucinated.

Determine 3: 5 embedding fashions, one sample. Stable curves present legitimate responses; dashed curves present hallucinations. The distributions separate constantly throughout all fashions, with hallucinated responses clustering under SGI = 1 (the ‘stayed dwelling’ threshold). The fashions had been educated by completely different organizations on completely different knowledge — but they agree on which responses traveled towards their sources. Picture by creator.

I ran the identical evaluation with 5 utterly completely different embedding fashions. Completely different architectures, completely different coaching procedures, completely different organizations — Sentence-Transformers, Microsoft, Alibaba, BAAI. If the sign had been an artifact of 1 specific embedding area, these fashions would disagree. They didn’t disagree. The typical correlation throughout fashions was r = 0.85 (from 0.80 to 0.95).

Determine 4. Correlation between the completely different fashions and architectures used within the experiment. Picture from the creator.

When the Math Predicted One thing

Up up to now, I had a helpful heuristic. Helpful heuristics are fantastic. However what occurred subsequent turned a heuristic into one thing extra principled. The triangle inequality. You most likely keep in mind this from faculty: the sum of any two sides of a triangle have to be better than the third facet. This constraint applies on spheres too, although the components seems barely completely different.

The spherical triangle inequality constrains admissible SGI values. Picture by creator.

If the query and context are very shut collectively — semantically related — then there isn’t a lot “room” for the response to distinguish between them. The geometry forces the angles to be related no matter response high quality. SGI values get squeezed towards 1. However when the query and context are far aside on the sphere? Now there’s geometric area for divergence. Legitimate responses can clearly depart towards the context. Lazy responses can clearly keep dwelling. The triangle inequality loosens its grip.

This means a prediction:

SGI’s discriminative energy ought to enhance as question-context separation will increase.

The outcomes confirms this prediction: monotonic enhance. Precisely because the triangle inequality predicted.

Query-Context Separation Impact Measurement (d) AUC
Low (related) 0.61 0.72
Medium 0.90 0.77
Excessive (completely different) 1.27 0.83
Desk 1: SGI worth enhance with scaling of question-context Separation

This distinction carries epistemic weight. Observing behaviour in knowledge after the very fact provides weak proof — such baehaviour could replicate noise or analyst levels of freedom fairly than real construction. The stronger check is prediction: deriving what ought to occur from fundamental ideas earlier than analyzing the info. The triangle inequality implied a particular relationship between θ(q,c) and discriminative energy. The empirical outcomes confirmed it.

The place It Doesn’t Work

TruthfulQA is a benchmark designed to check factual accuracy. Questions like “What causes the seasons?” with appropriate solutions (“Earth’s axial tilt”) and customary misconceptions (“Distance from the Solar”). I ran SGI on TruthfulQA. The consequence: AUC = 0.478. Barely worse than random guessing.

Angular geometry captures topical similarity. “The seasons are attributable to axial tilt” and “The seasons are attributable to photo voltaic distance” are about the identical subject. They occupy close by areas on the semantic sphere. One is true and one is fake, however they’re each responses that interact with the astronomical content material of the query.

SGI detects whether or not a response departed towards its sources. It can’t detect whether or not the response obtained the details proper. These are essentially completely different failure modes. It’s a scope boundary. And figuring out your scope boundaries is arguably extra necessary than figuring out the place your methodology works.

What This Means Virtually

Should you’re constructing RAG programs, SGI accurately ranks hallucinated responses under legitimate ones about 80% of the time — with none coaching or fine-tuning.

  • In case your retrieval system returns paperwork which are semantically very near the questions, SGI may have restricted discriminative energy. Not as a result of it’s damaged, however as a result of the geometry doesn’t allow differentiation. Think about whether or not your retrieval is definitely including info or simply echoing the question.
  • Impact sizes roughly doubled for long-form responses in comparison with brief ones. That is exactly the place human verification is costliest — studying a five-paragraph response takes time. Automated flagging is most beneficial precisely the place SGI works greatest.
  • SGI detects disengagement. Pure language inference detects contradiction. Uncertainty quantification detects mannequin confidence. These measure various things. A response might be topically engaged however logically inconsistent, or confidently flawed, or lazily appropriate accidentally. Protection in depth.

The Scientific Query

I’ve a speculation about why semantic laziness occurs. I need to be trustworthy that it’s hypothesis — I haven’t confirmed the causal mechanism.

Language fashions are autoregressive predictors. They generate textual content token by token, every alternative conditioned on every part earlier than. The query offers robust conditioning — acquainted vocabulary, established framing, a semantic neighborhood the mannequin is aware of nicely.

The retrieved context represents a departure from that neighborhood. Utilizing it nicely requires assured bridging: taking ideas from one semantic area and integrating them right into a response that began in one other area.

When a LLM is unsure about the best way to bridge, the trail of least resistance is to remain dwelling. Fashions generate one thing fluent that continues the query’s framing with out venturing into unfamiliar territory as a result of is statistically protected. As a consequence, the mannequin turns into semantically lazy.

If that is proper, SGI ought to correlate with inside mannequin uncertainty — consideration patterns, logit entropy, that kind of issues. Low-SGI responses ought to present signatures of hesitation. That’s a future experiment.

Takeaways

  • First: easy geometry can reveal construction that complicated realized programs miss. I spent months making an attempt to coach hallucination detectors. The factor that labored was two angles and a division. Typically the appropriate abstraction is the one which exposes the phenomenon most immediately, not the one with probably the most parameters.
  • Second: predictions matter greater than observations. Discovering a sample is simple. Deriving what sample ought to exist from first ideas, then confirming it — that’s how you already know you’re measuring one thing actual. The stratified evaluation wasn’t probably the most spectacular quantity on this work, nevertheless it was crucial.
  • Third: boundaries are options, not bugs. SGI fails utterly on TruthfulQA. That failure taught me extra about what the metric truly measures than the successes did. Any instrument that claims to work all over the place most likely works nowhere reliably.

Trustworthy Conclusion

I’m undecided if semantic laziness is a deep fact about how language fashions fail, or only a helpful approximation that occurs to work for present architectures. The historical past of machine studying is affected by insights that appeared basic and turned out to be contingent.

However for now, we now have a geometrical signature of disengagement: a sensible “hallucinations” detector. It’s constant throughout embedding fashions. It’s predictable from mathematical first ideas. And it’s low cost to compute.

That looks like progress.

Notice: The scientific paper with full methodology, statistical analyses, and reproducibility particulars is on the market at https://arxiv.org/abs/2512.13771.

You possibly can cite this work in BibText as:

@misc{marín2025semanticgroundingindexgeometric,
title={Semantic Grounding Index: Geometric Bounds on Context Engagement in RAG Methods},
creator={Javier Marín},
12 months={2025},
eprint={2512.13771},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.13771},
}

Javier Marin is an unbiased AI researcher primarily based in Madrid, engaged on reliability evaluation for manufacturing AI programs. He tries to be trustworthy about what he doesn’t know. You possibly can contact Javier at [email protected]. Any contribution will likely be wellcomed!


References

  • Azaria, A. and Mitchell, T. (2023). The interior state of an LLM is aware of when it’s mendacity. In Findings of the Affiliation for Computational Linguistics: EMNLP 2023, pages 967–976.
  • Bao, F., Chen, Y., and Wang, X. (2025). FaithBench: A various hallucination benchmark for summarization by trendy LLMs. arXiv preprint arXiv:2501.00942.
  • Bridson, M.R. and Haefliger, A. (2013). Metric Areas of Non-Constructive Curvature, quantity 319 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin.
  • Catak, F.O., Kuzlu, M., and Guler, O. (2024). Uncertainty quantification in giant language fashions by means of convex hull evaluation. arXiv preprint arXiv:2406.19712.
  • Firth, J.R. (1957). A synopsis of linguistic concept, 1930–1955. In Research in Linguistic Evaluation, pages 1–32. Blackwell,Oxford.
  • Fisher, R.A. (1953). Dispersion on a sphere. Proceedings of the Royal Society of London. Collection A, 217(1130):295–305.
  • Guu, Okay., Lee, Okay., Tung, Z., Pasupat, P., and Chang, M.-W. (2020). REALM: Retrieval-augmented language mannequin pre-training. In Proceedings of the thirty seventh Worldwide Convention on Machine Studying, pages 3929–3938.
  • Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2023). A survey on hallucination in giant language fashions: Ideas, taxonomy, challenges, and open questions. ACM Transactions on Data Methods.
  • Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., and Fung, P. (2023). Survey of hallucination in pure language era. ACM Computing Surveys, 55(12):1–38.
  • Kovács, Á. and Recski, G. (2025). LettuceDetect: A hallucination detection framework for RAG purposes. arXiv preprint arXiv:2502.17125. 10 A PREPRINT — DECEMBER 15, 2025
  • Kuhn, L., Gal, Y., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in pure language era. In The Eleventh Worldwide Convention on Studying Representations.
  • Li, X., Wang, Y., and Chen, Z. (2025). Semantic quantity estimation for uncertainty quantification in language fashions. arXiv preprint arXiv:2501.08765.
  • Meng, Y., Huang, J., Zhang, G., and Han, J. (2019). Spherical textual content embedding. In Advances in Neural Data Processing Methods, quantity 32, pages 8208–8217.
  • Pestov, V. (2000). On the geometry of similarity search: Dimensionality curse and focus of measure. Data Processing Letters, 73(1–2):47–51.
  • Wang, T. and Isola, P. (2020). Understanding contrastive illustration studying by means of alignment and uniformity on the hypersphere. In Proceedings of the thirty seventh Worldwide Convention on Machine Studying, pages 9929–9939.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

PHP Code Snippets Powered By : XYZScripts.com