, somebody claims they’ve invented a revolutionary AI structure. However while you see the identical mathematical sample — selective amplification + normalization — emerge independently from gradient descent, evolution, and chemical reactions, you understand we didn’t invent the eye mechanism with the Transformers structure. We rediscovered elementary optimization rules that govern how any system processes info below vitality constraints. Understanding consideration as amplification somewhat than choice suggests particular architectural enhancements and explains why present approaches work. Eight minutes right here provides you a psychological mannequin that would information higher system design for the following decade.
When Vaswani and colleagues printed “Consideration Is All You Want” in 2017, they thought they had been proposing one thing revolutionary [1]. Their transformer structure deserted recurrent networks solely, relying as a substitute on consideration mechanisms to course of whole textual content sequences concurrently. The mathematical core was easy: compute compatibility scores between positions, convert them to weights, and use these for selective mixture of knowledge.
However this sample seems to emerge independently wherever info processing programs face useful resource constraints below complexity. Not as a result of there’s some common regulation of consideration, however as a result of sure mathematical buildings appear to signify convergent options to elementary optimization issues.
We could also be taking a look at a type of uncommon instances the place biology, chemistry, and AI have converged on related computational methods — not via shared mechanisms, however via shared mathematical constraints.
The five hundred-Million-Yr Experiment
The organic proof for attention-like mechanisms is remarkably deep. The optic tectum/superior colliculus system, which implements spatial consideration via aggressive inhibition, reveals extraordinary evolutionary conservation throughout vertebrates [2]. From fish to people, this neural structure maintains structural and purposeful consistency throughout 500+ million years of evolution.
However maybe extra intriguing is the convergent evolution.
Unbiased lineages developed attention-like selective processing a number of instances: compound eye programs in bugs [3], digital camera eyes in cephalopods [4], hierarchical visible processing in birds [5], and cortical consideration networks in mammals [2]. Regardless of vastly totally different neural architectures and evolutionary histories, these programs converged on related options for selective info processing.
This raises a compelling query: Are we seeing proof of elementary computational constraints that govern how advanced programs should course of info below useful resource limitations?
Even easy organisms counsel this sample scales remarkably. C. elegans, with solely 302 neurons, demonstrates refined attention-like behaviors in meals searching for and predator avoidance [6]. Crops exhibit attention-like selective useful resource allocation, directing development responses towards related environmental stimuli whereas ignoring others [7].
The evolutionary conservation is placing, however we ought to be cautious about direct equivalences. Organic consideration entails particular neural circuits formed by evolutionary pressures fairly totally different from the optimization landscapes that produce AI architectures.
Consideration as Amplification: Reframing the Mechanism
Current theoretical work has essentially challenged how we perceive consideration mechanisms. Philosophers Peter Fazekas and Bence Nanay demonstrated that conventional “filter” and “highlight” metaphors essentially mischaracterize what consideration really does [8].
They assert that spotlight doesn’t choose inputs — it amplifies presynaptic alerts in a non-stimulus-driven manner, interacting with built-in normalization mechanisms that create the looks of choice. The mathematical construction they determine is the next:
- Amplification: Improve the energy of sure enter alerts
- Normalization: Constructed-in mechanisms (like divisive normalization) course of these amplified alerts
- Obvious Choice: The mixture creates what seems to be selective filtering
This framework explains seemingly contradictory findings in neuroscience. Results like elevated firing charges, receptive discipline discount, and encompass suppression all emerge from the identical underlying mechanism — amplification interacting with normalization computations that function independently of consideration.
Fazekas and Nanay centered particularly on organic neural programs. The query of whether or not this amplification framework extends to different domains stays open, however the mathematical parallels are suggestive.
Chemical Computer systems and Molecular Amplification
Maybe essentially the most stunning proof comes from chemical programs. Baltussen and colleagues demonstrated that the formose response — a community of autocatalytic reactions involving formaldehyde, dihydroxyacetone, and steel catalysts — can carry out refined computation [9].

The system reveals selective amplification throughout as much as 10⁶ totally different molecular species, attaining > 95% accuracy on nonlinear classification duties. Totally different molecular species reply differentially to enter patterns, creating what seems to be chemical consideration via selective amplification. Remarkably, the system operates on timescales (500 ms to 60 minutes) that overlap with organic and synthetic consideration mechanisms.
However the chemical system lacks the hierarchical management mechanisms and studying dynamics that characterize organic consideration. But the mathematical construction — selective amplification creating obvious selectivity — seems strikingly related. Programmable autocatalytic networks present further proof. Metallic ions like Nd³⁺ create biphasic management mechanisms, each accelerating and inhibiting reactions relying on focus [10]. This generates controllable selective amplification that implements Boolean logic capabilities and polynomial mappings via purely chemical processes.
Data-Theoretic Constraints and Common Optimization
The convergence throughout these totally different domains might mirror deeper mathematical requirements. Data bottleneck concept supplies a proper framework: any system with restricted processing capability should resolve the optimization drawback of minimizing info retention whereas preserving task-relevant particulars [11].
Jan Karbowski’s work on info thermodynamics reveals common vitality constraints on info processing [12]. The elemental thermodynamic sure on computation creates choice strain for environment friendly selective processing mechanisms throughout all substrates able to computation:

Data processing prices vitality, so environment friendly consideration mechanisms have a survival/efficiency benefit, the place σ represents entropy (S) manufacturing fee, and ΔI represents info processing capability.
At any time when any system — whether or not a mind, a pc, and even chemical reactions — processes info, it should dissipate vitality as waste warmth. The extra info you course of, the extra vitality you will need to waste. Since consideration mechanisms course of info (deciding what to give attention to), they’re topic to this vitality tax.
This creates common strain for environment friendly architectures — whether or not you’re evolution designing a mind, chemistry organizing reactions, or gradient descent coaching transformers.
Neural networks working at criticality — the sting between order and chaos — maximize info processing capability whereas sustaining stability [13]. Empirical measurements present that acutely aware consideration in people happens exactly at these important transitions [14]. Transformer networks throughout coaching exhibit related section transitions, organizing consideration weights close to important factors the place info processing is optimized [15].
This implies the likelihood that attention-like mechanisms might emerge wherever programs face the elemental trade-off between processing capability and vitality effectivity below useful resource constraints.
Convergent Arithmetic, Not Common Mechanisms
The proof factors towards a preliminary conclusion. Somewhat than discovering common mechanisms, we could also be witnessing convergent mathematical options to related optimization issues:

The mathematical construction — selective amplification mixed with normalization — seems throughout these domains, however the underlying mechanisms and constraints differ considerably.
For transformer architectures, this reframing suggests particular insights:
- Q·Ok computation implements amplification.

The dot product Q·Ok^T computes semantic compatibility between question and key representations, performing as a discovered amplification operate the place excessive compatibility scores amplify sign pathways.The scaling issue √d_k prevents saturation in high-dimensional areas, sustaining gradient move.
- Softmax normalization creates winner-take-all dynamics

Softmax implements aggressive normalization via divisive renormalization. The exponential time period amplifies variations (winner-take-all dynamics) whereas sum normalization ensures Σw_ij = 1. Mathematically this operate is equal to a divisive normalization.
- Weighted V mixture produces obvious selectivity

On this mixture there may be not express choice operator, it’s principally a linear mixture of worth vectors. The obvious selectivity emerges from the sparsity sample induced by softmax normalization. Excessive consideration weights create efficient gating with out express gating mechanisms.
The mixture softmax(amplification) induce a winner-take-all dynamics on the worth house.


Implications for AI Growth
Understanding consideration as amplification + normalization somewhat than choice affords a number of sensible insights for AI structure design:
- Separating Amplification and Normalization: Present transformers conflate these mechanisms. We’d discover architectures that decouple them, permitting for extra versatile normalization methods past softmax [16].
- Non-Content material-Based mostly Amplification: Organic consideration contains “not-stimulus-driven” amplification. Present transformer consideration is solely content-based (Q·Ok compatibility). We may examine discovered positional biases, task-specific amplification patterns, or meta-learned amplification methods.
- Native Normalization Swimming pools: Biology makes use of “swimming pools of surrounding neurons” for normalization somewhat than international normalization. This implies exploring native consideration neighborhoods, hierarchical normalization throughout layers, or dynamic normalization pool choice.
- Crucial Dynamics: The proof for consideration working close to important factors means that efficient consideration mechanisms ought to exhibit particular statistical signatures — power-law distributions, avalanche dynamics, and significant fluctuations [17].
Open Questions and Future Instructions
A number of elementary questions stay:
- How deep do the mathematical parallels lengthen? Are we seeing true computational equivalence or superficial similarity?
- What can chemical reservoir computing train us about minimal consideration architectures? If easy chemical networks can obtain attention-like computation, what does this counsel concerning the complexity necessities for AI consideration?
- Do information-theoretic constraints predict the evolution of consideration in scaling AI programs? As fashions turn into bigger and face extra advanced environments, will consideration mechanisms naturally evolve towards these common optimization rules?
- How can we combine organic insights about hierarchical management and adaptation into AI architectures? The hole between static transformer consideration and dynamic organic consideration stays substantial.
Conclusion
The story of consideration seems to be much less about invention and extra about rediscovery. Whether or not within the formose response’s chemical networks, the superior colliculus’s neural circuits, or transformer architectures’ discovered weights, we see variations on a mathematical theme: selective amplification mixed with normalization to create obvious selectivity.
This doesn’t lower the achievement of transformer architectures — if something, it suggests they signify a elementary computational perception that transcends their particular implementation. The mathematical constraints that govern environment friendly info processing below useful resource limitations seem to push totally different programs towards related options.
As we proceed scaling AI programs, understanding these deeper mathematical rules might show extra worthwhile than mimicking organic mechanisms instantly. The convergent evolution of attention-like processing suggests we’re working with elementary computational constraints, not engineering decisions.
Nature spent 500 million years exploring these optimization landscapes via evolution. We rediscovered related options via gradient descent in just a few years. The query now’s whether or not understanding these mathematical rules can information us towards even higher options that transcend each organic and present synthetic approaches.
Closing be aware
The true check: if somebody reads this and designs a greater consideration mechanism consequently, we’ve created worth.
Thanks for studying — and sharing!
Javier Marin
Utilized AI Advisor | Manufacturing AI Methods + Regulatory Compliance
[email protected]
References
[1] Vaswani, A., et al. (2017). Consideration is all you want. Advances in Neural Data Processing Methods, 30, 5998–6008.
[2] Knudsen, E. I. (2007). Elementary elements of consideration. Annual Evaluation of Neuroscience, 30, 57–78.
[3] Nityananda, V., et al. (2016). Consideration-like processes in bugs. Proceedings of the Royal Society B, 283(1842), 20161986.
[4] Cartron, L., et al. (2013). Visible object recognition in cuttlefish. Animal Cognition, 16(3), 391–401.
[5] Wylie, D. R., & Crowder, N. A. (2014). Avian fashions for 3D scene evaluation. Proceedings of the IEEE, 102(5), 704–717.
[6] Jang, H., et al. (2012). Neuromodulatory state and intercourse specify different behaviors via antagonistic synaptic pathways in C. elegans. Neuron, 75(4), 585–592.
[7] Trewavas, A. (2009). Plant behaviour and intelligence. Plant, Cell & Atmosphere, 32(6), 606–616.
[8] Fazekas, P., & Nanay, B. (2021). Consideration is amplification, not choice. British Journal for the Philosophy of Science, 72(1), 299–324.
[9] Baltussen, M. G., et al. (2024). Chemical reservoir computation in a self-organizing response community. Nature, 631(8021), 549–555.
[10] Kriukov, D. V., et al. (2024). Exploring the programmability of autocatalytic chemical response networks. Nature Communications, 15(1), 8649.
[11] Tishby, N., & Zaslavsky, N. (2015). Deep studying and the knowledge bottleneck precept. arXiv preprint arXiv:1503.02406.
[12] Karbowski, J. (2024). Data thermodynamics: From physics to neuroscience. Entropy, 26(9), 779.
[13] Beggs, J. M., & Plenz, D. (2003). Neuronal avalanches in neocortical circuits. Journal of Neuroscience, 23(35), 11167–11177.
[14] Freeman, W. J. (2008). Neurodynamics: An exploration in mesoscopic mind dynamics. Springer-Verlag.
[15] Gao, J., et al. (2016). Common resilience patterns in advanced networks. Nature, 530(7590), 307–312.
[16] Reynolds, J. H., & Heeger, D. J. (2009). The normalization mannequin of consideration. Neuron, 61(2), 168–185.
[17] Shew, W. L., et al. (2009). Neuronal avalanches suggest most dynamic vary in cortical networks at criticality. Journal of Neuroscience, 29(49), 15595–15600.
