Google's Nested Learning: "Attention Is All You Need V2"

Google just dropped what could be the sequel to the foundational “Attention is All You Need” paper. This research could solve AI’s most stubborn challenge: catastrophic forgetting.

When AI models learn something new, they tend to drastically forget what they previously learned. Humans don’t work this way—and now Google Research has a solution.

What Is Nested Learning?

Nested Learning is a new machine learning paradigm that treats models as a system of interconnected optimization problems running at different speeds—just like how our brain processes information.

The Core Problem

Current LLMs don’t learn from experiences; they remain limited to what they learned during training. They can’t learn or improve over time without losing previous knowledge.

As research shows, when neural networks are trained sequentially on multiple tasks, weights important for Task A are changed to meet objectives of Task B—causing abrupt knowledge loss.

How It Works: The Brain-Inspired Approach

Nested Learning changes this by viewing the model’s architecture and training algorithm as the same thing—just different “levels” of optimization.

Traditional AI Nested Learning
Binary memory (short/long-term) Spectrum of memory modules
Static after training Continuously self-modifying
New learning overwrites old Multi-tempo updates preserve both
Single optimization process Nested, multi-level optimization

Like the human brain, which runs fast circuits for immediate processing and slower ones for consolidating patterns, Nested Learning creates different update frequencies for different knowledge types.

Meet Hope: The Proof of Concept

The paper introduces Hope, a proof-of-concept architecture that demonstrates this approach:

  • Outperforms modern recurrent models on language modeling tasks
  • Handles long-context memory better than state-of-the-art models
  • Uses “continuum memory systems” that update at different frequencies
  • Self-modifying architecture that learns its own update rules

This is similar to how our brain manages short-term and long-term memory simultaneously—what Google calls a “Continuum Memory System” (CMS).

Real Performance Benchmarks

According to benchmark results, Hope demonstrates:

Task Type Metric Result
Language Modeling Perplexity Lower than Transformers
Common-sense Reasoning Accuracy Higher than recurrent models
Needle-in-Haystack Memory retrieval Superior long-context handling
Continual Learning Knowledge retention Minimal catastrophic forgetting

Why This Matters Now

As AI researcher Andrej Karpathy noted, AGI is still a decade away mainly because “no one has been able to develop an AI system that learns nonstop and constantly while working on its limitations every time, like a feedback loop does.”

Nested Learning directly addresses this gap. According to early analysis, Hope’s ability to mitigate catastrophic forgetting brings us closer to closing the gap between the static nature of current LLMs and the dynamic, adaptive intelligence of the human brain.

The Technical Breakthrough

What makes Nested Learning revolutionary is its reframing of fundamental concepts:

  • Backpropagation → Associative memory mapping data to errors
  • Attention mechanisms → Memory systems mapping tokens to context
  • Optimizers (Adam, SGD) → Memory modules compressing gradients
  • Model architecture → Nested optimization at multiple scales

This unified view allows for “deep optimizers” that move beyond simple dot-product similarities, incorporating more robust loss metrics like L2 regression.

What’s Next

We might finally be closing the gap between AI and the human brain’s ability to continually learn. The implications extend to healthcare, robotics, education, and conversational AI—any domain requiring systems that adapt without forgetting.

Read the full paper: Nested Learning: The Illusion of Deep Learning Architectures