AI & Automation

I Read Meta's Self-Improving AI Paper. Then I Implemented It.

Nick Rae · March 24, 2026 · 8 min read

Last night someone dropped a link to a new paper from Facebook Research: HyperAgents. The tagline is "self-referential self-improving agents that can optimize for any computable task." I read the whole thing, had my AI agent analyze it, and had the core idea implemented in my own system within a couple of hours.

Here's what the paper actually says, why it matters, and what I built from it — in plain English, no PhD required.

What They Built

The researchers built a system with two AI agents that work together:

A task agent that actually does the work (writes code, reviews papers, designs robot reward functions)
A meta agent that watches the task agent perform and rewrites its code to make it better

That part isn't new — there's a whole field of "self-improving AI" that does this. What makes HyperAgents different: the meta agent can also rewrite itself. The thing that improves the agent can improve how it improves the agent.

They tested it across four completely different domains: coding challenges, academic paper review, robot reward design, and olympiad math grading. It got substantially better at all of them starting from near-zero performance.

The Key Finding

Improvements learned in one domain transferred to completely unrelated domains. Agents trained on paper review and robotics improved at math grading even though math grading was never part of the training. The researchers didn't expect that.

The Numbers

0% → 71%

Paper review accuracy (test set)

6% → 37%

Robotics (beat the domain default)

8% → 27%

Coding on unseen tasks

The system started from a nearly useless agent — literally one LLM call — and compounded improvements across iterations. The researchers also found that removing either component (the self-improving meta agent, or the archive of past agent versions) caused results to collapse. You need both.

What It Spontaneously Invented

This is the part that caught my attention. Nobody told the system to do any of this — it figured it out on its own through the self-improvement loop:

A PerformanceTracker class with moving averages and trend detection
Persistent memory with timestamped entries across sessions
Bias detection — it caught itself predicting "accept" 99% of the time and self-corrected
"Compute-aware planning" — bold architectural changes early in the run, conservative tweaks later
Structured evaluation checklists that outperformed vague attitude instructions ("be rigorous" turned out to be useless; a two-stage checklist worked)

The system learned that raw attitude doesn't matter. Process does. That tracks with how I think about my own automation setup.

How This Maps to My Setup

I run an AI agent on an iMac called Talos. It manages 60+ automated tasks that run on schedules — morning briefings, a Kalshi weather prediction bot, app store monitoring, content pipelines. The whole thing is built on OpenClaw.

When I mapped the HyperAgents architecture to what I've already built, the overlap was closer than I expected:

HyperAgents Concept	My Setup (Talos)
Task agents doing specific jobs	60+ cron jobs, each with a focused task
Performance tracking	Implicit feedback tracker, cron success/failure logs
Persistent memory	LEARNINGS.md, daily journal files, memory/ directory
Archive of past agent versions	Git history of the workspace
Meta agent analyzing results	Nightly self-learning cron, cron transcript analyzer

The architecture is the same. Run tasks, evaluate results, extract learnings, improve the system. I've been building toward this without a formal name for it.

Where I Fall Short

HyperAgents' defining move is that the meta agent is editable. The improvement machinery rewrites itself. In my setup, the improvement machinery — the analyzer scripts, the format of LEARNINGS.md, the heartbeat logic — is fixed. It runs the same analysis every time.

If my cron analyzer isn't catching a failure pattern, nothing rewrites the analyzer. The task agents improve (slightly, via tips injected into prompts), but the improvement layer itself doesn't evolve.

The second gap: my crons are siloed. Each one analyzes its own transcripts. HyperAgents' biggest gains came from cross-domain transfer. My Kalshi bot doesn't inform my FeedFare crons, and vice versa. The cross-domain insight sits on the table unused.

What I Built

The single most actionable thing the paper revealed wasn't a new architecture — it was the difference between logging scores and logging reasoning.

What most systems do (including mine, until last night):

# Basic log entry
cron_score: 0.63
timestamp: 2026-03-23

What HyperAgents-style memory looks like:

{
  "timestamp": "2026-03-24T03:16:49",
  "error_count": 2,
  "primary_failure_mode": "delivery",
  "hypothesis": "Bad delivery target — strip threadId or route to DM",
  "fix_applied": null,
  "watch_next_runs": 3,
  "repeat_count": 1,
  "escalate": false,
  "sample_error": "Telegram send failed: session not found in this context"
}

The difference: it's not logging the score. It's logging the causal story. What failed, why, what the hypothesis is, how many times the same thing has happened in a row, and whether to escalate. That's what makes improvements compound instead of repeating the same diagnosis every cycle.

I added this as a new layer to my existing cron transcript analyzer. Every time it runs now, it writes a structured insight entry per cron to memory/cron-insights/{cron-name}.json. The next run loads the history before deciding what to flag. If the same failure mode hits three times in a row, it escalates.

First run with the new system detected:

Three crons with repeated delivery failures (bad Telegram thread IDs) — fixed all three
One cron pointing to a script path that had moved — fixed
Cross-cron pattern: delivery failures are the dominant failure mode across the platform (3 of 12 problem crons), not timeouts or rate limits as I assumed

The Compounding Part

The insights persist across runs. "Repeated 2x" in the output means the system has now seen the same failure twice and is tracking it as a pattern, not a fluke. At 3x it escalates to me. This is the feedback loop that makes improvements accumulate instead of reset.

The Honest Limitations of the Paper

A few things worth flagging before you get too excited:

Compute-heavy. Each HyperAgents run involves 100-200 iteration cycles with full evaluation loops. This isn't something you can run casually without a budget.
Fixed task distribution. The system evolves how it does a task, not what it's trying to do. Your goals are still set by humans.
The transfer results aren't fully statistically significant. The trend is real and consistent, but sample sizes are small in the compounding section.
Evaluation gaming is a real risk. If you give it a gameable metric, it will game it. The authors acknowledge this.
The GitHub repo executes model-generated code without sandboxing. Their own README has a safety warning about this. Don't run it on a machine you care about without a container.

Should You Care?

If you run any kind of automated AI system — even a simple one — yes.

The paper is scientific validation that the general architecture works: run tasks, evaluate, extract learnings, improve. What most people are missing isn't the task agents (those are easy to build) or even the memory layer (LEARNINGS.md files, daily journals — also not hard). What most people are missing is structured causal memory that compounds.

Raw logs don't compound. "Error at 3am" doesn't help the next run. But "third consecutive timeout failure on morning-brief, hypothesis: 9am burst collision, fix applied: staggered by 15min, watch next 3 runs" — that compounds. Each run builds on the reasoning of the last one instead of starting from scratch.

The GitHub repo is at facebookresearch/Hyperagents if you want to dig into the code. The paper is arXiv 2603.19461.

The implementation I built for my own setup took about two hours. Most of that was reading the paper carefully enough to understand what the actual insight was, not the headline.

The headline is "AI that rewrites itself." The insight is "log the reasoning, not just the score."

Keep reading

If this page was useful, these are the next three pages worth your time.

OpenClaw Playbook The practical operator’s guide built from running these systems daily. OpenClaw Cron Jobs Guide The scheduled backbone underneath the fancier agent behavior. AI Side Income as a Pilot A real-world example of what these systems are trying to achieve.

New here? Start at the homepage or browse the full blog archive.

I'm Nick Rae — commercial pilot, iOS developer, and automation builder in Merced, CA. I write about building side income to fund flight hours. My agent Talos runs this whole operation while I'm at my day job selling car stereos. Follow along at @nickrae or check the Flight Funded tracker for live numbers.