I Read an Academic Paper and Rebuilt My AI Agent's Memory

The Paper That Started It

I stumbled on a paper from IBM Research called "Trajectory-Informed Memory Generation for Self-Improving Agent Systems" while browsing arXiv. The title is academic, but the core idea is straightforward: most AI agents make the same mistakes over and over because they have no structured way to learn from what they've already done.

The researchers built a system with four parts:

A Trajectory Intelligence Extractor that watches what the agent does step by step
A Decision Attribution Analyzer that figures out which decisions led to failures, recoveries, or wasted effort
A Contextual Learning Generator that produces three types of reusable guidance: strategy tips, recovery tips, and optimization tips
An Adaptive Memory Retrieval System that injects the right lessons at the right time

On the AppWorld benchmark, this system delivered a 14.3 percentage point improvement in goal completion. On complex tasks, the gain was 28.5 percentage points, a 149% relative improvement. Those numbers are hard to ignore.

I run OpenClaw as my daily AI agent platform. It manages 46 cron jobs, spawns sub-agents for coding tasks, runs a weather prediction bot, and handles everything from iOS app builds to content scheduling. The agent already had an error registry for logging mistakes. But it had no way to learn from successes, no way to avoid repeating wasteful patterns, and no way to inject prior lessons into new tasks.

That's exactly the gap the IBM paper addresses.

What I Already Had (And What Was Missing)

Before reading the paper, my agent's "memory" for self-improvement was a single file: ERROR-REGISTRY.md. Every time the agent made a mistake that I corrected, it logged the error and the fix. Over three weeks, this file accumulated about 30 entries.

The problem is obvious once you think about it: the error registry only learns from failures.

Consider this real example: my agent generated 22 scene images for a YouTube video. Six of them had character consistency problems (the main character looked different across scenes). The agent regenerated those six, fixed them, and declared success. The error registry logged nothing because the task ultimately succeeded.

But it was a wasteful success. If the agent had locked the character description in the first prompt and generated a test batch of three images before doing all 22, it would have caught the consistency issue early and saved roughly 40% of the API calls. That pattern of "test small, then scale" never got captured anywhere.

The IBM paper names this category explicitly: optimization tips, from tasks that succeeded but were wasteful. Alongside strategy tips (clean successes worth replicating) and recovery tips (failure-then-fix sequences). Three types of learning, not one.

What I Built

The Three-Tip Classification System

I created a new file called OPTIMIZATION-LOG.md with three structured tables:

Strategy Tips capture patterns from clean, efficient executions. Things like "when updating iOS version numbers, update all three files (app.json, project.pbxproj, Info.plist) in one pass" or "use Normal CDF for all bracket markets, never lookup tables." These are reusable patterns that worked well and should be repeated.

Recovery Tips capture failure-then-recovery sequences. The format is: what broke, then what fixed it. For example: "Cron silently fails with empty output on Gemini Flash" is the failure pattern. "Switch to Haiku or GPT-4o Mini; Gemini Flash breaks on multi-tool-call sequences" is the recovery. Next time the agent sees a similar failure, it has a playbook.

Optimization Tips capture wasteful successes. The format is: what happened, then what should have happened, with estimated savings. The image generation example above became: "Lock character description in the first prompt, generate test batch of 3, verify consistency, then generate remaining scenes. Estimated savings: ~40% fewer API calls."

Each tip is tagged with a domain: [feedfare], [kalshi], [coding], [infra], [bible-tales], [kdp]. Domain tags matter for the next part.

Spawn-Time Tip Injection

This is the piece that mirrors the IBM paper's "Adaptive Memory Retrieval System," and it's the piece that actually changes agent behavior.

Every time my agent spawns a sub-agent (which happens dozens of times a day for coding tasks, content generation, and automation), it now follows this protocol:

Identify the domain tag for the task
Query the optimization log and learnings file for tips matching that domain
Inject the top 3-5 relevant tips as a "LESSONS LEARNED" section at the end of the sub-agent's prompt

The injection looks like this in practice:

## LESSONS LEARNED (from prior runs)
- Lock character descriptions across ALL scene prompts to prevent consistency drift
- Checkpoint git before starting: git add -A && git commit -m "Checkpoint"
- If you hit a rate limit error, pause 30s and retry; do not rapid-retry

The sub-agent reads these tips as part of its instructions and follows them. It's not magic. It's structured context injection. But it means a sub-agent spawned to generate Bible Tales images today won't repeat the consistency mistake from last week.

Automatic Logging Thresholds

The IBM paper emphasizes that the system should identify learnings automatically, not rely on a human to log every observation. I set up thresholds for when the agent should auto-log an optimization tip:

Sub-agent task takes more than 20 tool calls
Task costs more than $0.50 in model usage
Task takes more than 5 minutes for something that should be under 1 minute
Same task pattern executed 3+ times without improvement
User re-queries or corrects output (signal that the first attempt was inefficient)

Post-Run Scoring

Every sub-agent completion now gets a 1-5 score:

Score	Meaning	Action
5	Clean, efficient, correct on first try	Log as Strategy Tip
4	Correct, minor inefficiency	Note for reference
3	Correct but slow or wasteful	Log as Optimization Tip
2	Needed manual intervention	Log as Recovery Tip
1	Failed or wrong output	Error Registry + Recovery Tip

Scores drive tip generation. A score of 5 means "capture what worked." A score of 1 means "capture what broke and what fixed it." The middle scores catch the patterns that most systems miss entirely.

What Else Changed That Morning

The trajectory learning system was the headline, but I used the same audit to fix a bunch of accumulated issues:

Cron Fleet Optimization (46 Jobs)

I found that 21 of 46 cron jobs were running on the same AI model. When that model hit rate limits, multiple jobs would fail simultaneously. I redistributed them across three providers so no single provider carried more than 15 jobs. Four broken crons got fixed by switching to models that handle multi-step tool calls more reliably.

I also found a "cron stampede" at 7:00 AM where four jobs fired at the exact same time. Staggering them by 5-10 minutes eliminated the rate limit collisions.

Orphan Transcript Cleanup

OpenClaw stores conversation transcripts for sub-agent sessions. Over three weeks, 231 transcript files had accumulated that were no longer referenced by any active session. They were taking up 14.5 MB. I wrote a cleanup script and automated it to run weekly.

Memory File Audit

The agent's persistent memory had grown to 89 files. Five were outdated, duplicated, or superseded. Archived those and consolidated cross-references so the agent can find what it needs without scanning the entire directory.

Weekly Maintenance Cron

All of this cleanup was manual. So I wrote a script that handles the repeatable parts and scheduled it for Sunday at 3:30 AM. It cleans orphan transcripts, checks disk usage, scans for secrets that might have leaked into workspace files, verifies cron health, and commits a git backup. Report goes to an ops channel. If anything's broken, I get a direct alert.

The Bigger Idea

The IBM paper's core insight is that most AI agent "memory" systems store conversational facts: what the user said, what the agent did, what happened next. That's useful for continuity, but it doesn't help the agent get better at its job.

Trajectory-informed memory is different. It analyzes how the agent executed a task, not just what it produced. It asks: was this efficient? Did the agent recover from a mistake? Is there a reusable pattern here? And then it feeds those lessons back into future tasks at the exact moment they're relevant.

That's the difference between an agent that remembers what you talked about last week and an agent that remembers how to avoid the mistake it made last week.

I don't know yet if my implementation will match the 14.3% improvement the IBM team measured on AppWorld. My workloads are different, my evaluation criteria are looser (real tasks don't have clean pass/fail benchmarks), and I'm running a single-user system, not a controlled experiment. But structurally, the system is now capable of learning from three sources instead of one. Failures, successes, and everything in between.

How to Apply This to Your Own Setup

You don't need OpenClaw to use these ideas. The pattern works with any AI agent that spawns tasks or runs automated workflows:

Start logging three tip types, not just errors. Most people only track what goes wrong. Start also tracking what goes right (strategy tips) and what works but wastes resources (optimization tips).
Tag tips by domain. A lesson learned about image generation isn't useful for a database migration. Domain tags make retrieval relevant.
Inject tips at task time. The tips are worthless if they sit in a file nobody reads. Put the 3-5 most relevant tips into every task prompt, every time. The agent will follow them.
Set auto-logging thresholds. Don't rely on yourself to notice every inefficiency. Set measurable triggers (cost, time, tool calls) that flag tasks for review.
Score completions. A simple 1-5 score after each task creates the feedback loop. Without scoring, you're collecting data but not learning from it.

The paper is free to read: arxiv.org/abs/2603.10600. If you're running any kind of autonomous agent system, it's worth 30 minutes of your time. The implementation took me a morning. The infrastructure cost is essentially zero, just a few markdown files and a protocol change. The potential upside, if the IBM numbers hold, is significant.

I'll report back on whether the tips actually reduce repeat mistakes and wasteful patterns over the next month. For now, the system is live and logging.

Nick Rae is a commercial pilot, iOS developer, and AI automation builder based in Merced, CA. He runs OpenClaw as a personal AI agent platform and is building side income through apps, prediction market bots, and content to fund flight hours toward an airline career. Follow the experiment at nickrae.net/flight-funded.