
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
A benchmark suite designed to evaluate how well language model agents can autonomously replicate their own tasks—RepliBench measures agentic scalability, error accumulation, and strategic planning across replication cycles.

CartaNova
Jul 7, 2025
Authors: Sid Black, Asa Cooper Stickland, Jake Pencharz, Oliver Sourbut, Michael Schmatz, Jay Bailey, Ollie Matthews, Ben Millwood, Alex Remedios, Alan Cooney
Link: https://arxiv.org/abs/2504.18566
RepliBench presents a new benchmark specifically designed to evaluate a language model agent’s ability to autonomously replicate itself—effectively creating a “copy” of its own capabilities using only its reasoning skills, memory, tools, and environment. The core idea is to push the boundaries of what autonomous agents can do without human assistance.
In this benchmark, an original “source” agent is given a task: to recreate another agent that matches its own behavior and performance. This process involves multiple cognitive skills:
understanding and documenting its own capabilities,
selecting and using development tools,
writing and debugging code,
testing and improving its replication iteratively.
The benchmark introduces varying task scenarios, such as using open-source APIs, accessing internet documentation, or working with restricted memory. The evaluation criteria include:
Reconstruction accuracy – how closely the replica matches the original agent’s behavior,
Autonomy – how independently the replication is done,
Efficiency – time and steps taken to complete the process.
The authors conduct experiments using LLMs like GPT-4 and Claude, revealing critical insights:
Models often fail to generalize across toolchains.
Long-term planning and memory remain weak spots.
But with enough tools and planning scaffolds, agents show promise in reproducing themselves.
The paper emphasizes that self-replication could become a key capability for future LLM agents, particularly in building resilient, adaptive, and scalable AI systems. RepliBench is not only a tool for benchmarking but a conceptual challenge toward self-improving agents.