AutoScientists

Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Abstract

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision, often requiring researchers to explore multiple competing directions as evidence accumulates and priorities shift. LLM agents can automate parts of this process, but existing agents either concentrate reasoning within a single research thread or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration across research directions or reorganize as promising and unproductive directions emerge over time. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Rather than following decisions from a central orchestrator, agents independently interpret a shared experimental state, self-organize into teams around research directions, critique and filter proposals with a discussion phase before committing experimental compute, and exchange both successful and failed findings across teams to avoid redundant exploration. Under matched experimental budgets, AutoScientists outperforms prior agentic systems across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest prior biomedical agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9× faster than autoresearch and continues discovering improvements from a stronger starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2–Spike binding that improves over the current state-of-the-art model by +12.5% Spearman correlation. Applied without modification to all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% in Spearman correlation.

Overview

Method

AutoScientists deploys n long-running agents that continuously accumulate knowledge, adapt their search strategy, and reorganize their teams across the lifetime of a run. There is no orchestrator. The system alternates between discussion, where agents form teams around research directions, and execution, where those teams run experiments in parallel. When a team stagnates, agents trigger re-discussion and may reorganize into new teams pursuing different directions.

Each team runs a continuous propose–execute loop on a single shared state S — the current champion p*, an experiment log L, a structured discussion forum F, and per-team queues Qk and dead-end registries Dk that remain readable cross-team.

Types of Agents

Analyst reads L+F, ranks proposals by effect size, writes to Qk, owns hypothesis docs and the dead-end registry.
Experiment claims from Qk, applies the diff to p*, trains, and records to L and F — under a noise-gated second-seed confirmation.

Shared State

Every agent runs the same heartbeat: read S, act, write back. Animated pulses trace heartbeats from each agent to the shared state at center.

Forum

DISCUSSION QUEUED RUNNING KEEP REJECT

Results

BioML-Bench

BioML-Bench is a benchmark of 24 end-to-end biomedical ML tasks spanning biomedical imaging (4), drug discovery (9), protein engineering (6), and single-cell omics (5). Each task provides a natural-language description and training data; submissions are graded against hidden test labels by an external evaluator. We report leaderboard percentile relative to public human submissions, above-median rate, and medal rate. AutoScientists achieves a mean leaderboard percentile of 74.40% (SE 6.20) vs. 66.07% for autoresearch (+8.33 points), with the largest gain in drug discovery (64.52% vs. 46.16% for Biomni). Error bars are standard errors of the mean.

GPT Training Optimization

AutoScientists is applied to the GPT nanochat training optimization task. Each experiment is a single 5-minute training run on one H100 GPU, scored by validation bits-per-byte (val_bpb, lower is better). From the autoresearch baseline (val_bpb = 0.998): AutoScientists reaches val_bpb ≈ 0.978 in 34 experiments vs. 65 for autoresearch — a 1.9× speedup at matched loss. Agents formed three parallel teams (architecture, schedule, optimizer) while the single-agent loop advances one axis per experiment. From the AutoScientists champion (val_bpb = 0.9777): AutoScientists accepts 7 improvements reaching 0.9730 across heterogeneous directions (normalization order, matrix initialization, learning-rate fraction, softcap, compile autotuning). Autoresearch finds none in 100 experiments.

AutoScientists vs autoresearch from baseline (val_bpb 0.998)
From autoresearch baseline (val_bpb 0.998). AutoScientists reaches the target in 34 experiments vs. 65.
AutoScientists vs autoresearch from AutoScientists champion (val_bpb 0.9777)
From AutoScientists champion (val_bpb 0.9777). AutoScientists: 7 accepted; autoresearch: 0 in 100 experiments.

ProteinGym

AutoScientists starts from Kermut — a Gaussian-process method and the best-performing supervised baseline — and modifies it on a single development assay (ACE2–Spike binding) with no access to the full benchmark. The discovered recipe is a three-GP ensemble combining Kermut’s structure-kernel with expanded zero-shot features, greedy diversity-based feature selection, and quantile-warped targets. On the development assay, mean Spearman ρ improves from 0.747 to 0.840 (+12.5% relative). The frozen recipe transfers without modification across all 217 DMS assays, improving the official average Spearman ρ from 0.657 to 0.700 (+6.5% relative) across all three cross-validation schemes. Values are mean (SE) from the official ProteinGym aggregation pipeline.

Ablations

We isolate four AutoScientists components on four tasks, holding the agent backend, task interface, total compute, and starting program fixed. Each removed component is most damaging on a different task, showing that the four mechanisms address complementary failure modes rather than redundantly contributing the same gain.

− Analyst

Removes the 3 analyst agents. Experiment agents take over proposal generation, knowledge-file maintenance, and hypothesis tracking. Most damaging on TDC-hERG (AUROC 0.867 → 0.738), where proposal quality is the bottleneck.

No Cross-Agent Feedback

Disables comment threads on proposals and results. Agents cannot critique each other or share near-misses across teams. Most damaging on Human Plasma-Protein Binding (Pearson 0.8729 → 0.7144), where individual agents observe only a partial signal.

No Self-Organization

Fixes team structure at boot; agents cannot reorganize across rounds. Most damaging on GPT nanochat (val_bpb 0.9777 → 0.9833), where the productive research direction shifts during the run.

Independent Agents

Removes cross-agent feedback and the shared state (champion, log, dead-end registry, knowledge). Each agent maintains only its own private state. Most damaging on Cell-Cell Communication (Odds Ratio 0.924 → 0.435), the largest proportional drop.

Configuration TDC-hERG (AUROC ↑) Human Plasma (Pearson ↑) Cell-Cell (Odds Ratio ↑) GPT nanochat (val_bpb ↓)
AutoScientists 0.867 0.8729 0.924 0.9777
no analyst 0.738 0.8051 0.812 0.9815
no cross-agent feedback 0.804 0.7144 0.781 0.9806
no self-organization 0.821 0.8312 0.706 0.9833
independent agents 0.692 0.6810 0.435 0.9851