Let AI coaches score every action to train agents end-to-end
MAPPA addresses two fundamental challenges in training multi-agent systems end-to-end:
| Challenge | Problem | MAPPA Solution |
|---|---|---|
| Credit Assignment | When a task fails, which agent is responsible? | AI coach examines each agent's outputs and tool feedback to assign accurate blame |
| Sample Efficiency | Multi-agent rollouts are expensive, but traditional RL provides only one signal at the end | Per-action process rewards provide dense feedback for every step |
An LLM coach evaluates every action as it happens—not just the final outcome. The coach receives:
- The agent's role and what it was asked to do
- What the agent saw before acting
- What the agent generated
- Tool output: stdout, stderr, error messages
This enables accurate credit assignment without counterfactual reasoning—just checking what each agent actually produced.
| Feature | Description |
|---|---|
| Per-Action Coaching | AI coach (Gemini) evaluates each agent action with process rewards (0-10) |
| Multi-Agent Orchestration | Sequential agent workflows where each agent builds on previous outputs |
| Code Execution | Agents write and execute Python via SandboxFusion (secure, isolated) |
| Distributed RL Training | REINFORCE++ with DeepSpeed + Ray for multi-GPU training |
- Python 3.11+
- CUDA-compatible GPUs (minimum 2x 80GB A100/H100)
- UV package manager (recommended)
- Clone and install:
git clone https://github.com/ltjed/multiagent-coaching.git
cd multiagent-coaching
uv venv --python 3.11
source .venv/bin/activate
uv pip install -r requirements_uv.txt- Set up SandboxFusion (code execution):
git clone https://github.com/bytedance/SandboxFusion.git ~/SandboxFusion
cd ~/SandboxFusion
# Main environment
conda create -n sandbox python=3.12 -y
conda activate sandbox
pip install poetry && poetry install
# Runtime environment
conda create -n sandbox-runtime python=3.11 -y
conda activate sandbox-runtime
pip install -r runtime/python/requirements.txt- Configure LLM coach credentials:
# For Vertex AI (Gemini)
export VERTEX_PROJECT=your-gcp-project-id
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
# Or standard Gemini API
export GOOGLE_API_KEY=your-api-keyTrain a 3-agent system for math problem solving:
bash scripts/run_train_mathchat.sh Qwen/Qwen3-4B-Thinking-2507Workflow: Problem Solver → Code Executor → Verifier
- Trains on AIME competition problems
- External coach (Gemini) evaluates each agent action
- Saves checkpoints to
./checkpoints/mathchat_coach/
Train a 3-agent system for data science tasks:
bash scripts/run_train_dsbench.sh Qwen/Qwen3-4B-Thinking-2507Workflow: Data Engineer → Modeler → Analyst
- Kaggle-style modeling tasks
- Agents write and execute code via SandboxFusion
- Evaluates on held-out tasks with ground truth metrics
python -m marti.cli.commands.train \
--config-name mathchat_with_coach \
default_agent.pretrain=/path/to/model \
use_wandb=your_api_keymultiagent-coaching/
├── marti/ # Core package
│ ├── agents/ # Agent implementations
│ │ ├── base_agent.py # Abstract Agent class
│ │ ├── multi_agent.py # Multi-agent orchestration
│ │ └── math_agent.py # Math-specific agents
│ │
│ ├── cli/ # Command-line interface
│ │ ├── commands/train.py # Main training entry point (Hydra)
│ │ └── configs/ # Hydra configuration files
│ │ ├── mathchat_with_coach.yaml
│ │ ├── dsbench_ds_pipeline.yaml
│ │ └── default.yaml
│ │
│ ├── controllers/ # Training orchestration
│ │ ├── base_controller.py # Single-agent controller
│ │ └── multi_agent_controller.py
│ │
│ ├── models/ # Model infrastructure
│ │ ├── actor.py # Actor model wrapper
│ │ ├── vllm/ # vLLM inference engines
│ │ └── ray_launcher.py # Distributed training
│ │
│ ├── trainers/ppo/ # RL training
│ │ ├── trainer.py # REINFORCE++/PPO trainer
│ │ ├── actor.py # Policy training
│ │ └── critic.py # Value function training
│ │
│ ├── verifiers/ # Reward computation
│ │ ├── coach/external_coach.py # LLM-based process evaluator
│ │ ├── dsbench/ # Data science metrics
│ │ └── qwen/ # Math answer verification
│ │
│ └── worlds/ # Execution environments
│ ├── multi_agent_world_async.py
│ ├── workflows/ # Task-specific pipelines
│ │ ├── mathchat_workflow_with_coach.py
│ │ └── dsbench_workflow.py
│ └── tools/ # Code execution, search
│
├── scripts/ # Training scripts
│ ├── run_train_mathchat.sh
│ ├── run_train_dsbench.sh
│ └── setup_sandbox.sh
│
├── data/Bench/ # Evaluation datasets
│ ├── AIME_1983_2024.json # 933 AIME problems
│ ├── amc.json # AMC problems
│ └── dsbench_*.json # Data science benchmarks
│
├── requirements_uv.txt # Dependencies
└── setup_env.sh # Environment setup automation
1. MultiAgentController initializes:
├─ Dataset loading (MATH, AIME, DSBench)
├─ Agent models (actor/critic/reference)
├─ vLLM engines (2 per agent)
└─ MultiAgentWorldAsync environment
2. For each episode:
├─ Experience generation:
│ ├─ Agent 1 generates output
│ ├─ Coach evaluates Agent 1 → PROCESS_SCORE: X/10
│ ├─ Agent 2 sees Agent 1's output + coach feedback
│ ├─ Coach evaluates Agent 2
│ └─ Agent 3 sees all outputs, produces final answer
│
└─ REINFORCE++ training:
├─ Compute advantages (global batch normalization)
├─ DeepSpeed backpropagation
└─ Checkpoint saving + metric logging
The external LLM coach provides process rewards (0-10 scale) for each agent action:
PROCESS_SCORE: 8
REASONING: The code correctly implements the solution approach...
This enables:
- Dense feedback: Every action receives a reward, not just final outcomes
- Accurate credit assignment: Coach examines tool outputs to trace blame correctly
- Cross-model learning: Train smaller models with feedback from larger coaches
- Sequential execution: Each agent sees all previous agents' outputs
- File-based coordination: Agents pass artifacts through shared workspace (creates audit trail for coach)
- Thinking models: Support for
<think>tags withis_reasoning_model=true
Training configs use Hydra and are located in marti/cli/configs/. Each pipeline has its own YAML config and shell script.
Config: marti/cli/configs/mathchat_with_coach.yaml
Script: scripts/run_train_mathchat.sh
Problem Solver → Code Executor → Verifier
| Agent | Role | Max Turns |
|---|---|---|
| Problem Solver | Reasons through the problem step-by-step | 1 |
| Code Executor | Writes and executes Python code to verify/compute | 2 |
| Verifier | Synthesizes outputs and provides final answer | 1 |
# Workflow
workflow_func_path: "marti/worlds/workflows/mathchat_workflow_with_coach.py"
# Coach
workflow_args:
coach_model: "gemini-2.5-flash"
use_vertex_ai: true
coder_max_turns: 2
# Agents
agents:
- agent_problem_solver
- agent_code_executor
- agent_verifier| Parameter | Value | Description |
|---|---|---|
advantage_estimator |
reinforce_plus_plus | REINFORCE++ algorithm |
n_samples_per_prompt |
2 | Samples per prompt |
rollout_batch_size |
32 | Prompts per batch |
train_batch_size |
16 | Samples per training step |
num_episodes |
8 | Training episodes |
vllm_num_engines |
2 | vLLM engines per agent |
prompt_max_len |
24576 | 24K input context |
generate_max_len |
4096 | 4K generation length |
- Training: 512 problems randomly sampled from AIME_1983_2024.json (933 total)
- Evaluation: aime_eval_32.json (32) + amc_eval_32.json (32)
Config: marti/cli/configs/dsbench_ds_pipeline.yaml
Script: scripts/run_train_dsbench.sh
Data Engineer → Modeler → Analyst
| Agent | Role | Max Turns | Required Outputs |
|---|---|---|---|
| Data Engineer | EDA, preprocessing, feature engineering | 4 | X_train.pkl, y_train.pkl, X_test.pkl |
| Modeler | Algorithm selection, training, tuning | 4 | model.pkl |
| Analyst | Prediction generation, format verification | 4 | submission.csv |
When something fails, the coach examines the file trail:
DATAENGINEER evaluation:
- Tool output: "Saved X_train.pkl, y_train.pkl"
- No mention of X_test.pkl
- VERDICT: Failed to save required artifact
- SCORE: 3/10
ANALYST evaluation:
- Required file X_test.pkl was never created upstream
- Correctly attempted to load it
- VERDICT: Not at fault for the failure
- SCORE: 6/10
# Workflow
workflow_func_path: "marti/worlds/workflows/dsbench_workflow.py"
# Coach
workflow_args:
coach_model: "gemini-2.5-pro"
use_vertex_ai: true
data_engineer_max_turns: 4
modeler_max_turns: 4
analyst_max_turns: 4
# Agents
agents:
- agent_data_engineer
- agent_modeler
- agent_analyst
# Stratified sampling (maintains classification/regression balance)
default_agent:
stratified_sampling: true
stratify_key: "data_type"| Parameter | Value | Description |
|---|---|---|
advantage_estimator |
reinforce_plus_plus | REINFORCE++ algorithm |
n_samples_per_prompt |
2 | Samples per prompt |
rollout_batch_size |
16 | Prompts per batch |
train_batch_size |
16 | Samples per training step |
num_episodes |
30 | Training episodes |
vllm_num_engines |
2 | vLLM engines per agent |
prompt_max_len |
24576 | 24K input context |
generate_max_len |
16384 | 16K generation (for long code) |
coach_model |
gemini-2.5-pro | Gemini 2.5 Pro (1M context) |
- Training: dsbench_modeling_train.json (64 Kaggle-style modeling tasks)
- Evaluation: dsbench_modeling_eval.json (8 held-out tasks)
- Split: Stratified ~47% classification, ~53% regression
Any config parameter can be overridden via CLI:
# MathChat
python -m marti.cli.commands.train \
--config-name mathchat_with_coach \
default_agent.pretrain=/path/to/model \
workflow_args.coach_model="gemini-2.5-flash"
# DSBench
python -m marti.cli.commands.train \
--config-name dsbench_ds_pipeline \
default_agent.pretrain=/path/to/model \
workflow_args.coach_model="gemini-2.5-pro"| Configuration | GPUs | Use Case |
|---|---|---|
| Minimum | 2x 80GB | Single-agent training |
| Recommended | 4-8x 80GB | Multi-agent training |
The framework includes optimizations for limited GPU memory:
colocate_all_models=true: Share GPUs between modelsvllm_gpu_memory_utilization=0.6-0.7: Leave 30-40% for trainingvllm_enable_sleep=true: vLLM releases memory during backpropgradient_checkpointing=true: Trade compute for memoryzero_stage=3: Maximum DeepSpeed memory compression
| Platform | Configuration |
|---|---|
| Weights & Biases | use_wandb=<API_KEY> |
| TensorBoard | Logs saved to logs/ |
| Weave (LLM tracing) | use_weave=true |
- LLM Inference: vLLM 0.8.5, flash-attn 2.7.3, transformers 4.52.1
- Training: PyTorch 2.6.0, DeepSpeed 0.16.8, Ray 2.43.0
- Configuration: Hydra 1.3.2
- LLM APIs: google-genai, openai 2.6.1
- Tools: MCP 1.20.0 (Model Context Protocol)
See requirements_uv.txt for exact tested versions.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on the MARTI multi-agent reinforcement learning framework
- Code execution powered by SandboxFusion
- Distributed training with DeepSpeed and Ray
- LLM inference via vLLM
For questions and support, please open an issue on the GitHub repository.
To cite this work, please use the following BibTeX entry:
@misc{li2026mappa,
title={Scaling Multiagent Systems with Process Rewards},
author={Ed Li and Junyu Ren and Cat Yan},
year={2026},
eprint={2601.23228},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.23228},
}