Research & Engineering

Our Problems

June 8, 2026

We believe that training data is solved through creative engineering and research ideas. Here are some of the concrete technical problems we are working on right now:

Index All Code on the Internet

Useful coding data is spread across many public sources - GitHub alone has hundreds of millions of public repositories and more than 100M pull requests merged every year. Beyond this, GitLab, Bitbucket, public developer tool documentation, Stack Overflow, and other sources are extremely useful as seed data for data pipelines.

Ideally, we would like to be able to run queries like this over this data:

“return all 300+ LOC refactor PRs introduced in the past 3 months, in codebases with 20+ L6–L9 engineers as core contributors, written in Rust”

To do so, we need to crawl, process, and index all of the code on the internet. Doing so is already a hard systems problem, as sources have different APIs, rate limits, data formats, update patterns, and access rules. Once we have stored this data, we still need to enrich and classify it to make sense of it in a way that is cost-efficient at our scale.

Measuring the Usefulness of Training Data

When talking about data quality internally, we refer to two things:

First, we refer to data being bug-free. In the case of RL, this means that tasks are not reward-hackable, prompts are fair and not ambiguous, verifiers cover everything we ask for in the instruction, and environment difficulty is calibrated for whatever model we want to train with it. While these things are not trivial to get right, they are table-stakes for training models.

Secondly, we think about what and how much of a learning signal an individual task provides: Even among tasks that pass the criteria mentioned above, there are a decent amount of tasks on which models do not improve during RL. We are working on methods that let us assign a quality score to a task that predicts how much the individual task will contribute to improving a model within a training dataset.

So far, we have found that using agentic judges to check the qualitative differences between rollouts leading to positive rewards and rollouts leading to negative rewards is useful: GRPO maximizes the likelihood of positive rollouts while minimizing the likelihood of negative rollouts, so it is intuitive that tasks in which positive rewards are caused by fundamentally different reasoning strategies rather than small tweaks provide a strong signal. We are making progress towards more interpretable signals that have stronger predictive power, and are looking to tackle the problem of predicting relative sample importance within a dataset. One paper we found interesting while studying this problem is The Unlearnability Phenomenon in RLVR for Language Models.

Multi-Node Snapshotting

When agents attempt long-horizon tasks such as those in FrontierSWE, they often run for multiple hours or in extreme cases up to days on our infrastructure. We need to monitor their runs for reward hacks, infra issues, and other signs of early failure. For example, if an agent performs a reward hack, there is no point in continuing that trajectory. Ideally, we’d like to revert to a known good state, patch the hack, and continue again from there.

To do this, we need a system that can periodically snapshot and restore disk, process memory, and GPU state. Snapshotting must be extremely efficient, with minimal interruptions or ideally none at all. We want to be able to do this not only for individual containers, but also for distributed systems spanning many nodes. This will be very powerful once done: we can pause, resume, inspect, and branch environment runs instead of treating every rollout as one-shot execution.

RL Training Infrastructure

The only way we can dogfood our product and understand how to create useful training data is to train models ourselves. For this purpose, we build RL infrastructure that lets us work with trillion-parameter open source LLMs.

Our current trainer defines universal abstractions for rollouts and lets us plug in different training backends - for this, we work with both serverless training solutions such as Tinker as well as an internal fork of slime for full-parameter fine-tuning on our own GPUs. While we are seeing good results, there is obviously still a lot to do. One thing we are starting to work on right now is infrastructure for midtraining / continued pretraining, to better understand how synthetic data affects downstream RL performance

Rollout Infrastructure

For every task we build, we run many agent rollouts to measure task difficulty and perform QA. These rollouts need to support a diverse range of execution environments, from simple individual containers to Kubernetes clusters for SRE tasks and multi-GPU environments for AI research tasks.

This creates a hard infrastructure problem: we need to launch, monitor and tear down heterogeneous environments at high scale while keeping rollouts isolated, reproducible, and cheap enough to run continuously. As models get better and tasks need to become more difficult, we expect that execution environments will become significantly higher in complexity: right now, we are thinking about how we can build tasks that teach agents to use cloud services from hyperscalers like AWS; as these services are too complex to mock, it likely requires us to quickly spin up and tear down hundreds of fully isolated AWS execution environments that do not interfere with each others' rollouts.

Measuring Code Quality

Coding agents are trained in environments that largely rely on deterministic tests. These tests can measure whether AI-generated code is functionally correct, but they cannot measure whether the code is good. As a result, coding agents tend to produce sloppy code when human engineers are not in the loop to do targeted prompting.

We are investigating how we can automatically grade code quality so that it can be used as a reward signal during RL. Beyond ways of generating better rubrics, we are specifically interested in agent-maintainability as a proxy for quality: Inspired by SlopCodeBench, we have started to measure how many tokens agents use when extending features written by other agents. So far, we have observed strong correlation between the effort it takes for agents to extend code and our perception of code quality.

Training on Long Horizon Tasks

Training on ultra-long horizon tasks such as the ones in FrontierSWE poses a big challenge: The duration of an RL training step is bound by the time it takes to run rollouts - as a result, a single training step on tasks that take agents 10 hours to solve is 60 times slower than a training step on tasks that take agents 10 minutes to solve. As models become more capable, and the tasks they have trained on become longer to run in nature, RL training runs will become dramatically slower.

We are thinking about ways to increase training efficiency for long horizon tasks. Particularly, we are excited about training algorithms such as SDPO that provide richer feedback than GRPO with scalar rewards. This problem can be tackled both on the algorithmic side as well from a data perspective: when purely using SDPO, what a model learns depends strongly on how the environment feedback is provided. We believe that a promising research direction is figuring out how to provide high-quality environment feedback for very complex tasks.

Synthetic Codebases

To create difficult coding tasks, we need large codebases as seed data. Github is a finite resource and contains data that models have already seen during pretraining, and purchasing private codebases is a tedious process. For this reason, we are interested in creating entirely synthetic, yet realistic codebases spanning millions of lines of code.

We have already seen reports of massive, somewhat functional and entirely AI-generated codebases such as Cursor's browser or Claude's C Compiler. What is still unknown is whether training on tasks created from AI-generated codebases leads to the same improvements we get from training on real-world codebases. We are both investigating whether this is the case, and what we can do to make synthetic codebases as realistic as possible.

Get involved

If any of these problems sound interesting to you, we’d love to talk. Reach out to hiring@proximal.ai.

← Back to Blog