PROXIMAL
Research & Engineering

Our Problems

Here are some concrete technical problems we’re working on to help push the frontier of coding data:

Data

Beyond SWE-Bench

A large portion of RL training data for coding today is still built on SWE-Bench and SWE-Bench Pro. Both have real limits. Solutions average between tens to 200 lines of code. Tasks are scoped to a single PR. The prompts are very long lists of specs, not something a real user would write. Most of these tasks are also already in every frontier model’s pretraining data.

We are working on something internally called Ultra to replace this portion of the training data with something much larger and much harder.

Very long horizon RL tasks

Most coding RL tasks today take minutes to a few hours for agents to solve. We are working on creating longer-horizon environments that require agents to plan and work from multiple hours to days.

Creating environments across these horizons is very complex. Verification becomes a big bottleneck because the number of possible solutions compounds over the horizon. Partial rewards also become more important, but they are harder to design. Bad partial rewards can also reward the wrong behavior or create new reward hacks. Long-horizon work also needs new infrastructure. We need systems that detect and patch reward hacks and fairness issues early while the rollouts are still running.

Complex coding environments

Most coding environments today are scoped to a single-turn “make this change or debug this issue in my isolated codebase” task. This only covers a small distribution of software engineering work. We need other types of environments that cover the rest of the distribution, including:

  • Conversational tasks. We use tools like Cursor and Claude Code conversationally. Most of those conversations are hard to verify deterministically using code. We need to come up with creative ways to build conversations that are both verifiable and hard for the models.
  • Infrastructure tasks. A lot of software work touches cloud infrastructure. We want to mock AWS, GCP, and Azure locally and build environments around those services.
  • Frontend related tasks. We need to validate and grade long-horizon frontend changes. Creating a very reliable grader here is still an open research problem that needs to be solved.

Infrastructure

Index all code on the internet

Useful coding data is spread across many public sources. GitHub alone has hundreds of millions of public repositories and more than 100M pull requests merged every year. GitLab, Bitbucket, code in blog posts and documentation, Stack Overflow answers, public traces, and snippets add more data. This data is very useful, but it requires a lot of processing before it can support different stages of model training.

We’d like to crawl, process, and index all of this data in a way that lets us run queries like this:

“all 300+ LOC refactor PRs introduced in the past 3 months, in codebases with 20+ L6–L9 engineers as core contributors, written in Rust”

To do this, we need to crawl, process, and index all of the code on the internet. Crawling at this scale is already a hard systems problem. Sources have different APIs, rate limits, data formats, update patterns, and access rules. After that, we still need to classify every commit, PR, repo, and contributor. We need to know what kind of change is being introduced, who wrote it, how good the codebase is, and what stack it uses. The hard part is doing this accurately at high scale without making it too expensive.

Taskrunner

Taskrunner is our framework for specifying, packaging, and deploying RL environments. It exposes useful agents, QA tools, and workflows to us, and tracks how environments are created, edited, tested, and reviewed, whether the work is done by humans or agents.

As models improve and start doing work for days and months at a time, we need to continuously reinvent the way we build, test, and observe environments. The current version of Taskrunner looks very different from the initial version we built. This will continue to be true, and it requires a lot of experimentation and careful engineering to keep the framework evolving with model capabilities.

Making a good interface to interact with the models also becomes its own research problem. The UX has to evolve with model capabilities. As models get better, Taskrunner needs to expose the right abstractions without making the workflow feel complicated. For example, what interface lets someone find and debug a specific reward hack across hundreds of week-long agent traces?

Multi-node snapshotting

We have agents that run for hours or days at Proximal. We need to monitor their runs for reward hacks, infra issues, and other signs of early failure. For example, if an agent performs a reward hack, there is no point in continuing that trajectory. Ideally, we’d like to revert to a known good state, patch the hack, and continue again from there. This gives us signal on the rest of the environment: other reward hacks it might run into, fairness issues, pass rates, and more.

To do this, we need a system that can periodically snapshot and restore disk, process memory, and GPU state. Snapshotting must be extremely efficient, with minimal interruptions or ideally none at all. We want to be able to do this not only for individual containers, but also for distributed systems spanning many nodes. This will be very powerful once done: we can pause, resume, inspect, and branch environment runs instead of treating every rollout as one-shot execution.

Research

FrontierSWE v2

FrontierSWE is our benchmark for measuring coding ability on ultra-long horizon tasks. We work with experts across industries to propose problems that would be hard even for the best engineers in the world. The first release includes 17 tasks across feature implementation, performance optimization, and research. This is just the start.

We want to add new categories and tasks. Because FrontierSWE includes many different kinds of tasks, we are also finding that every task requires its own infrastructure stack, topology, partial grading logic, and more. Very low-level performance optimization needs to be run on bare-metal servers to remove noise. Distributed training tasks require us to scale isolated training infrastructure up and down for each rollout.

Data quality

Data quality becomes more important as task horizons increase. Today, a training run might take four or five days. If each environment rollout takes days, the full experimentation loop gets much longer. We will be able to run fewer experiments, so each dataset needs to be higher quality before we train on it. This becomes even more important as we use more synthetic data. Today, humans still help shape the distribution of tasks. As that changes, we need stronger ways to measure and control data quality directly.

There are many levels to this problem. Some signals are environment-level, like task difficulty, reward design, and whether the task is hackable. Other signals are dataset-level, like diversity, coverage, and whether the dataset creates useful emergent behavior after training.

  • What are the different components of quality for coding environments?
  • Are there proxies for how much training signal an environment will give before we actually train on it?
  • If we train on high-quality environments, do models improve faster than if we train on lower-quality environments? Can 10% of a dataset provide the same value as the remaining 90%?
  • What emergent behaviors can appear when a model is trained on a dataset? How do we test for both useful and harmful emergent behaviors?
  • How do we measure environment quality as task horizons increase to days or weeks?

Code quality and fuzzy verifiers

Deterministic tests can measure whether AI-generated code is functionally correct, but they cannot measure whether the code is good. We are interested in whether code quality can be measured reliably at all. Do expert humans agree on what makes a patch high quality? Do agents agree with those judgments? Can we decompose quality into dimensions like maintainability, readability, architectural fit, and future extensibility?

If these judgments are consistent enough, we can train reward models that score code quality directly. This would let us reward models for writing code that engineers would actually want to maintain, not just code that passes tests. The holy grail is a fuzzy verifier that can assign useful rewards for these softer properties across many coding tasks.

Synthetic codebases

As model capabilities improve, we need more complex codebases to train them on. There are currently two options: public codebases that are contaminated, and licensed private codebases that mostly come from startups, so they are typically smaller and have less variance.

Ideally, we want a variety of large non-contaminated codebases for the models. One way to get them is to create them. There is some early research in this direction here and here, but these prototypes are not usable for training yet. We need higher-quality codebases with more human grounding for training. We are exploring various ideas to create synthetic codebases that are suitable for model training.

Get involved

If any of these problems sound interesting to you, we’d love to talk. Reach out to hiring@proximal.ai.