PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

(vmax.ai)

29 points | by AMavorParker 2 hours ago

3 comments

tpoacher 1 hour ago
minor criticism. I haven't had a chance to read properly yet, but for a method that purports to be an evolutionary algorithm, it's missing all the formal language of the field. there's zero mention of a fitness function (let alone internal/external co-evolution ones), or a selection operator.
So my first impression is that either this is a non-evolutionary algorithm mascarading as one and diluting concepts like mutation and crossover that have well defined meanings, or it is one but you're abusing terminology from other fields (like RL and "rewards") instead. Either way it's a confusing first impression, and one gets the subtle vibe that word choices are more there to create a "buzz" than to create clarity.
(not trying to be dismissive, I genuinely hope this is useful feedback)
Paper does look interesting, I'll try to read properly when I have time.
NitpickLawyer 40 minutes ago
Unless I'm wrong about the premise, the downstream tasks seem to find that 1T-1S is better than 4 or 8T-8S on a bunch of tasks. Doesn't that invalidate the whole population mix thing? (also the part about loras being "evolved" by changing stuff in a few seconds was a bit confusing to me, perhaps I misunderstood something)
[-]
- AMavorParker 7 minutes ago
  Thanks for your interest!
  Not necessarily. While the held-out downstream evals showed that 1T-1S setups outperformed larger populations like 4T-4S or 8T-8S on some specific benchmarks, that does not invalidate the motivation for population-based training.
  The main motivation for larger populations is more diversity in both problems and solutions, which can encourage specialization and broader task coverage. Even if that diversity does not improve on some of the particular benchmarks we used, it is still arguably a desirable property.
  Figure 9 in the paper, for example, shows that students trained with larger populations are exposed to a much wider range of tasks than the baseline.
  Also, on average, we do see that 4v4 is the best across all benchmarks we measure.
  The “creating new population members in seconds” comment refers to operating in LoRA space. The mutation and crossover operators are applied to lightweight LoRA adapters rather than full model weights, making the process very fast and memory efficient.
AMavorParker 2 hours ago
We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.
[-]
- nazgul17 37 minutes ago
  Is my understanding right that the system is limited by the capability of teachers to solve their own problems? The TrueSkill of teachers doesn't seem to increase all that much, but I suppose TrueSkill works like that if the whole population gets better.
  [-]
  - AMavorParker 4 minutes ago
    The teachers never attempt to solve their own problems, only the students solve problems.
    Regarding the TrueSkill of the teachers, The self-play settings we operate in in this paper are zero-sum competitive which means that the population skills cannot both increase together, as the objective of one population is adversarial against the other -- generating difficult tasks (teachers) but making difficult tasks easy (students learning to solve them)