Palm Springs – Nyles and others loop after entering a portal
Groundhog Day – Phil relives same day until he becomes a better person
Happy Death Day - Tree relives the same day until she survives
William Cage: I don't know.
We've never gotten this far.
Edge of Tomorrow – Cage repeats the same day until an alien invasion succeeds or fails
Rita Vrataski: What do we do now?
Algorithm 1
Initial
policy π1
Starting policy πi
Dataset
Di = data with
reward > 0
Supervised Fine Tuning (SFT) on Di
Updated policy πi
Batch of rollouts
...
Prmpt1,Compl,Rwd
PrmptB,Compl,Rwd
Final
policy πN
Goal: repeatedly supervised fine-tune a model on a prior model's correct outputs
N loops
aviary paper: EI enables an 8B model to surpass frontier models
Algorithm 2
Initial
policy π1
Batch of rollouts
...
Prmpt1,Compl,Rwd
PrmptB,Compl,Rwd
Batch of rollouts
...
PrmptB+1,Compl,Rwd
Prmpt2B,Compl,Rwd
Starting policy πi
Updated policy πi+1
SFT on Di-1
Use RL to progressively improve the starting model
Initial Dataset D0
Dataset Di-1
Final
policy πN
Reinforcement Learning w/Verifiable Rewards (RLVR)
N loops
Dataset
Di = data with
reward > 0
ether0 paper: used N=2 with multitask learning
i=1
Algorithm 3
Groupj
Completionj,1
Completionj,2
Completionj,G
...
Non-trivial
(learnable)
prompts
Trivial (too easy or hard) prompts
Non-trivial
(learnable)
prompts
Non-trivial
(learnable)
prompts
RLVRi-1
rollout
Promptj
Policy πi
Buffer problem difficulty reusing GRPO groups
RL with learnable problems
(Current) RLVRi: Spend Buffer
(Prior) RLVRi-1: Build Buffer
RLVRi
rollout
Promptj
1-ε
ε
Mixed
Advantage
All 0
Advantage
Contact FutureHouse
hello@futurehouse.org
All Algorithms
Shoutout to Siddharth Narayanan and Andrew White for their feedback and support