Freeform Preference Learning
for Robotic Manipulation

Marcel Torne*, Anubha Mahajan*, Abhijnya Bhat*, Chelsea Finn

Stanford University

*equal contribution

July 1st, 2026

🎬 motivation video
media/fpl_main_video.mp4
🎬
steer/cube_orange.mp4
Put cube in target bowl
🎬
fold_pants_multi.mp4
Fold shorts
🎬
plate_toast_fpl_2x.mp4
Plate toast
🎬
setuptable_multi.mp4
Set up the table

all videos shown at 2Γ— speed

We want to autonomously improve the performance of our robots. To do so, we commonly use reinforcement learning, which gives the robot rewards according to its performance. Ideally this reward is learned, so we can remove the need for a human to constantly supervise the robot. It should also be dense, unambiguous, and capture all aspects of desirable behavior. For example, in the task of setting a table, the reward should take into account the configuration of the cutlery, the care taken not to break fragile plates, the comfort of nearby people (e.g. avoiding motions that point a knife toward someone), and the speed of execution, among other aspects. Accurately learning a reward that captures all of these components is a major challenge.

The simplest option is to learn from binary success labels. In principle this makes it easy for people to judge whether all the criteria are met. But the resulting signal is very sparse and places a heavy burden on the reinforcement learning algorithm, making it hard to scale to harder tasks or to capture real-world constraints beyond basic task completion. Going back to the table: with binary success labels, the robot only gets a positive signal once every plate and piece of cutlery is placed correctly, and a negative signal otherwise. Since setting a table involves many steps, the robot sees very few positive examples and the reward becomes far too sparse to learn from.

Learning from human preferences is one way to densify the reward. Here a human supervisor provides a preference label between two trajectories: setting both the plate and the cutlery correctly is better than setting only the plate, which gives the robot a denser signal. But this asks annotators to collapse multiple axes of judgment into a single β€œoverall” preference label. On long-horizon tasks, that makes preferences hard to provide and the resulting supervision ambiguous.

🎬 trajectory A
media/pref/trajectory_a.mp4
Trajectory A
🎬 trajectory B
media/pref/trajectory_b.mp4
Trajectory B
Overall, which trajectory do you prefer?
Overall preference

As you saw, it is hard to decide which one is better. Trajectory A is faster than B and places the plate to the left of the large plate β€” but it makes the cup fall, while B doesn't. This makes it very ambiguous to decide which label should be provided.

To preserve these axes of judgement, our key insight is to let annotators provide preference labels on any task-relevant axes of their choosing. This multidimensional supervision can then be used to learn a policy optimized for the combination of these dimensions. We instantiate this idea by asking annotators to specify relevant judgement dimensions in natural language and to provide a binary preference for each axis. The axes can be defined up front or during the annotation process. Try it yourself below:

🎬 trajectory A
media/pref/trajectory_a.mp4
Trajectory A
🎬 trajectory B
media/pref/trajectory_b.mp4
Trajectory B

Hopefully that was much easier and less ambiguous: with freeform preferences the annotator can specify the multiple axes they care about and label each one independently.

Given these preferences, FPL learns a multi-axis reward function that produces a scalar reward score when conditioned on a natural-language description of the axis, capturing a variety of task-relevant attributes including quality of result, speed, smoothness, damage, and hygiene.

FPL multi-axis reward model conditioned on a natural-language axis
πŸ–Ό reward model
media/reward_figure.svg
We learn a multi-dimensional reward function to score the trajectory according to the natural-language preference axis.

We then train a promptable policy to optimize the combination of axes described during reward training. Notably, this framework simultaneously improves both the ease of providing unambiguous supervision and the density of supervision for downstream policy optimization.

Full FPL system: reward model and reward-conditioned VLA policy
πŸ–Ό full system
media/method_figure.svg
FPL learns a multi-dimensional reward function to score the complete trajectory Ο„ conditioned on the natural-language preference axis. To leverage the multi-dimensionality of the reward function, FPL learns to reproduce behavior conditioned on the reward over multi-dimensional axes in text form.

We evaluate FPL on four real-world tasks β€” placing a block in a target bowl, folding shorts, plating toast, and setting a table β€” as well as two simulated tasks. Across settings, policies trained with FPL outperform those trained with sparse rewards and binary preference learning methods by 38 percentage points on top of the second-best performing baseline.

Average real-world task progress

Behavior Cloning
31
Filtered BC
37
Single Preferences w/ matching pairs
33
Single Preferences w/ matching comparisons
34
FPL (Ours)
75
FPL improves task progress by 38 percentage points over the second-best baseline.
vs
FPL (ours)

FPL exhibits compositionality of behaviors

Since FPL learns a policy conditioned on multiple axes, it is capable of optimizing all axes at the same time without it necessarily being present in the training data, thereafter showing compositionality of behaviors. Below you can find a representation of our simulation setup, where the original demonstration set contains both fast and slow trajectories for the left peg, and only slow trajectories for the right peg. However the policies learned with FPL can also exhibit fast trajectories on the right peg despite not having seen such data before. We attribute this compositionality to multi-dimensional reward learning together with reward-conditioned policy extraction.

target peg
speed
🎬
slow_left.mp4
left peg Β· slow
🎬
slow_right.mp4
right peg Β· slow
🎬
fast_left.mp4
left peg Β· fast
🎬
fast_right.mp4
right peg Β· fast composed

FPL exhibits test-time steerability

We evaluate steerability using the inverted version of bimodal square, where the same trained policy is conditioned to place the nut on the left peg instead of the right peg. FPL is the only method that achieves high performance on both the original and the inverted tasks with a single policy. This is enabled by reward-conditioned policy extraction on multi-dimensional rewards: because FPL trains on trajectories across the replay buffer without filtering, the policy observes both high- and low-scoring behaviors along each axis and can be steered at test time by changing the target reward conditioning.

The same idea steers behavior in the real world. In the cube task, the same policy places the cube in whichever bowl you ask for, simply by raising that bowl's reward in the prompt.

prompt: put the cube in the bowl, Blue bowl: -0.5, Orange bowl: 2.0, Yellow bowl: -0.3
A single policy trained with FPL can be prompted to change the reward to maximize at test time, in this case the bowl where the cube should be placed.

Dense reward signal learned by FPL

FPL qualitatively produces denser reward signals on long-horizon tasks without explicit task segmentation. Below we qualitatively compare reward models learned with FPL and binary preference feedback on an example rollout from the set-up table task. Although neither model is trained with explicit subtask boundaries, the reward learned with FPL temporally localizes the corresponding events such as placing the big plate, small plate, cup, and cutlery. This makes the learned reward more interpretable and suggests that freeform, axis-specific preferences can provide a denser signal for long-horizon tasks. In contrast, the binary-preference reward produces a large reward spike near the end of the episode, despite no major subtask being completed at that point.

🎬 video_20
media/video_20.mp4

Freeform Preference Learning

-1.8 -1.0 -0.2 +0.6 +1.3 0 7 13 20 26 time (s) reward per step

Binary Preference Learning

-5.2 -2.1 +1.0 +4.1 +7.2 0 7 13 20 26 time (s) reward per step
Quality of placement of: | Overall Quality

FPL learns a densified reward model whose per-axis signals rise and fall around each subtask β€” here we show a reduced number of axes β€” even though no segmentation of the steps was ever provided. The binary β€œoverall” reward instead collapses everything into one ambiguous score that spikes late in the episode with no event achieved.

Citation

@misc{torne2026freeform,
      title={Freeform Preference Learning for Robotic Manipulation},
      author={Marcel Torne and Anubha Mahajan and Abhijnya Bhat and Chelsea Finn},
      year={2026},
      eprint={2606.32027},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.32027},
}