Freeform Preference Learning
for Robotic Manipulation
July 1st, 2026
media/fpl_main_video.mp4steer/cube_orange.mp4fold_pants_multi.mp4plate_toast_fpl_2x.mp4setuptable_multi.mp4all videos shown at 2Γ speed
We want to autonomously improve the performance of our robots. To do so, we commonly use reinforcement learning, which gives the robot rewards according to its performance. Ideally this reward is learned, so we can remove the need for a human to constantly supervise the robot. It should also be dense, unambiguous, and capture all aspects of desirable behavior. For example, in the task of setting a table, the reward should take into account the configuration of the cutlery, the care taken not to break fragile plates, the comfort of nearby people (e.g. avoiding motions that point a knife toward someone), and the speed of execution, among other aspects. Accurately learning a reward that captures all of these components is a major challenge.
The simplest option is to learn from binary success labels. In principle this makes it easy for people to judge whether all the criteria are met. But the resulting signal is very sparse and places a heavy burden on the reinforcement learning algorithm, making it hard to scale to harder tasks or to capture real-world constraints beyond basic task completion. Going back to the table: with binary success labels, the robot only gets a positive signal once every plate and piece of cutlery is placed correctly, and a negative signal otherwise. Since setting a table involves many steps, the robot sees very few positive examples and the reward becomes far too sparse to learn from.
Learning from human preferences is one way to densify the reward. Here a human supervisor provides a preference label between two trajectories: setting both the plate and the cutlery correctly is better than setting only the plate, which gives the robot a denser signal. But this asks annotators to collapse multiple axes of judgment into a single βoverallβ preference label. On long-horizon tasks, that makes preferences hard to provide and the resulting supervision ambiguous.
As you saw, it is hard to decide which one is better. Trajectory A is faster than B and places the plate to the left of the large plate β but it makes the cup fall, while B doesn't. This makes it very ambiguous to decide which label should be provided.
To preserve these axes of judgement, our key insight is to let annotators provide preference labels on any task-relevant axes of their choosing. This multidimensional supervision can then be used to learn a policy optimized for the combination of these dimensions. We instantiate this idea by asking annotators to specify relevant judgement dimensions in natural language and to provide a binary preference for each axis. The axes can be defined up front or during the annotation process. Try it yourself below:
Hopefully that was much easier and less ambiguous: with freeform preferences the annotator can specify the multiple axes they care about and label each one independently.
Given these preferences, FPL learns a multi-axis reward function that produces a scalar reward score when conditioned on a natural-language description of the axis, capturing a variety of task-relevant attributes including quality of result, speed, smoothness, damage, and hygiene.
media/reward_figure.svgWe then train a promptable policy to optimize the combination of axes described during reward training. Notably, this framework simultaneously improves both the ease of providing unambiguous supervision and the density of supervision for downstream policy optimization.
media/method_figure.svgWe evaluate FPL on four real-world tasks β placing a block in a target bowl, folding shorts, plating toast, and setting a table β as well as two simulated tasks. Across settings, policies trained with FPL outperform those trained with sparse rewards and binary preference learning methods by 38 percentage points on top of the second-best performing baseline.
Average real-world task progress
FPL exhibits compositionality of behaviors
Since FPL learns a policy conditioned on multiple axes, it is capable of optimizing all axes at the same time without it necessarily being present in the training data, thereafter showing compositionality of behaviors. Below you can find a representation of our simulation setup, where the original demonstration set contains both fast and slow trajectories for the left peg, and only slow trajectories for the right peg. However the policies learned with FPL can also exhibit fast trajectories on the right peg despite not having seen such data before. We attribute this compositionality to multi-dimensional reward learning together with reward-conditioned policy extraction.
FPL exhibits test-time steerability
We evaluate steerability using the inverted version of bimodal square, where the same trained policy is conditioned to place the nut on the left peg instead of the right peg. FPL is the only method that achieves high performance on both the original and the inverted tasks with a single policy. This is enabled by reward-conditioned policy extraction on multi-dimensional rewards: because FPL trains on trajectories across the replay buffer without filtering, the policy observes both high- and low-scoring behaviors along each axis and can be steered at test time by changing the target reward conditioning.
The same idea steers behavior in the real world. In the cube task, the same policy places the cube in whichever bowl you ask for, simply by raising that bowl's reward in the prompt.
Dense reward signal learned by FPL
FPL qualitatively produces denser reward signals on long-horizon tasks without explicit task segmentation. Below we qualitatively compare reward models learned with FPL and binary preference feedback on an example rollout from the set-up table task. Although neither model is trained with explicit subtask boundaries, the reward learned with FPL temporally localizes the corresponding events such as placing the big plate, small plate, cup, and cutlery. This makes the learned reward more interpretable and suggests that freeform, axis-specific preferences can provide a denser signal for long-horizon tasks. In contrast, the binary-preference reward produces a large reward spike near the end of the episode, despite no major subtask being completed at that point.
Citation
@misc{torne2026freeform,
title={Freeform Preference Learning for Robotic Manipulation},
author={Marcel Torne and Anubha Mahajan and Abhijnya Bhat and Chelsea Finn},
year={2026},
eprint={2606.32027},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2606.32027},
}