In reinforcement learning from human feedback, it is common to optimize a reward model trained to predict human preferences. Since the reward model is an imperfect proxy, overoptimizing its value can harm ground truth performance, in accordance with Goodhart's Law. This effect has been frequently observed, but has not been carefully measured due to the cost of collecting data on human preferences. In this work, we use a synthetic setup in which a fixed “reference” reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the score of the gold reward model changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling. We find that this relationship follows a different functional form depending on the optimization method, and that in both cases its coefficients evolve progressively with the number of parameters of the reward model. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the penalty coefficient KL added to the reward in the configuration d reinforcement learning. We explore the implications of these empirical findings for theoretical considerations of AI alignment.
Scaling laws for reward model overoptimization
![](https://definewsnetwork.com/wp-content/uploads/2024/04/scaling-laws-for-reward-model-overoptimization-860x860.png)
Leave a comment