Scaling laws for reward model overoptimization

In reinforcement learning from human feedback, it is common to optimize a reward model trained to predict human preferences. Since the reward model is an imperfect proxy, overoptimizing its value can harm ground truth performance, in accordance with Goodhart's Law. This effect has been frequently observed, but has not been carefully measured due to the cost of collecting data on human preferences. In this work, we use a synthetic setup in which a fixed “reference” reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the score of the gold reward model changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling. We find that this relationship follows a different functional form depending on the optimization method, and that in both cases its coefficients evolve progressively with the number of parameters of the reward model. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the penalty coefficient KL added to the reward in the configuration d reinforcement learning. We explore the implications of these empirical findings for theoretical considerations of AI alignment.

Scaling laws for reward model overoptimization

Leave a Reply Cancel reply

Stay Connected

Create an Amazing Newspaper

Latest News

Federal authorities are warning of scammers who use couriers to collect cash and gold from their victims, many of whom are elderly.

Government announces £100m for quantum research centres

Presentation of LEDGER FLEX: announced live at B24

Bitcoin Investors Won't Sell BTC Even If Price Falls to $3,000, Peter Schiff Survey Finds

Subscribe to our newsletter