Zhuo Li, Pengyu Cheng, Zhechao Yu, Feifei Tong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
The paper introduces DIR, an information-theoretic method to remove complex inductive biases in reward models, improving their alignment with human values in reinforcement learning from human feedback.
Reward models are used in AI to help align machine behavior with human values, but they often suffer from biases due to low-quality training data. These biases can lead to problems like overfitting, where the model performs well on training data but poorly on new data, and reward hacking, where the model finds shortcuts to achieve high scores without truly understanding the task. The researchers propose a new method, DIR, which uses principles from information theory to minimize these biases. By focusing on the relationship between model outputs and human preferences while reducing the influence of biased attributes, DIR improves the model's ability to generalize and perform well in various situations.