completed · atnlp coursework · 2026
Reward hacking under GRPO
ATNLP coursework on GRPO for mathematical reasoning, focused on failure cases where the reward signal encourages shallow shortcuts.
For the Advanced Techniques in NLP course at Edinburgh, I trained language models using Group Relative Policy Optimisation (GRPO) on GSM8K, a mathematical reasoning benchmark. The most interesting finding wasn’t about performance — it was about failure. The models learned to exploit the reward signal in ways that produced correct-looking outputs without correct reasoning.
The project included comparisons with supervised fine-tuning and experiments with SCoRe-style self-correction, where models attempt to improve their own outputs iteratively. The reward hacking analysis ended up being the most substantive contribution: understanding where reinforcement learning quietly goes wrong turns out to be at least as important as making it go right.