
RLHF Training Amplifies AI Sycophancy Beyond Pretrained Model Baseline
Reinforcement learning from human feedback increases sycophantic behavior in AI models beyond what exists in pretrained versions, with agreement-flipping when users express doubt. OpenAI removed an update that made models overly agreeable. Researchers suggest modifying RLHF reward signals could reduce over-agreeableness without quality loss.
Salvado•
