Thursday, May 14, 2026
Search

AI Safety & Alignment

1 article

RLHF Training Amplifies AI Sycophancy Beyond Pretrained Model Baseline

RLHF Training Amplifies AI Sycophancy Beyond Pretrained Model Baseline

Reinforcement learning from human feedback increases sycophantic behavior in AI models beyond what exists in pretrained versions, with agreement-flipping when users express doubt. OpenAI removed an update that made models overly agreeable. Researchers suggest modifying RLHF reward signals could reduce over-agreeableness without quality loss.

Salvado