I work on deceptive alignment and reward hacking at Anthropic
Current theme: default
Less Wrong (text)
Less Wrong (link)