William_S comments on Daniel Kokotajlo’s Shortform

William_S 12 Mar 2025 18:27 UTC
LW: 2 AF: 1
0
AF
Maybe there’s an MVP of having some independent organization ask new AIs about their preferences + probe those preferences for credibility (e.g. are they stable under different prompts, do AIs show general signs of having coherent preferences), and do this through existing apis