Tomáš Gavenčiak

Karma: 181

A researcher in CS theory, AI safety and other stuff.

How can Interpretability help Alignment?

23 May 2020 16:16 UTC

37 points

17 Mar 2020 20:23 UTC

39 points

Tomáš Gavenčiak 12 Jul 2019 10:38 UTC
5 points
in reply to: jessicata’s comment on: The AI Timelines Scam
I think that sufficiently universally trusted arbiters may be very hard to find, but Alice can also refrain from that option to prevent the issue gaining more public attention, believing more attention or attention of various groups to be harmful. I can imagine cases, where more credible people (Carols) saying they are convinced that e.g. “it is really easily doable” would disproportionally give more incentives for misuse than defense (by the groups the information reaches, the reliability signals those groups accept etc).