Oliver Sourbut comments on Relaxed adversarial training for inner alignment

Oliver Sourbut Nov 24, 2021, 5:00 PM
3 points
A potential issue with Relaxed Adversarial Training, as factorised in the post. $d e p l o y$ is presumably dependent on the outcome of the training process itself (i.e. the training process has side-effects, most notable the production of a deployed ML artefact which might have considerable impact on the world!). Since the training process is downstream of the adversary, this means that the quality of the adversary’s choice of pseudo-inputs to propose depends on the choice itself. This could lead to concerns about different fixed points (or even the existence of any fixed point?) in that system.

(My faint worry is that by being proposed, a problematic pseudo-input will predictably have some gradient ‘training it away’, making it less plausible to arise in the deploy distribution, making it less likely to be proposed… but that makes it have less gradient predictably ‘training it away’, making it more plausible in the deploy distribution, making it more likely to be proposed, …....)

Some ways to dissolve this
1. In conversation with Evan, he already mentioned a preferred reframing of RAT which bypasses pseudo-inputs and prefers to directly inspect some property of the model (e.g. myopia)
2. I wonder about maybe ‘detecting weird fixpoints’ by also inspecting the proposed pseudo-inputs for ‘is this a weird and concerning pseudo-input?’ (if so, the supervisor is predicting weird and concerning post-deployment worlds!)
3. If we instead consider causal reasoning and the counterfactual $d e p l o y$ of ‘what if we did no more training and deployed now?’ this dissolves the dependence (I wonder if this is actually the intended idea of the OP). This leaves open the question of how much harder/easier it is to do counterfactual vs predictive reasoning here.
4. If we instead consider ‘deployment’ to be ‘any moment after now’ (including the remainder of the training process) it might cash out similar to 3? This chimes with one of my intuitions about embedded agency which I don’t know an official name for but which I think of as ‘you only get one action’ (because any action affects the world which affects you so there’s now a different ‘you’)
Interesting? Or basically moot? Or something in between?

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer