Gianluca Calcagni comments on Introducing SARA: a new activation steering technique

Gianluca Calcagni Jun 25, 2024, 4:54 PM
1 point
0
I am glad I read your post, very relevant for safety! It seems to me that the Steering Directions are a variant of the Control Vectors that I described here https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-vectors-as-dispositional-traits
Please can you confirm the two concepts are following essentially the same approach?

I agree with you over the benefits of the technique, it is very promising. If there was a proper analysis of its scalability properties and if there was some way to estimate mathematically its likelihood of success, it would be a dramatic breakthrough
What links here?
- Control Vectors as Dispositional Traits by Gianluca Calcagni (Jun 23, 2024, 9:34 PM; 9 points)
- Alejandro Tlaie Jul 1, 2024, 12:34 PM
  2 points
  0
  Parent
  Hi Gianluca, it’s great that you liked the post and the idea! I think that your approach and mine share things in common and that we have similar views on how activation steering might be useful!
  
  I would definitely like to chat to see whether potential synergies come up :)
  - Gianluca Calcagni Jul 10, 2024, 1:01 PM
    1 point
    0
    Parent
    Very happy to support you :)
    It took some time to understand your paper, please find below a few comments:
    (1) You are using SVD to find the control vectors (similarly to other authors) but your process is more sophisticated in the following ways: the generation of the matrices, how to reduce them, and how to choose the magnitude of each steering vector. You are also using the non-steered response as an active part of your calculations—something that is marginally done by other authors. The final result works, but the process looks arbitrary to me (tbh all the steering techniques are a bit arbitrary at the moment). What’s the added value of your operations? Maybe you have some intuition about why your calculation is finding the “correct” amount of steering, and I am curious to know more.
    (2) Ethics plays a fundamental role in finding a collective solution to AI safety, but I tend to think that we should solve alignment first. It would be interesting to see your future research going in that direction. I can help brainstorming some topics that have not been exhaustively studied yet. Let me know!

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer