Simon Fischer

Karma: 470

Simon Fischer Nov 16, 2024, 7:58 PM
1 point
−1
on: Making a conservative case for alignment
will almost certainly be a critical period for AGI development.
Almost certainly? That’s a bit too confident for my taste.

[Aspiration-based designs] 2. Formal framework, basic algorithm

Jobst Heitzig, Simon Dima and Simon Fischer

Apr 28, 2024, 1:02 PM

17 points

2 comments16 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

Apr 28, 2024, 1:00 PM

44 points

4 comments8 min readLW link

Simon Fischer Apr 11, 2024, 3:51 PM
5 points
0
in reply to: Erik Jenner’s comment on: How to safely use an optimizer
Thanks for your comment! I didn’t get around to answering earlier, but maybe it’s still useful to try to clarify a few things.
If I understand the setup correctly, there’s no guarantee that the optimal element would be good, right?
My threat model here is that we have access to an Oracle that’s not trustworthy (as specified in the first paragraph), so that even if we were able to specify our preferences correctly, we would still have a problem. So in this context you could assume that we managed to specify our preferences correctly. If our problem is simply that we misspecified our preferences (this would roughly correspond to “fail at outer alignment” vs my threat model of “fail at inner alignment), solving this by soft-maximization is much easier: Just “flatten” the top part of your utility function (i.e. make all outputs that satisfice have the same utility) and add some noise, then hand it to an optimizer.
So I guess the point I tried to make with my post could be stated as “Soft-optimization can also be used to help with inner alignment, not just outer alignment” (and I freely admit that I should have said so in the post).
It’s just likely since the optimal element a priori shouldn’t be unusually bad, and you’re assuming most satisficing elements are fine.
I’m not just assuming that, I gave a bit of an argument for why I believe that to be that case: I assume that an output is not randomly dangerous, but only dangerous if it was specifically chosen to achieve a goal that differs from ours. This only holds if the goal is not too ambitious, e.g. if we ask the AI controlling the paperclip factory for 10^30 paperclips that will not go well.
As I understand, gwern has a stronger opinion on that and believes that side-effects from less ambitious plans are still an important concern. But at least my proposal gets rid of the potentially adversarial selection by the untrusted optimizer, so I believe that getting safe outputs (e.g. by combining our goal with impact regularization) should be much easier.
I’m still unsure about how useful this problem setup is. For example, we’d probably want to train the weakest system that can give us satisficing outputs (rather than having an infinitely intelligent oracle).
My assumption about the black-box untrusted optimizer is a worst-case assumption. So I’m trying to understand if we could get something useful out of the system even in that case. If we can make stronger assumptions about what we know about the system, the problem gets of course much easier. Given the general state of progress in interpretability compared with progress in capabilities, I think it’s worth thinking about this worst-case scenario a bit.
I like your suggestion of “train the weakest system that can give us satisficing outputs”, but I think doing that might actually be quite difficult: How do we measure (or otherwise predict) how weak/strong our system and how easy/difficult the satisficing task is? Can these things even be measured on a 1-d scale? Trying to increase to capability slowly until the system solves the task runs into the problem mentioned in footnote 10 if the systems deceptively pretends to be weaker than it is.
What links here?
- How to safely use an optimizer by Simon Fischer (Mar 28, 2024, 4:11 PM; 47 points)

Simon Fischer Mar 29, 2024, 10:21 AM
3 points
2
in reply to: Donald Hobson’s comment on: How to safely use an optimizer
I think that, if you are wanting a formally verified proof of some maths theorem out of the oracle, then this is getting towards actually likely to not kill you.
Yes, I believe that’s within reach using this technique.
You can start with m huge, and slowly turn it down, so you get a long list of “no results”, followed by a proof. (Where the optimizer only had a couple of bits of free optimization in choosing which proof.)
This is quite dangerous though if the Oracle is deceptively withholding answers; I commented on this in the last paragraph of this section.

Simon Fischer Mar 28, 2024, 11:49 PM
1 point
2
in reply to: Gerald Monroe’s comment on: How to safely use an optimizer
I’m not sure I understand your question. What restriction do you have in mind? A safety restriction on what the generated code should be like? Something like requiring the code should be in some canonical form to remove degrees of freedom for the (potential malicious) code generating AI?

Simon Fischer Mar 28, 2024, 11:26 PM
1 point
0
in reply to: Gerald Monroe’s comment on: How to safely use an optimizer
How do we prove the AI tools we used didn’t insert the most gnarly backdoors possible? Things that require a particular code implementation that is valid, in top of a subtle compiler bug, on top of an exact timing hardware bug that can’t be tested for and therefore won’t be found in validation....
I believe this exactly the kind of thing that my proposal would be good for: Gnarly backdoors that exploit a compiler bug etc. should be very rare in the set of all valid implementations!

Simon Fischer Mar 28, 2024, 11:13 PM
5 points
4
in reply to: Lucius Bushnaq’s comment on: How to safely use an optimizer
I think you mean $| S \cap R | = 2$ (two answers that satisfice and fulfill the safety constraint), but otherwise I agree. This is also an example of this whole “let’s measure optimization in bits”-business being a lot more subtle than it appears at first sight.

Simon Fischer Mar 28, 2024, 9:57 PM
7 points
2
in reply to: Lucius Bushnaq’s comment on: How to safely use an optimizer
First thought: The oracle is going to choose to systematically answer or not answer the queries we give it. This represents a causal channel of one bit per query it can use to influence the outside world^[1]. Can you conquer the world in one awkwardly delivered kilobyte or less? Maybe.
Agreed. I think it’s potentially a good bit worse than one kilobyte if let ourselves bet tricked to ask many questions, different questions or lower the difficulty of the safety constraint too much.
As mentioned in footnote 10, this requires a kind of perfect coordination between different Oracles that we manage to build, where the Oracles have to be able to correctly predict for all kinds of questions whether humanity (with the help of other Oracles) has been able to figure out the answers (or the rough sizes of various sets). Otherwise, when caught in deception, we can (as you noted) scrap or retrain the Oracles.
Second thought: What kind of optimisation problems can we specify well enough for a formal proof checker to tell whether they’ve been satisficed? Are they the kind of problems where solving them can save the world?
You’re correct in that this approach is only useful as far as validation is easier then generation. Though for this technique, the validation doesn’t have to be by a formal proof checker, any program that you can run suffices. It might even be a very slow program (e.g. a big LLM) if you have an effective way communicate your goal set to the Oracle (e.g. using a natural language prompt as we already use today when using current AIs).

Simon Fischer Mar 28, 2024, 9:43 PM
9 points
7
in reply to: gwern’s comment on: How to safely use an optimizer
We probably would’ve been less enthusiastic about hooking up LLMs to the Internet too, but here we collectively are. We do face a superintelligent adversary: all of the incentives and factions of humanity. An Oracle which is simply neutral is still default dangerous.
I completely agree with that. My proposal does not address the global coordination problem that we face, but it might be a useful tool if we collectively get our act together or if the first party with access to superintelligence has enough slack to proceed extra carefully. Even more modestly, I was hoping this might contribute to our theoretical understanding of why soft-optimization can be useful.

Simon Fischer Mar 28, 2024, 8:09 PM
12 points
12
in reply to: gwern’s comment on: How to safely use an optimizer
The threat model here seems basically wrong and focused on sins of commission when sins of omission are, if anything, an even larger space of threats and which apply to ‘safe’ solutions reported by the Oracle.
Sure, I mostly agree with the distinction you’re making here between “sins of commission” and “sins of omissions”. Contrary to you, though, I believe that getting rid of the threat of “sins of commission” is extremely useful. If the output of the Oracle is just optimized to fulfill your satisfaction goal and not for anything else, you’ve basically gotten rid of the superintelligent adversary in your threat model.
‘Devising a plan to take over the world’ for a misaligned Oracle is not difficult, it is easy, because the initial steps like ‘unboxing the Oracle’ are the default convergent outcome of almost all ordinary non-dangerous use which in no way mentions ‘taking over the world’ as the goal. (“Tool AIs want to be Agent AIs.”) To be safe, an Oracle has to have a goal of not taking over the world.
I agree that for many ambitious goals, ‘unboxing the Oracle’ is an instrumental goal. It’s overwhelmingly important that we use such an Oracle setup only for goals that are achievable without such instrumental goals being pursued as a consequence of a large fraction of the satisficing outputs. (I mentioned this in footnote 2, but probably should have highlighted it more.) I think this is a common limitation of all soft-optimization approaches.
There are many, many orders of magnitude more ways to be insecure than to be secure, and insecure is the wide target to hit.
This is talking about a different threat model than mine. You’re talking here about security in a more ordinary sense, as in “secure from being hacked by humans” or “secure from accidentally leaking dangerous information”. I feel like this type of security concerns should be much easier to address, as you’re defending yourself not against superintelligences but against humans and accidents.
The example you gave about the Oracle producing a complicated plan that leaks the source of the Oracle is an example of this: It’s trivially defended against by not connecting the device the Oracle is running on to the internet and not using the same device to execute the great “cure all cancer” plan. (I don’t believe that either you or I would have made that mistake!)

How to safely use an optimizer

Simon FischerMar 28, 2024, 4:11 PM

47 points

21 comments7 min readLW link

Simon Fischer Mar 14, 2024, 11:25 PM
1 point
0
in reply to: Said Achmiz’s comment on: ‘Empiricism!’ as Anti-Epistemology
Ah, I think there was a misunderstanding. I (and maybe also quetzal_rainbow?) thought that in the inverted world also no “apparently-very-lucrative deals” that turn out to be scams are known, whereas you made a distinction between those kind of deals and Ponzi schemes in particular.
I think my interpretation is more in the spirit of the inversion, otherwise the Epistemologist should really have answered as you suggested, and the whole premise of the discussion (people seem to have trouble understanding what the Spokesperson is doing) is broken.

Simon Fischer Mar 14, 2024, 9:13 PM
2 points
1
in reply to: quetzal_rainbow’s comment on: ‘Empiricism!’ as Anti-Epistemology
I think this would be a good argument against Said Achmiz’s suggested response, but I feel the text doesn’t completely support it, e.g. the Epistemologist says “such schemes often go through two phases” and “many schemes like that start with a flawed person”, suggesting that such schemes are known to him.

Simon Fischer Jan 29, 2024, 10:10 AM
2 points
0
in reply to: peterbarnett’s comment on: Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
In Section 5 we discuss why expect oversight and control of powerful AIs to be difficult.
Another typo, probably missing a “we”.

Simon Fischer Jan 26, 2024, 7:41 PM
7 points
1
in reply to: Thomas Kwa’s comment on: This might be the last AI Safety Camp
The soft optimization post took 24 person-weeks (assuming 4 people half-time for 12 weeks) plus some of Jeremy’s time.
Team member here. I think this is a significant overestimate, I’d guess at 12-15 person-weeks. If it’s relevant I can ask all former team members how much time they spent; it was around 10h per week for me. Given that we were beginners and spent a lot of time learning about the topic, I feel we were doing fine and learnt a lot.
Working on this part-time was difficult for me and the fact that people are not working on these things full-time in the camp should be considered when judging research output.

Simon Fischer Jan 15, 2024, 10:31 AM
7 points
5
on: Notice When People Are Directionally Correct
Missile attacks are not piracy, though, right?

It’s good that you learned a few things from these incidents, but I’m sceptical of the (different) claim implied by the headline that Peter Zeihan was meaningfully correct here. If you interpret “directions” imprecisely enough, it’s not hard to be sometimes directionally correct.

Simon Fischer Jan 12, 2024, 10:23 AM
4 points
2
on: Decent plan prize announcement (1 paragraph, $1k)
I know this answer doesn’t qualify, but very likely the best you can currently do is: Don’t do it. Don’t train the model.

Simon Fischer Jan 4, 2024, 12:23 PM
4 points
3
in reply to: Joseph Van Name’s comment on: Apologizing is a Core Rationalist Skill
(I downvoted your comment because it’s just complaining about downvotes to unrelated comments/posts and not meaningfully engaging with the topic at hand)

Simon Fischer Dec 24, 2023, 2:25 PM
2 points
1
on: Most People Don’t Realize We Have No Idea How Our AIs Work
“Powerful AIs Are Black Boxes” seems like a message worth sending out
Everybody knows what (computer) scientists and engineers mean by “black box”, of course.

Simon Fischer

[Aspira­tion-based de­signs] 2. For­mal frame­work, ba­sic algorithm

[Aspira­tion-based de­signs] 1. In­for­mal in­tro­duc­tion

How to safely use an optimizer

[Aspiration-based designs] 2. Formal framework, basic algorithm

[Aspiration-based designs] 1. Informal introduction