Michael Soareverix

Karma: 93

Our Existing Solutions to AGI Alignment (semi-safe)

Michael SoareverixJul 21, 2022, 7:00 PM

12 points

1 comment3 min readLW link

Michael Soareverix Jul 17, 2022, 8:46 AM
14 points
2
on: All AGI safety questions welcome (especially basic ones) [monthly thread]
What stops a superintelligence from instantly wireheading itself?
A paperclip maximizer, for instance, might not need to turn the universe into paperclips if it can simply access its reward float and set it to the maximum. This is assuming that it has the intelligence and means to modify itself, and it probably still poses an existential risk because it would eliminate all humans to avoid being turned off.
The terrifying thing I imagine about this possibility is that it also answers the Fermi Paradox. A paperclip maximizer seems like it would be obvious in the universe, but an AI sitting quietly on a dead planet with its reward integer set to the max is far more quiet and terrifying.

Musings on the Human Objective Function

Michael SoareverixJul 15, 2022, 7:13 AM

3 points

0 comments3 min readLW link

Michael Soareverix Jul 13, 2022, 7:24 PM
1 point
0
in reply to: Quintin Pope’s comment on: The Easiest Solution to AI Alignment
Interesting! I appreciate the details here; it gives me a better sense of why narrow ASI is probably not something that can exist. Is there a place we could talk over audio about AGI alignment versus text here on LessWrong? I’d like to get a better idea of the field, especially as I move into work like creating an AI Alignment Sandbox.
My Discord is Soareverix#7614 and my email is maarocket@gmail.com. I’d really appreciate the chance to talk with you over audio before I begin working on sharing alignment info and coming up with my own methods for solving the problem.

Michael Soareverix Jul 13, 2022, 12:49 AM
1 point
0
in reply to: Quintin Pope’s comment on: The Easiest Solution to AI Alignment
Good points. Your point about value alignment being better to solve then just trying to orchestrate a pivotal act is true, but if we don’t have alignment solved by the time AGI rolls around, then from a pure survival perspective, it might be better to try a narrow ASI pivotal act instead of hoping that AGI turns out to be aligned already. This solution above doesn’t solve alignment in the traditional sense, it just pushes the AGI timeline back hopefully enough to solving alignment.
The idea I have specifically is that you have something like GPT-3 (unintelligent in all other domains, doesn’t expand outside of its system or optimize outside of itself) that becomes an incredibly effective Tool AI. GPT-3 isn’t really aligned in the Yudkowsky sense, but I’m sure you could get it to write a mildly persuasive piece already. (It sort of already has: https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3 ).
Scale this to superintelligent levels like AlphaGo, and I think you could orchestrate a pivotal act pretty rapidly. It doesn’t solve the alignment problem, but it pushes it back.
The problems are that the user needs to be aligned and that the type of narrow ASI has to be developed before AGI. But given the state of narrow ASI, I think it might be one of the best shots, and I do think a narrow ASI could get to this level before AGI, much the same way as AlphaGo proceeded MuZero.
What I am ultimately saying is that if we get a narrow AI that has the power to make a pivotal act, we should probably use it.

Michael Soareverix Jul 13, 2022, 12:37 AM
2 points
0
on: What is wrong with this approach to corrigibility?
I am new to the AI Alignment field, but at first glance, this seems promising! You can probably hard-code it not to have the ability to turn itself off, if that turns out to be a problem in practice. We’d want to test this in some sort of basic simulation first. The problem would definitely be self-modification and I can imagine the system convincing a human to turn it off in some strange, manipulative, and potentially dangerous way. For instance, the model could begin attacking humans, instantly causing a human to run to shut it down, so the model would leave a net negative impact despite having achieved the same reward.
What I like about this approach is that it is simple/practical to test and implement. If we have some sort of alignment sandbox (using a much more basic AI as a controller or test subject) we can give the AI a way of simply manipulating another agent to press the button, as well as ways of maximizing its alternative reward function.
Upvoted, and I’m really interested to see the other replies here.

Three Minimum Pivotal Acts Possible by Narrow AI

Michael SoareverixJul 12, 2022, 9:51 AM

0 points

4 comments2 min readLW link

Could an AI Alignment Sandbox be useful?

Michael SoareverixJul 2, 2022, 5:06 AM

2 points

1 comment1 min readLW link

Michael Soareverix Jun 17, 2022, 7:33 AM
3 points
0
in reply to: Rob Bensinger’s comment on: A central AI alignment problem: capabilities generalization, and the sharp left turn
Very cool! So this idea has been thought of, and it doesn’t seem totally unreasonable, though it definitely isn’t a perfect solution. A neat idea is a sort of ‘laziness’ score so that it doesn’t take too many high-impact options.
It would be interesting to try to build an AI alignment testing ground, where you have a little simulated civilization and try to use AI to align properly with it, given certain commands. I might try to create it in Unity to test some of these ideas out in the (less abstract than text and slightly more real) world.

Michael Soareverix Jun 15, 2022, 8:07 PM
2 points
0
on: A central AI alignment problem: capabilities generalization, and the sharp left turn
One solution I can see for AGI is to build in some low-level discriminator that prevents the agent from collecting massive reward. If the agent is expecting to get near-infinite reward in the near future by wiping out humanity using nanotech, then we can set a solution so it decides to do something that will earn it a more finite amount of reward (like obeying our commands).
This has a parallel with drugs here on Earth. Most people are a little afraid of that type of high.
This probably isn’t an effective solution, but I’d love to hear why so I can keep refining my ideas.

Michael Soareverix Jun 7, 2022, 4:23 PM
7 points
3
in reply to: Vaniver’s comment on: AGI Ruin: A List of Lethalities
Appreciate it! Checking this out now

Michael Soareverix Jun 7, 2022, 6:49 AM
2 points
−6
on: AGI Ruin: A List of Lethalities
I view AGI in an unusual way. I really don’t think it will be conscious or think in very unusual ways outside of its parameters. I think it will be much more of a tool, a problem-solving machine that can spit out a solution to any problem. To be honest, I imagine that one person or small organization will develop AGI and almost instantly ascend into (relative) godhood. They will develop an AI that can take over the internet, do so, and then calmly organize things as they see fit.
GPT-3, DALLE-E 2, Google Translate… these are all very much human-operated tools rather than self-aware agents. Honestly, I don’t see a particular advantage to building a self-aware agent. To me, AGI is just a generalizable system that can solve any problem you present it with. The wielder of the system is in charge of alignment. It’s like if you had DALL-E 2 20 years ago… what do you ask it to draw? It doesn’t have any reason to expand itself outside of its computer (maybe for more processing power? that seems like an unusual leap). You could probably draw some great deepfakes of world leaders and that wouldn’t be aligned with humanity, but the human is still in charge. The only problem would be asking it something like “an image designed to crash the human visual system” and getting an output that doesn’t align with what you actually wanted, because you are now in a coma.
So, I see AGI as more of a tool than a self-aware agent. A tool that can do anything, but not one that acts on its own.
I’m new to this site, but I’d love some feedback (especially if I’m totally wrong).
-Soareverix

Michael Soareverix

Our Ex­ist­ing Solu­tions to AGI Align­ment (semi-safe)

Mus­ings on the Hu­man Ob­jec­tive Function

Three Min­i­mum Pivotal Acts Pos­si­ble by Nar­row AI

Could an AI Align­ment Sand­box be use­ful?

Our Existing Solutions to AGI Alignment (semi-safe)

Musings on the Human Objective Function

Three Minimum Pivotal Acts Possible by Narrow AI

Could an AI Alignment Sandbox be useful?