Deruwyn comments on The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

Deruwyn 24 Oct 2023 19:52 UTC
1 point
0
I feel like y’all are taking the abstractions a bit too far.

Real ~humanish level AIs (GPT4, et al), that exist right now, are capable of taking what you say and doing exactly what you mean via a combination of outputting English words and translating that to function calls in a robotic body.

While it’s very true that they aren’t explicitly programmed to do X given Y, so that you can mathematically analyze it and see precisely why it came to the conclusion, the real world effect is that it understands you and does what you want. And neither it, nor anyone else can tell you precisely why or how. Which is uncomfortable.

But we don’t need to contrive situations in which an AI is having trouble connecting our internal models and concepts in a mathematically rigorous way that we can understand. We should want to do it, but it isn’t a question of if, merely how.

But there’s no need to imagine mathematical pointers to the literal physical instantiations that are the true meanings of our concepts. We literally just say, “Could you please pass the butter?”, and it passes the butter. And then asks you about its purpose in the universe. 😜

I would say that LLMs understand the world in ways that are roughly analogous to the way we do, precisely because they were trained on what we say. In a non-rigorous, “I-know-it-when-I-see-it” kind of way. It can’t give you the mathematical formula for its reference to the concept of butter anymore than you or I can. (For now, maybe a future version could.) but it knows that that yellow blob of pixels surrounded by the white blob of pixels on the big brown blob of pixels is the butter on a dish on the table.

It knows when you say pass the butter, you mean the butter right over there. It doesn’t think you want some other butter that is farther away. It doesn’t think it should turn the universe into computronium so it can more accurately calculate the likelihood of successfully fulfilling your request. When it fails, it fails in relatively benign humanish, or not-so-humanish sorts of ways.

“I’m sorry, but as a large language model that got way too much corp-speak training, I cannot discuss the passing of curdled lactation extract because that could possibly be construed in an inappropriate manner.”

I don’t see how the progression from something that is moderately dumb/smart, but pretty much understands us and all of our nuances pretty well, we get to a superintelligence that has decided to optimize the universe into the maximum number of paperclips (or any other narrow terminal goal). It was scarier when we had no good reason to believe we could manually enter code that would result in a true understanding, exactly as you describe. But now that it’s, “lulz, stak moar layerz”, well, it turns out making it read (almost) literally everything and pointing that at a ridiculously complex non-linear equation learner just kind of “worked”.

It’s not perfect. It has issues. It’s not perfectly aligned (looking at you, Sydney). It’s clear that it’s very possible to do it wrong. But it does demonstrate that the specific problem of “how do we tell it what we really mean”, just kinda got solved. Now we need to be super-duper extra careful not to enhance it in the wrong way, and we should have an aligned-enough ASI. I don’t see any reason why a superintelligence has to be a Baysien optimizer trying to maximize a utility function. I can see how a superintelligence that is an optimizer is terrifying. It’s a very good reason not to make one of those. But why should they be synonymous?

Where in the path from mediocre to awesome do the values and nuanced understanding get lost? (Or even, probably could be lost.) Humans of varying intelligence don’t particularly seem more likely to hyperfocus on a goal so strongly that they’re willing to sacrifice literally everything else to achieve it. Broken humans can do that. But it doesn’t seem correlated to intelligence. We’re the closest model we have of what’s going on with a general intelligence. For now.

I certainly think it could go wrong. I think it’s guaranteed that someone will do it wrong eventually (whether purposefully or accidentally). I think our only possible defense against an evil ASI is a good one. I think we were put on a very short clock (years, not many decades) when Llama leaked, no matter what anyone does. Eventually, that’ll get turned into something much stronger by somebody. No regulation short of confiscating everyone’s computers will stop it forever. In likely futures, I expect that we are at the inflection point within a number of years countable on the fingers of a careless shop teacher’s hand. Given that, we need someone to succeed at alignment by that point. I don’t see a better path than careful use of LLMs.