red75prime

Karma: 66

red75prime May 11, 2023, 9:47 AM
8 points
3
in reply to: __RicG__’s comment on: AGI-Automated Interpretability is Suicide
You said it yourself, GPT “”wants”″ to predict the correct probability distribution of the next token
No, I said that GPT does predict next token, while probably not containing anything that can be interpreted as “I want to predict next token”. Like a bacterium does divide (with possible adaptive mutations), while not containing “be fruitful and multiply” written somewhere inside.
If you instead meant that GPT is “just an algorithm”
No, I certainly didn’t mean that. If the extended Church—Turing thesis holds for macroscopic behavior of our bodies, we can indeed be represented as Turing-machine algorithms (with polynomial multiplier on efficiency).
What I feel, but can’t precisely convey, is that there’s a huge gulf (in computational complexity maybe) between agentic systems (that do have explicit internal representation of, at least, some of their goals) and “zombie-agentic” systems (that act like agents with goals, but have no explicit internal representation of those goals).
we don’t know what our utility actually is
How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent’s shortcomings)?

red75prime May 10, 2023, 11:21 PM
12 points
11
in reply to: __RicG__’s comment on: AGI-Automated Interpretability is Suicide
I really don’t expect “goals” to be explicitly written down in the network. There will very likely not be a thing that says “I want to predict the next token” or “I want to make paperclips” or even a utility function of that. My mental image of goals is that they are put “on top” of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
I’m sure that I don’t understand you. GPT most likely doesn’t have “I want to predict next token” written somewhere, because it doesn’t want to predict next token. There’s nothing in there that will actively try to predict next token no matter what. It’s just the thing it does when it runs.
Is it possible to have a system that just “actively try to make paperclips no matter what” when it runs, but it doesn’t reflect it in its reasoning and planning? I have a feeling that it requires God-level sophistication and knowledge of the universe to create a device that can act like that, when the device just happens to act in a way that robustly maximizes paperclips while not containing anything that can be interpreted as that goal.
I found that I can’t precisely formulate why I feel that. Maybe I’ll be able to express that in a few weeks (or I’ll find that the feeling is misguided).

red75prime May 10, 2023, 8:12 PM
9 points
4
on: AGI-Automated Interpretability is Suicide
Solving interpretability with an AGI (even with humans-in-the-loop) might not lead to particularly great insights on a general alignment theory or even on how to specifically align a particular AGI
Wouldn’t it at least solve corrigibility by making it possible to detect formation of undesirable end-goals? I think even GPT-4 can classify textual interpretation of an end-goal on a basis of its general desirability for humans.
It seem to need another assumption, namely that the AGI has sufficient control of its internal state and knowledge of the detection network to be able to bring itself into a state that produces interpretation that trips detection network, while also allowing the AGI to advance its agenda.

red75prime May 10, 2023, 6:43 PM
3 points
3
in reply to: Christopher King’s comment on: Have you heard about MIT’s “liquid neural networks”? What do you think about them?
I have low confidence in that, but I guess it (OOD generalization by “liquid” networks) works well in differentiable continuous domains (like low-level motion planning) by exploiting natural smoothness of a system. So I wouldn’t get my hopes high in its universal applicability.

red75prime May 10, 2023, 6:30 PM
LW: 3 AF: 2
0
AF
in reply to: Steven Byrnes’s comment on: LeCun’s “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem
If you have a next-frame video predictor, you can’t ask it how a human would feel. You can’t ask it anything at all—except “what might be the next frame of thus-and-such video?”. Right?
Not exactly. You can extract embeddings from a video predictor (activations of the next-to-last layer may do, or you can use techniques, which enhance semantic information captured in the embeddings). And then use supervised learning to train a simple classifier from an embedding to human feelings on a modest number of video/feelings pairs.

red75prime May 10, 2023, 1:36 PM
1 point
0
in reply to: the gears to ascension’s comment on: Yoshua Bengio argues for tool-AI and to ban “executive-AI”
the issue I still see is—how do you recognize an ai executive that is trying to disguise itself?
It can’t disguise itself without researching disguising methods first. The question is will interpretability tools be up to the task of catching it.
It will not work for catching AI executive originating outside of controlled environment (unless it queries AI scientist). But given that such attempts will originate from uncoordinated relatively computationally underpowered sources, it may be possible to preemptively enumerate disguising techniques that such AI executive could come up with. If there are undetectable varieties..., well, it’s mostly game over.

red75prime May 9, 2023, 9:57 PM
1 point
0
in reply to: MichaelStJules’s comment on: All AGI Safety questions welcome (especially basic ones) [May 2023]
Thanks. Could we be sure that a bare utility maximizer doesn’t modify itself into a mugging-proof version? I think we can. Such modification drastically decreases expected utility.
It’s a bit of relief that a sizeable portion of possible intelligences can be stopped by playing god to them.

red75prime May 9, 2023, 3:23 PM
1 point
0
on: All AGI Safety questions welcome (especially basic ones) [May 2023]
Are there ways to make a utility maximizer impervious to Pascal’s mugging?

red75prime May 8, 2023, 9:11 AM
1 point
−1
in reply to: Rob Bensinger’s comment on: An artificially structured argument for expecting AGI ruin
Humans were created by evolution, but [...]
We know that evolution has no preferences (evolution is not an agent), so we generally don’t frame our preferences as an approximation of evolution’s ones. People who believe that they were created with some goal in mind of the creator do engage in reasoning of what was truly meant for them to do.

red75prime May 8, 2023, 8:39 AM
1 point
0
in reply to: Rob Bensinger’s comment on: An artificially structured argument for expecting AGI ruin
See also, in the OP: “Problem of Fully Updated Deference: Normative uncertainty doesn’t address the core obstacles to corrigibility.”
The provided link assumes that any preference can be expressed as a utility function over world-states. If you don’t assume that (and you shouldn’t as human preferences can’t be expressed as such), you cannot maximize weighted average of potential utility functions. Some actions are preference-wise irreversible. Take for example virtue ethics: wiping out your memory doesn’t restore your status as a virtuous person even if the world doesn’t contain any information of your unvirtuous acts anymore, so you don’t plan to do that.
When I asked here earlier why the article “Problem of Fully Updated Deference” uses incorrect assumption, I’ve got the answer that it’s better to have some approximation than none as it allows to move forward in exploring the problem of alignment. But I see that it became an unconditional cornerstone and not a toy example of analysis.

red75prime May 8, 2023, 12:58 AM
1 point
0
on: An artificially structured argument for expecting AGI ruin
As a strong default, STEM-level AGIs will have “goals”—or will at least look from the outside like they do. By this I mean that they’ll select outputs that competently steer the world toward particular states.
Clarification: when talking about world-states I mean world-state minus the state of agent (we are interested in the external actions of the agent).
For starters, you can have goal-directed behavior without steering the world toward particular states. Novelty seeking, for example, don’t imply any particular world-state to achieve.
And I think that the more strong default is that agent will have goal uncertainty. What reinforcement learning agent can say about its desired world-states or world-histories (the goal might not be expressible as an utility function over world-states) upon introspection? Nothing. Would it conclude that its goal is to make sure to self-stimulate as long as possible? Given its vast knowledge of humans, the idea looks fairly dumb (it has low prior probability) and its realization contradict almost any other possibility.
The only kind of agent that will know its goal with certainty is an agent that was programmed with its preferences explicitly pointing to the external world. That is upon introspection the agent finds that its action selection circuitry contains a module that compares expected world-states (or world-state/action pairs) produced by the given set of actions. That is someone was dumb enough to try to program explicit utility function, but secured sufficient funding anyway (completely possible situation, I agree).
But does it really removes goal uncertainty? Sufficiently intelligent agent knows that its utility function is an approximation of true preferences of the creator. That is prior probability of “stated goal == true goal” is infinitesimal (alignment is hard and agent knows it). Will it be enough to prevent the usual “kill-them-all and make tiny molecular squiggles”? The agent still has a choice of which actions to feed to the action selection block.

red75prime May 1, 2023, 6:28 AM
3 points
0
on: Hell is Game Theory Folk Theorems
In a realistic setting agents will be highly incentivized to seek other forms of punishment besides turning dial. But nice toy hell.

red75prime Apr 24, 2023, 12:58 PM
7 points
2
in reply to: Algon’s comment on: Could a superintelligence deduce general relativity from a falling apple? An investigation
Thanks for clearing my confusion. I’ve grown rusty on the topic of AIXI.
So going forwards from simple theories and seeing how they bridge to your effective model would probably do the trick
Assuming that there’s not much fine-tuning to do. Locating our world in the string theory landscape could take quite a few bits if it’s computationally feasible at all.
And remember, we’re talking about an ASI here
It hinges on assumption that ASI of this type is physically realizable. I can’t find it now, but I remember that preprocessing step, where heuristic generation is happening, for one variant of computable AIXI was found to take impractical amount of time. Am I wrong? Are there newer developments?

red75prime Apr 24, 2023, 7:45 AM
3 points
0
in reply to: Algon’s comment on: Could a superintelligence deduce general relativity from a falling apple? An investigation
it seems plausible that you could have GR + QFT and a megabyte of briding laws plus some other data to specify local conditions and so on.
How computationally bound variant of AIXI can arrive at QFT? You most likely can’t faithfully simulate a non-trivial quantum system on a classical computer within reasonable time limits. The AIXI is bound to find some computationally feasible approximation of QFT first (Maxwell’s equations and cutoff at some arbitrary energy to prevent ultraviolet catastrophe, maybe). And with no access to experiments it cannot test simpler systems.

red75prime Apr 23, 2023, 9:54 PM
5 points
0
in reply to: Algon’s comment on: Could a superintelligence deduce general relativity from a falling apple? An investigation
I mean are there reasons to assume that a variant of computable AIXI (or its variants) can be realized as a physically feasible device? I can’t find papers indicating significant progress in making feasible AIXI approximations.

red75prime Apr 23, 2023, 9:08 PM
1 point
0
in reply to: Algon’s comment on: Could a superintelligence deduce general relativity from a falling apple? An investigation
Assume it has disgusting amounts of compute
Isn’t it the same as “assume that it can do argmax as fast as needed for this scenario”?

red75prime Apr 21, 2023, 8:40 AM
1 point
0
in reply to: baynaynas’s comment on: How does AI Risk Affect the Simulation Hypothesis?
Of all the peoples’ lives that exist and have existed, what are the chances I’m living [...here and now]
Is there a more charitable interpretation of this line of thinking rather than “My soul selected this particular body out of all available”?
You being you as you are is a product of your body developing in circumstances it happened to develop in.

red75prime Apr 10, 2023, 1:57 AM
1 point
0
in reply to: Vladimir_Nesov’s comment on: All AGI Safety questions welcome (especially basic ones) [April 2023]
“Hard problem of corrigibility” refers to Problem of fully updated deference—Arbital, which uses a simplification (human preferences can be described as a utility function) that can be inappropriate for the problem. Human preferences are obviously path-dependent (you don’t want to be painfully disassembled and reconstituted as a perfectly happy person with no memory of disassembly). Was appropriateness of the above simplification discussed somewhere?