porby

Karma: 1,879

porby Nov 24, 2023, 11:45 PM
30 points
5
on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
I’m not sure if I fall into the bucket of people you’d consider this to be an answer to. I do think there’s something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.
In case it’s informative, here’s how I’d respond to this:
Well, I claim that these are more-or-less the same fact. It’s no surprise that the AI falls down on various long-horizon tasks and that it doesn’t seem all that well-modeled as having “wants/desires”; these are two sides of the same coin.
Mostly agreed, with the capability-related asterisk.
Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one’s plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.
Agreed in the spirit that I think this was meant, but I’d rephrase this: a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target will tend to be better at reaching that target than a system that doesn’t.
That’s subtly different from individual systems having convergent internal reasons for taking the same path. This distinction mostly disappears in some contexts, e.g. selection in evolution, but it is meaningful in others.
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I’ll say it “wants” that outcome “in the behaviorist sense”.
I think this frame is reasonable, and I use it.
it’s a little hard to imagine that you don’t contain some reasonably strong optimization that strategically steers the world into particular states.
Agreed.
that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.
Agreed.
“AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.
Agreed for a large subset of architectures. Any training involving the equivalent of extreme optimization for sparse/distant reward in a high dimensional complex context seems to effectively guarantee this outcome.
So, maybe don’t make those generalized wrench-removers just yet, until we do know how to load proper targets in there.
Agreed, don’t make the runaway misaligned optimizer.
I think there remains a disagreement hiding within that last point, though. I think the real update from LLMs is:
1. We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
2. It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
3. That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
In other words, a big part of the update for me was in having a real foothold on loading the full complexity of “proper targets.”
I don’t think what we have so far constitutes a perfect and complete solution, the nice properties could be broken, paradigms could shift and blow up the golden path, it doesn’t rule out doom, and so on, but diving deeply into this has made many convergent-doom paths appear dramatically less likely to Late2023!porby compared to Mid2022!porby.

porby Nov 24, 2023, 9:23 PM
7 points
1
on: What’s the evidence that LLMs will scale up efficiently beyond GPT4? i.e. couldn’t GPT5, etc., be very inefficient?
This isn’t directly evidence, but I think it’s worth flagging: by the nature the topic, much of the most compelling evidence is potentially hazardous. This will bias the kinds of answers you can get.
(This isn’t hypothetical. I don’t have some One Weird Trick To Blow Up The World, but there’s a bunch of stuff that falls under the policy “probably don’t mention this without good reason out of an abundance of caution.”)

porby Nov 23, 2023, 9:25 PM
2 points
0
in reply to: jacquesthibs’s comment on: TurnTrout’s shortform feed
For what it’s worth, I’ve had to drop from python to C# on occasion for some bottlenecks. In one case, my C# implementation was 418,000 times faster than the python version. That’s a comparison between a poor python implementation and a vectorized C# implementation, but… yeah.

porby Nov 14, 2023, 2:41 AM
3 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task).
Right; a preference being conditionally overwhelmed by other preferences does not make the presence of the overwhelmed preference conditional.
Or to phrase it another way, suppose I don’t like eating bread^[1] (-1 utilons), but I do like eating cheese (100 utilons) and garlic (1000 utilons).
You ask me to choose between garlic bread (1000 − 1 = 999 utilons) and cheese (100 utilons); I pick the garlic bread.
The fact that I don’t like bread isn’t erased by the fact that I chose to eat garlic bread in this context.
It also seems to cover security (if we’re dead it won’t know), health (if we’re incapacitated it won’t know) and prosperity (if we’re under economical constraints that impacts our free will). But I’m interested to consider possible failure modes.
This is aiming at a different problem than goal agnosticism; it’s trying to come up with an agent that is reasonably safe in other ways.
In order for these kinds of bounds (curiosity, nausea) to work, they need to incorporate enough of the human intent behind the concepts.
So perhaps there is an interpretation of those words that is helpful, but there remains the question “how do you get the AI to obey that interpretation,” and even then, that interpretation doesn’t fit the restrictive definition of goal agnosticism.
The usefulness of strong goal agnostic systems (like ideal predictors) is that, while they do not have properties like those by default, they make it possible to incrementally implement those properties.
1. ^
  utterly false for the record

porby Nov 10, 2023, 12:01 AM
3 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
For example, a system that avoids experimenting on humans—even when prompted to do so otherwise—is expressing a preference about humans being experimented on by itself.
Being meaningfully curious will also come along with some behavioral shift. If you tried to induce that behavior in a goal agnostic predictor through conditioning for being curious in that way and embed it in an agentic scaffold, it wouldn’t be terribly surprising for it to, say, set up low-interference observation mechanisms.
Not all violations of goal agnosticism necessarily yield doom, but even prosocial deviations from goal agnosticism are still deviations.

porby Nov 9, 2023, 11:43 PM
4 points
2
in reply to: Thane Ruthenis’s comment on: TurnTrout’s shortform feed
I think what we’re discussing requires approaching the problem with a mindset entirely foreign to the mainstream one. Consider how many words it took us to get to this point in the conversation, despite the fact that, as it turns out, we basically agree on everything. The inferential distance between the standard frameworks in which AI researchers think, and here, is pretty vast.
True!
I expect that if the mainstream AI researchers do make strides in the direction you’re envisioning, they’ll only do it by coincidence. Then probably they won’t even realize what they’ve stumbled upon, do some RLHF on it, be dissatisfied with the result, and keep trying to make it have agency out of the box. (That’s basically what already happened with GPT-4, to @janus’ dismay.)
Yup—this is part of the reason why I’m optimistic, oddly enough. Before GPT-likes became dominant in language models, there was all kinds of flailing that often pointed in more agenty-by-default directions. That flailing then found GPT because it was easily accessible and strong.
Now, the architectural pieces subject to similar flailing is much smaller, and I’m guessing we’re only one round of benchmarks at scale from a major lab before the flailing shrinks dramatically further.
In other words, I think the necessary work to make this path take off is small and the benefits will be greedily visible. I suspect one well-positioned researcher could probably swing it.
That said, you’re making some high-quality novel predictions here, and I’ll keep them in mind when analyzing AI advancements going forward.
Thanks, and thanks for engaging!
Come to think of it, I’ve got a chunk of mana laying around for subsidy. Maybe I’ll see if I can come up with some decent resolution criteria for a market.

porby Nov 8, 2023, 9:54 PM
4 points
0
in reply to: Thane Ruthenis’s comment on: TurnTrout’s shortform feed
I assume that by “lower-level constraints” you mean correlations that correctly capture the ground truth of reality, not just the quirks of the training process. Things like “2+2=4″, “gravity exists”, and “people value other people”
That’s closer to what I mean, but these constraints are even lower level than that. Stuff like understanding “gravity exists” is a natural internal implementation that meets some constraints, but “gravity exists” is not itself the constraint.
In a predictor, the constraints serve as extremely dense information about what predictions are valid in what contexts. In a subset of predictions, the awareness that gravity exists helps predict. In other predictions, that knowledge isn’t relevant, or is even misleading (e.g. cartoon physics). The constraints imposed by the training distribution tightly bound the contextual validity of outputs.
But since they’re not, at the onset, categorized differently at the level of cognitive algorithms, a nascent AGI would experiment with slipping both types of constraints.
I’d agree that, if you already have an AGI of that shape, then yes, it’ll do that. I’d argue that the relevant subset of predictive training practically rules out the development of that sort of implementation, and even if it managed to develop, its influence would be bounded into irrelevance.
Even in the absence of a nascent AGI, these constraints are tested constantly during training through noise and error. The result is a densely informative gradient pushing the implementation back towards a contextually valid state.
Throughout the training process prior to developing strong capability and situational awareness internally, these constraints are both informing and bounding what kind of machinery makes sense in context. A nascent AGI must have served the extreme constraints of the training distribution to show up in the first place; its shape is bound by its development, and any part of that shape that “tests” constraints in a way that worsens loss is directly reshaped.
Even if a nascent internal AGI of this type develops, if it isn’t yet strong enough to pull off complete deception with respect to the loss, the gradients will illuminate the machinery of that proto-optimizer and it will not survive in that shape.
Further, even if we suppose a strong internal AGI develops that is situationally aware and is sufficiently capable and motivated to try deception, there remains the added dependency on actually executing that deception while never being penalized by gradients. This remains incredibly hard. It must transition into an implementation that satisfies the oppressive requirements of training while adding an additional task of deception without even suffering a detectable complexity penalty.
These sorts of deceptive mesaoptimizer outcomes are far more likely when the optimizer has room to roam. I agree that you could easily observe this kind of testing and slipping when the constraints under consideration are far looser, but the kind of machine that is required by these tighter constraints doesn’t even bother with trying to slip constraints. It’s just not that kind of machine, and there isn’t a convergent path for it to become that kind of machine under this training mechanism.
And despite that lack of an internal motivation to explore and exploit with respect to any external world states, it still has capabilities (in principle) which, when elicited, make it more than enough to eat the universe.
Does that align with what you’re envisioning? If yes, then our views on the issue are surprisingly close. I think it’s one of our best chances at producing an aligned AI, and it’s one of the prospective targets of my own research agenda.
Yup!
I don’t think the current mainstream research directions are poised to result in this. AI Labs have been very clear in their intent to produce an agent-like AGI, not a superhuman forecasting tool. I expect them to prioritize research into whatever tweaks to the training schemes would result in homunculi; not whatever research would result in perfect predictors + our ability to precisely query them.
I agree that they’re focused on inducing agentiness for usefulness reasons, but I’d argue the easiest and most effective way to get to useful agentiness actually routes through this kind of approach.
This is the weaker leg of my argument; I could be proven wrong by some new paradigm. But if we stay on something like the current path, it seems likely that the industry will just do the easy thing that works rather than the inexplicable thing that often doesn’t work.
What are the “other paths” you’re speaking of? As you’d pointed out, prompts are a weak and awkward way to run custom queries on the AI’s world-model. What alternatives are you envisioning?
I’m pretty optimistic about members of a broad class that are (or likely are) equivalent to conditioning, since these paths tend to preserve the foundational training constraints.
A simple example is [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). Having a “good” and “bad” token, or a scalarized goodness token, still pulls in many of the weaknesses of the RLHF’s strangely shaped reward function, but there are trivial/naive extensions to this which I would anticipate being major improvements over the state of the art. For example, just have more (scalarized) metatokens representing more concepts such that the model must learn a distinction between being correct and sounding correct, because the training process split those into different tokens. There’s no limit on how many such metatokens you could have; throw a few hundred fine-grained classifications into the mix. You could also bake complex metatoken prompts into single tokens with arbitrary levels of nesting or bake the combined result into the weights (though I suspect weight-baking would come with some potential failure modes).^[1]
Another more recent path is observing the effect that conditions have on activations and dynamically applying the activation diffs to steer behavior. At the moment, I don’t know how to make this quite as strong as the previous conditioning scheme, but I bet people will figure out a lot more soon and that it leads somewhere similar.
1. ^
  There should exist some reward signal which could achieve a similar result in principle, but that goes back to the whole “we suck at designing rewards that result in what we want” issue. This kind of structure, as ad hoc as it is, is giving us an easier API to lever the model’s own capability to guide its behavior. I bet we can come up with even better implementations, too.

porby Nov 8, 2023, 9:00 PM
3 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
Probably not? It’s tough to come up with an interpretation of those properties that wouldn’t result in the kind of unconditional preferences that break goal agnosticism.

porby Nov 7, 2023, 7:28 PM
4 points
0
in reply to: Thane Ruthenis’s comment on: TurnTrout’s shortform feed
I’m using as a “an optimization constraint on actions/plans that correlated well with good performance on the training dataset; a useful heuristic”.
Alright, this is pretty much the same concept then, but the ones I’m referring to operate at a much lower and tighter level than thumbs-downing murder-proneness.
So...
Such constraints are, for example, the reason our LLMs are able to produce coherent speech at all, rather than just babbling gibberish.
Agreed.
… and yet this would still get in the way of qualitatively more powerful capabilities down the line, and a mind that can’t somehow slip these constraints won’t be a general intelligence.
While I agree these claims probably hold for the concrete example of thumbs-downing an example of murderproneness, I don’t see how they hold for the lower-level constraints that imply the structure of its capability. Slipping those constraints looks more like babbling gibberish.
By default, those would be constrained to be executed the way humans execute them, the way the AI was shown to do it during the training. But the whole point of an AGI is that it should be able to invent better solutions than ours. More efficient ways of thinking, weird super-technological replacements for our construction techniques, etc.
While it’s true that an AI probably isn’t going to learn true things which are utterly divorced from and unimplied by the training distribution, I’d argue that the low-level constraints I’m talking about both leave freedom for learning wildly superhuman internal representations and directly incentivize it during extreme optimization. An “ideal predictor” wouldn’t automatically start applying these capabilities towards any particular goal involving external world states by default, but it remains possible to elicit those capabilities incrementally.
Making the claim more concise: it seems effectively guaranteed that the natural optimization endpoint of one of these architectures would be plenty general to eat the universe if it were aimed in that direction. That process wouldn’t need to involve slipping any of the low-level constraints.
I’m guessing the disconnect between our models is where the aiming happens. I’m proposing that the aiming is best (and convergently) handled outside the scope of wildly unpredictable and unconstrained optimization processes. Instead, it takes place at a level where a system of extreme capability infers the gaps in specifications and applies conditions robustly. The obvious and trivial version of this is conditioning through prompts, but this is a weak and annoying interface. There are other paths that I suspect bottom out at equivalent power/safety yet should be far easier to use in a general way. These paths allow incremental refinement by virtue of not automatically summoning up incorrigible maximizers by default.
If the result of refinement isn’t an incorrigible maximizer, then slipping the higher level “constraints” of this aiming process isn’t convergent (or likely), and further, the nature of these higher-level constraints would be far more thorough than anything we could achieve with RLHF.
In fact, my model says there’s no fundamental typological difference between “a practical heuristic on how to do a thing” and “a value” at the level of algorithmic implementation. It’s only in the cognitive labels we-the-general-intelligences assign them.
That’s pretty close to how I’m using the word “value” as well. Phrased differently, it’s a question of how the agent’s utilities are best described (with some asterisks around the non-uniqueness of utility functions and whatnot), and observable behavior may arise from many different implementation strategies—values, heuristics, or whatever.

porby Nov 7, 2023, 4:32 AM
4 points
0
in reply to: Thane Ruthenis’s comment on: TurnTrout’s shortform feed
I think we’re using the word “constraint” differently, or at least in different contexts.
Sure! Human values are not arbitrary either; they, too, are very heavily constrained by our instincts. And yet, humans still sometimes become omnicidal maniacs, Hell-worshipers, or sociopathic power-maximizers. How come?
In terms of the type and scale of optimization constraint I’m talking about, humans are extremely unconstrained. The optimization process represented by our evolution is way out there in terms of sparsity and distance. Not maximally so—there are all sorts of complicated feedback loops in our massive multiagent environment—but it’s nothing like the value constraints on the subset of predictors I’m talking about.
To be clear, I’m not suggesting “language models are tuned to be fairly close to our values.” I’m making a much stronger claim that the relevant subset of systems I’m referring to cannot express unconditional values over external world states across anything resembling the training distribution, and that developing such values out of distribution in a coherent goal directed way practically requires the active intervention of a strong adversary. In other words:
A homunculus needs to be able to nudge these constraints somehow, for it to be useful, and its power grows the more it’s able to disregard them.
...
These constraints do not generalize as fast as a homunculus’ understanding goes.
I see no practical path for a homunculus of the right kind, by itself, to develop and bypass the kinds of constraints I’m talking about without some severe errors being made in the design of the system.
Further, this type of constraint isn’t the same thing as a limitation of capability. In this context, with respect to the training process, bypassing these kinds of constraints is kind of like a car bypassing having-a-functioning-engine. Every training sample is a constraint on what can be expressed locally, but it’s also information about what should be expressed. They are what the machine of Bayesian inference is built out of.
In other words, the hard optimization process is contained to a space where we can actually have reasonable confidence that inner alignment with the loss is the default. If this holds up, turning up the optimization on this part doesn’t increase the risk of value drift or surprises, it just increases foundational capability.
The ability to use that capability to aim itself is how the foundation becomes useful. The result of this process need not result in a coherent maximizer over external world states, nor does it necessarily suffer from coherence death spirals driving it towards being a maximizer. It allows incremental progress.
(That said: this is not a claim that all of alignment is solved. These nice properties can be broken, and even if they aren’t, the system can be pointed in catastrophic directions. An extremely strong goal agnostic system like this could be used to build a dangerous coherent maximizer (in a nontrivial sense); doing so is just not convergent or particularly useful.)

porby Nov 6, 2023, 8:31 PM
17 points
11
in reply to: Thane Ruthenis’s comment on: TurnTrout’s shortform feed
My model says that general intelligence^[1] is just inextricable from “true-goal-ness”. It’s not that I think homunculi will coincidentally appear as some side-effect of capability advancement — it’s that the capabilities the AI Labs want necessarily route through somehow incentivizing NNs to form homunculi. The homunculi will appear inasmuch as the labs are good at their jobs.
I’ve got strong doubts about the details of this. At the high level, I’d agree that strong/useful systems that get built will express preferences over world states like those that could arise from such homunculi, but I expect that implementations that focus on inducing a homunculus directly through (techniques similar to) RL training with sparse rewards will underperform more default-controllable alternatives.
My reasoning would be that we’re bad at using techniques like RL with a sparse reward to reliably induce any particular behavior. We can get it to work sometimes with denser reward (e.g. reward shaping) or by relying on a beefy pre-existing world model, but the default outcome is that sparse and distant rewards in a high dimensional space just don’t produce the thing we want. When this kind of optimization is pushed too far, it’s not merely dangerous; it’s useless.
I don’t think this is temporary ignorance about how to do RL (or things with similar training dynamics). It’s fundamental:
1. Sparse and distant reward functions in high dimensional spaces give the optimizer an extremely large space to roam. Without bounds, the optimizer is effectively guaranteed to find something weird.
2. For almost any nontrivial task we care about, a satisfactory reward function takes a dependency on large chunks of human values. The huge mess of implicit assumptions, common sense, and desires of humans are necessary bounds during optimization. This comes into play even at low levels of capability like ChatGPT.
Conspicuously, the source of the strongest general capabilities we have arises from training models with an extremely constraining optimization target. The “values” that can be expressed in pretrained predictors are forced into conditionalization as a direct and necessary part of training; for a reasonably diverse dataset, the resulting model can’t express unconditional preferences regarding external world states. While it’s conceivable that some form of “homunculi” could arise, their ability to reach out of their appropriate conditional context is directly and thoroughly trained against.
In other words, the core capabilities of the system arise from a form of training that is both densely informative and blocks the development of unconditional values regarding external world states in the foundational model.
Better forms of fine-tuning, conditioning, and activation interventions (the best versions of each, I suspect, will have deep equivalences) are all built on the capability of that foundational system, and can be directly used to aim that same capability. Learning the huge mess of human values is a necessary part of its training, and its training makes eliciting the relevant part of those values easier—that necessarily falls out of being a machine strongly approximating Bayesian inference across a large dataset.
The final result of this process (both pretraining and conditioning or equivalent tuning) is still an agent that can be described as having unconditional preferences about external world states, but the path to get there strikes me as dramatically more robust both for safety and capability.
Summarizing a bit: I don’t think it’s required to directly incentivize NNs to form value-laden homunculi, and many of the most concerning paths to forming such homunculi seem worse for capabilities.

porby Nov 3, 2023, 10:16 PM
2 points
0
in reply to: Thomas Kwa’s comment on: Parametrically retargetable decision-makers tend to seek power
If LLMs end up being useful, how do they get around these theorems? Can we get some result where if RLHF has a capabilities component and a power-averseness component, the capabilities component can cause the agent to be power-seeking on net?
Intuitively, eliciting that kind of failure seems like it would be pretty easy, but it doesn’t seem to be a blocker for the usefulness of the generalized form of LLMs. My mental model goes something like:
1. Foundational goal agnosticism evades optimizer-induced automatic doom, and
2. Models implementing a strong approximation of Bayesian inference are, not surprisingly, really good at extracting and applying conditions, so
3. They open the door to incrementally building a system that holds the entirety of a safe wish.
Things like “caring about means,” or otherwise incorporating the vast implicit complexity of human intent and values, can arise in this path, while I’m not sure the same can be said for any implementation that tries to get around the need for that complexity.
It seems like the paths which try to avoid importing the full complexity while sticking to crisp formulations will necessarily be constrained in their applicability. In other words, any simple expression of values subject to optimization is only safe within a bounded region. I bet there are cases where you could define those bounded regions and deploy the simpler version safely, but I also bet the restriction will make the system mostly useless.
Biting the bullet and incorporating more of the necessary complexity expands the bounded region. LLMs, and their more general counterparts, have the nice property that turning the screws of optimization on the foundation model actually makes this safe region larger. Making use of this safe region correctly, however, is still not guaranteed😊

porby Nov 2, 2023, 10:53 PM
3 points
1
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense.
If you successfully gave a strong maximizer the goal of maximizing a goal agnostic utility function, yes, you could then draw a box around the resulting system and correctly call it goal agnostic.
In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low n of the spontaneous fighting danse of two thermostats. If the latter can be described as goal agnostic, I don’t think the former shall not (hence my examples of environmental constraints that could let someone use your or my personality as a certified subprogram).
Composing multiple goal agnostic systems into a new system, or just giving a single goal agnostic system some trivial scaffolding, does not necessarily yield goal agnosticism in the new system. It won’t necessarily eliminate it, either; it depends on what the resulting system is.
Yes, but shall we also agree that non-goal agnostic agents can produce goal agnostic agent?
Yes; during training, a non-goal agnostic optimizer can produce a goal agnostic predictor.

porby Oct 27, 2023, 7:44 PM
7 points
3
in reply to: johnswentworth’s comment on: Symbol/Referent Confusions in Language Model Alignment Experiments
I agree with the specific claims in this post in context, but the way they’re presented makes me wonder if there’s a piece missing which generated that presentation.
And the key question for corrigibility is what actions the model would take in response to that observation, which is just a totally different question from how it responds to some user’s natural-language query about being turned off.
It is correct to say that, if you know nothing about the nature of the system’s execution, this kind of natural language query is very little information. A deceptive system could output exactly the same thing. It’s stronger evidence that the system isn’t an agent that’s aggressively open with its incorrigibility, but that’s pretty useless.
If you somehow knew that, by construction of the underlying language model, there was a strong correlation between these sorts of natural language queries and the actions taken by a candidate corrigible system built on the language model, then this sort of query is much stronger evidence. I still wouldn’t call it strong compared to a more direct evaluation, but in this case, guessing that the maybeCorrigibleBot will behave more like the sample query implies is reasonable.
In other words:
Me: Yet more symbol-referent confusion! In fact, this one is a special case of symbol-referent confusion which we usually call “gullibility”, in which one confuses someone’s claim of X (the symbol) as actually implying X (the referent).
If you intentionally build a system where the two are actually close enough to the same thing, this is no longer a confusion.
If my understanding of your position is correct: you wouldn’t disagree with that claim, but you would doubt there’s a good path to a strong corrigible agent of that approximate form built atop something like modern architecture language models but scaled up in capability. You would expect many simple test cases with current systems like RLHF’d GPT4 in an AutoGPT-ish scaffold with a real shutdown button to work but would consider that extremely weak evidence about the safety properties of a similar system built around GPT-N in the same scaffold.
If I had to guess where we might disagree, it would be in the degree to which language models with architectures similar-ish to current examples could yield a system with properties that permit corrigibility. I’m pretty optimistic about this in principle; I think a there is a subset of predictive training that yields high capability with an extremely constrained profile of “values” that make the system goal agnostic by default. I think there’s a plausible and convergent path to capabilities that routes through corrigible-ish systems by necessity and permits incremental progress on real safety.
I’ve proven pretty bad at phrasing the justifications concisely, but if I were to try again: the relevant optimization pressures during the kinds of predictive training I’m referring to directly oppose the development of unconditional preferences over external world states, and evading these constraints carries a major complexity penalty. The result of extreme optimization can be well-described by a coherent utility function, but one representing only a conditionalized mapping from input to output. (This does not imply or require cognitive or perceptual myopia. This also does not imply that an agent produced by conditioning a predictor remains goal agnostic.)
A second major piece would be that this subset of predictors also gets superhumanly good at “just getting what you mean” (in a particular sense of the phrase) because it’s core to the process of Bayesian inference that they implement. They squeeze enormous amount of information out of every available source of conditions and stronger such models do even more. This doesn’t mean that the base system will just do what you mean, but it is the foundation on which you can more easily build useful systems.
There are a lot more details that go into this that can be found in other walls of text.
On a meta level:
That conversation we just had about symbol/referent confusions in interpreting language model experiments? That was not what I would call an advanced topic, by alignment standards. This is really basic stuff. (Which is not to say that most people get it right, but rather that it’s very early on the tech-tree.) Like, if someone has a gearsy model at all, and actually thinks through the gears of their experiment, I expect they’ll notice this sort of symbol/referent confusion.
I’ve had the occasional conversation that, vibes-wise, went in this direction (not with John).
It’s sometimes difficult to escape that mental bucket after someone pattern matches you into it, and it’s not uncommon for the heuristic to result in one half the conversation sounding like this post. There have been times where the other person goes into teacher-mode and tries e.g. a socratic dialogue to try to get me to realize an error they think I’m making, only to discover at the end some minutes later that the claim I was making was unrelated and not in contradiction with the point they were making.
This isn’t to say “and therefore you should put enormous effort reading the manifesto of every individual who happens to speak with you and never use any conversational heuristics,” but I worry there’s a version of this heuristic happening at the field level with respect to things that could sound like “language models solve corrigibility and alignment.”

porby Oct 26, 2023, 4:37 PM
3 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
In this very sense, one cannot want an external world state that is already in place, correct?
An agent can have unconditional preferences over world states that are already fulfilled. A maximizer doesn’t stop being a maximizer if it’s maximizing.
Let’s say we want to maximize the number of digits of pi we explicitly know.
That’s definitely a goal, and I’d describe an agent with that goal as both “wanting” in the previous sense and not goal agnostic.
Also, what about the thermostat question above?
If the thermostat is describable as goal agnostic, then I wouldn’t say it’s “wanting” by my previous definition. If the question is whether the thermostat’s full system is goal agnostic, I suppose it is, but in an uninteresting way.
(Note that if we draw the agent-box around ‘thermostat with temperature set to 72’ rather than just ‘thermostat’ alone, it is not goal agnostic anymore. Conditioning a goal agnostic agent can produce non-goal agnostic agents.)

porby Oct 24, 2023, 10:36 PM
3 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
If you were using “wanting” the way I was using the word in the previous post, then yes, it would be wrong to describe a goal agnostic system as “wanting” something, because the way I was using that word would imply some kind of preference over external world states.
I have no particular ownership over the definition of “wanting” and people are free to use words however they’d like, but it’s at least slightly unintuitive to me to describe a system as “wanting X” in a way that is not distinct from “being X,” hence my usage.

porby Oct 23, 2023, 12:35 AM
5 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
If you have a model that “wants” to be goal agnostic in a way that means it behaves in a goal agnostic way in all circumstances, it is goal agnostic. It never exhibits any instrumental behavior arising from unconditional preferences over external world states.
For the purposes of goal agnosticism, that form of “wanting” is an implementation detail. The definition places no requirement on how the goal agnostic behavior is achieved.
In other words:
If the model is describable as wanting to be goal agnostic, in terms of a utility function, it is not goal agnostic.
A model that “wants” to be goal agnostic such that its behavior is goal agnostic can’t be described as “wanting” to be goal agnostic in terms of its utility function; there will be no meaningful additional terms for “being goal agnostic,” just the consequences of being goal agnostic.
As a result of how I was using the words, the fact that there is an observable difference between “being” and “wanting to be” is pretty much tautological.

porby Oct 16, 2023, 5:26 PM
7 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
it seems that we could describe both the model and the optimizer as either having an unconditional preference for goal agnosticism, or both as having preferences over the state of external words(to include goal agnostic models). I don’t understand what axiom or reasoning leads to treating these two things differently.
The difference is subtle but important, in the same way that an agent that “performs bayesian inference” is different from an agent that “wants to perform bayesian inference.”
A goal agnostic model does not want to be goal agnostic, it just is. If the model is describable as wanting to be goal agnostic, in terms of a utility function, it is not goal agnostic.
The observable difference between the two is the presence of instrumental behavior towards whatever goals it has. A model that “wants to perform bayesian inference” might, say, maximize the amount of inference it can do, which (in the pathological limit) eats the universe.
A model that wants to be goal agnostic has fewer paths to absurd outcomes since self-modifying to be goal agnostic is a more local process that doesn’t require eating the universe and it may have other values that suggest eating the universe is bad, but it’s still not immediately goal agnostic.
From your answers, I understand that you treat goal agnostic agent as an oxymoron, correct?
Agent doesn’t have a constant definition across all contexts, but it can be valid to describe a goal agnostic system as a rational agent in the VNM sense. Taking the “ideal predictor” as an example, it has a utility function that it maximizes. In the limit, it very likely represents a strong optimizing process. It just so happens that the goal agnostic utility function does not directly imply maximization with respect to external world states, and does not take instrumental actions that route through external world states (unless the system is conditioned into an agent that is not goal agnostic).

porby Oct 14, 2023, 4:45 PM
4 points
0
in reply to: Nathan Helm-Burger’s comment on: FAQ: What the heck is goal agnosticism?
Yup, agreed! In the limit, they’d be giving everyone end-the-world buttons. I have hope that the capabilities curve will be such that we can avoid accidentally putting out such buttons, but I still anticipate there being a pretty rapid transition that sees not-catastrophically-bad-but-still-pretty-bad consequences just because it’s too hard to change gears on 1-2 year timescales.

porby Oct 13, 2023, 3:12 PM
5 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
That’s the crux I think: I don’t get why you reject (programmable) learning processes as goal agnostic.
It’s important to draw a box around the specific agent under consideration. Suppose I train a model with predictive loss such that the model is goal agnostic. Three things can be simultaneously true:
1. Viewed in isolation, the optimizer responsible for training the model isn’t goal agnostic because it can be described as having preferences over external world state (the model).
2. The model is goal agnostic because it meets the stated requirements (and is asserted by the hypothetical).
3. A simulacrum arising from sequences predicted by that goal agnostic predictor when conditioned to predict non-goal agnostic behavior is not goal agnostic.
Let’s say I clone you_genes a few billions time, each time twisting your environment and education until I’m statistically happy with the recipe. What unconditional preferences would you expect to remain?
The resulting person would still be human, and presumably not goal agnostic as a result. A simulacrum produced by an ideal goal agnostic predictor that is conditioned to reproduce the behavior of that human would also not be goal agnostic.
The fact that that those preferences arose conditionally based on your selection process isn’t relevant to whether the person is goal agnostic. The relevant kind of conditionality is within the agent under consideration.
Let’a say you_adult are actually a digital brain in some matrix, with an unpleasant boss who stop and randomly restart your emulation each time your preference get over his. Could that process make you_matrix goal agnostic?
No; “I” still have preferences over world states. They’re just being overridden.
Bumping up a level and drawing the box around the unpleasant boss and myself combined, still no, because the system expresses my preferences filtered by my boss’s preferences.
Some behavior being conditional isn’t sufficient for goal agnosticism; there must be no way to describe the agent under consideration as having unconditional preferences over external world states.