porby

Karma: 1,882

porby Dec 13, 2023, 8:49 PM
4 points
0
on: porby’s Shortform
I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.

porby Dec 13, 2023, 8:45 PM
3 points
0
on: Suggestions for net positive LLM research
I’m accumulating a to-do list of experiments much faster than my ability to complete them:
If you wanted to take one of these and run with it or a variant, I wouldn’t mind!
The unifying theme behind many of these is goal agnosticism: understanding it, verifying it, maintaining it, and using it.
Note: I’ve already started some of these experiments, and I will very like start others soon. If you (or anyone reading this, for that matter) sees something they’d like to try, we should chat to avoid doing redundant work. I currently expect to focus on #4 for the next handful of weeks, so that one is probably at the highest risk of redundancy.
Further note: I haven’t done a deep dive on all relevant literature; it could be that some of these have already been done somewhere! (If anyone happens to know of prior art for any of these, please let me know.)

porby Dec 11, 2023, 2:55 AM
4 points
0
on: porby’s Shortform
Retrodicting prompts can be useful for interpretability when dealing with conditions that aren’t natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.
What does a prompt retrodictor look like?
Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there’s nothing special in principle about soft prompts with regard to their impact on conditioning predictions.
Just take large traditional text datasets. Feed the model a chunk of the string. Train on the prediction of tokens before the chunk.
Two obvious approaches:
1. Special case of infilling. Stick to a purely autoregressive training mode, but train the model to fill a gap autoregressively. In other words, the sequence would be:
  [Prefix token][Prefix sequence][Suffix token][Suffix sequence][Middle token][Middle sequence][Termination token]
  Or, as the paper points out:
  [Suffix token][Suffix sequence][Prefix token][Prefix sequence][Middle sequence][Termination token] Nothing stopping the prefix sequence from having zero length.
2. Could also specialize training for just previous prediction:
  [Prompt chunk]["Now predict the previous" token][Predicted previous chunk, in reverse]
But we don’t just want some plausible previous prompts, we want the ones that most precisely match the effect on the suffix’s activations.
This is trickier. Specifying the optimization target is easy enough: retrodict a prompt that minimizes MSE((activations | sourcePrompt), (activations | retrodictedPrompt)), where (activations | sourcePrompt) are provided. Transforming that into a reward for RL is one option. Collapsing the outout distribution into a token is a problem; there’s no way to directly propagate the gradient through that collapse and into the original distribution. Without that differentiable connection, analytically computing gradients for the other token options becomes expensive and turns into a question of sampling strategies. Maybe something clever floating around.
Note that retrodicting with an activation objective has some downsides:
1. If the retrodictor’s the same model as the predictor, there are some weird feedback loops. The activations become a moving target.
2. Targeting activations makes the retrodictor model-specific. Without targeting activations, the retrodictor could work for any model in principle.
3. While the outputs remain constrained to token distributions, the natural endpoint for retrodiction on activations is not necessarily coherent natural language. Adversarially optimizing for tokens which produce a particular activation may go weird places. It’ll likely still have some kind of interpretable “vibe,” assuming the model isn’t too aggressively exploitable.
This class of experiment is expensive for natural language models. I’m not sure how interesting it is at scales realistically trainable on a couple of 4090s.
What links here?
- Suggestions for net positive LLM research by Cole Wyeth (Dec 13, 2023, 5:29 PM; 13 points)

porby Dec 11, 2023, 12:04 AM
2 points
0
in reply to: porby’s comment on: porby’s Shortform
Another potentially useful metric in the space of “fragility,” expanding on #4 above:
The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.
This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.
Does internal representational fragility correlate with other notions of “fragility,” like the information-required-to-induce-behavior “fragility” in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?
Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.
If anything, I’d expect anticorrelation; well-learned regions probably have enough training constraints that they’ve been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.
That’d still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.

porby Dec 10, 2023, 10:47 PM
2 points
0
in reply to: porby’s comment on: porby’s Shortform
A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.
Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.
In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That’d be good to know!
This kind of adversarial prompt automation could also be trivially included in an evaluations program.
I can’t imagine that this hasn’t been done before. If anyone has seen something like this, please let me know.

porby Dec 10, 2023, 10:38 PM
2 points
0
in reply to: porby’s comment on: porby’s Shortform
Expanding on #6 from above more explicit, since it seems potentially valuable:
From the goal agnosticism FAQ:
The definition as stated does not put a requirement on how “hard” it needs to be to specify a dangerous agent as a subset of the goal agnostic system’s behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.
From earlier experimentpost:
Figure out how to think about the “fragility” of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There’s clearly a spectrum here in terms of how “chaotic” a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
This can be phrased as “what’s the amount of information required to push a model into behavior X?”
Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:
“What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?”
In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:
Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.
This seems like… it’s… an extremely good answer to the “fragility” question? It’s trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.
Conceptually, it’s a quantification of the number of information theoretic mistakes you’d need to make to get bad behavior from the model.

porby Dec 10, 2023, 10:23 PM
5 points
0
on: porby’s Shortform
Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.
Does training the model to recognize properties (e.g. ‘niceness’) explicitly as metatokens via classification make soft prompts better at capturing those properties?
You could test for that explicitly:
1. Pretrain model A with metatokens with a classifier.
2. Pretrain model B without metatokens.
3. Train soft prompts on model A with the same classifier.
4. Train soft prompts on model B with the same classifier.
5. Compare performance of soft prompts in A and B using the classifier.
Notes and extensions:
1. The results of the research are very likely scale sensitive. As the model gets larger, many classifier-relevant distinctions that could be missed by small models lacking metatoken training may naturally get included. In the limit, the metatoken training contribution may become negligible. Is this observable across ~pythia scales? Could do SFT on pythia to get a “model A.”
2. The above description leaves out some complexity. Ideally, the classifier could give scalar scores. This requires scalarized input tokens for the model that pretrains with metatokens.
3. How does soft prompting work when tokens are forced to be smaller? For example, if each token is a character, it’ll likely have a smaller residual dedicated to it compared to tokens that spans ~4 characters to equalize total compute.
4. To what degree does soft prompting verge on a kind of “adversarial” optimization? Does it find fragile representations where small perturbations could produce wildly different results? If so, what kinds of regularization are necessary to push back on that, and what is the net effect of that regularization?
5. There’s no restriction on the nature of the prompt. In principle, the “classifier” could be an RL-style scoring mechanism for any reward. How many tokens does it take to push a given model into particular kinds of “agentic” behavior? For example, how many tokens does it take to encode the prompt corresponding to “maximize the accuracy of the token prediction at index 32 in the sequence”?
6. More generally: the number of tokens required to specify a behavior could be used as a metric for the degree to which a model “bakes in” a particular functionality. More tokens required to specify behavior successfully → more information required in that model to specify that behavior.
What links here?
- Suggestions for net positive LLM research by Cole Wyeth (Dec 13, 2023, 5:29 PM; 13 points)

porby Dec 9, 2023, 8:42 PM
4 points
2
on: porby’s Shortform
Quarter-baked experiment:
1. Stick a sparse autoencoder on the residual stream in each block.
2. Share weights across autoencoder instances across all blocks.
3. Train autoencoder during model pretraining.
4. Allow the gradients from autoencoder loss to flow into the rest of the model.
Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:
1. Do the meanings of features remain consistent over multiple blocks? What does it mean for an earlier block’s feature to “mean” the same thing as a later block’s same feature when they’re at different parts of execution?
2. How much does a shared representation across all blocks harm performance? Getting the comparison right is subtle; it would be quite surprising if there is no slowdown on predictive training when combined with the autoencoder training since they’re not necessarily aligned. Could try training very small models to convergence to see if they have different plateaus.
3. If forcing a shared representation doesn’t harm performance, why not? In principle, blocks can execute different sorts of programs with different IO. Forcing the residual stream to obey a format that works for all blocks without loss would suggest that there were sufficient representational degrees of freedom remaining (e.g. via superposition) to “waste” some when the block doesn’t need it. Or the shared “features” mean something completely different at different points in execution.
4. Compare the size of the dictionary required to achieve a particular specificity of feature between the shared autoencoder and a per-block autoencoder. How much larger is the shared autoencoder? In the limit, it could just be BlockCount times larger with some piece of the residual stream acting as a lookup. It’d be a little surprising if there was effectively no sharing.
5. Compare post-trained per-block autoencoders against per-block autoencoders embedded in pretraining that allow gradients to flow into the rest of the model. Are there any interesting differences in representation? Maybe in terms of size of dictionary relative to feature specificity? In other words, does pretraining the feature autoencoder encourage a more decodable native representation?
6. Take a look at the decoded features across blocks. Can you find a pattern for what features are relevant to what blocks? (This doesn’t technically require having a shared autoencoder, but having a single shared dictionary makes it easier to point out when the blocks are acting on the same feature, rather than doing an investigation, squinting, and saying “yeah, that sure looks similar.”)
What links here?
- Suggestions for net positive LLM research by Cole Wyeth (Dec 13, 2023, 5:29 PM; 13 points)

porby Nov 30, 2023, 6:44 PM
5 points
1
in reply to: RogerDearnaley’s comment on: How to Control an LLM’s Behavior (why my P(DOOM) went down)
I think that’d be great!
Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.
I suspect there’s a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point.

porby Nov 29, 2023, 7:18 PM
5 points
1
in reply to: RogerDearnaley’s comment on: How to Control an LLM’s Behavior (why my P(DOOM) went down)
What I’m calling a simulator (following Janus’s terminology) you call a predictor
Yup; I use the terms almost interchangeably. I tend to use “simulator” when referring to predictors used for a simulator-y use case, and “predictor” when I’m referring to how they’re trained and things directly related to that.
I also like your metatoken concept: that’s functionally what I’m suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.
Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.)

porby Nov 29, 2023, 7:00 PM
4 points
0
in reply to: Shiroe’s comment on: How to Control an LLM’s Behavior (why my P(DOOM) went down)
Alas, nope! To my knowledge it hasn’t actually been tried at any notable scale; it’s just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.

porby Nov 29, 2023, 4:06 AM
26 points
5
on: How to Control an LLM’s Behavior (why my P(DOOM) went down)
Signal boosted! This is one of those papers that seems less known that it should be. It’s part of the reason why I’m optimistic about dramatic increases in the quality of “prosaic” alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it’s part of a path that’s robust enough to scale.
You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.
It’s also interesting in that it can preserve the constraints on learnable values during predictive training, unlike approaches equivalent to RL with sparse/distant rewards.
The fact that the distinctions it learns about the metatokens become better and better as more optimization pressure is applied is an interesting inversion of the usual doom-by-optimization story. Taking such a model to the extreme of optimization just makes it exceedingly good at distinguishing subtle details of what constitutes <nice> versus <authoritative_tone> versus <correct>. It’s an axis of progress in alignment that generalizes as the capability does; the capability is the alignment. I’m pretty certain that a model that has very thoroughly learned what “nice” means at the human level can meaningfully generalize it to contexts where it hasn’t seen it directly applied.^[1]
I’m also reasonably confident in finding some other paths to extremely similar effects on internal representations. I wouldn’t be surprised if we can decompose conditions into representational features to learn about what they mean at the learned feature level, then cobble together new inference-time conditions via representational intervention that would have equivalent effects to training new metatokens.
1. ^
  After all, ChatGPT4/DALLE3 can generate an image of a vacuum cleaner that “embodies the aspirational human trait of being kind to one another.” That seems like more of a reach than a hypothetical superintelligence figuring out that humans wouldn’t be okay with, say, a superscience plan that would blow up 25% of the earth’s crust.

porby Nov 29, 2023, 2:10 AM
10 points
0
in reply to: aysja’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
Hm, I’m sufficiently surprised at this claim that I’m not sure that I understand what you mean. I’ll attempt a response on the assumption that I do understand; apologies if I don’t:
I think of tools as agents with oddly shaped utility functions. They tend to be conditional in nature.
A common form is to be a mapping between inputs and outputs that isn’t swayed by anything outside of the context of that mapping (which I’ll term “external world states”). You can view a calculator as a coherent agent, but you can’t usefully describe the calculator as a coherent agent with a utility function regarding world states that are external to the calculator’s process.
You could use a calculator within a larger system that is describable as a maximizer over a utility function that includes unconditional terms for external world states, but that doesn’t change the nature of the calculator. Draw the box around the calculator within the system? Pretty obviously a tool. Draw the box around the whole system? Not a tool.
I’ve been using the following two requirements to point at a maximally^[1] tool-like set of agents. This composes what I’ve been calling goal agnosticism:
1. The agent cannot be usefully described^[2] as having unconditional preferences about external world states.
2. Any uniformly random sampling of behavior from the agent has a negligible probability of being a strong and incorrigible optimizer.
Note that this isn’t the same thing as a definition for “tool.” An idle rock uselessly obeys this definition; tools tend to useful for something. This definition is meant to capture the distinction between things that feel like tools and those that feel like “proper” agents.
To phrase it another way, the intuitive degree of “toolness” is a spectrum of how much the agent exhibits unconditional preferences about external world states through instrumental behavior.
Notably, most pretrained LLMs with the usual autoregressive predictive loss and a diverse training set are heavily constrained into fitting this definition. Anything equivalent to RL agents trained with sparse/distant rewards is not. RLHF bakes a condition into the model of peculiar shape. I wouldn’t be surprised if it doesn’t strictly obey the definition anymore, but it’s close enough along the spectrum that it still feels intuitive to call it a tool.
Further, just like in the case of the calculator, you can easily build a system around a goal agnostic “tool” LLM that is not, itself, goal agnostic. Even prompting is enough to elicit a new agent-in-effect that is not necessarily goal agnostic. The ability for a goal agnostic agent to yield non-goal agnostic agents does not break the underlying agent’s properties.^[3]
1. ^
  For one critical axis in the toolishness basis, anyway.
2. ^
  Tricky stuff like having a bunch of terms regarding external world states that just so happen to always cancel don’t count.
3. ^
  This does indeed sound kind of useless, but I promise the distinction does actually end up mattering quite a lot! That discussion goes beyond the scope of this post. The FAQ goes into more depth.

porby Nov 29, 2023, 1:05 AM
3 points
0
in reply to: Nathan Helm-Burger’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
While this probably isn’t the comment section for me to dump screeds about goal agnosticism, in the spirit of making my model more legible:
I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case.
Yup! The value I assign to goal agnosticism—particularly as implemented in a subset of predictors—is in its usefulness as a foundation to build strong non-goal agnostic systems that aren’t autodoomy. The transition out of goal agnosticism is not something I expect to avoid, nor something that I think should be avoided.
I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won’t be enough to stop catastrophe once someone has defected.
I’d be more worried about this if I thought the path was something that required Virtuous Sacrifice to maintain. In practice, the reason I’m as optimistic (nonmaximally pessimistic?) as I am that I think there are pretty strong convergent pressures to stay on something close enough to the non-autodoom path.
In other words, if my model of capability progress is roughly correct, then there isn’t a notably rewarding option to “defect” architecturally/technologically that yields greater autodoom.
With regard to other kinds of defection:
I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don’t see how to have much hope for humanity.
Yup! Goal agnosticism doesn’t directly solve misuse (broadly construed), which is part of why misuse is ~80%-ish of my p(doom).
And I also don’t see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?
If we muddle along deeply enough into a critical risk period slathered in capability overhangs that TurboDemon.AI v8.5 is accessible to every local death cult and we haven’t yet figured out how to constrain their activity, yup, that’s real bad.
Given my model of capability development, I think there are many incremental messy opportunities to act that could sufficiently secure the future over time. Given the nature of the risk and how it can proliferate, I view it as much harder to handle than nukes or biorisk, but not impossible.

porby Nov 28, 2023, 12:41 AM
6 points
0
on: porby’s Shortform
Another experiment:
1. Train model M.
2. Train sparse autoencoder feature extractor for activations in M.
3. FT = FineTune(M), for some form of fine-tuning function FineTune.
4. For input x, fineTuningBias(x) = FT(x) - M(x)
5. Build a loss function on top of the fineTuningBias function. Obvious options are MSE or dot product with bias vector.
6. Backpropagate the loss through M(x) into the feature dictionaries.
7. Identify responsible features by large gradients.
8. Identify what those features represent (manually or AI-assisted).
9. To what degree do those identified features line up with the original FineTune function’s intent?
Extensions:
1. The features above are in the context of a single input. Check for larger scopes by sampling more inputs, backpropagating, and averaging the observed feature activations. Check for ~unconditional shifts induced by FineTune by averaging over an extremely broad sampling of inputs.
2. Can check path dependence during RLHF-like fine tuning. Do the features modified across multiple RLHF runs remain similar? Note that this does not require interpreting what features represent, just that they differ. That makes things easier! (Also, note that this doesn’t technically require a feature dictionary; the sparse autoencoder bit just makes it easier to reason about the resulting direction.)
3. Can compare representations learned by decision transformers versus PPO-driven RLHF. Any difference between the features affected? Any difference in the degree of path dependence?
4. Can compare other forms of conditioning. Think [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). In this case, there wouldn’t really be a fine-tuning training stage; rather, the existence of the condition would serve as the runtime FineTune function. Compare the features between the conditioned and unconditioned cases. Presence of the conditions in pretraining could change the expressed features, but that’s not a huge problem.
5. Any way to meaningfully compare against activation steering? Given that the analysis is based directly on the activations to begin with, it would just be a question of where the steering vector came from. The feature dictionary could be used to build a steering vector, in principle.
6. Does RLHF change the feature dictionary? On one hand, conditioning-equivalent RL (with KL divergence penalty) shouldn’t find new sorts of capability-relevant distinctions, but it’s very possible that it collapses some features that are no longer variable in the fine-tuned model. This is trickier to evaluate; could try to train a linear map on the activations of model B before feeding it to an autoencoder trained on model A’s activations.
What links here?
- Suggestions for net positive LLM research by Cole Wyeth (Dec 13, 2023, 5:29 PM; 13 points)

porby Nov 27, 2023, 10:42 PM
6 points
0
on: porby’s Shortform
Some experimental directions I recently wrote up; might as well be public:
1. Some attempts to demonstrate how goal agnosticism breaks with modifications to the architecture and training type. Trying to make clear the relationship between sparsity/distance of the implicit reward function and unpredictability of results.
2. A continuation and refinement of my earlier (as of yet unpublished) experiments about out of distribution capability decay. Goal agnosticism is achieved by bounding the development of capabilities into a shape incompatible with internally motivated instrumental behavior across the training distribution; if it’s possible for any nontrivial capability to persist out of distribution at toy scales, even with significant contrivance to train it into existence in the first place, that would be extremely concerning for the potential persistence of deceptive mesaoptimizers at scale.
  
  Ideally, the experiment would examine the difference between OOD capabilities with varying levels of overlap with the training distribution. For example, contrast four cases:
  A: A model is trained on ten different “languages” with zero translation tasks between them. These “languages” would be not human languages, but rather trivial types of sequences that do not share any obvious form or underlying structure. One language could be the sequence generated by f(x) = 2x + 1; another might be to endlessly repeat “brink bronk poot toot.”
  B: A model is trained on ten different languages with significantly different form, but a shared underlying structure. For example, all the languages might involve solving trivial arithmetic, but one language is “3 + 4 = 7″ and another language is “three plus four equals seven.”
  C: Same as B, but now give the model translation tasks.
  D: Same as C, but leave one language pair’s translation tasks unspecified. Any successful translation for that pair would necessarily arise from a generalizing implementation.
  
  For each model, drop parts of the training distribution but continue to perform test evaluations on that discontinued part. Do models with more apparent shared implementation decay more slowly? How does the decay vary with hyperparameters?
  
  Some circuit-level analysis might be helpful here to identify whether capability is lost via trivial gating versus catastrophic scrambling, but it’s probably best to punt that to a separate experiment.
3. I suspect there is an equivalence between conditioning and representational intervention, like activation steering. They may be different interfaces to the same effect. I’d like to poke around metatoken-like approaches (like Pretraining Language Models with Human Preferences) and see if I can find anything compelling from a representational perspective.
4. Assuming goal agnosticism is actually achieved and maintained, it broadens the scope for what kinds of interpretability can be useful by ruling out internal representational adversaries. There may be room for more experiments around motivational interpretability. (Some other work has already been published on special cases.)
Less concretely, I’d also like to:
1. Figure out how to think about the “fragility” of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There’s clearly a spectrum here in terms of how “chaotic” a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
2. More fully ground “Responsible Scaling Policy”-style approaches on a goal agnostic foundation. If a lab can demonstrate that a model is incapable of learning preferences over external world states, and that their method of aiming the model isn’t “fragile” in the above sense, then it’s a good candidate for incremental experimentation.
3. Come up with other ways to connect this research path with policy more generally.
What links here?

porby Nov 27, 2023, 6:22 PM
4 points
0
in reply to: Quintin Pope’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
In retrospect, the example I used was poorly specified. It wouldn’t surprise me if the result of the literal interpretation was “the AI refuses to play chess” rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn’t significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn’t actually the most reliable accessible strategy to “never lose at chess” for that broader type of system and I’d expect superior strategies to be found in the limit of optimization.

porby Nov 27, 2023, 6:11 PM
10 points
2
in reply to: paulfchristiano’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
But the point is that in this scenario the LM doesn’t want anything in the behaviorist sense, yet is a perfectly adequate tool for solving long-horizon tasks. This is not the form of wanting you need for AI risk arguments.
My attempt at an ITT-response:

Drawing a box around a goal agnostic LM and analyzing the inputs and outputs of that box would not reveal any concerning wanting in principle. In contrast, drawing a box around a combined system—e.g. an agentic scaffold that incrementally asks a strong inner goal agnostic LM to advance the agent’s process—could still be well-described by a concerning kind of wanting.
Trivially, being better at achieving goals makes achieving goals easier, so there’s pressure to make system-as-agents which are better at removing wrenches. As the problems become more complicated, the system needs to be more responsible for removing wrenches to be efficient, yielding further pressure to give the system-as-agent more ability to act. Repeat this process a sufficient and unknown number of times and, potentially without ever training a neural network describable as having goals with respect to external world states, there’s a system with dangerous optimization power.
(Disclaimer: I think there are strong repellers that prevent this convergent death spiral, I think there are lots of also-attractive-for-capabilities offramps along the worst path, and I think LM-like systems make these offramps particularly accessible. I don’t know if I’m reproducing opposing arguments faithfully and part of the reason I’m trying is to see if someone can correct/improve on them.)

porby Nov 25, 2023, 6:05 PM
9 points
5
in reply to: Logan Zoellner’s comment on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
Trying to respond in what I think the original intended frame was:
A chess AI’s training bounds what the chess AI can know and learn to value. Given the inputs and outputs it has, it isn’t clear there is an amount of optimization pressure accessible to SGD which can yield situational awareness and so forth; nothing about the trained mapping incentivizes that. This form of chess AI can be described in the behaviorist sense as “wanting” to win within the boundaries of the space that it operates.
In contrast, suppose you have a strong and knowledgeable multimodal predictor trained on all data humanity has available to it that can output arbitrary strings. Then apply extreme optimization pressure for never losing at chess. Now, the boundaries of the space in which the AI operates are much broader, and the kinds of behaviorist “values” the AI can have are far less constrained. It has the ability to route through the world, and with extreme optimization, it seems likely that it will.
(For background, I think it’s relatively easy to relocate where the optimization squeezing is happening to avoid this sort of worldeating outcome, but it remains true that optimization for targets with ill-defined bounds is spooky and to be avoided.)

porby Nov 25, 2023, 12:22 AM
3 points
0
in reply to: Ilio’s comment on: FAQ: What the heck is goal agnosticism?
you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct?
Goal agnosticism can, in principle, apply to things which are not pure predictors, and there are things which could reasonably be called predictors which are not goal agnostic.
A subset of predictors are indeed the most powerful known goal agnostic systems. I can’t currently point you toward another competitive goal agnostic system (rocks are uselessly goal agnostic), but the properties of goal agnosticism do, in concept, extend beyond predictors, so I leave the door open.
Also, by using the term “goal agnosticism” I try to highlight the value that arises directly from the goal-related properties, like statistical passivity and the lack of instrumental representational obfuscation. I could just try to use the more limited and implementation specific “ideal predictors” I’ve used before, but in order to properly specify what I mean by an “ideal” predictor, I’d need to specify goal agnosticism.