It might be worth (someone) writing out what is meant by each kind of misalignment category, as used in the db. Objective misalignment, specific gaming, value misalignment all seem overlapping, and I’m not at all sure what physical misalignment is supposed to be pointing to.
sudhanshu_kasewa
MIRI
Anthropic
The EU parliament
The UK government
Vladimir Putin
The Chinese Communist Party
The American Public
Joe Biden
Greg Brockman/Sam Altman
Demis Hassabis
Elon Musk
Mark Zuckerberg
[Question] Who can most reduce X-Risk?
Very cool. Thanks for putting this together.
Half-baked, possibly off-topic: I wonder if there’s some data-collection that can be used to train out polysemi from a model by fine-tuning.
e.g.:
Show 3 examples (just like in this game), and have the user pick the odd-one-out
User can say “they are all the same”, if so, remove one at random, and replace with a new example
Tag the (neuron, positive example) pairs with (numerical value) label 1, the odd-one-out with 0
Fine-tune with next-word-prediction and an auxilliary loss using this new collected dataset
Can probably use some automated (e.g. semantic similarity) labelling method to cluster labelled+unlabelled instances, to increase the size of the dataset
Neuronpedia interface/codebase could probably be forked to do this kind of data collection very easily.
Dumb question alert:
In the appendix “Details for penalizing depending on “downstream” variables”, I’m not able to wrap my head around what we can expect the reporter to learn—if anything at all—seeing that it has no dependency on the inputs (elsewhere it is dependent on z sampled from the posterior).
Specifically, the only call to the reporter (in the function reporter_loss in this section) contains no information (about before, action, after) from the predictor at all:
answer = reporter(question, ε, θ_reporter)
(unless “question” includes some context from the current (before, action, after) being considered, which I’m assuming is not the case)
My dumb question then is:
-- Why would this reporter be performant in any way?
My reasoning: For a given question Q (say, “Is the diamond in the room?”) we might have some answers of “Yes” and some of “No” in the dataset, but without the context, we’re essentially training the reporter to map noise that is uncorrelated with/independent of the context to the answer; essentially, for a fixed question Q and fixed realization of the noise RV, the reporter will be uniformly uncertain (or well, it will mirror the statistics in the data) about the value of the answer. Since the noise is independent/uncorrelated, this would be true for every noise value.
Naive thought #2618281828:
Could asking counterfactual questions be a potentially useful strategy to bias the reporter to be a direct translator rather than a human simulator?
Concretely, consider a tuple (v, a, v’), where v := ‘before’ video, a := ‘action’ selected by SmartVault or augmented-human or whatever, and v’ := ‘after’ video.
Then, for some new action a’, ask the question:
“Given (v, a, v’), if action a’ was taken, is the diamond in the room?”
(How we collect such data is unclear but doesn’t seem obviously intractable.)
I think there’s some value here:
Answering such a question might not require computation concerning a and v’ ; if we see these computations being used, we might derive more value from regularizers that penalize downstream variables (which now includes the nodes close to a)
This might also force the reporter to essentially model (or compress but not indefinitely) the predictor; the reporter now has both a compressed predictor Bayes’ net and a human Bayes’ net. If we can be confident that the compressed predictor BN is much smaller than the human BN, then doing direct translation within the reporter, i.e. compressed predictor BN inference + translation + read off from human BN might be less expensive than the human simulator alternative, i.e. compressed predictor BN inference + ‘translation’/bridging computation + human BN inference.
We might find ways of being confident that the compressed predictor BN is small (e.g. by adding decoders at every layer of the reporter that reconstruct v, a or v’ and heavily penalizing later-layer decoders)
1.
Is there such a thing as a free action, or an action where e.g. the agent breaks its own legs, when it is not accounted for in the action space of the underlying MDP? That feels like adding a new layer of misspecification (which no doubt is a possibility, and probably deserves deep investigation) orthogonal to reward function misspecification.
2.
It seems as though this kind of circumvention of impact penalties depends on what calculates the penalty. If the environment (or something else external to the agent) is responsible for calculating the penalty and providing it to the agent, then it could do so as if the agent was not constrained (or committing to be constrained) at all. If the agent is internally responsible for computing this impact, it could probably find ways to hack this, similar to wireheading.
So, assuming its some external entity E that computes the impact penalty, it must have the (dis?)ability to account for the agent’s modified action space when making this computation for the agent to successfully reduce the penalty as in the earlier example.
Something agent A does must signal to E that A’s action space has changed. If so, we might find ourselves in a situation where A is able to communicate an arbitrarily complex message to E, particularly, something of the form of: “I will never do anything that gives me more than the minimum penalty”, or in terms of actions: “I will deterministically follow the policy that gives me the minimum penalty while achieving my goals.” E, if it believed A, would not include high penalty actions in the action space of its penalty computations, and thus A would avoid the impact penalty.
Thanks for the detailed response. Meta: It feels good to receive a signal that this was a ‘neat question’, or in general, a positive-seeming contribution to LW. I have several unexpressed thoughts from fear of not actually creating value for the community.
it sounds like what you want is a reward function that is simple, but somehow analogous to the complexity of human value? And it sounds like maybe the underspecified bit is “you, as a human, have some vague notion that some sorts of value-generation are ‘cheating’”, and your true goal is “the most interesting outcome that doesn’t feel like Somehow Cheating to me?”
This is about correct. A secondary reason for simplicity is to attempt to be computationally efficient (for the environment that generates the reward).
“one cell == an atom”
I can see that as being a case, but, again, computational tractability. Actual interesting structures in GoL can be incredibly massive, for example, this Tetris Proccessor (2,940,928 x 10,295,296 cells). Maybe there’s some middle ground between truly fascinating GoL patterns made from atoms and my cell-as-a-planet level abstraction, as suggested by Daniel Kokotajlo in another comment.
How ‘good’ is it to have a repeating loop of, say, a billion flourishing human lives? Is it better than a billion human lives that happens exactly once and ends?
Wouldn’t most argue that, in general, more life is better than less life? (but I see some of my hidden assumptions here, such as “the ’life’s we’re talking about here are qualitatively similar e.g. the repeating life doesn’t feel trapped/irrelevant/futile because it is aware that it is repeating”)
I think “moral value” (or, “value”) in real life is about the process of solving “what is valuable and how to do I get it?”
I don’t disagree, but I also think this is sort of outside the scope of finite-space cellular automata.
In this case it might mean that the system optimizes either for true continuous novelty, or the longest possible loop?
Given the constraints of CA, I’m mostly in agreement with this suggestion. Thanks.
I do suspect that figuring out which of your assumptions are “valid” is an important part of the question here.
Yes, I agree. Concretely, to me it looks like ‘if I saw X happening in GoL, and I imagine being a sentient being (at some scale, TBD) in that world (well, with my human values), then would I want to live in it?’, and translating that into some rules that promote or disincentivise X.
I do think taking this approach is broadly difficult, though. Perhaps its worth getting a v0.1 out with reward being tied to instantiations of novel states to begin with, and then seeing whether to build on that or try a new approach.
No, this is not something I can undertake—however, the effort itself need not be very complicated. You’ve already got a list of Misalignment types in the form: create a google doc with definitions/descriptions of each of these, and put a link to that doc in this question.