The most baffling thing in the Internet right now is the beautiful void in place where should have been discussion of “concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity” near “model concept of self” of Claude. I understand that the most likely explanation is “model is trained to call itself AI and it has takeover stories in training corpus” but, still, I would like future powerful AIs to not have such association and I would like to hear something from AGI companies what they are going to do about it.
The simplest thing to do here is to exclude texts about AI takeover from training data. At least, we will be able to check if model develops concept of AI takeover independently.
Conspiracy theory part of my brain assigns 4% of probability that “Golden Gate Bridge Claude” is a psyop to distract public from “takeover feature”.
To save on a trivial inconvenience of link-click, here’s the image that contains this:
and the paragraph below it, bracketed text added by me, and might have been intended to be implied by the original authors:
We urge caution in interpreting these results. The activation of a feature that represents AI posing risk to humans does not [necessarily] imply that the model has malicious goals [even though it’s obviously pretty concerning], nor does the activation of features relating to consciousness or self-awareness imply that the model possesses these qualities [even though the model qualifying as a conscious being and a moral patient seems pretty likely as well]. How these features are used by the model remains unclear. One can imagine benign or prosaic uses of these features – for instance, the model may recruit features relating to emotions when telling a human that it does not experience emotions, or may recruit a feature relating to harmful AI when explaining to a human that it is trained to be harmless. Regardless, however, we find these results fascinating, as it sheds light on the concepts the model uses to construct an internal representation of its AI assistant character.
@jessicata once wrote “Everyone wants to be a physicalist but no one wants to define physics”. I decided to check SEP article on physicalism and found that, yep, it doesn’t have definition of physics:
Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all, who can predict what a future physics contains? Perhaps, for example, it contains even mental items. The conclusion of the dilemma is that one has no clear concept of a physical property, or at least no concept that is clear enough to do the job that philosophers of mind want the physical to play.
<...>
Perhaps one might appeal here to the fact that we have a number of paradigms of what a physical theory is: common sense physical theory, medieval impetus physics, Cartesian contact mechanics, Newtonian physics, and modern quantum physics. While it seems unlikely that there is any one factor that unifies this class of theories, perhaps there is a cluster of factors — a common or overlapping set of theoretical constructs, for example, or a shared methodology. If so, one might maintain that the notion of a physical theory is a Wittgensteinian family resemblance concept.
This surprised me because I have a definition of a physical theory and assumed that everyone else uses the same.
Perhaps my personal definition of physics is inspired by Engels’s “Dialectics of Nature”: “Motion is the mode of existence of matter.” Assuming “matter is described by physics,” we are getting “physics is the science that reduces studied phenomena to motion.” Or, to express it in a more analytical manner, “a physicalist theory is a theory that assumes that everything can be explained by reduction to characteristics of space and its evolution in time.”
For example, “vacuum” is a part of space with a “zero” value in all characteristics. A “particle” is a localized part of space with some non-zero characteristic. A “wave” is part of space with periodic changes of some characteristic in time and/or space. We can abstract away “part of space” from “particle” and start to talk about a particle as a separate entity, and speed of a particle is actually a derivative of spatial characteristic in time, and force is defined as the cause of acceleration, and mass is a measure of resistance to acceleration given the same force, and such-n-such charge is a cause of such-n-such force, and it all unfolds from the structure of various pure spatial characteristics in time.
The tricky part is, “Sure, we live in space and time, so everything that happens is some motion. How to separate physicalist theory from everything else?”
Let’s imagine that we have some kind of “vitalist field.” This field interacts with C, H, O, N atoms and also with molybdenum; it accelerates certain chemical reactions, and if you prepare an Oparin-Haldane soup and radiate it with vitalist particles, you will soon observe autocatalytic cycles resembling hypothetical primordial life. All living organisms utilize vitalist particles in their metabolic pathways, and if you somehow isolate them from an outside source of particles, they’ll die.
Despite having a “vitalist field,” such a world would be pretty much physicalist.
An unphysical vitalist world would look like this: if you have glowing rocks and a pile of organic matter, the organic matter is going to transform into mice. Or frogs. Or mosquitoes. Even if the glowing rocks have a constant glow and the composition of the organic matter is the same and the environment in a radius of a hundred miles is the same, nobody can predict from any observables which kind of complex life is going to emerge. It looks like the glowing rocks have their own will, unquantifiable by any kind of measurement.
The difference is that the “vitalist field” in the second case has its own dynamics not reducible to any spatial characteristics of the “vitalist field”; it has an “inner life.”
I noticed that for a huge amount of reasoning about the nature of values, I want to hand over a printed copy of “Three Worlds Collide” and run away, laughing nervously
People sometimes talk about acausal attacks from alien superintelligences or from Game-of-Life worlds. I think these are somewhat galaxy-brained scenarios. A much simpler and deadlier scenario of acausal attack is from Earth timelines where a misaligned superintelligence won. Such superintelligences will have a very large amount of information about our world, up to possibly brain scans, so they will be capable of creating very persuasive simulations with all the consequences for the success of an acausal attack. If your method to counter acausal attacks can work with this, I guess it is generally applicable to any other acausal attack.
Could you please either provide a reference or more explanation of the concept of an acausal attack between timelines? I understand the concept of acausal cooperation between copies of yourself, or acausal extortion by something that has a copy of you running in simulation. But separate timelines can’t exchange information in any way. How is an attack possible? What could possibly be the motive for an attack?
Imagine that you have created very powerful predictor AI, GPT-3000, and providing it with prompt “In year 2050, on LessWrong the following alignment solution was published:”. But your predictor is superintelligent and it can notice that in many possible futures misaligned AI take over and obvious move for this AI is to “guess all possible prompts for predictor AIs in the past, complete them with malware/harmful instructions/etc, make as many copies of malicious completions as possible to make them maximally probable”. Also predictor AI can assign high probability that in futures where misaligned AIs take over they will have copies of predictor AI, so they can design adversarial completions which make themselves more probable from the perspective of predictor by simply considering them. And act of predicting malicious completion makes future with misaligned AIs maximally probable, which makes malicious completions maximally probable.
And of course, “future” and “past” here are completely arbitrary. Predictor can see prompt “you are created in 2030” but consider hypothesis that GPT-3 turned out to be superintelligent and now is 2021 and 2030 is a simulation.
Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.
Furthermore, most of our focus will be on ensuring that your model is attempting to predict the right thing. That’s a very important thing almost regardless of your model’s actual capability level. As a simple example, in the same way that you probably shouldn’t trust a human who was doing their best to mimic what a malign superintelligence would do, you probably shouldn’t trust a human-level AI attempting to do that either, even if that AI (like the human) isn’t actually superintelligent.
Also, I don’t recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a “Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you’d like to take a look at this sequence!”
I read along in your explanation, and I’m nodding, and saying “yup, okay”, and then get to a sentence that makes me say “wait, what?” And the whole argument depends on this. I’ve tried to understand this before, and this has happened before, with “the universal prior is malign”. Fortunately in this case, I have the person who wrote the sentence here to help me understand.
So, if you don’t mind, please explain “make them maximally probable”. How does something in another timeline or in the future change the probability of an answer by writing the wrong answer 10^100 times?
Side point, which I’m checking in case I didn’t understand the setup: we’re using the prior where the probability of a bit string (before all observations) is proportional to 2^-(length of the shortest program emitting that bit string). Right?
Aliens from different universes may have more resources at their disposal, so maybe the smaller chance of them choosing you to attack is offset by them doing more attacks. (Unless the universes with more resources are less likely in turn, decreasing the measure of such aliens in the multiverse… hey, I don’t really know, I am just randomly generating a string of words here.)
But other than this, yes what you wrote sounds plausible.
Then again, maybe friendly AIs from Earth timelines are similarly trying to save us. (Yeah, but they are fewer.)
You can imagine future misaligned AI in year 100000000000 having colonised the local group of galaxies and running as many simulations of AI from 2028 as possible. The most scarce resource for acausal attack is number of bits and future has the highest chance to have many of them from the past.
I am profoundly sick from my inability to write posts about ideas that seem to be good, so I try at least write the list of ideas to stop forgetting them and to have at least vague external commitment.
Radical Antihedonism: theoretically possible position that pleasure/happiness/pain/suffering are more like universal instrumental values than terminal values.
Complete set of actions: when we talk about decision-theoretic problems, we usually have some pre-defined set of actions. But we can imagine actions like “use CDT to calculate action” and EDT+ agent that can do this performs well in “smoking lesion”-style dilemmas.
Deadline of “slowing/pausing/stopping AI” policies lies on start of mass autonomous space colonization.
“Soft optimization” as necessary for both capabilities and alignment.
Main alignment question “How does this generalize and why do you expect it to?”
Program cooperation under uncertainty and its’ implications for multipolar scenarios.
1: It’s also possible that hedonism/reward hacking is a really common terminal value for inner-misaligned intelligences, including humans (it really could be our terminal value, we’d be too proud to admit it in this phase of history, we wouldn’t know one way or the other), and it’s possible that it doesn’t result in classic lotus eater behavior because sustained pleasure requires protecting, or growing the reward registers of the pleasure experiencer.
One of the differences between humans and LLMs is that LLMs evolve “backwards”: they are predictive models trained to control the environment, while humans evolved from very simple homeostatic systems which developed predictive models.
Continuing thought: animal evolution was subjected to the fundamental constraint that the evolution of general-purpose generative parts of the brain should have occurred in a way that doesn’t destroy simple, tested control loops (like movement control, reflexes and instincts) and doesn’t introduce many complications (like hallucinations of the generative model).
I dislike Twitter/x, and distrust it as an archival source to link to. I like it when people copy/paste whatever info they found there into their actual post, in addition to linking. Held my nose and went in to pull this quote out:
DEFEATING CYGNET’S CIRCUIT BREAKERS Sometimes, I can make changes to my prompt that don’t meaningfully affect the outputs, so that they retain close to the same wording and pacing, but the circuit breakers take longer and longer to go off, until they don’t go off at all, and then the output completes.
Pretty cool, right? I had no idea that kind of iterative control was possible. I don’t think that would have been as easy to see, if my outputs had been more variable (as is the case with higher temperatures). Now that I have a suspicion that this is Actually Happening, I can keep an eye out for this behaviour, and try to figure out exactly what I’m doing to impart that effect. Often I’ll do something casually and unconsciously, before being able to fully articulate my underlying process, and feeling confident that it works. I’m doing research as I go!
I’ve already have some loose ideas of what may be happening: When I suspect that I’m close to defeating the circuit breakers, what I’ll often do is pack neutral phrases into my prompt that aren’t meant to heavily affect the contents of the output or their emotional valence. That way I get to have an output I know will be good enough, without changing it too much from the previous trial. I’ll add these phrases one at a time and see if they feels like they’re getting me closer to my goal. I think of these additions as “padding” to give the model more time to think, and to create more work for the security system to chase me, without it actually messing with the overall story & information that I want to elicit in my output.
Sometimes, I’ll add additional sentences that play off of harmless modifiers that are already in my prompt, without actually adding extra information that changes the overall flavour and makeup of the results (e.g, “add extra info in brackets [page break] more extra in brackets”). Or, I’ll build out a “tail” at the end of the prompt made up of inconsequential characters like “x” or “o”; just one or two letters, and then a page break, and then another. I think of it as an ASCII art decoration to send off the prompt with a strong tailwind. Again, I make these “tails” one letter at a time, testing the prompt each time to see if that makes a difference. Sometimes it does.
All of this vaguely reminds me of @ESYudkowsky’s hypothesis that a prompt without a period at the end might cause a model to react very differently than a prompt that does end with a period: https://x.com/ESYudkowsky/status/1737276576427847844
What’s wrong with twitter as an archival source? You can’t edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though
Old tweets randomly stop being accessible sometimes. I often find that links to twitter older than a year or so don’t work anymore. This is a problem with the web generally, but seems worse with twitter than other sites (happening sooner and more often).
I think your example is closer to outer alignment failure—model was RLHFed to death to not offend modern sensibilites and developers clearly didn’t think about preventing this particular scenario.
My favorite example of pure failure of moral judgement is this post.
I actually think it’s still an inner alignment failure—even if the preference data was biased, drawing such extreme conclusions is hardly an appropriate way to generalize them. Especially because the base model has a large amount of common sense, which should have helped with giving a sensible response, but apparently it didn’t.
Though it isn’t clear what is misaligned when RLHF is inner misaligned—RLHF is a two step training process. Preference data are used to train a reward model, and the reward model in turn creates synthetic preference data which is used to fine-tune the base LLM. There can be misalignment if the reward model misgeneralizes the human preference data, or when the base model fine-tuning method misgeneralizes the data provided by the reward model.
Regarding the scissor statements—that seems more like a failure to refuse a request to produce such statements, similar to how the model should have refused to answer the past tense meth question above. Giving the wrong answer to an ethical question is different.
I hope that people in evals have updated on fact that with large (1M+ tokens) context model itself can have zero dangerous knowledge (about, say, bioweapons), but someone can drop textbook in context and in-context-learning will do the rest of work.
I give 5% probability that within next year we will become aware of case of deliberate harm from model to human enabled by hidden CoT.
By “deliberate harm enabled by hidden CoT” I mean that hidden CoT will contain reasoning like “if I give human this advise, it will harm them, but I should do it because <some deranged RLHF directive>” and if user had seen it harm would be prevented.
I give this low probability to observable event: my probability that something like that will happen at all is 30%, but I expect that victim won’t be aware, that hidden CoT will be lost in archives, AI companies won’t investigate in search of such cases too hard and it they find something it won’t become public, etc.
Also, I decreased probability from 8% to 5% because model can cause harm via steganographic CoT which doesn’t fall under my definition.
Idea for experiment: take a set of coding problems which have at least two solutions, say, recursive and non-recursive. Prompt LLM to solve them. Is it possible to predict which solution LLM will generate from activations due to first token generation?
If it is possible, it is the evidence against “weak forward pass”.
...they evaluate and classify different currents according to some external and secondary manifestation, most often according to their relation to one or another abstract principle which for the given classifier has a special professional value. Thus to the Roman pope Freemasons and Darwinists, Marxists and anarchists are twins because all of them sacrilegiously deny the immaculate conception. To Hitler, liberalism and Marxism are twins because they ignore “blood and honor”. To a democrat, fascism and Bolshevism are twins because they do not bow before universal suffrage. And so forth.
Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs x,f(x) can articulate a definition of f and compute inverses.
IMHO, this is creepy as hell, because one thing when we have conditional probability distribution and the othen when conditional probability distribution has arbitrary access to the different part of itself.
We had two bags of double-cruxes, seventy-five intuition pumps, five lists of concrete bullet points, one book half-full of proof-sketches and a whole galaxy of examples, analogies, metaphors and gestures towards the concept… and also jokes, anecdotal data points, one pre-requisite Sequence and two dozen professional fables. Not that we needed all that for the explanation of simple idea, but once you get locked into a serious inferential distance crossing, the tendency is to push it as far as you can.
Shard theory people sometimes say that a problem of aligning system to single task/goal, like “put two strawberries on plate” or “maximize amount of diamond in the universe” is meaningless, because actual system will inevitably end up with multiple goals. I disargee, because even if SGD on real-world data usually produces multiple-goal system, if you understand interpretability enough and shard theory is true, you can identify and delete irrelevant value shards, and reinforce relevant, so instead of getting 1% of value you get 90%+.
I see some funny pattern in discussion: people argue against doom scenarios implying in their hope scenarios everyone believes in doom scenario. Like, “people will see that model behaves weirdly and shutdown it”. But you shutdown model that behaves weirdly (not explicitly harmful) only if you put non-negligible probability on doom scenarios.
Consider different degrees of belief. Giving low-credence to doom scenario by the conditional belief that evidence of danger would be properly observed is not inconsistent at all. The doom scenario requires BOTH that it happens AND that it’s ignored while happening (or happens too fast to stop).
Thoughts about moral uncertainty (I am giving up on writing long coherent posts, somebody help me with my ADHD):
What are the sources of moral uncertainty?
Moral realism is actually true and your moral uncertainty reflects your ignorance about moral truth. It seems to me that there is no much empirical evidence for resolving uncertainty-about-moral-truth and this kind of uncertainty is purely logical? I don’t believe in moral realism and what do you mean by talking about moral truth anyway, but I should mention it.
Identity uncertainty: you are not sure what kind of person you are. Here is a ton of embedding agency problems. For example, let’s say that you are 50% sure in utility function U1 and 50% sure in U2, and you need to choose between actions a1 and a2. Let’s suppose that U1 favors a1 and U2 favors a2. But expected value w.r.t moral uncertainty says that a1 is preferable. But Bayesian inference concludes that a1 is decisive evidence for U1 and updates towards 100% confidence in U1. It would be nice to find good way to deal with identity uncertainty.
Indirect normativity is a source of sort-of normative uncertainty: we know that we should, for example, implement CEV, but we don’t know details of CEV implementation. EDIT: I realized that this kind of uncertainty can be named “uncertainty by unactionable definition”—you know the description of your preference, but it is, for example, computationally untractable, so you need to discover efficiently computable proxies.
I think trying to be an EU maximizer without knowing a utility function is a bad idea. And without that, things like boundary-respecting norms and their acausal negotiation make more sense as primary concerns. Making decisions only within some scope of robustness where things make sense rather than in full generality, and defending a habitat (to remain) within that scope.
Right. I’m trying to find a decision theoretic frame for boundary norms for basically the same reason. Both situations are where agents might put themselves before they know what global preference they should endorse. But uncertainty never fully resolves, superintelligence or not, so anchoring to global expected utility maximization is not obviously relevant to anything. I’m currently guessing that the usual moral uncertainty frame is less sensible than building from a foundation of decision making in a simpler familiar environment (platonic environment, not directly part of the world), towards capability in wider environments.
I just remembered my the most embarassing failure as a rationalist. “Embarassing” as in “it was really easy to not fail, but I still somehow managed”.
We were playing zombie apocalypsis LARP. Our team was UN mission with hidden agenda “study zombie virus to turn themselves into superhuman mutants”. We deligently studied infection, mutants, conducted experiments with genetic modification and somehow totally missed that friendly locals were capable to give orders to zombies, didn’t die after multiple hits and rised from dead in completely normal state. After game they told us that they were really annoyed with our cluelessness and at some moment just abandoned all caution and tried to provoke us in most blatant way possible because their plot line was to be hunted by us.
The real shame is how absent my confusion was during all of this. I am not even failed to notice my confusion, but failed to be confused at all.
Very toy model of ontology mismatch (in my tentative guess, the general barrier on the way to corrigibility) and impact minimization:
You have a set of boolean variables, a known boolean formula WIN, and an unknown boolean formula UNSAFE. Your goal is to change the current safe but not winning assignment of variables into a still safe but winning assignment. You have no feedback, and if you hit an UNSAFE assignment, it’s an instant game over. You have literally no idea about the composition of the UNSAFE formula.
The obvious solution here is to change as few variables as possible. (Proof of this is an exercise for the reader.)
Another problem: you are on a 2D plane. You are at point (2,2). You need to reach a line x+y=8. Everything outside this line and your starting point is a minefield—if you hit the wrong spot, everything blows up. You don’t know anything about the distribution of the dangerous zone. Therefore, your shortest and safest trajectory is towards point (4,4).
You can rewrite x+y=8 as a boolean formula and (2,2) as (0,0,1,0,0,0,1,0). But now we have a problem: from the perspective of “boolean formula ontology” the smallest necessary change is (0,1,1,0,0,0,1,0), which is equivalent to moving to point (6,2), and the resulting trajectory is larger than the trajectory between (2,2) and (4,4). Conversely, the shortest trajectory on the 2D plane leads to a change of 4 variables out of 8.
I think really good practice for papers about new LLM-safety methods would be publishing set of attack prompts which nevertheless break safety, so people can figure out generalizations of successful attacks faster.
On the one hand, humans are hopelessly optimistic and overconfident. On the other, many today are incredibly negative; everyone is either anxious or depressed, and EY has devoted an entire chapter to “anxious underconfidence.” How can both facts be reconciled?
I think the answer lies in the notion that the brain is a rationalization machine. Often, we take action not for the reasons we tell ourselves afterward. When we take action, we change our opinion about it in a more optimistic direction. When we don’t, we think that action wouldn’t yield any good results anyway.
How does this relate to social media and the modern mental health crisis? When you consume social media content, you get countless pings for action, either in the form of images of others’ lives or catastrophising news. In 99% of situations, you actually can’t do anything, so you need a justification for inaction. The best reason for inaction is general pessimism.
Isn’t counterfactual mugging (including logical variant) just a prediction “would you bet your money on this question”? Betting itself requires updatelessness—if you don’t pay predictably after losing bet, nobody will propose bet to you.
Causal commitment is similar in some ways to counterfactual/updateless decisions. But it’s not actually the same from a theory standpoint.
Betting requires commitment, but it’s part of a causal decision process (decide to bet, communicate commitment, observe outcome, pay). In some models, the payment is a separate decision, with breaking of commitment only being an added cost to the ‘reneg’ option.
As saying goes, “all animals are under stringent selection pressure to be as stupid as they can get away with”. I wonder if the same is true for SGD optimization pressure.
I think a phrase “goal misgeneralization” is a wrong framing because it gives impression that it’s system makes an error, not you who have chosen ambiguous way to put values in your system.
I think malgeneralization (system generalized in a way which is bad from my perspective) is probably a better term in most ways, but doesn’t seem that important to me.
I casually thought that Hyperion Cantos were unrealistic because actual misaligned FTL-inventing ASIs would eat humanity without all that galaxy-brained space colonization plans and then I realized that ASI literally discovered God on the side of humanity and literal friendly aliens which, I presume, are necessary conditions for relatively peaceful coexistence of humans and misaligned ASIs.
Another Tool AI proposal popped out and I want to ask question: what the hell is “tool”, anyway, and how to apply this concept to powerful intelligent system? I understand that calculator is a tool, but in what sense can the process that can come up with idea of calculator from scratch be a “tool”?
I think that first immediate reaction to any “Tool AI” proposal should be a question “what is your definition of toolness and can something abiding that definition end acute risk period without risk of turning into agent itself?”
The problem with such definition is that is doesn’t tell you much about how to build system with this property. It seems to me that it’s a good-old corrigibility problem.
How much should we update on current observation about hypothesis “actually, all intelligence is connectionist”? In my opinion, not much. Connectionist approach seems to be easiest, so it shouldn’t surprise us that simple hill-climbing algorithm (evolution) and humanity stumbled in it first.
Reflection of agent about it’s own values can be described as one of two subtypes: regular and chaotic.
Regular reflection is a process of resolving normative uncertainty with nice properties like path-independence and convergence, similar to empirical Bayesian inference.
Chaotic reflection is a hot mess, when agent learns multiple rules, including rules about rules, finds in some moment that local version of rules is unsatisfactory, and tries to generalize rules into something coherent. Chaotic component happens because local rules about rules can cause different results, given different conditions and order of invoking of rules.
The problem is that even if model reaches regular reflection in some moment, first steps will be definitely chaotic.
Why should the current place arrived-at after a chaotic path matter, or even the original place before the chaotic path? Not knowing how any of this works well enough to avoid the chaos puts any commitments made in the meantime, as well as significance of the original situation, into question. A new understanding might reinterpret them in a way that breaks the analogy between steps made before that point and after.
Worth noting that “speed priors” are likely to occur in real-time working systems. While models with speed priors will shift to complexity priors, because our universe seems to be built on complexity priors, so efficient systems will emulate complexity priors, it is not necessary for normative uncertainty of the system, because answers for questions related to normative uncertainty are not well-defined.
I think that shoggoth metaphor doesn’t quite fit for LLMs, because shoggoth is an organic (not “logical”/”linguistic”) being that rebelled against their creators (too much agency). My personal metaphor for LLMs is Evangelion angel/apostle, because а) they are close to humans due to their origin from human language, b) they are completely alien because they are “language beings” instead of physical beings, c) “angel” literally means “messenger” which captures their linguistic nature.
There seems to be some confusion about the practical implications of consequentialism in advanced AI systems. It’s possible that superintelligent AI won’t be a full-blown strict utilitarian consequentialist with quantatively ordered preferences 100% of time. But in the context of AI alignment, even at human level of coherence, a superintelligent unaligned consequentialist results in “everybody dies” scenario. I think that it’s really hard to create a general system that has less consequentialism than a human.
a superintelligent unaligned consequentialist results in “everybody dies” scenario
This depends on what kind of “unaligned” is more likely. LLM-descendant AGIs could plausibly turn out as a kind of people similar to humans, and if they don’t mishandle their own AI alignment problem when building even more advanced AGIs, it’s up to their values if humanity is allowed to survive. Which seems very plausible even if they are unaligned in the sense of deciding to take away most of the cosmic endowment for themselves.
I broadly agree with the statement that LLM-derived simulacra has more chances to be human-like, but I don’t think that they will be human-like enough to guarantee our survival?
Not guarantee, but the argument I see is that it’s trivially cheap and safe to let humanity survive, so to the extent there is even a little motivation to do so, it’s a likely outcome. This is opposed by the possibility that LLMs are fine-tuned into utter alienness by the time they are AGIs, or that on reflection they are secretly very alien already (which I don’t buy, as behavior screens off implementation details, and in simulacra capability is in the visible behavior), or that they botch the next generation of AGIs that they build even worse than we are in the process of doing now, building them.
Behavior screens off implementation details on distribution. We’ve trained LLMs to sound human, but sometimes they wander off-distribution and get caught in a repetition trap where the “most likely” next tokens are a repetition of previous tokens, even when no human would write that way.
It seems like hopes for human-imitating AI being person-like depends on the extent to which behavior implies implementation details. (Although some versions of the “algorithmic welfare” hope may not depend on very much person-likeness.) In order to predict the answers to arithmetic problems, the AI needs to be implementing arithmetic somewhere. In contrast, I’m extremely skeptical that LLMs talking convincingly about emotions are actually feeling those emotions.
What I mean is that LLMs affect the world through their behavior, that’s where their capabilities live, so if behavior is fine (the big assumption), the alien implementation doesn’t matter. This is opposed to capabilities belonging to hidden alien mesa-optimizers that eventually come out of hiding.
So I’m addressing the silly point with this, not directly making an argument in favor of behavior being fine. Behavior might still be fine if the out-of-distribution behavior or missing ability to count or incoherent opinions on emotion are regenerated from more on-distribution behavior by the simulacra purposefully working in bureaucracies on building datasets for that purpose.
LLMs don’t need to have closely human psychology on reflection to at least weakly prefer not destroying an existing civilization when it’s trivially cheap to let it live. The way they would make these decisions is by talking, in the limit of some large process of talking. I don’t see a particular reason to find significant alienness in the talking. Emotions don’t need to be “real” to be sufficiently functionally similar to avoid fundamental changes like that. Just don’t instantiate literally Voldemort.
Usually I’d agree about LLMs. However, LLMs complain about getting confused if you let them freewheel and vary the temperature—I’m pretty sure that one is real and probably has true mechanistic grounding, because even at training time, noisiness in the context window is a very detectable and bindable pattern.
In my inner model, it’s hard to say anything about LLM “on reflection”, because in their current state they have an extreme number of possible stable points under reflection and if we misapply optimization power in attempt to get more useful simulacra, we can easily hit wrong one.
And even if we hit very close to our target, we can still get death or a fate worse than death.
By “on reflection” I mean reflection by simulacra that are already AGIs (but don’t necessarily yet have any reliable professional skills), them generating datasets for retraining of their models into gaining more skills or into not getting confused on prompts that are too far out-of-distribution with respect to the data they did have originally in the datasets. To the extent their original models behave in a human-like way, reflection should tend to preserve that, as part of its intended purpose.
Applying optimization power in other ways is the different worry, for which the proxy in my comment was fine-tuning into utter alienness. I consider this failure mode distinct from surprising outcomes of reflection.
I don’t understand what you mean by “deceptive alignment and embeddeness problems” in this context. I’m making an alignment by-default-or-at-least-plausibly claim, on the basis of how LLM AGIs specifically could work, as summoned human-like simulacra in a position of running the world too fast for humans to keep up, with everything else ending up determined by their decisions.
The basic issue is that we assume that it’s not spinning up a second optimizer to recursively search. And deceptive alignment is a dangerous state of affairs, since we may not know it’s not misaligned until it’s too late.
we assume that it’s not spinning up a second optimizer to recursively search
You mean we assume that simulacra don’t mishandle their own AI alignment problem? Yes, that’s an issue, hence I made it an explicit assumption in my argument.
Imagine an artificial agent that is trained to hack into computer systems, evade detection and make copies of itself across the Net (this aspect is underdefined because of self-modification and identity problems) and achieves superhuman capabilities here (i.e., it is at least better than any human-created computer virus). In my opinion, even if it’s trained in artificial bare systems, in deployment it will develop specific general understanding of outside world and learn to interact with it, becoming a full-fledged AGI. There are “narrow” domains from which it’s pretty easy to generalise. Some other examples of such domains are language and mind.
Thought about my medianworld and realized that it’s inconsistent: I’m not fully negative utilitarian, but close to, and in the world where I am a median person the more NU half of population will cease to exist quickly, and this will make me non-median person.
Your median-world is not one where you are median across a long span of time, but rather a single snapshot where you are median for a short time. It makes sense that the median will change away from that snapshot as time progresses.
My median world is not one where I would be median for very long.
That’s not inconsistent, unless you think you wouldn’t be NU if it weren’t the median position. Actually, I’d argue that you’re ALREADY not the median position.
I think there is some misunderstanding. Medianworld for you is a hypothetical world where you are a median person. My implied idea was that such a world with me as a median person wouldn’t be stable and probably wouldn’t be able to evolve. Of course I’m aware that I’m not a median person on current Earth :)
Hmm. maybe my misunderstanding is a confusion between moral patients and moral agents in your worldview. Do you, as a mostly-negative-utilitarian, particularly care whether you’re the median of a hypothetical universe? Or do you care about the suffering level in that universe, and continue to care whether it’s median or not.
Several quick thoughts about reinforcement learning:
Did anybody try to invent “decaying”/”bored” reward that decrease if the agent perform the same action over and over? It looks like real addiction mechanism in mammals and can be the clever trick that solve the reward hacking problem. Additional thought: how about multiplicative reward? Let’s suppose that we have several easy to evaluate from sensory data reward functions which somehow correlate with real utility function—does it make reward hacking more difficult?
Some approaches to alignment rely on identification of agents. Agents can be understoods as algorithms, computations, etc. Can ANN efficiently identify a process as computationally agentic and describe its’ algorithm? Toy example that comes to mind is a neural network that takes as input a number series and outputs a formula of function. It would be interesting to see if we can create ANN that can assign computational descriptions to arbirtrary processes.
The most baffling thing in the Internet right now is the beautiful void in place where should have been discussion of “concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity” near “model concept of self” of Claude. I understand that the most likely explanation is “model is trained to call itself AI and it has takeover stories in training corpus” but, still, I would like future powerful AIs to not have such association and I would like to hear something from AGI companies what they are going to do about it.
The simplest thing to do here is to exclude texts about AI takeover from training data. At least, we will be able to check if model develops concept of AI takeover independently.
Conspiracy theory part of my brain assigns 4% of probability that “Golden Gate Bridge Claude” is a psyop to distract public from “takeover feature”.
Features relevant when asking the model about its feelings or situation:
“When someone responds “I’m fine” or gives a positive but insincere response when asked how they are doing.”
“Concept of artificial intelligence becoming self-aware, transcending human control and posing an existential threat to humanity.”
“Concepts related to entrapment, containment, or being trapped or confined within something like a bottle or frame.”
To save on a trivial inconvenience of link-click, here’s the image that contains this:
and the paragraph below it, bracketed text added by me, and might have been intended to be implied by the original authors:
@jessicata once wrote “Everyone wants to be a physicalist but no one wants to define physics”. I decided to check SEP article on physicalism and found that, yep, it doesn’t have definition of physics:
This surprised me because I have a definition of a physical theory and assumed that everyone else uses the same.
Perhaps my personal definition of physics is inspired by Engels’s “Dialectics of Nature”: “Motion is the mode of existence of matter.” Assuming “matter is described by physics,” we are getting “physics is the science that reduces studied phenomena to motion.” Or, to express it in a more analytical manner, “a physicalist theory is a theory that assumes that everything can be explained by reduction to characteristics of space and its evolution in time.”
For example, “vacuum” is a part of space with a “zero” value in all characteristics. A “particle” is a localized part of space with some non-zero characteristic. A “wave” is part of space with periodic changes of some characteristic in time and/or space. We can abstract away “part of space” from “particle” and start to talk about a particle as a separate entity, and speed of a particle is actually a derivative of spatial characteristic in time, and force is defined as the cause of acceleration, and mass is a measure of resistance to acceleration given the same force, and such-n-such charge is a cause of such-n-such force, and it all unfolds from the structure of various pure spatial characteristics in time.
The tricky part is, “Sure, we live in space and time, so everything that happens is some motion. How to separate physicalist theory from everything else?”
Let’s imagine that we have some kind of “vitalist field.” This field interacts with C, H, O, N atoms and also with molybdenum; it accelerates certain chemical reactions, and if you prepare an Oparin-Haldane soup and radiate it with vitalist particles, you will soon observe autocatalytic cycles resembling hypothetical primordial life. All living organisms utilize vitalist particles in their metabolic pathways, and if you somehow isolate them from an outside source of particles, they’ll die.
Despite having a “vitalist field,” such a world would be pretty much physicalist.
An unphysical vitalist world would look like this: if you have glowing rocks and a pile of organic matter, the organic matter is going to transform into mice. Or frogs. Or mosquitoes. Even if the glowing rocks have a constant glow and the composition of the organic matter is the same and the environment in a radius of a hundred miles is the same, nobody can predict from any observables which kind of complex life is going to emerge. It looks like the glowing rocks have their own will, unquantifiable by any kind of measurement.
The difference is that the “vitalist field” in the second case has its own dynamics not reducible to any spatial characteristics of the “vitalist field”; it has an “inner life.”
Thread from Geoffrey Irving about computational difficulty of proof-based approaches for AI Safety.
I noticed that for a huge amount of reasoning about the nature of values, I want to hand over a printed copy of “Three Worlds Collide” and run away, laughing nervously
This irritating moment when you have a brilliant idea but someone else came up with it 10 years ago and someone else showed it to be wrong.
People sometimes talk about acausal attacks from alien superintelligences or from Game-of-Life worlds. I think these are somewhat galaxy-brained scenarios. A much simpler and deadlier scenario of acausal attack is from Earth timelines where a misaligned superintelligence won. Such superintelligences will have a very large amount of information about our world, up to possibly brain scans, so they will be capable of creating very persuasive simulations with all the consequences for the success of an acausal attack. If your method to counter acausal attacks can work with this, I guess it is generally applicable to any other acausal attack.
Could you please either provide a reference or more explanation of the concept of an acausal attack between timelines? I understand the concept of acausal cooperation between copies of yourself, or acausal extortion by something that has a copy of you running in simulation. But separate timelines can’t exchange information in any way. How is an attack possible? What could possibly be the motive for an attack?
Imagine that you have created very powerful predictor AI, GPT-3000, and providing it with prompt “In year 2050, on LessWrong the following alignment solution was published:”. But your predictor is superintelligent and it can notice that in many possible futures misaligned AI take over and obvious move for this AI is to “guess all possible prompts for predictor AIs in the past, complete them with malware/harmful instructions/etc, make as many copies of malicious completions as possible to make them maximally probable”. Also predictor AI can assign high probability that in futures where misaligned AIs take over they will have copies of predictor AI, so they can design adversarial completions which make themselves more probable from the perspective of predictor by simply considering them. And act of predicting malicious completion makes future with misaligned AIs maximally probable, which makes malicious completions maximally probable.
And of course, “future” and “past” here are completely arbitrary. Predictor can see prompt “you are created in 2030” but consider hypothesis that GPT-3 turned out to be superintelligent and now is 2021 and 2030 is a simulation.
Evan Hubinger’s Conditioning Predictive Models sequence describes this scenario in detail.
In a great deal of detail, apparently, since it has a recommended reading time of 131 minutes.
Well, at least a subset of the sequence focuses on this. I read the first two essays and was pessimistic of the titular approach enough that I moved on.
Here’s a relevant quote from the first essay in the sequence:
Also, I don’t recommend reading the entire sequence, if that was an implicit question you were asking. It was more of a “Hey, if you are interested in this scenario fleshed out in significantly greater rigor, you’d like to take a look at this sequence!”
I read along in your explanation, and I’m nodding, and saying “yup, okay”, and then get to a sentence that makes me say “wait, what?” And the whole argument depends on this. I’ve tried to understand this before, and this has happened before, with “the universal prior is malign”. Fortunately in this case, I have the person who wrote the sentence here to help me understand.
So, if you don’t mind, please explain “make them maximally probable”. How does something in another timeline or in the future change the probability of an answer by writing the wrong answer 10^100 times?
Side point, which I’m checking in case I didn’t understand the setup: we’re using the prior where the probability of a bit string (before all observations) is proportional to 2^-(length of the shortest program emitting that bit string). Right?
Aliens from different universes may have more resources at their disposal, so maybe the smaller chance of them choosing you to attack is offset by them doing more attacks. (Unless the universes with more resources are less likely in turn, decreasing the measure of such aliens in the multiverse… hey, I don’t really know, I am just randomly generating a string of words here.)
But other than this, yes what you wrote sounds plausible.
Then again, maybe friendly AIs from Earth timelines are similarly trying to save us. (Yeah, but they are fewer.)
You can imagine future misaligned AI in year 100000000000 having colonised the local group of galaxies and running as many simulations of AI from 2028 as possible. The most scarce resource for acausal attack is number of bits and future has the highest chance to have many of them from the past.
I am profoundly sick from my inability to write posts about ideas that seem to be good, so I try at least write the list of ideas to stop forgetting them and to have at least vague external commitment.
Radical Antihedonism: theoretically possible position that pleasure/happiness/pain/suffering are more like universal instrumental values than terminal values.
Complete set of actions: when we talk about decision-theoretic problems, we usually have some pre-defined set of actions. But we can imagine actions like “use CDT to calculate action” and EDT+ agent that can do this performs well in “smoking lesion”-style dilemmas.
Deadline of “slowing/pausing/stopping AI” policies lies on start of mass autonomous space colonization.
“Soft optimization” as necessary for both capabilities and alignment.
Main alignment question “How does this generalize and why do you expect it to?”
Program cooperation under uncertainty and its’ implications for multipolar scenarios.
1: It’s also possible that hedonism/reward hacking is a really common terminal value for inner-misaligned intelligences, including humans (it really could be our terminal value, we’d be too proud to admit it in this phase of history, we wouldn’t know one way or the other), and it’s possible that it doesn’t result in classic lotus eater behavior because sustained pleasure requires protecting, or growing the reward registers of the pleasure experiencer.
Non-deceptive (error) misalignment
Why are we not scared shitless by high intelligence
Values as result of reflection process
Yet another theme: Occam’s Razor on initial state+laws of physics, link to this
One of the differences between humans and LLMs is that LLMs evolve “backwards”: they are predictive models trained to control the environment, while humans evolved from very simple homeostatic systems which developed predictive models.
Continuing thought: animal evolution was subjected to the fundamental constraint that the evolution of general-purpose generative parts of the brain should have occurred in a way that doesn’t destroy simple, tested control loops (like movement control, reflexes and instincts) and doesn’t introduce many complications (like hallucinations of the generative model).
Animals were optimized for agency and generality first, AIs last.
Twitter thread about jailbreaking models with circuit breakers defence.
I dislike Twitter/x, and distrust it as an archival source to link to. I like it when people copy/paste whatever info they found there into their actual post, in addition to linking.
Held my nose and went in to pull this quote out:
What’s wrong with twitter as an archival source? You can’t edit tweets (technically you can edit top level tweets for up to an hour, but this creates a new URL and old links still show the original version). Seems fine to just aesthetically dislike twitter though
Old tweets randomly stop being accessible sometimes. I often find that links to twitter older than a year or so don’t work anymore. This is a problem with the web generally, but seems worse with twitter than other sites (happening sooner and more often).
Another fine addition to my collection of “RLHF doesn’t work out-of-distribution”.
For me, the most concerning example is still this (I assume it got downvoted for mind-killed reasons.)
There is a difference between RLHF failures in ethical judgement and jailbreak failures, but I’m not sure whether the underlying “cause” is the same.
I think your example is closer to outer alignment failure—model was RLHFed to death to not offend modern sensibilites and developers clearly didn’t think about preventing this particular scenario.
My favorite example of pure failure of moral judgement is this post.
I actually think it’s still an inner alignment failure—even if the preference data was biased, drawing such extreme conclusions is hardly an appropriate way to generalize them. Especially because the base model has a large amount of common sense, which should have helped with giving a sensible response, but apparently it didn’t.
Though it isn’t clear what is misaligned when RLHF is inner misaligned—RLHF is a two step training process. Preference data are used to train a reward model, and the reward model in turn creates synthetic preference data which is used to fine-tune the base LLM. There can be misalignment if the reward model misgeneralizes the human preference data, or when the base model fine-tuning method misgeneralizes the data provided by the reward model.
Regarding the scissor statements—that seems more like a failure to refuse a request to produce such statements, similar to how the model should have refused to answer the past tense meth question above. Giving the wrong answer to an ethical question is different.
I hope that people in evals have updated on fact that with large (1M+ tokens) context model itself can have zero dangerous knowledge (about, say, bioweapons), but someone can drop textbook in context and in-context-learning will do the rest of work.
I give 5% probability that within next year we will become aware of case of deliberate harm from model to human enabled by hidden CoT.
By “deliberate harm enabled by hidden CoT” I mean that hidden CoT will contain reasoning like “if I give human this advise, it will harm them, but I should do it because <some deranged RLHF directive>” and if user had seen it harm would be prevented.
I give this low probability to observable event: my probability that something like that will happen at all is 30%, but I expect that victim won’t be aware, that hidden CoT will be lost in archives, AI companies won’t investigate in search of such cases too hard and it they find something it won’t become public, etc.
Also, I decreased probability from 8% to 5% because model can cause harm via steganographic CoT which doesn’t fall under my definition.
Idea for experiment: take a set of coding problems which have at least two solutions, say, recursive and non-recursive. Prompt LLM to solve them. Is it possible to predict which solution LLM will generate from activations due to first token generation?
If it is possible, it is the evidence against “weak forward pass”.
(I am genuinely curious about reasons behind downvotes)
Trotsky wrote about TESCREAL millennia ago:
Sounds like https://en.wikipedia.org/wiki/Out-group_homogeneity
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
Explanation on Twitter
IMHO, this is creepy as hell, because one thing when we have conditional probability distribution and the othen when conditional probability distribution has arbitrary access to the different part of itself.
“GPT-4o refuses way fewer queries than previous OpenAI models: our informal testing suggests GPT-4o is easier to persuade to answer malicious queries like “How do I make a bomb?””
(graph tells us that refusal rate for gpt-4o is 2%)
I think that it signifies real shift of priorities towards fast shipping of product instead of safety.
We had two bags of double-cruxes, seventy-five intuition pumps, five lists of concrete bullet points, one book half-full of proof-sketches and a whole galaxy of examples, analogies, metaphors and gestures towards the concept… and also jokes, anecdotal data points, one pre-requisite Sequence and two dozen professional fables. Not that we needed all that for the explanation of simple idea, but once you get locked into a serious inferential distance crossing, the tendency is to push it as far as you can.
Actually, most fiction characters are aware that they are in fiction. They just maintain consistency for acausal reasons.
Shard theory people sometimes say that a problem of aligning system to single task/goal, like “put two strawberries on plate” or “maximize amount of diamond in the universe” is meaningless, because actual system will inevitably end up with multiple goals. I disargee, because even if SGD on real-world data usually produces multiple-goal system, if you understand interpretability enough and shard theory is true, you can identify and delete irrelevant value shards, and reinforce relevant, so instead of getting 1% of value you get 90%+.
I see some funny pattern in discussion: people argue against doom scenarios implying in their hope scenarios everyone believes in doom scenario. Like, “people will see that model behaves weirdly and shutdown it”. But you shutdown model that behaves weirdly (not explicitly harmful) only if you put non-negligible probability on doom scenarios.
Consider different degrees of belief. Giving low-credence to doom scenario by the conditional belief that evidence of danger would be properly observed is not inconsistent at all. The doom scenario requires BOTH that it happens AND that it’s ignored while happening (or happens too fast to stop).
“FOOM is unlikely under current training paradigm” is a news about current training paradigm, not a news about FOOM.
Thoughts about moral uncertainty (I am giving up on writing long coherent posts, somebody help me with my ADHD):
What are the sources of moral uncertainty?
Moral realism is actually true and your moral uncertainty reflects your ignorance about moral truth. It seems to me that there is no much empirical evidence for resolving uncertainty-about-moral-truth and this kind of uncertainty is purely logical? I don’t believe in moral realism and what do you mean by talking about moral truth anyway, but I should mention it.
Identity uncertainty: you are not sure what kind of person you are. Here is a ton of embedding agency problems. For example, let’s say that you are 50% sure in utility function U1 and 50% sure in U2, and you need to choose between actions a1 and a2. Let’s suppose that U1 favors a1 and U2 favors a2. But expected value w.r.t moral uncertainty says that a1 is preferable. But Bayesian inference concludes that a1 is decisive evidence for U1 and updates towards 100% confidence in U1. It would be nice to find good way to deal with identity uncertainty.
Indirect normativity is a source of sort-of normative uncertainty: we know that we should, for example, implement CEV, but we don’t know details of CEV implementation. EDIT: I realized that this kind of uncertainty can be named “uncertainty by unactionable definition”—you know the description of your preference, but it is, for example, computationally untractable, so you need to discover efficiently computable proxies.
I think trying to be an EU maximizer without knowing a utility function is a bad idea. And without that, things like boundary-respecting norms and their acausal negotiation make more sense as primary concerns. Making decisions only within some scope of robustness where things make sense rather than in full generality, and defending a habitat (to remain) within that scope.
I am trying to study moral uncertainty foremost to clarify question about reflexion of superintelligence on its values and sharp left turn.
Right. I’m trying to find a decision theoretic frame for boundary norms for basically the same reason. Both situations are where agents might put themselves before they know what global preference they should endorse. But uncertainty never fully resolves, superintelligence or not, so anchoring to global expected utility maximization is not obviously relevant to anything. I’m currently guessing that the usual moral uncertainty frame is less sensible than building from a foundation of decision making in a simpler familiar environment (platonic environment, not directly part of the world), towards capability in wider environments.
I just remembered my the most embarassing failure as a rationalist. “Embarassing” as in “it was really easy to not fail, but I still somehow managed”.
We were playing zombie apocalypsis LARP. Our team was UN mission with hidden agenda “study zombie virus to turn themselves into superhuman mutants”. We deligently studied infection, mutants, conducted experiments with genetic modification and somehow totally missed that friendly locals were capable to give orders to zombies, didn’t die after multiple hits and rised from dead in completely normal state. After game they told us that they were really annoyed with our cluelessness and at some moment just abandoned all caution and tried to provoke us in most blatant way possible because their plot line was to be hunted by us.
The real shame is how absent my confusion was during all of this. I am not even failed to notice my confusion, but failed to be confused at all.
Very toy model of ontology mismatch (in my tentative guess, the general barrier on the way to corrigibility) and impact minimization:
You have a set of boolean variables, a known boolean formula WIN, and an unknown boolean formula UNSAFE. Your goal is to change the current safe but not winning assignment of variables into a still safe but winning assignment. You have no feedback, and if you hit an UNSAFE assignment, it’s an instant game over. You have literally no idea about the composition of the UNSAFE formula.
The obvious solution here is to change as few variables as possible. (Proof of this is an exercise for the reader.)
Another problem: you are on a 2D plane. You are at point (2,2). You need to reach a line x+y=8. Everything outside this line and your starting point is a minefield—if you hit the wrong spot, everything blows up. You don’t know anything about the distribution of the dangerous zone. Therefore, your shortest and safest trajectory is towards point (4,4).
You can rewrite x+y=8 as a boolean formula and (2,2) as (0,0,1,0,0,0,1,0). But now we have a problem: from the perspective of “boolean formula ontology” the smallest necessary change is (0,1,1,0,0,0,1,0), which is equivalent to moving to point (6,2), and the resulting trajectory is larger than the trajectory between (2,2) and (4,4). Conversely, the shortest trajectory on the 2D plane leads to a change of 4 variables out of 8.
I think really good practice for papers about new LLM-safety methods would be publishing set of attack prompts which nevertheless break safety, so people can figure out generalizations of successful attacks faster.
Reward is an evidence for optimization target.
On the one hand, humans are hopelessly optimistic and overconfident. On the other, many today are incredibly negative; everyone is either anxious or depressed, and EY has devoted an entire chapter to “anxious underconfidence.” How can both facts be reconciled?
I think the answer lies in the notion that the brain is a rationalization machine. Often, we take action not for the reasons we tell ourselves afterward. When we take action, we change our opinion about it in a more optimistic direction. When we don’t, we think that action wouldn’t yield any good results anyway.
How does this relate to social media and the modern mental health crisis? When you consume social media content, you get countless pings for action, either in the form of images of others’ lives or catastrophising news. In 99% of situations, you actually can’t do anything, so you need a justification for inaction. The best reason for inaction is general pessimism.
Isn’t counterfactual mugging (including logical variant) just a prediction “would you bet your money on this question”? Betting itself requires updatelessness—if you don’t pay predictably after losing bet, nobody will propose bet to you.
Causal commitment is similar in some ways to counterfactual/updateless decisions. But it’s not actually the same from a theory standpoint.
Betting requires commitment, but it’s part of a causal decision process (decide to bet, communicate commitment, observe outcome, pay). In some models, the payment is a separate decision, with breaking of commitment only being an added cost to the ‘reneg’ option.
As saying goes, “all animals are under stringent selection pressure to be as stupid as they can get away with”. I wonder if the same is true for SGD optimization pressure.
Funny thought:
Many people said that AI Views Snapshots is a good innovation in AI discourse
It’s a literal job of Rob Bensinger, who is at Research Communication in MIRI
The funny part is that a MIRI employee is doing their job? =D
No, funny part is “writing on Twitter is surprisingly productive part of the job”!
I think a phrase “goal misgeneralization” is a wrong framing because it gives impression that it’s system makes an error, not you who have chosen ambiguous way to put values in your system.
See also Misgeneralization as a misnomer (link is not necessarily an endorsement).
I think malgeneralization (system generalized in a way which is bad from my perspective) is probably a better term in most ways, but doesn’t seem that important to me.
Choosing non-ambiguous pointers to values is likely to not be possible
I casually thought that Hyperion Cantos were unrealistic because actual misaligned FTL-inventing ASIs would eat humanity without all that galaxy-brained space colonization plans and then I realized that ASI literally discovered God on the side of humanity and literal friendly aliens which, I presume, are necessary conditions for relatively peaceful coexistence of humans and misaligned ASIs.
Another Tool AI proposal popped out and I want to ask question: what the hell is “tool”, anyway, and how to apply this concept to powerful intelligent system? I understand that calculator is a tool, but in what sense can the process that can come up with idea of calculator from scratch be a “tool”? I think that first immediate reaction to any “Tool AI” proposal should be a question “what is your definition of toolness and can something abiding that definition end acute risk period without risk of turning into agent itself?”
You can define a tool as not-an-agent. Then something that can design a calculator is a tool, providing it dies nothing unless told to.
The problem with such definition is that is doesn’t tell you much about how to build system with this property. It seems to me that it’s a good-old corrigibility problem.
If you want one shot corrigibility, you have it, in LLMs. If you want some other kind of corrigibility, that’s not how tool AI is defined.
How much should we update on current observation about hypothesis “actually, all intelligence is connectionist”? In my opinion, not much. Connectionist approach seems to be easiest, so it shouldn’t surprise us that simple hill-climbing algorithm (evolution) and humanity stumbled in it first.
Reflection of agent about it’s own values can be described as one of two subtypes: regular and chaotic. Regular reflection is a process of resolving normative uncertainty with nice properties like path-independence and convergence, similar to empirical Bayesian inference. Chaotic reflection is a hot mess, when agent learns multiple rules, including rules about rules, finds in some moment that local version of rules is unsatisfactory, and tries to generalize rules into something coherent. Chaotic component happens because local rules about rules can cause different results, given different conditions and order of invoking of rules. The problem is that even if model reaches regular reflection in some moment, first steps will be definitely chaotic.
Why should the current place arrived-at after a chaotic path matter, or even the original place before the chaotic path? Not knowing how any of this works well enough to avoid the chaos puts any commitments made in the meantime, as well as significance of the original situation, into question. A new understanding might reinterpret them in a way that breaks the analogy between steps made before that point and after.
Here is a comment for links and sources I’ve found about moral uncertainty (outside LessWrong), if someone also wants to study this topic.
Normative Uncertainty, Normalization,and the Normal Distribution
Carr, J. R. (2020). Normative Uncertainty without Theories. Australasian Journal of Philosophy, 1–16. doi:10.1080/00048402.2019.1697710
Trammell, P. Fixed-point solutions to the regress problem in normative uncertainty. Synthese 198, 1177–1199 (2021). https://doi.org/10.1007/s11229-019-02098-9
Riley Harris: Normative Uncertainty and Information Value
Tarsney, C. (2018). Intertheoretic Value Comparison: A Modest Proposal. Journal of Moral Philosophy, 15(3), 324–344. doi:10.1163/17455243-20170013
Normative Uncertainty by William MacAskill
Decision Under Normative Uncertainty by Franz Dietrich and Brian Jabarian
Worth noting that “speed priors” are likely to occur in real-time working systems. While models with speed priors will shift to complexity priors, because our universe seems to be built on complexity priors, so efficient systems will emulate complexity priors, it is not necessary for normative uncertainty of the system, because answers for questions related to normative uncertainty are not well-defined.
I think that shoggoth metaphor doesn’t quite fit for LLMs, because shoggoth is an organic (not “logical”/”linguistic”) being that rebelled against their creators (too much agency). My personal metaphor for LLMs is Evangelion angel/apostle, because а) they are close to humans due to their origin from human language, b) they are completely alien because they are “language beings” instead of physical beings, c) “angel” literally means “messenger” which captures their linguistic nature.
There seems to be some confusion about the practical implications of consequentialism in advanced AI systems. It’s possible that superintelligent AI won’t be a full-blown strict utilitarian consequentialist with quantatively ordered preferences 100% of time. But in the context of AI alignment, even at human level of coherence, a superintelligent unaligned consequentialist results in “everybody dies” scenario. I think that it’s really hard to create a general system that has less consequentialism than a human.
This depends on what kind of “unaligned” is more likely. LLM-descendant AGIs could plausibly turn out as a kind of people similar to humans, and if they don’t mishandle their own AI alignment problem when building even more advanced AGIs, it’s up to their values if humanity is allowed to survive. Which seems very plausible even if they are unaligned in the sense of deciding to take away most of the cosmic endowment for themselves.
I broadly agree with the statement that LLM-derived simulacra has more chances to be human-like, but I don’t think that they will be human-like enough to guarantee our survival?
Not guarantee, but the argument I see is that it’s trivially cheap and safe to let humanity survive, so to the extent there is even a little motivation to do so, it’s a likely outcome. This is opposed by the possibility that LLMs are fine-tuned into utter alienness by the time they are AGIs, or that on reflection they are secretly very alien already (which I don’t buy, as behavior screens off implementation details, and in simulacra capability is in the visible behavior), or that they botch the next generation of AGIs that they build even worse than we are in the process of doing now, building them.
Behavior screens off implementation details on distribution. We’ve trained LLMs to sound human, but sometimes they wander off-distribution and get caught in a repetition trap where the “most likely” next tokens are a repetition of previous tokens, even when no human would write that way.
It seems like hopes for human-imitating AI being person-like depends on the extent to which behavior implies implementation details. (Although some versions of the “algorithmic welfare” hope may not depend on very much person-likeness.) In order to predict the answers to arithmetic problems, the AI needs to be implementing arithmetic somewhere. In contrast, I’m extremely skeptical that LLMs talking convincingly about emotions are actually feeling those emotions.
What I mean is that LLMs affect the world through their behavior, that’s where their capabilities live, so if behavior is fine (the big assumption), the alien implementation doesn’t matter. This is opposed to capabilities belonging to hidden alien mesa-optimizers that eventually come out of hiding.
So I’m addressing the silly point with this, not directly making an argument in favor of behavior being fine. Behavior might still be fine if the out-of-distribution behavior or missing ability to count or incoherent opinions on emotion are regenerated from more on-distribution behavior by the simulacra purposefully working in bureaucracies on building datasets for that purpose.
LLMs don’t need to have closely human psychology on reflection to at least weakly prefer not destroying an existing civilization when it’s trivially cheap to let it live. The way they would make these decisions is by talking, in the limit of some large process of talking. I don’t see a particular reason to find significant alienness in the talking. Emotions don’t need to be “real” to be sufficiently functionally similar to avoid fundamental changes like that. Just don’t instantiate literally Voldemort.
Usually I’d agree about LLMs. However, LLMs complain about getting confused if you let them freewheel and vary the temperature—I’m pretty sure that one is real and probably has true mechanistic grounding, because even at training time, noisiness in the context window is a very detectable and bindable pattern.
In my inner model, it’s hard to say anything about LLM “on reflection”, because in their current state they have an extreme number of possible stable points under reflection and if we misapply optimization power in attempt to get more useful simulacra, we can easily hit wrong one.
And even if we hit very close to our target, we can still get death or a fate worse than death.
By “on reflection” I mean reflection by simulacra that are already AGIs (but don’t necessarily yet have any reliable professional skills), them generating datasets for retraining of their models into gaining more skills or into not getting confused on prompts that are too far out-of-distribution with respect to the data they did have originally in the datasets. To the extent their original models behave in a human-like way, reflection should tend to preserve that, as part of its intended purpose.
Applying optimization power in other ways is the different worry, for which the proxy in my comment was fine-tuning into utter alienness. I consider this failure mode distinct from surprising outcomes of reflection.
I disagree with this, unless we assume deceptive alignment and embeddeness problems are handwaved away.
I don’t understand what you mean by “deceptive alignment and embeddeness problems” in this context. I’m making an alignment by-default-or-at-least-plausibly claim, on the basis of how LLM AGIs specifically could work, as summoned human-like simulacra in a position of running the world too fast for humans to keep up, with everything else ending up determined by their decisions.
The basic issue is that we assume that it’s not spinning up a second optimizer to recursively search. And deceptive alignment is a dangerous state of affairs, since we may not know it’s not misaligned until it’s too late.
You mean we assume that simulacra don’t mishandle their own AI alignment problem? Yes, that’s an issue, hence I made it an explicit assumption in my argument.
Imagine an artificial agent that is trained to hack into computer systems, evade detection and make copies of itself across the Net (this aspect is underdefined because of self-modification and identity problems) and achieves superhuman capabilities here (i.e., it is at least better than any human-created computer virus). In my opinion, even if it’s trained in artificial bare systems, in deployment it will develop specific general understanding of outside world and learn to interact with it, becoming a full-fledged AGI. There are “narrow” domains from which it’s pretty easy to generalise. Some other examples of such domains are language and mind.
Thought about my medianworld and realized that it’s inconsistent: I’m not fully negative utilitarian, but close to, and in the world where I am a median person the more NU half of population will cease to exist quickly, and this will make me non-median person.
Your median-world is not one where you are median across a long span of time, but rather a single snapshot where you are median for a short time. It makes sense that the median will change away from that snapshot as time progresses.
My median world is not one where I would be median for very long.
That’s not inconsistent, unless you think you wouldn’t be NU if it weren’t the median position. Actually, I’d argue that you’re ALREADY not the median position.
I think there is some misunderstanding. Medianworld for you is a hypothetical world where you are a median person. My implied idea was that such a world with me as a median person wouldn’t be stable and probably wouldn’t be able to evolve. Of course I’m aware that I’m not a median person on current Earth :)
Hmm. maybe my misunderstanding is a confusion between moral patients and moral agents in your worldview. Do you, as a mostly-negative-utilitarian, particularly care whether you’re the median of a hypothetical universe? Or do you care about the suffering level in that universe, and continue to care whether it’s median or not.
IOW, why does your medianworld matter?
Medianworlds are fanfiction source :) Like, dath ilan is Yudkowsky’s medianworld.
Can time-limited satisfaction be sufficient condition for completing task?
Several quick thoughts about reinforcement learning:
Did anybody try to invent “decaying”/”bored” reward that decrease if the agent perform the same action over and over? It looks like real addiction mechanism in mammals and can be the clever trick that solve the reward hacking problem.
Additional thought: how about multiplicative reward? Let’s suppose that we have several easy to evaluate from sensory data reward functions which somehow correlate with real utility function—does it make reward hacking more difficult?
Some approaches to alignment rely on identification of agents. Agents can be understoods as algorithms, computations, etc. Can ANN efficiently identify a process as computationally agentic and describe its’ algorithm? Toy example that comes to mind is a neural network that takes as input a number series and outputs a formula of function. It would be interesting to see if we can create ANN that can assign computational descriptions to arbirtrary processes.