james.lucassen
Whatever you end up doing, I strongly recommend taking a learning-by-writing style approach (or anything else that will keep you in critical assessment mode rather than classroom mode). These ideas are nowhere near solidified enough to merit a classroom-style approach, and even if they were infallible, that’s probably not the fastest way to learn them and contribute original stuff.
The most common failure mode I expect for rapid introductions to alignment is just trying to absorb, rather than constantly poking and prodding to get a real working understanding. This happened to me, and wasted a lot of time.
This is the exact problem StackExchange tries to solve, right? How do we get (and kickstart the use of) an Alignment StackExchange domain?
Agree it’s hard to prove a negative, but personally I find the following argument pretty suggestive:
“Other AGI labs have some plans—these are the plans we think are bad, and a pivotal act will have to disrupt them. But if we, ourselves, are an AGI lab with some plan, we should expect our pivotal agent to also be able to disrupt our plans. This does not directly lead to the end of the world, but it definitely includes root access to the datacenter.”
Optimization and Adequacy in Five Bullets
Proposed toy examples for G:
G is “the door opens”, a- is “push door”, a+ is “some weird complicated doorknob with a lock”. Pretty much any b- can open a-, but only a very specific key+manipulator combo opens a+. a+ is much more informative about successful b than a- is.
G is “I make a million dollars”, a- is “straightforward boring investing”, a+ is “buy a lottery ticket”. A wide variety of different world-histories b can satisfy a-, as long as the markets are favorable—but a very narrow slice can satisfy a+. a+ is a more fragile strategy (relative to noise in b) than a- is.
it doesn’t work if your goal is to find the optimal answer, but we hardly ever want to know the optimal answer, we just want to know a good-enough answer.
Also not an expert, but I think this is correct
Paragraph:
When a bounded agent attempts a task, we observe some degree of success. But the degree of success depends on many factors that are not “part of” the agent—outside the Cartesian boundary that we (the observers) choose to draw for modeling purposes. These factors include things like power, luck, task difficulty, assistance, etc. If we are concerned with the agent as a learner and don’t consider knowledge as part of the agent, factors like knowledge, skills, beliefs, etc. are also externalized. Applied rationality is the result of attempting to distill this big complicated mapping from (agent, power, luck, task, knowledge, skills, beliefs, etc) → success down to just agent → success. This lets us assign each agent a one-dimensional score: “how well do you achieve goals overall?” Note that for no-free-lunch reasons, this already-fuzzy thing is further fuzzified by considering tasks according to the stuff the observer cares about somehow.
Sentence:
Applied rationality is a property of a bounded agent, which attempts to describe how successful that agent tends to be when you throw tasks at it, while controlling for both “environmental” factors such as luck and “epistemic” factors such as beliefs.
Follow-up:
In this framing, it’s pretty easy to define epistemic rationality analogously compressing from everything → prediction loss to just agent → prediction loss.
However, in retrospect I think the definition I gave here is pretty identical to how I would have defined “intelligence”, just without reference to the “mapping broad start distribution to narrow outcome distribution” idea (optimization power) that I usually associate with that term. If anyone could clarify specifically the difference between applied rationality and intelligence, I would be interested.
Maybe you also have to control for “computational factors” like raw processing power, or something? But then what’s left inside the Cartesian boundary? Just the algorithm? That seems like it has potential, but still feels messy.
This leans a bit close to the pedantry side, but the title is also a bit strange when taken literally. Three useful types (of akrasia categories)? Types of akrasia, right, not types of categories?
That said, I do really like this classification! Introspectively, it seems like the three could have quite distinct causes, so understanding which category you struggle with could be important for efforts to fix.
Props for first post!
Trying to figure out what’s being said here. My best guess is two major points:
Meta doesn’t work. Do the thing, stop trying to figure out systematic ways to do the thing better, they’re a waste of time. The first thing any proper meta-thinking should notice is that nobody doing meta-thinking seems to be doing object level thinking any better.
A lot of nerds want to be recognized as Deep Thinkers. This makes meta-thinking stuff really appealing for them to read, in hopes of becoming a DT. This in turn makes it appealing for them to write, since it’s what other nerds will read, which is how they get recognized as a DT. All this is despite the fact that it’s useless.
Ah, gotcha. I think the post is fine, I just failed to read.
If I now correctly understand, the proposal is to ask a LLM to simulate human approval, and use that as the training signal for your Big Scary AGI. I think this still has some problems:
Using an LLM to simulate human approval sounds like reward modeling, which seems useful. But LLM’s aren’t trained to simulate humans, they’re trained to predict text. So, for example, an LLM will regurgitate the dominant theory of human values, even if it has learned (in a Latent Knowledge sense) that humans really value something else.
Even if the simulation is perfect, using human approval isn’t a solution to outer alignment, for reasons like deception and wireheading
I worry that I still might not understand your question, because I don’t see how fragility of value and orthogonality come into this?
The key thing here seems to be the difference between understanding a value and having that value. Nothing about the fragile value claim or the Orthogonality thesis says that the main blocker is AI systems failing to understand human values. A superintelligent paperclip maximizer could know what I value and just not do it, the same way I can understand what the paperclipper values and choose to pursue my own values instead.
Your argument is for LLM’s understanding human values, but that doesn’t necessarily have anything to do with the values that they actually have. It seems likely that their actual values are something like “predict text accurately”, and this requires understanding human values but not adopting them.
now this is how you win the first-ever “most meetings” prize
Agree that this is definitely a plausible strategy, and that it doesn’t get anywhere near as much attention as it seemingly deserves, for reasons unknown to me. Strong upvote for the post, I want to see some serious discussion on this. Some preliminary thoughts:
How did we get here?
If I had to guess, the lack of discussion on this seems likely due to a founder effect. The people pulling the alarm in the early days of AGI safety concerns were disproportionately to the technical/philosophical side rather than to the policy/outreach/activism side.
In early days, focus on the technical problem makes sense. When you are the only person in the world working on AGI, all the delay in the world won’t help unless the alignment problem gets solved. But we are working at very different margins nowadays.
There’s also an obvious trap which makes motivated reasoning really easy. Often, the first thing that occurs when thinking about slowing down AGI development is sabotage—maybe because this feels urgent and drastic? It’s an obviously bad idea, and maybe that lets us to motivated stopping.
Maybe the “technical/policy” dichotomy is keeping us from thinking of obvious ways we could be making the future much safer? The outreach org you propose seems like not really either. Would be interested in brainstorming other major ways to affect the world, but not gonna do that in this comment.
HEY! FTX! OVER HERE!!
You should submit this to the Future Fund’s ideas competition, even though it’s technically closed. I’m really tempted to do it myself just to make sure it gets done, and very well might submit something in this vein once I’ve done a more detailed brainstorm.
I don’t think I understand how the scorecard works. From:
[the scorecard] takes all that horrific complexity and distills it into a nice standardized scorecard—exactly the kind of thing that genetically-hardcoded circuits in the Steering Subsystem can easily process.
And this makes sense. But when I picture how it could actually work, I bump into an issue. Is the scorecard learned, or hard-coded?
If the scorecard is learned, then it needs a training signal from Steering. But if it’s useless at the start, it can’t provide a training signal. On the other hand, since the “ontology” of the Learning subsystem is learned-from-scratch, then it seems difficult for a hard-coded scorecard to do this translation task.
this is great,thanks!
What do you think about the effectiveness of the particular method of digital decluttering recommended by Digital Minimalism? What modifications would you recommend? Ideal duration?
One reason I have yet to do a month-long declutter is because I remember thinking something like “this process sounds like something Cal Newport just kinda made up and didn’t particularly test, my own methods that I think of for me will probably better than Cal’s method he thought of for him”.
So far my own methods have not worked.
Memetic evolution dominates biological evolution for the same reason.
Faster mutation rate doesn’t just produce faster evolution—it also reduces the steady-state fitness. Complex machinery can’t reliably be evolved if pieces of it are breaking all the time. I’m mostly relying No Evolutions for Corporations or Nanodevices plus one undergrad course in evolutionary bio here.
Also, just empirically: memetic evolution produced civilization, social movements, Crusades, the Nazis, etc.
Thank you for pointing this out. I agree with the empirical observation that we’ve had some very virulent and impactful memes. I’m skeptical about saying that those were produced by evolution rather than something more like genetic drift, because of the mutation-rate argument. But given that observation, I don’t know if it matters if there’s evolution going on or not. What we’re concerned with is the impact, not the mechanism.
I think at this point I’m mostly just objecting to the aesthetic and some less-rigorous claims that aren’t really important, not the core of what you’re arguing. Does it just come down to something like:
“Ideas can be highly infectious and strongly affect behavior. Before you do anything, check for ideas in your head which affect your behavior in ways you don’t like. And before you try and tackle a global-scale problem with a small-scale effort, see if you can get an idea out into the world to get help.”
I think we’re seeing Friendly memetic tech evolving that can change how influence comes about.
Wait, literally evolving? How? Coincidence despite orthogonality? Did someone successfully set up an environment that selects for Friendly memes? Or is this not literally evolving, but more like “being developed”?
The key tipping point isn’t “World leaders are influenced” but is instead “The Friendly memetic tech hatches a different way of being that can spread quickly.” And the plausible candidates I’ve seen often suggest it’ll spread superexponentially.
Whoa! I would love to hear more about these plausible candidates.
There’s insufficient collective will to do enough of the right kind of alignment research.
I parse this second point as something like “alignment is hard enough that you need way more quality-adjusted research-years (QARY’s?) than the current track is capable of producing. This means that to have any reasonable shot at success, you basically have to launch a Much larger (but still aligned) movement via memetic tech, or just pray you’re the messiah and can singlehandedly provide all the research value of that mass movement.”. That seems plausible, and concerning, but highly sensitive to difficulty of alignment problem—which I personally have practically zero idea how to forecast.
Ah, so on this view, the endgame doesn’t look like
“make technical progress until the alignment tax is low enough that policy folks or other AI-risk-aware people in key positions will be able to get an unaware world to pay it”
But instead looks more like
“get the world to be aware enough to not bumble into an apocalypse, specifically by promoting rationality, which will let key decision-makers clear out the misaligned memes that keep them from seeing clearly”
Is that a fair summary? If so, I’m pretty skeptical of the proposed AI alignment strategy, even conditional on this strong memetic selection and orthogonality actually happening. It seems like this strategy requires pretty deeply influencing the worldview of many world leaders. That is obviously very difficult because no movement that I’m aware of has done it (at least, quickly), and I think they all would like to if they judged it doable. Importantly, the reduce-tax strategy requires clarifying and solving a complicated philosophical/technical problem, which is also very difficult. I think it’s more promising for the following reasons:
It has a stronger precedent (historical examples I’d reference include the invention of computability theory, the invention of information theory and cybernetics, and the adventures in logic leading up to Godel)
It’s more in line with rationalists’ general skill set, since the group is much more skewed towards analytical thinking and technical problem-solving than towards government/policy folks and being influential among those kinds of people
The number of people we would need to influence will go up as AGI tech becomes easier to develop, and every one is a single point of failure.
To be fair, these strategies are not in a strict either/or, and luckily use largely separate talent pools. But if the proposal here ultimately comes down to moving fungible resources towards the become-aware strategy and away from the technical-alignment strategy, I think I (mid-tentatively) disagree
Thanks! Edits made accordingly. Two notes on the stuff you mentioned that isn’t just my embarrassing lack of proofreading:
The definition of optimization used in Risks From Learned Optimization is actually quite different from the definition I’m using here. They say:
“a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.”
I personally don’t really like this definition, since it leans quite hard on reifying certain kinds of algorithms—when is there “really” explicit search going on? Where is the search space? When does a configuration of atoms consitute an objective function? Using this definition strictly, humans aren’t *really* optimizers, we don’t have an explicit objective function written down anywhere. Balls rolling down hills aren’t optimizers either.
But by the definition of optimization I’ve been using here, I think pretty much all evolved organisms have to be at least weak optimizers, because survival is hard. You have manage constraints from food and water and temperature and predation etc… the window of action-sequences that lead to successful reproduction are really quite narrow compared to the whole space. Maintaining homeostasis requires ongoing optimization pressure.
Agree that not all optimization processes fundamentally have to be produced by other optimization processes, and that they can crop up anywhere you have the necessary negentropy resevoir. I think my claim is that optimization processes are by default rare (maybe this is exactly because they require negentropy?). But since optimizers beget other optimizers at a rate much higher than background, we should expect the majority of optimization to arise from other optimization. Existing hereditary trees of optimizers grow deeper much faster than new roots spawn, so we should expect roots to occupy a negligible fraction of the nodes as time goes on.