See something I’ve written which you disagree with? I’m experimenting with offering cash prizes of up to US$1000 to anyone who changes my mind about something I consider important. Message me our disagreement and I’ll tell you how much I’ll pay if you change my mind + details :-) (EDIT: I’m not logging into Less Wrong very often now, it might take me a while to see your message—I’m still interested though)
John_Maxwell
For what it’s worth, I often find Eliezer’s arguments unpersuasive because they seem shallow. For example:
The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.
This seem like a fuzzy “outside view” sort of argument. (Compare with: “A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways.” On the other hand, a causal model of a gun lets you explain which specific gun operations can be deadly and why.)
I’m not saying Eliezer’s conclusion is false. I find other arguments for that conclusion much more persuasive, e.g. involving mesa-optimizers, because there is a proposed failure type which I understand in causal/mechanistic terms.
(I can provide other examples of shallow-seeming arguments if desired.)
As the proposal stands it seems like the AI’s predictions of human thoughts would offer no relevant information about how the AI is predicting the non-thought story content, since the AI could be predicting these different pieces of content through unrelated mechanisms.
Might depend whether the “thought” part comes before or after particular story text. If the “thought” comes after that story text, then it’s generated conditional on that text, essentially a rationalization of that text from a hypothetical DM’s point of view. If it comes before that story text, then the story is being generated conditional on it.
Personally I think I might go for a two-phase process. Do the task with a lot of transparent detail in phase 1. Summarize that detail and filter out infohazards in phase 2, but link from the summary to the detailed version so a human can check things as needed (flagging links to plausible infohazards). (I guess you could flag links to parts that seemed especially likely to be incorrigible/manipulative cognition, or parts of the summary that the summarizer was less confident in, as well.)
I updated the post to note that if you want voting rights in Google, it seems you should buy $GOOGL not $GOOG. Sorry! Luckily they are about the same price, and you can easily dump your $GOOG for $GOOGL. In fact, it looks like $GOOGL is $6 cheaper than $GOOG right now? Perhaps because it is less liquid?
Fraud also seems like the kind of problem you can address as it comes up. And I suspect just requiring people to take a salary cut is a fairly effective way to filter for idealism.
All you have to do to distract fraudsters is put a list of poorly run software companies where you can get paid more money to work less hard at the top of the application ;-) How many fraudsters would be silly enough to bother with a fraud opportunity that wasn’t on the Pareto frontier?
The problem comes when one tries to pour a lot of money into that sort of approach
It seems to me that the Goodhart effect is actually stronger if you’re granting less money.
Suppose that we have a population of people who are keen to work on AI safety. Suppose every time a person from that population gets an application for funding rejected, they lose a bit of the idealism which initially drew them to the area and they start having a few more cynical thoughts like “my guess is that grantmakers want to fund X, maybe I should try to be more like X even though I don’t personally think X is a great idea.”
In that case, the level of Goodharting seems to be pretty much directly proportional to the number of rejections—and the less funding available, the greater the quantity of rejections.
On the other hand, if the United Nations got together tomorrow and decided to fund a worldwide UBI, there’d be no optimization pressure at all, and people would just do whatever seemed best to them personally.
EDIT: This appears to be a concrete example of what I’m describing
I think if you’re in the early stages of a big project, like founding a pre-paradigmatic field, it often makes sense to be very breadth-first. You can save a lot of time trying to understand the broad contours of solution space before you get too deeply invested in a particular approach.
I think this can even be seen at the microscale (e.g. I was coaching someone on how to solve leetcode problems the other day, and he said my most valuable tip was to brainstorm several different approaches before exploring any one approach in depth). But it really shines at the macroscale (“you built entirely the wrong product because you didn’t spend enough time talking to customers and exploring the space of potential offerings in a breadth-first way”).
One caveat is that breadth-first works best if you have a good heuristic. For example, if someone with less than a year of programming experience was practicing leetcode problems, I wouldn’t emphasize the importance of brainstorming multiple approaches as much, because I wouldn’t expect them to have a well-developed intuition for which approaches will work best. For someone like that, I might recommend going depth-first almost at random until their intuition is developed (random rollouts in the context of monte carlo tree search are a related notion). I think there is actually some psych research showing that more experienced engineers will spend more time going breadth-first at the beginning of a project.
A synthesis of the above is: if AI safety is pre-paradigmatic, we want lots of people exploring a lot of different directions. That lets us understand the broad contours better, and also collects data to help refine our intuitions.
IMO the AI safety community has historically not been great at going breadth-first, e.g. investing a lot of effort in the early days into decision theory stuff which has lately become less fashionable. I also think people are overconfident in their intuitions about what will work, relative to the amount of time which has been spent going depth-first and trying to work out details related to “random” proposals.
In terms of turning money into AI safety, this strategy is “embarrassingly parallel” in the sense that it doesn’t require anyone to wait for a standard textbook or training program, or get supervision from some critical person. In fact, having a standard curriculum or a standard supervisor could be counterproductive, since it gets people anchored on a particular frame, which means a less broad area gets explored. If there has to be central coordination, it seems better to make a giant list of literatures which could provide insight, then assign each literature to a particular researcher to acquire expertise in.
After doing parallel exploration, we could do a reduction tree. Imagine if we ran an AI safety tournament where you could sign up as “red team”, “blue team”, or “judge”. At each stage, we generate tuples of (red player, blue player, judge) at random and put them in a video call or a Google Doc. The blue player tries to make a proposal, the red player tries to break it, the judge tries to figure out who won. Select the strongest players on each team at each stage and have them advance to the next stage, until you’re left with the very best proposals and the very most difficult to solve issues. Then focus attention on breaking those proposals / solving those issues.
- Interviews on Improving the AI Safety Pipeline by Dec 7, 2021, 12:03 PM; 55 points) (
- Why not offer a multi-million / billion dollar prize for solving the Alignment Problem? by Apr 17, 2022, 4:08 PM; 15 points) (EA Forum;
- Dec 30, 2021, 4:34 PM; 2 points) 's comment on ARC’s first technical report: Eliciting Latent Knowledge by (
- Jan 1, 2022, 11:29 PM; 2 points) 's comment on ARC’s first technical report: Eliciting Latent Knowledge by (
Yes, I tried it. It gave me a headache but I would guess that’s not common. Think it’s probably a decent place to start.
I didn’t end up sticking to this because of various life disruptions. I think it was a bit helpful but I’m planning to try something more intensive next time.
I’m glad you are thinking about this. I am very optimistic about AI alignment research along these lines. However, I’m inclined to think that the strong form of the natural abstraction hypothesis is pretty much false. Different languages and different cultures, and even different academic fields within a single culture (or different researchers within a single academic field), come up with different abstractions. See for example lsusr’s posts on the color blue or the flexibility of abstract concepts. (The Whorf hypothesis might also be worth looking into.)
This is despite humans having pretty much identical cognitive architectures (assuming that we can create a de novo AGI with a cognitive architecture as similar to a human brain as human brains are to each other seems unrealistic). Perhaps you could argue that some human-generated abstractions are “natural” and others aren’t, but that leaves the problem of ensuring that the human operating our AI is making use of the correct, “natural” abstractions in their own thinking. (Some ancient cultures lacked a concept of the number 0. From our perspective, and that of a superintelligent AGI, 0 is a ‘natural’ abstraction. But there could be ways in which the superintelligent AGI invents ‘natural’ abstraction that we haven’t yet invented, such that we are living in a “pre-0 culture” with respect to this abstraction, and this would cause an ontological mismatch between us and our AGI.)
But I’m still optimistic about the overall research direction. One reason is if your dataset contains human-generated artifacts, e.g. pictures with captions written in English, then many unsupervised learning methods will naturally be incentivized to learn English-language abstractions to minimize reconstruction error. (For example, if we’re using self-supervised learning, our system will be incentivized to correctly predict the English-language caption beneath an image, which essentially requires the system to understand the picture in terms of English-language abstractions. This incentive would also arise for the more structured supervised learning task of image captioning, but the results might not be as robust.)
This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.
Social sciences are a notable exception here. And I think social sciences (or even humanities) may be the best model for alignment—‘human values’ and ‘corrigibility’ seem related to the subject matter of these fields.
Anyway, I had a few other comments on the rest of what you wrote, but I realized what they all boiled down to was me having a different set of abstractions in this domain than the ones you presented. So as an object lesson in how people can have different abstractions (heh), I’ll describe my abstractions (as they relate to the topic of abstractions) and then explain how they relate to some of the things you wrote.
I’m thinking in terms of minimizing some sort of loss function that looks vaguely like
reconstruction_error + other_stuff
where
reconstruction_error
is a measure of how well we’re able to recreate observed data after running it through our abstractions, andother_stuff
is the part that is supposed to induce our representations to be “useful” rather than just “predictive”. You keep talking about conditional independence as the be-all-end-all of abstraction, but from my perspective, it is an interesting (potentially novel!) option for theother_stuff
term in the loss function. The same way dropout was once an interesting and novelother_stuff
which helped supervised learning generalize better (making neural nets “useful” rather than just “predictive” on their training set).The most conventional choice for
other_stuff
would probably be some measure of the complexity of the abstraction. E.g. a clustering algorithm’s complexity can be controlled through the number of centroids, or an autoencoder’s complexity can be controlled through the number of latent dimensions. Marcus Hutter seems to be as enamored with compression as you are with conditional independence, to the point where he created the Hutter Prize, which offers half a million dollars to the person who can best compress a 1GB file of Wikipedia text.Another option for
other_stuff
would be denoising, as we discussed here.You speak of an experiment to “run a reasonably-detailed low-level simulation of something realistic; see if info-at-a-distance is low-dimensional”. My guess is if the
other_stuff
in your loss function consists only of conditional independence things, your representation won’t be particularly low-dimensional—your representation will see no reason to avoid the use of 100 practically-redundant dimensions when one would do the job just as well.Similarly, you speak of “a system which provably learns all learnable abstractions”, but I’m not exactly sure what this would look like, seeing as how for pretty much any abstraction, I expect you can add a bit of junk code that marginally decreases the reconstruction error by overfitting some aspect of your training set. Or even junk code that never gets run / other functional equivalences.
The right question in my mind is how much info at a distance you can get for how many additional dimensions. There will probably be some number of dimensions N such that giving your system more than N dimensions to play with for its representation will bring diminishing returns. However, that doesn’t mean the returns will go to 0, e.g. even after you have enough dimensions to implement the ideal gas law, you can probably gain a bit more predictive power by checking for wind currents in your box. See the elbow method (though, the existence of elbows isn’t guaranteed a priori).
(I also think that an algorithm to “provably learn all learnable abstractions”, if practical, is a hop and a skip away from a superintelligent AGI. Much of the work of science is learning the correct abstractions from data, and this algorithm sounds a lot like an uberscientist.)
Anyway, in terms of investigating convergence, I’d encourage you to think about the inductive biases induced by both your loss function and also your learning algorithm. (We already know that learning algorithms can have different inductive biases than humans, e.g. it seems that the input-output surfaces for deep neural nets aren’t as biased towards smoothness as human perceptual systems, and this allows for adversarial perturbations.) You might end up proving a theorem which has required preconditions related to the loss function and/or the algorithm’s inductive bias.
Another riff on this bit:
This is the natural abstraction hypothesis in action: across the sciences, we find that low-dimensional summaries of high-dimensional systems suffice for broad classes of “far-away” predictions, like the speed of a sled.
Maybe we could differentiate between the ‘useful abstraction hypothesis’, and the stronger ‘unique abstraction hypothesis’. This statement supports the ‘useful abstraction hypothesis’, but the ‘unique abstraction hypothesis’ is the one where alignment becomes way easier because we and our AGI are using the same abstractions. (Even though I’m only a believer in the useful abstraction hypothesis, I’m still optimistic because I tend to think we can have our AGI cast a net wide enough to capture enough useful abstractions that ours are in their somewhere, and this number will be manageable enough to find the right abstractions from within that net—or something vaguely like that.) In terms of science, the ‘unique abstraction hypothesis’ doesn’t just say scientific theories can be useful, it also says there is only one ‘natural’ scientific theory for any given phenomenon, and the existence of competing scientific schools sorta seems to disprove this.
Anyway, the aspect of your project that I’m most optimistic about is this one:
This raises another algorithmic problem: how do we efficiently check whether a cognitive system has learned particular abstractions? Again, this doesn’t need to be fully general or arbitrarily precise. It just needs to be general enough to use as a tool for the next step.
Since I don’t believe in the “unique abstraction hypothesis”, checking whether a given abstraction corresponds to a human one seems important to me. The problem seems tractable, and a method that’s abstract enough to work across a variety of different learning algorithms/architectures (including stuff that might get invented in the future) could be really useful.
Interesting, thanks for sharing.
I couldn’t figure out how to go backwards easily.
Command-shift-g right?
After practicing Vim for a few months, I timed myself doing the Vim tutorial (vimtutor on the command line) using both Vim with the commands recommended in the tutorial, and a click-and-type editor. The click-and-type editor was significantly faster. Nowadays I just use Vim for the macros, if I want to do a particular operation repeatedly on a file.
I think if you get in the habit of double-clicking to select words and triple-clicking to select lines (triple-click and drag to select blocks of code), click-and-type editors can be pretty fast.
We present a useful toy environment for reasoning about deceptive alignment. In this environment, there is a button. Agents have two actions: to press the button or to refrain. If the agent presses the button, they get +1 reward for this episode and −10 reward next episode. One might note a similarity with the traditional marshmallow test of delayed gratification.
Are you sure that “episode” is the word you’re looking for here?
https://www.quora.com/What-does-the-term-“episode”-mean-in-the-context-of-reinforcement-learning-RL
I’m especially confused because you switched to using the word “timestep” later?
Having an action which modifies the reward on a subsequent episode seems very weird. I don’t even see it as being the same agent across different episodes.
Also...
Suppose instead of one button, there are two. One is labeled “STOP,” and if pressed, it would end the environment but give the agent +1 reward. The other is labeled “DEFERENCE” and, if pressed, gives the previous episode’s agent +10 reward but costs −1 reward for the current agent.
Suppose that an agent finds itself existing. What should it do? It might reason that since it knows it already exists, it should press the STOP button and get +1 utility. However, it might be being simulated by its past self to determine if it is allowed to exist. If this is the case, it presses the DEFERENCE button, giving its past self +10 utility and increasing the chance of its existence. This agent has been counterfactually mugged into deferring.
I think as a practical matter, the result depends entirely on the method you’re using to solve the MDP and the rewards that your simulation delivers.
lsuser had an interesting idea of creating a new Youtube account and explicitly training the recommendation system to recommend particular videos (in his case, music): https://www.lesswrong.com/posts/wQnJ4ZBEbwE9BwCa3/personal-experiment-one-year-without-junk-media
I guess you could also do it for Youtube channels which are informative & entertaining, e.g. CGP Grey and Veritasium. I believe studies have found that laughter tends to be rejuvenating, so optimizing for videos you think are funny is another idea.
I suspect you will be most successful at this if you get in the habit of taking breaks away from your computer when you inevitably start to flag mentally. Some that have worked for me include: going for a walk, talking to friends, taking a nap, reading a magazine, juggling, noodling on a guitar, or just daydreaming.
...When we can state code that would solve the problem given a hypercomputer, we have become less confused. Once we have the unbounded solution we understand, in some basic sense, the kind of work we are trying to perform, and then we can try to figure out how to do it efficiently.
ASHLEY: Which may well require new insights into the structure of the problem, or even a conceptual revolution in how we imagine the work we’re trying to do.
I’m not convinced your chess example, where the practical solution resembles the hypercomputer one, is representative. One way to sort a list using a hypercomputer is to try every possible permutation of the list until we discover one which is sorted. I tend to see Solomonoff induction as being cartoonishly wasteful in a similar way.
From a safety standpoint, hoping and praying that SGD won’t stumble across lookahead doesn’t seem very robust, if lookahead represents a way to improve performance. I imagine that whether SGD stumbles across lookahead will end up depending on complicated details of the loss surface that’s being traversed.
Lately I’ve been examining the activities I do to relax and how they might be improved. If you haven’t given much thought to this topic, Meaningful Rest is excellent background reading.
An interesting source of info for me has been lsusr’s posts on cutting out junk media: 1, 2, 3. Although I find lsusr’s posts inspiring, I’m not sure I want to pursue the same approach myself. lsusr says: “The harder a medium is to consume (or create, as applicable) the smarter it makes me.” They responded to this by cutting all the easy-to-consume media out of their life.
But when I relax, I don’t necessarily want to do something hard. I want to do something which rejuvenates me. (See “Meaningful Rest” post linked previously.)
lsusr’s example is inspiring in that it seems they got themselves studying things like quantum field theory for fun in their spare time. But they also noted that “my productivity at work remains unchanged”, and ended up abandoning the experiment 9 months in “due to multiple changes in my life circumstances”. Personally, when I choose to work on something, I usually expect it to be at least 100x as good a use of my time as random productive-seeming stuff like studying quantum field theory. So given a choice, I’d often rather my breaks rejuvenate me a bit more per minute of relaxation, so I can put more time and effort into my 100x tasks, than have the break be slightly useful on its own.
To adopt a different frame… I’m a fan of the wanting/liking/approving framework from this post.
-
In some sense, +wanting breaks are easy to engage in because it doesn’t require willpower to get yourself to do them. But +wanting breaks also tend to be compulsive, and that makes them less rejuvenating (example: arguing online).
-
My point above is that I should mostly ignore the +approving or -approving factor in terms of the break’s non-rejuvenating, external effects.
-
It seems like the ideal break is +liking, and enough +wanting that it doesn’t require willpower to get myself to do it, and once I get started I can disconnect for hours and be totally engrossed, but not so +wanting that I will be tempted to do it when I should be working or keep doing it late into the night. I think playing the game Civilization might actually meet these criteria for me? I’m not as hooked on it as I used to be, but I still find it easy to get engrossed for hours.
Interested to hear if anyone else wants to share their thinking around this or give examples of breaks which meet the above criteria.
-
Good to know! I was thinking the application process would be very transparent and non-demanding, but maybe it’s better to ditch it altogether.
I don’t see the “obvious flaw” you’re pointing at and would appreciate a more in-depth explanation.
In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this:
You ask your AGI to generate a plan for how it could maximize paperclips.
Your AGI generates a plan. “Step 1: Manipulate human operator into thinking that paperclips are the best thing ever, using the following argument...”
You stop reading the plan at that point, and don’t click “execute” for it.