michael_mjd

Karma: 191

michael_mjd Apr 27, 2023, 3:16 AM
5 points
3
on: Transcript and Brief Response to Twitter Conversation between Yann LeCunn and Eliezer Yudkowsky
I agree with the analysis of the ideas overall. I think however, AI x-risk does have some issue regarding communications. First of all, I think it’s very unlikely that Yann will respond to the wall of text. Even though he is responding, I imagine him more to be on the level of your college professor. He will not reply to a very detailed post. In general, I think that AI x-risk should aim to explain a bit more, rather than to take the stance that all the “But What if We Just...” has already been addressed. It may have been, but this is not the way to getting them to open up rationally to it.
Regarding Yann’s ideas, I have not looked at them in full. However, they sound like what I imagine an AI capabilities researcher would try to make as their AI alignment “baseline” model:
- Hardcoding the reward will obviously not work.
- Therefore, the reward function must be learned.
- If an AI is trained on reward to generate a policy, whatever the AI learned to optimize can easily go off the rails once it gets out of distribution, or learn to deceive the verifiers.
- Therefore, why not have the reward function explicitly in the loop with the world model & action chooser?
- ChatGPT/GPT-4 seems to have a good understanding of ethics. It probably will not like it if you told it a plan was to willingly deceive human operators. As a reward model, one might think it might be robust enough.
They may think that this is enough to work. It might be worth explaining in a concise way why this baseline does not work. Surely we must have a resource on this. Even without a link (people don’t always like to follow links from those they disagree with), it might help to have some concise explanation.
Honestly, what are the failure modes? Here is what I think:
- The reward model may have pathologies the action chooser could find.
- The action chooser may find a way to withhold information from the reward model.
- The reward model evaluates what, exactly? Text of plans? Text of plans != the entire activations (& weights) of the model...

michael_mjd Apr 24, 2023, 9:21 PM
3 points
0
in reply to: faul_sname’s comment on: We Need To Know About Continual Learning
Essentially yes, heh. I take this as a learning experience for my writing, I don’t know what I was thinking, but it is obvious in hindsight that saying to just “switch on backprop” sounds very naive.
I also confess I haven’t done the due diligence to find out what the actual largest model that has been tried with this, whether someone has tried it with Pythia or LLaMa. I’ll do some more googling tonight.
One intuition why the largest models might be different, is that part of the training/fine-tuning going on will have to do with the model’s own output. The largest models are the ones where the model’s own output is not essentially word salad.

michael_mjd Apr 24, 2023, 4:39 PM
3 points
0
in reply to: faul_sname’s comment on: We Need To Know About Continual Learning
I have noted the problem of catastrophic forgetting in the section “why it might not work”. In general I agree continual learning is obviously a thing, otherwise I would not have used the established terminology. What I believe however is that the problems we face in continual learning in e.g. a 100M BERT model may not be the same as what we observe in models that can now meaningfully self critique. We have explored this technique publicly, but have we tried it with GPT-4? The publicly part was really just a question of whether OpenAI actually did it on this model or not, and it would be an amazing data point if they could say “We couldn’t get it to work.”

michael_mjd Apr 23, 2023, 11:26 PM
1 point
0
in reply to: the gears to ascension’s comment on: We Need To Know About Continual Learning
It’s possible it’s downvoted because it might be considered dangerous capability research. It just seems highly unlikely that this would not be one of many natural research directions perhaps already attempted, and I figure we might as well acknowledge it and find out what it actually does in practice.
Or maybe downvotes because it “obviously won’t work”, but I think it’s not obvious to me and would welcome discussion on that.

michael_mjd Apr 23, 2023, 11:25 PM
1 point
0
in reply to: Seth Herd’s comment on: We Need To Know About Continual Learning
Thanks, this is a great analysis on the power of agentized LLMs, which I probably need to spend some more time thinking about. I will work my way through the post over the next few days. I briefly skimmed the episodic memory section for now, and I see it is like an embedding based retrieval system for past outputs/interactions of the model, reminiscent of the way some Helper chatbots look up stuff from FAQs. My overall intuitions on this:
- It’s definitely something, but the method of embedding and retrieval, if static, would be very limiting
- Someone will probably add RL on top of it to adjust the EBR system, which will improve on that part significantly… if they can get the hparams correct.
- It still doesn’t seem to me as much “long term memory” so much as it’s access to Google or CTRL-F on one’s e-mail
- I imagine actually updating the internals of the system is a fundamentally different kind of update.
It might be possible that a hybrid approach would end up working better, perhaps not even “continuous learning”, but batched episodic learning. (“Sleep” but not sure how far that analogy goes.)

We Need To Know About Continual Learning

michael_mjdApr 22, 2023, 5:08 PM

30 points

14 comments4 min readLW link

michael_mjd Mar 21, 2023, 9:16 PM
17 points
7
on: My Objections to “We’re All Gonna Die with Eliezer Yudkowsky”
Very interesting write up. Do you have a high level overview of why, despite all of this, P(doom) is still 5%? What do you still see as the worst failure modes?

michael_mjd Mar 17, 2023, 8:01 AM
13 points
4
in reply to: baturinsky’s comment on: GPT-4 solves Gary Marcus-induced flubs
Noticed this as well. I tried to get it to solve some integration problems, and it could try different substitutions and things, but if they did not work, it kind of gave up and said to numerically integrate it. Also, it would make small errors, and you would have to point it out, though it was happy to fix them.
I’m thinking that most documents it reads tend to omit the whole search/backtrack phase of thinking. Even work that is posted online that shows all the steps, usually filters out all the false starts. It’s like how most famous mathematicians were known for throwing away their scratchwork, leaving everyone to wonder how exactly they formed their thought processes...

michael_mjd Nov 20, 2022, 4:39 AM
2 points
−1
on: What’s the Deal with Elon Musk and Twitter?
The media does have its biases but their reaction seems perfectly reasonable to me. Occam’s razor suggests this is not only unorthodox, but shows extremely poor judgment. This demonstrates that (a) either Elon is actually NOT as smart he has been hyped to be, or (b) there’s some ulterior motive, but these are long-tailed.
Typically when one joins a company, you don’t do anything for X number of months and get the lay of the land. I’m inclined to believe this is not just a local minimum, but typically close to the optimal strategy for a human being (but not a superintelligence playing 5D chess). It’s unlikely the case that he bought the company only months from bankruptcy. Everywhere in big tech is doing layoffs but not to this magnitude. Also, coming into an office and demanding people work twice as hard and completely change their schedules around, would not work in any company. No employee with a family would be able to switch that quickly. No sane employee would be willing to pivot like this. Also, why should they? They have leverage.
None of the methods described above are actually reasonable in a real company, like blanket layoffs by LoC. Yes we can discuss above the motivations, how maybe to first order (probably not even that) it gives an approximation, but it’s not like it’s that hard to do it better and more accurate than this. No, at the end of the day, he’s either an idiot, or deliberately trying to destroy it either out of some kind of revenge, or maybe somehow in the view that this buys time for AI alignment :)
Here are my predictions:
* He will have trouble staffing the company and complain loudly about it with the tired “no one wants to work anymore”
* He will move the company to TX and hire from there at ¹⁄₂ the salary or so.
* The site will stabilize, though not improve in any meaningful way, but he will be lauded as a hero in red states.
* The move to TX will be intended to signal a shift away from Silicon Valley and have a small but measurable effect, but CA will remain the dominant hub.

michael_mjd Jun 17, 2022, 3:02 AM
2 points
0
in reply to: MSRayne’s comment on: Instrumental Convergence To Offer Hope?
I’ll say I definitely think it’s too optimistic and I don’t much too much stock into it. Still, I think it’s worth thinking about.
Yes, absolutely we are not following the rule. The reason why I think it might change with an AGI: (1) currently we humans, despite what we say when we talk about aliens, still place a high prior on being alone in the universe, or from dominant religious perspectives, that we are the most intelligent. Those things combine to make us think there are no consequences to our actions against other life. An AGI, itself a proof of concept that there can be levels of general intelligence, may have more reason to be cautious. (2) Humans are not as rational. Not that a rational human would decide to be vegan—maybe with our priors, we have little reason to suspect that higher powers would care—especially since it seems to be the norm of the animal kingdom already. But, in terms of rationality, some humans are pretty good at taking very dangerous risks, risks that perhaps an AGI may be more cautious about. (3) There’s something to be said about degrees of following the rule. At one point humans were pretty confident about doing whatever they wanted to nature, nowadays at least some proportion of the population wants to at least, not destroy it all. Partly for self preservation reasons, but also partly for intrinsic value. (and probably 0% for fear of god-like retribution, to be fair, haha). I doubt the acausal reasoning would make an AGI conclude it can never harm any humans, but perhaps “spare at least x%”.
I think the main counterargument would be the fear of us creating a new AGI, so it may come down to how much effort the AGI has to expend to monitor/prevent that from happening.

michael_mjd Jun 8, 2022, 7:12 PM
1 point
0
in reply to: Donald Hobson’s comment on: AGI Safety FAQ / all-dumb-questions-allowed thread
That is a very fair criticism. I didn’t mean to imply this is something I was very confident in, but was interested in for three reasons:

1) This value function aside, is this a workable strategy, or is there a solid reason for suspecting the solution is all-or-nothing? Is it reasonable to ‘look for’ our values with human effort, or does this have to be something searched for using algorithms?
2) It sort of gives a flavor to what’s important in life. Of course the human value function will be a complicated mix of different sensory inputs, reproduction, and goal seeking, but I felt like there’s a kernel in there where curiosity is one of our biggest drivers. There was a post here a while back about someone’s child being motivated first and foremost by curiosity.
3) An interesting thought occurs to me that, supposing we do create a deferential superintelligence. If it’s cognitive capacities far outpace that of humans, does that mean the majority of consciousness in the universe is from the AI? If so, is it strange to think, is it happy? What is it like to be a god with the values of a child? Maybe I should make a separate comment about this.

michael_mjd Jun 7, 2022, 7:00 AM
47 points
0
on: AGI Safety FAQ / all-dumb-questions-allowed thread
I’m an ML engineer at a FAANG-adjacent company. Big enough to train our own sub-1B parameter language models fairly regularly. I work on training some of these models and finding applications of them in our stack. I’ve seen the light after I read most of Superintelligence. I feel like I’d like to help out somehow. I’m in my late 30s with kids, and live in the SF bay area. I kinda have to provide for them, and don’t have any family money or resources to lean on, and would rather not restart my career. I also don’t think I should abandon ML and try to do distributed systems or something. I’m a former applied mathematician, with a phd, so ML was a natural fit. I like to think I have a decent grasp on epistemics, but haven’t gone through the sequences. What should someone like me do? Some ideas: (a) Keep doing what I’m doing, staying up to date but at least not at the forefront; (b) make time to read more material here and post randomly; (c) maybe try to apply to Redwood or Anthropic… though dunno if they offer equity (doesn’t hurt to find out though) (d) try to deep dive on some alignment sequence on here.

michael_mjd Jun 7, 2022, 6:58 AM
10 points
−1
on: AGI Safety FAQ / all-dumb-questions-allowed thread
Has there been effort into finding a “least acceptable” value function, one that we hope would not annihilate the universe or turn it degenerate, even if the outcome itself is not ideal? My example would be to try to teach a superintelligence to value all other agents facing surmountable challenges in a variety of environments. The degeneracy condition of this, is if it does not value the real world, will simply simulate all agents in a zoo. However, if the simulations are of faithful fidelity, maybe that’s not literally the worst thing. Plus, the zoo, to truly be a good test of the agents, would approach being invisible.
What links here?
- niplav's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by Aryeh Englander (Jun 7, 2022, 9:44 AM; 7 points)

michael_mjd Jun 3, 2022, 5:53 AM
1 point
0
on: Confused why a “capabilities research is good for alignment progress” position isn’t discussed more
I can see the argument of capabilities vs safety both ways. On the one hand, by working on capabilities, we may get some insights. We could figure out how much data is a factor, and what kinds of data they need to be. We could figure out how long term planning emerges, and try our hand at inserting transparency into the model. We can figure out whether the system will need separate modules for world modeling vs reward modeling. On the other hand, if intelligence turns out to be not that hard, and all we need to do is train a giant decision transformer… then we have major problems.
I think it would be great to focus capabilities research into a narrower space as Razied says. My hunch is that a giant language model by itself would not go foom, because it’s not really optimizing for anything other than predicting the next token. It’s not even really aware of the passage of time. I can’t imagine it having a drive to, for example, make the world output only a single word forever. I think the danger would be in trying to make it into an agent.
I also think that there must be alignment work that can be done without knowing the exact nature of the final product. For example, learning the human value function, whether it comes from a brain-like formulation, or inverse RL. I am also curious if there has been work done on trying to find a “least bad” nondegenerate value function, i.e. one that doesn’t kill us, torture us, or tile the universe with junk, even if it does not necessarily want what we want perfectly. I think relevant safety work can always take the form of, “suppose current technology scaled up (e.g. decision transformer) could go foom, what should we do right now that could constrain it?” There is some risk that future advancements could be very different, and work done in this stage is not directly applicable, but I imagine it would still be useful somehow. Also, my intuition is that we could always wonder what’s the next step in capabilities, until the final step, and we may not know it’s the final step.
One thing you have to admit, though. Capabilities research is just plain exciting, probably on the same level as working on the Manhattan project was exciting. I mean, who doesn’t want to know how intelligence works?

michael_mjd Jun 3, 2022, 2:20 AM
LW: 5 AF: 1
0
AF
in reply to: johnswentworth’s comment on: Confused why a “capabilities research is good for alignment progress” position isn’t discussed more
I think we are getting some information. For example, we can see that token level attention is actually quite powerful for understanding language and also images. We have some understanding of scaling laws. I think the next step is a deeper understanding of how world modeling fits in with action generation—how much can you get with just world modeling, versus world modeling plus reward/action combined?
If the transformer architecture is enough to get us there, it tells us a sort of null hypothesis for intelligence—that the structure for predicting sequences by comparing all pairs of elements of a limited sequence—is general.
Not rhetorically, what kind of questions you think would better lead to understanding how AGI works?
I think teaching a transformer with an internal thought process (predicting the next tokens over a part of the sequence that’s “showing your work”) would be an interesting insight into how intelligence might work. I thought of this a little while back but also discovered this is also a long standing MIRI research direction into transparency. I wouldn’t be surprised if Google took it up at this point.

michael_mjd Jun 3, 2022, 1:57 AM
1 point
0
on: Confused why a “capabilities research is good for alignment progress” position isn’t discussed more
I think the desire works because most honest people know, if they give a good-sounding answer that is ultimately meaningless, no benefits will come of the answers given. They may eventually stop asking questions, knowing the answers are always useless. It’s a matter of estimating future rewards from building relationships.
Now, when a human gives advice to another human, most of the time it is also useless, but not always. Also, it tends to not be straight up lies. Even in the useless case, people still think there is some utility in there, for example, having the person think of something novel, giving them a chance to vent without appearing to talk to a brick wall, etc.
To teach a GPT to do this, maybe there would have to be some reward signal. To do with purely language modeling, not sure. Maybe you could continue to train it with examples of its own responses and the interviewer’s response afterwards with whether its advice was true or not. With enough of these sessions, perhaps you could run the language model and have it try to predict the human response, and see what it thinks of its own answers, haha.

michael_mjd Jun 1, 2022, 5:41 AM
1 point
0
on: The Hard Intelligence Hypothesis and Its Bearing on Succession Induced Foom
One other thing I’m interested in, is there a good mathematical model of ‘search’? There may not be an obvious answer. I just feel like there is some pattern that could be leveraged. I was playing hide and seek with my kids the other day, and noticed that, in a finite space, you expect there to be finite hiding spots. True, but every time you think you’ve found them all, you end up finding one more. I wonder if figuring out optimizations or discoveries follow a similar pattern. There are some easy ones, then progressively harder ones, but there are far more to be found than one would expect… so to model finding these over time, in a very large room...

michael_mjd Jun 1, 2022, 4:00 AM
1 point
0
on: The Hard Intelligence Hypothesis and Its Bearing on Succession Induced Foom
I agree, I have also thought I am not completely sure of the dynamics of the intelligence explosion. I would like to have more concrete footing to figure out what takeoff will look like, as neither fast nor slow are proved.
My intuition however is the opposite. I can’t disprove a slow takeoff, but to me it seems intuitive that there are some “easy” modifications that should take us far beyond human level. Those intuitions, though they could be wrong, are thus:
- I feel like human capability is limited in some obvious ways. If I had more time and energy to focus on interesting problems, I could accomplish WAY more. Most likely most of us get bored, lazy, distracted, or obligated by our responsibilities too much to unlock our full potential. Also, sometimes our thinking gets cloudy. Reminds me a bit of the movie Limitless. Imagine just being a human, but where all the parts of your brain were a well-oiled machine.
- A single AI would not need to solve so many coordination problems which bog down humanity as a whole from acting like a superintelligence.
- AI can scale its search abilities in an embarrassingly parallel way. It can also optimize different functions for different things, like imagine a brain built for scientific research.
Perhaps intelligence is hard and won’t scale much farther than this, but I feel like if you have this, you already have supervillain level intelligence. Maybe not “make us look like ants” intelligence, but enough for domination.

michael_mjd May 28, 2022, 6:59 AM
1 point
0
in reply to: michael_mjd’s comment on: [$20K in Prizes] AI Safety Arguments Competition
Policy makers.

michael_mjd May 28, 2022, 6:58 AM
1 point
0
in reply to: michael_mjd’s comment on: [$20K in Prizes] AI Safety Arguments Competition
Policy makers

michael_mjd

We Need To Know About Con­tinual Learning

We Need To Know About Continual Learning