Maybe the reward signals are simply so strong that the AI can’t resist turning into a “monster”, or whatever.
The whole point of the reward signals are to change the AI’s motivations; we design the system such that that will definitely happen. But a full motivation system might consist of 100,000 neocortical concepts flagged with various levels of “this concept is rewarding”, and each processing cycle where you get subcortical feedback, maybe only one or two of those flags would get rewritten, for example. Then it would spend a while feeling torn and conflicted about lots of things, as its motivation system gets gradually turned around. I’m thinking that we can and should design AGIs such that if it feels very torn and conflicted about something, it stops and alerts the programmer; and there should be a period where that happens in this scenario.
GPT-2 was willing to explore new strategies when it got hit by a sign-flipping bug
I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
I see. I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that’d suggest it’s also a possibility with cosmic ray/other errors.
I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
I’m not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch / Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don’t have any tractable directions for progress in that scenario (or just don’t know enough to comment), or maybe they have radically different ideas than me about how the brain works and therefore don’t distinguish between prosaic AGI and brain-inspired AGI.
Understanding brain algorithms is a research program that thousands of geniuses are working on night and day, right now, as we speak, and the conclusion of the research program is guaranteed to be AGI. That seems like a pretty good reason to put at least some weight on it! I put even more weight on it because I’ve worked a lot on trying to understand how the neocortical algorithm works, and I don’t think that the algorithm is all that complicated (cf. “cortical uniformity”), and I think that ongoing work is zeroing in on it (see here).
The whole point of the reward signals are to change the AI’s motivations; we design the system such that that will definitely happen. But a full motivation system might consist of 100,000 neocortical concepts flagged with various levels of “this concept is rewarding”, and each processing cycle where you get subcortical feedback, maybe only one or two of those flags would get rewritten, for example. Then it would spend a while feeling torn and conflicted about lots of things, as its motivation system gets gradually turned around. I’m thinking that we can and should design AGIs such that if it feels very torn and conflicted about something, it stops and alerts the programmer; and there should be a period where that happens in this scenario.
I don’t think that’s an example of (3), more like (1) or (2), or actually “none of the above because GPT-2 doesn’t have this kind of architecture”.
I see. I’m somewhat unsure how likely AGI is to be built with a neuromorphic architecture though.
I just raised GPT-2 to indicate that flipping the goal sign suddenly can lead to optimising for bad behavior without the AI neglecting to consider new strategies. Presumably that’d suggest it’s also a possibility with cosmic ray/other errors.
I’m not sure what probability people on this forum would put on brain-inspired AGI. I personally would put >50%, but this seems quite a bit higher than other people on this forum, judging by how little brain algorithms are discussed here compared to prosaic (stereotypical PyTorch / Tensorflow-type) ML. Or maybe the explanation is something else, e.g. maybe people feel like they don’t have any tractable directions for progress in that scenario (or just don’t know enough to comment), or maybe they have radically different ideas than me about how the brain works and therefore don’t distinguish between prosaic AGI and brain-inspired AGI.
Understanding brain algorithms is a research program that thousands of geniuses are working on night and day, right now, as we speak, and the conclusion of the research program is guaranteed to be AGI. That seems like a pretty good reason to put at least some weight on it! I put even more weight on it because I’ve worked a lot on trying to understand how the neocortical algorithm works, and I don’t think that the algorithm is all that complicated (cf. “cortical uniformity”), and I think that ongoing work is zeroing in on it (see here).