GPT-o1′s extended, more coherent chain of thought—see Ethan Mollick’s crossword puzzle test for a particularly long chain of goal-directed reasoning[1] -- seems like a relatively likely place to see the emergence of simple instrumental reasoning in the wild. I wouldn’t go so far as to say I expect it (I haven’t even played with o1-preview yet), but it seems significantly more likely than previous LLM models.
Frustratingly, for whatever reason OpenAI has chosen not to let users see the actual chain of thought, only a model-generated summary of it. We don’t know how accurate the summary is, and it seems likely that it omits any worrying content (OpenAI: ‘We also do not want to make an unaligned chain of thought directly visible to users’).
This is unfortunate from a research perspective. Probably we’ll eventually see capable open models along similar lines, and can do that research then.
[EDIT: to be clear, I’m talking here about very simple forms of instrumental reasoning. ‘Can I take over the world to apply more compute to this problem’ seems incredibly unlikely. I’m thinking about things more like, ‘Could I find the answer online instead of working this problem out myself’ or anything else of the form ‘Can I take actions that will get me to the win, regardless of whether they’re what I was asked to do?’.]
Incidentally, the summarized crossword-solving CoT that Mollick shows is an exceptionally clear demonstration of the model doing search, including backtracking.
Something I hadn’t caught until my second read of OpenAI’s main post today: we do at least get a handful of (apparent) actual chains of thought (search ‘we showcase the chain of thought’ to find them). They’re extremely interesting.
They’re very repetitive, with the model seeming to frequently remind itself of its current hypotheses and intermediate results (alternately: process supervision rewards saying correct things even if they’re repetitious; presumably that trades off against a length penalty?).
The CoTs immediately suggest a number of concrete & straightforward strategies for improving the process and results; I think we should expect pretty rapid progress for this approach.
It’s fascinating to watch the model repeatedly tripping over the same problem and trying to find a workaround (eg search for ‘Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)’ in the Cipher example, where the model keeps having trouble with the repeated xs at the end). The little bit of my brain that can’t help anthropomorphizing these models really wants to pat it on the head and give it a cookie when it finally succeeds.
Again, it’s unambiguously doing search (at least in the sense of proposing candidate directions, pursuing them, and then backtracking to pursue a different direction if they don’t work out—some might argue that this isn’t sufficient to qualify).
This is the big takeaway here, and my main takeaway is that search is a notable capabilities improvement on it’s own, but still needs compute scaling to get better results.
But the other takeaway is that based on it’s performance in several benchmarks, I think that it turns out that adding search was way easier than Francois Chollet thought it would, and it’s looking like the compute and data are the hard parts of getting intelligence into LLMs, not the search and algorithm parts.
This is just another point on the trajectory of LLMs being more and more general reasoners, and not just memorizing their training data.
I was just amused to see a tweet from Subbarao Kambhampati in which he essentially speculates that o1 is doing search and planning in a way similar to AlphaGo...accompanied by a link to his ‘LLMs Can’t Plan’ paper.
I think we’re going to see some goalpost-shifting from a number of people in the ‘LLMs can’t reason’ camp.
I agree with this, and I think that o1 is clearly a case where a lot of people will try to shift the goalposts even as AI gets more and more capable and runs more and more of the economy.
It’s looking like the hard part isn’t the algorithmic or data parts, but the compute part of AI.
This is the first model where we have strong evidence that the LLM is actually reasoning/generalizing and not just memorizing it’s data.
Really? There were many examples where even GPT-3 solved simple logic problems which couldn’t be explained with having the solution memorized. The effectiveness of chain of thought prompting was discovered when GPT-3 was current. GPT-4 could do fairly advanced math problems, explain jokes etc.
The o1-preview model exhibits a substantive improvement in CoT reasoning, but arguably not something fundamentally different.
I don’t remember exactly, but there were debates (e.g. involving Gary Marcus) on whether GPT-3 was merely a stochastic parrot or not, based on various examples. The consensus here was that it wasn’t. For one, if it was all just memorization, then CoT prompting wouldn’t have provided any improvement, since CoT imitates natural language reasoning, not a memorization technique.
Yeah, I’m getting a little worried that porby’s path to AI safety is reliant at least a little on AI companies on not taking shortcuts/insights like Strawberry/Q*, and this makes me more pessimistic today than yesterday because of METR’s testing on o1, though notably I don’t consider it nearly an update as some other people on LW believe.
Given the race dynamic and the fact that some major players don’t even recognize safety as a valid concern, it seems extremely likely to me that at least some will take whatever shortcuts they can find (in the absence of adequate legislation, and until/unless we get a large warning shot).
Yeah, one thing I sort of realized is that instrumental convergence capabilities can come up even without very sparse RL, and I now think that while non-instrumental convergent AIs could exist, they will be way more compute inefficient compared to those that use some instrumental convergence.
To be clear, I learned some new stuff about AI alignment that makes me still quite optimistic mostly regardless of architecture, with both alignment generalizing further than capabilities for pretty deep reasons, combined with the new path of synthetic data letting us control what the AI learns and values through data, but still this was a mild violation of my model of how future AI goes.
I think the key thing I didn’t appreciate is that a path to alignment/safety that works technically doesn’t mean it will get used in practice, and following @Seth Herd, an alignmnent solution that requires high taxes or that isn’t likely to be implemented is a non-solution in real life.
I don’t immediately find that piece very convincing; in short I’m skeptical that the author’s claims are true for a) smarter systems that b) are more agentic and RL-ish. A few reasons:
The core difficulty isn’t with how hard reward models are to train, it’s with specifying a reward function in the first place in a way that’s robust enough to capture all the behavior and trade-offs we want. LLMs aren’t a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it’ll be interesting to see how prosaically-aligned it is.
We have very many examples of reward misspecification and goal misgeneralization in RL; it’s historically been quite difficult to adequately specify a reward function for agents acting in environments.
This becomes way more acute as capabilities move past the level where humans can quickly and easily choose the better output (eg as the basis for a reward model for RLHF).
That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it’s enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It’s not obvious to me that that works in practice, or is efficient enough to be practical.
I’m pretty skeptical of: ‘...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.’ I think there’s a lot of missing work there to be able to claim that mapping.
‘Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.’ I think this is false. Consider ‘Biden (/Trump) was a great president.’ The world is full of situations where humans differ wildly on whether they’re good or bad.
Maybe I’ve just failed to cross the inferential distance here, but on first read I’m pretty skeptical.
The core difficulty isn’t with how hard reward models are to train, it’s with specifying a reward function in the first place in a way that’s robust enough to capture all the behavior and trade-offs we want. LLMs aren’t a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it’ll be interesting to see how prosaically-aligned it is.
This is actually right, but I think this is actually addressable by making large synthetic datasets, and I also think that we can in practice define reward functions densely enough such that we can capture al of the behavior we want in practice.
We have very many examples of reward misspecification and goal misgeneralization in RL; it’s historically been quite difficult to adequately specify a reward function for agents acting in environments.
I agree with this, but I will also say that the examples listed point to a strong reason why RL also wasn’t as capable as people thought, and a lot of the hacks also decreased capabilities as they decreased alignment, so any solution to that problem would help capabilities and alignment massively.
That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it’s enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It’s not obvious to me that that works in practice, or is efficient enough to be practical.
Yeah, I think the big question for my views is whether the LLM solution has low enough taxes to be practical, and my answer is at this point is probable, but not a sure thing, as it requires them to slow down in the race a little (but training runs will get longer, so there’s a countervailing force to this.)
I think there are reasons to be optimistic here, mainly due to updating against evopsych views on how humans got their capabilities and values, combined with updating against complexity and fragility of value due to LLM successes, though it will require real work to bring about.
I’m pretty skeptical of: ‘...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.’ I think there’s a lot of missing work there to be able to claim that mapping.
I think that the verification-generation gap is pervasive in a lot of fields, from workers in many industries being verified by bosses to make sure their job is done right, to people who buy air conditioners being able to find efficient air-conditioning for their needs despite not verifying very hard, to researchers verifying papers that were generated, to social reformers having correct critiques of various aspects of society but not being able to generate a new societal norm, and more.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
‘Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.’ I think this is false. Consider ‘Biden (/Trump) was a great president.’ The world is full of situations where humans differ wildly on whether they’re good or bad.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think the crux is whether verification is easier than generation in alignment, since if verification is just as hard as generation, this hurts plans like scalable oversight a lot.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:
Such and such a software system is not vulnerable to hacking[1].
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn’t know or hasn’t noticed, that means that A will have very large negative effects.
This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier’s Law: ‘Anyone can invent a security system so clever that she or he can’t think of how to break it.’
Collatz conjecture is true can’t be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it’s absolutely hard to verify the proof that was generated:
The same response can be given to the 4th example here.
On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions.
For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it’s a rather weak AI bill after amendments.
I usually don’t focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data.
Re Schneier’s Law and security mindset, I’ve become more skeptical of security mindset being useful in general, for 2 reasons:
I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn’t exist in our actual security field or even fields that have to deal with highly fragile systems.
There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot):
These posts and comments are helpful pointers to my view:
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
True, but I do actually think there is actually real traction on the problem already, and IMO one of the cooler results is Pretraining Language Models from Human feedback, and note that even a problem is in NP can get really intractable in the worst case (though we don’t have proof of that)
So there’s a strained analogy to be made here.
For this:
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’.
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
Indeed, one of the traps of social reformers IRL is to think that just because verifying something is correct or wrong is easy, generating a new social outcome, perhaps via norms must also be easy, but it isn’t, because the verification side is much easier than the generation side.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’.
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
I’m talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that’s a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, ‘Run this, it’ll generate a cure for cancer and definitely not do bad things’, and in many cases it would be difficult-to-impossible to confirm that.
We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we’re back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).
GPT-o1′s extended, more coherent chain of thought—see Ethan Mollick’s crossword puzzle test for a particularly long chain of goal-directed reasoning[1] -- seems like a relatively likely place to see the emergence of simple instrumental reasoning in the wild. I wouldn’t go so far as to say I expect it (I haven’t even played with o1-preview yet), but it seems significantly more likely than previous LLM models.
Frustratingly, for whatever reason OpenAI has chosen not to let users see the actual chain of thought, only a model-generated summary of it. We don’t know how accurate the summary is, and it seems likely that it omits any worrying content (OpenAI: ‘We also do not want to make an unaligned chain of thought directly visible to users’).
This is unfortunate from a research perspective. Probably we’ll eventually see capable open models along similar lines, and can do that research then.
[EDIT: to be clear, I’m talking here about very simple forms of instrumental reasoning. ‘Can I take over the world to apply more compute to this problem’ seems incredibly unlikely. I’m thinking about things more like, ‘Could I find the answer online instead of working this problem out myself’ or anything else of the form ‘Can I take actions that will get me to the win, regardless of whether they’re what I was asked to do?’.]
Incidentally, the summarized crossword-solving CoT that Mollick shows is an exceptionally clear demonstration of the model doing search, including backtracking.
Something I hadn’t caught until my second read of OpenAI’s main post today: we do at least get a handful of (apparent) actual chains of thought (search ‘we showcase the chain of thought’ to find them). They’re extremely interesting.
They’re very repetitive, with the model seeming to frequently remind itself of its current hypotheses and intermediate results (alternately: process supervision rewards saying correct things even if they’re repetitious; presumably that trades off against a length penalty?).
The CoTs immediately suggest a number of concrete & straightforward strategies for improving the process and results; I think we should expect pretty rapid progress for this approach.
It’s fascinating to watch the model repeatedly tripping over the same problem and trying to find a workaround (eg search for ‘Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)’ in the Cipher example, where the model keeps having trouble with the repeated xs at the end). The little bit of my brain that can’t help anthropomorphizing these models really wants to pat it on the head and give it a cookie when it finally succeeds.
Again, it’s unambiguously doing search (at least in the sense of proposing candidate directions, pursuing them, and then backtracking to pursue a different direction if they don’t work out—some might argue that this isn’t sufficient to qualify).
This is the big takeaway here, and my main takeaway is that search is a notable capabilities improvement on it’s own, but still needs compute scaling to get better results.
But the other takeaway is that based on it’s performance in several benchmarks, I think that it turns out that adding search was way easier than Francois Chollet thought it would, and it’s looking like the compute and data are the hard parts of getting intelligence into LLMs, not the search and algorithm parts.
This is just another point on the trajectory of LLMs being more and more general reasoners, and not just memorizing their training data.
I was just amused to see a tweet from Subbarao Kambhampati in which he essentially speculates that o1 is doing search and planning in a way similar to AlphaGo...accompanied by a link to his ‘LLMs Can’t Plan’ paper.
I think we’re going to see some goalpost-shifting from a number of people in the ‘LLMs can’t reason’ camp.
I agree with this, and I think that o1 is clearly a case where a lot of people will try to shift the goalposts even as AI gets more and more capable and runs more and more of the economy.
It’s looking like the hard part isn’t the algorithmic or data parts, but the compute part of AI.
Really? There were many examples where even GPT-3 solved simple logic problems which couldn’t be explained with having the solution memorized. The effectiveness of chain of thought prompting was discovered when GPT-3 was current. GPT-4 could do fairly advanced math problems, explain jokes etc.
The o1-preview model exhibits a substantive improvement in CoT reasoning, but arguably not something fundamentally different.
True enough, and I should probably rewrite the claim.
Though what was the logic problem that was solved without memorization.
I don’t remember exactly, but there were debates (e.g. involving Gary Marcus) on whether GPT-3 was merely a stochastic parrot or not, based on various examples. The consensus here was that it wasn’t. For one, if it was all just memorization, then CoT prompting wouldn’t have provided any improvement, since CoT imitates natural language reasoning, not a memorization technique.
Yeah, it’s looking like GPT-o1 is just quantitatively better at generalizing compared to GPT-3, not qualitatively better.
Yeah, I’m getting a little worried that porby’s path to AI safety is reliant at least a little on AI companies on not taking shortcuts/insights like Strawberry/Q*, and this makes me more pessimistic today than yesterday because of METR’s testing on o1, though notably I don’t consider it nearly an update as some other people on LW believe.
Given the race dynamic and the fact that some major players don’t even recognize safety as a valid concern, it seems extremely likely to me that at least some will take whatever shortcuts they can find (in the absence of adequate legislation, and until/unless we get a large warning shot).
Yeah, one thing I sort of realized is that instrumental convergence capabilities can come up even without very sparse RL, and I now think that while non-instrumental convergent AIs could exist, they will be way more compute inefficient compared to those that use some instrumental convergence.
To be clear, I learned some new stuff about AI alignment that makes me still quite optimistic mostly regardless of architecture, with both alignment generalizing further than capabilities for pretty deep reasons, combined with the new path of synthetic data letting us control what the AI learns and values through data, but still this was a mild violation of my model of how future AI goes.
I think the key thing I didn’t appreciate is that a path to alignment/safety that works technically doesn’t mean it will get used in practice, and following @Seth Herd, an alignmnent solution that requires high taxes or that isn’t likely to be implemented is a non-solution in real life.
Do you have a link to a paper / LW post / etc on that? I’d be interested to take a look.
This was the link I was referring to:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I don’t immediately find that piece very convincing; in short I’m skeptical that the author’s claims are true for a) smarter systems that b) are more agentic and RL-ish. A few reasons:
The core difficulty isn’t with how hard reward models are to train, it’s with specifying a reward function in the first place in a way that’s robust enough to capture all the behavior and trade-offs we want. LLMs aren’t a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it’ll be interesting to see how prosaically-aligned it is.
We have very many examples of reward misspecification and goal misgeneralization in RL; it’s historically been quite difficult to adequately specify a reward function for agents acting in environments.
This becomes way more acute as capabilities move past the level where humans can quickly and easily choose the better output (eg as the basis for a reward model for RLHF).
That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it’s enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It’s not obvious to me that that works in practice, or is efficient enough to be practical.
I’m pretty skeptical of: ‘...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.’ I think there’s a lot of missing work there to be able to claim that mapping.
‘Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.’ I think this is false. Consider ‘Biden (/Trump) was a great president.’ The world is full of situations where humans differ wildly on whether they’re good or bad.
Maybe I’ve just failed to cross the inferential distance here, but on first read I’m pretty skeptical.
Some thoughts on this comment:
This is actually right, but I think this is actually addressable by making large synthetic datasets, and I also think that we can in practice define reward functions densely enough such that we can capture al of the behavior we want in practice.
I agree with this, but I will also say that the examples listed point to a strong reason why RL also wasn’t as capable as people thought, and a lot of the hacks also decreased capabilities as they decreased alignment, so any solution to that problem would help capabilities and alignment massively.
Yeah, I think the big question for my views is whether the LLM solution has low enough taxes to be practical, and my answer is at this point is probable, but not a sure thing, as it requires them to slow down in the race a little (but training runs will get longer, so there’s a countervailing force to this.)
I think there are reasons to be optimistic here, mainly due to updating against evopsych views on how humans got their capabilities and values, combined with updating against complexity and fragility of value due to LLM successes, though it will require real work to bring about.
I think that the verification-generation gap is pervasive in a lot of fields, from workers in many industries being verified by bosses to make sure their job is done right, to people who buy air conditioners being able to find efficient air-conditioning for their needs despite not verifying very hard, to researchers verifying papers that were generated, to social reformers having correct critiques of various aspects of society but not being able to generate a new societal norm, and more.
Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.
Note I’m referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person’s values.
I think the crux is whether verification is easier than generation in alignment, since if verification is just as hard as generation, this hurts plans like scalable oversight a lot.
Thanks for the thoughtful responses.
I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:
The Collatz conjecture is true.
The net effect of SB-1047 will be positive [given x values].
Trump will win the upcoming election.
The 10th Busy Beaver number is <number>.
Such and such a software system is not vulnerable to hacking[1].
I think we’re more aware of problems in NP, the kind you’re talking about, because they’re ones that we can get any traction on.
So ‘many problems are hard to solve but easy to verify’ isn’t much evidence that capabilities/alignment falls into that reference class.
I think that in particular, one kind of alignment problem that’s clearly not in that reference class is: ‘Given utility function U, will action A have net-positive consequences?’. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn’t know or hasn’t noticed, that means that A will have very large negative effects.
This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier’s Law: ‘Anyone can invent a security system so clever that she or he can’t think of how to break it.’
To address your examples:
Collatz conjecture is true can’t be a statement that is harder to prove than it is to verify for the reason given by Vanessa Kosoy here, though you might be right that in practice it’s absolutely hard to verify the proof that was generated:
https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general#feTSDufEqXozChSbB
The same response can be given to the 4th example here.
On election outcomes, the polls on Labor Day are actually reasonably predictive of what happens on November, mostly because at this point voters hear a lot more about the prospective candidates and are starting to form opinions.
For the SB-1047 case, the one prediction I will make right now is that the law has essentially no positive or negative affect, for a lot of values, solely because it’s a rather weak AI bill after amendments.
I usually don’t focus on the case where we try to align an AI that is already misaligned, but rather trying to get the model into a basin of alignment early via data.
Re Schneier’s Law and security mindset, I’ve become more skeptical of security mindset being useful in general, for 2 reasons:
I think that there are enough disanalogies, like the fact that you can randomly change some of the parameters in the model and you still get the same or improved performance in most cases, that notably doesn’t exist in our actual security field or even fields that have to deal with highly fragile systems.
There is good reason to believe a lot of the security exploits discovered that seem magical to not actually matter in practice, because of ridiculous pre-conditions, and the computer security field is selection biased to say that this exploit dooms us forever (even if it cannot):
These posts and comments are helpful pointers to my view:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/#ogt6CZkMNZ6oReuTk
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/#MFqdexvnuuRKY6Tbx
On this:
True, but I do actually think there is actually real traction on the problem already, and IMO one of the cooler results is Pretraining Language Models from Human feedback, and note that even a problem is in NP can get really intractable in the worst case (though we don’t have proof of that)
So there’s a strained analogy to be made here.
For this:
Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.
Indeed, one of the traps of social reformers IRL is to think that just because verifying something is correct or wrong is easy, generating a new social outcome, perhaps via norms must also be easy, but it isn’t, because the verification side is much easier than the generation side.
I’m talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that’s a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, ‘Run this, it’ll generate a cure for cancer and definitely not do bad things’, and in many cases it would be difficult-to-impossible to confirm that.
We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we’re back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).