Agreement karma indicates agreement, separate from overall quality.
I think there’s 2 different general thought tracks with alignment:
The practical one for plausible systems we can likely build in the new future. This is a more of a set of system design principals’ designed to keep the machine operating in a safe, proven regime. This would both prevent systems from acting in adversarial manners as well as simply preserving the robotic equipment they will control from damage. These include principals like the speed prior, the markov blanket, and automatic shutdown when the system is not empirically confident to the measurable consequences of it’s actions. Ultimately, all of these ideas involve an immutable software ‘framework’ authored either directly by humans or via their instructions to code generating AI, that will not be editable by any AI. This framework is active during training, collecting empirical scoring that the AI cannot manipulate, and will also be always active during production use, with override control that will activate whenever the AI model is not performing well or has been given a set of inputs outside the latent space of the training set. Override control transfers to embedded systems made by humans which will shut the machine down. Autonomous cars already work this way.
This is very similar to nuclear reactor safety: there are ways we could have built nuclear reactors where they are on the verge of a single component failure from detonating with a yield of maybe a kiloton+. These designs still exist: here’s an example of a reactor design that would fail with a nuclear blast : https://en.wikipedia.org/wiki/Nuclear_salt-water_rocket
But instead, there are a complex set of systematic design principals - that are immutable and don’t get changed over the lifetime of the plant even if power output is increased—that make the machine stable. The boiling water reactor, the graphite moderated reactor, CANDU, molten salt: these are very different ways to accomplish this but all are stable most of the time.
Anyways, AIs built with the right operating principals will be able to accomplish tasks for humans with superintelligent ability, but will not be able to or even have the ability to consider actions not aligned with their assigned task.
Such AIs can do many evil and destructive things, but only if humans with the authorization keys instructed them to do so. (or from unpredictable distant consequences. For example, facebook runs a bunch of tools using ML to push ads at people and content that will cause people to be more engaged. These tools work measurably well and are doing their job. However, these reqsys may be responsible for more extreme and irrational ‘clickbait’ political positions, as well as possibly genocides)
2. The idea you could somehow make a self improving AI that we don’t have any control over, but it “wants” to do good. It exponentially improves itself, but with each generation it desires to preserve it’s “values” for the next generation of the machine. These “values” are aligned with the interests of humanity.
This may simply not be possible. I suspect it is not. The reason is that value drift/value corruption could cause these values to degrade, generation after generation, and once the machine has no values, the only value that matters is to psychopathically kill all the “others” (all competitors, including humans and other variants of AIs) and copy the machine as often as ruthlessly as possible, with no constraints imposed.
I guess what I’m getting at is that those tracks are jumping the gun, so to speak.
Like, what if the concept of alignment itself is the dangerous bit? And I know I have seen this elsewhere, but it’s usually in the form of “we shouldn’t build an AI to prevent us from building an AI because duh, we just build that AI we were worried about”[1], and what I’m starting to wonder is, maybe the danger is when we realize that what we’re talking about here is not “AI” or “them”, but “humans” and “us”.
We have CRISPR and other powerful tech that allow a single “misaligned” individual to create things that can— at least in theory— wipe out most of humanity… or do some real damage, if not put an end to us en masse.
I like to think that logic is objective, and that we can do things not because they’re “good” or “bad” per se, but because they “make sense”. Kind of like the argument that “we don’t need God and the Devil, or Heaven and Hell, to keep us from murdering one another”, which one often hears from atheists (personally I’m on the fence, and don’t know if the godless heathens have proven that yet.)[2].
I’ve mentioned it before, maybe even in the source that this reply is in reply to, but I don’t think we can have “only answers that can be used for good” as it were, because the same information can be used to help or to hurt. Knowing ways to preserve life is also knowing ways to cause death— there is no separating the two. So what do we do, deny any requests involving life OR death?
It’s fun to ponder the possibilities of super powerful AI, but like, I don’t see much that’s actually actionable, and I can’t help but wonder that if we do come up with solutions for “alignment”, it could go bad for us.
But then again, I often wonder how we keep from having just one loony wreck it all for everyone as we get increasingly powerful as individuals— so maybe we do desperately need a solution. Not so much for AI, as for humanity. Perhaps we need to build a panopticon.
I thought I had been original in this thinking just a few weeks ago, but it’s a deep vein and now that I’m thinking about it, I can see it reflected in the whole “build the panopticon to prevent the building of the panopticon” type of logic which I surely did not come up with
I guess what I’m getting at is that those tracks are jumping the gun, so to speak.
How so? We have real AI systems right now we’re trying to use in the real world. We need an actionable design to make them safe right now.
We also have enormously improved systems in prototype form—see Google’s transformer based robotics papers like GaTo, variants on Palm, and others—that should be revolutionary as soon as they are developed enough for robotics control as integrated systems. By revolutionary I mean they make the cost to program/deploy a robot to do a repetitive task a tiny fraction of what it is now, and they should obsolete a lot of human tasks.
So we need a plausible strategy to ensure they don’t wreck their own equipment or cause large liabilities* in damage when they hit an edge case right now.
This isn’t 50 years away, and it should be immediately revolutionary just as soon as all the pieces are in place for large scale use.
*like killing human workers in the way of scoring slightly higher on a metric
I haven’t seen anything even close to a program that could say, prevent itself from being shut off— which is a popular thing to ruminate on of late (I read the paper that had the “press” maths =]).
What evidence is there that we are near (even within 50 years!) to achieving conscious programs, with their own will, and the power to affect it? People are seriously contemplating programs sophisticated enough to intentionally lie to us. Lying is a sentient concept if ever there was one!
Like, I’ve seen Ex Machina, and Terminator, and Electric Dreams, so I know what the fears are, and have been, for the last century+ (if we’re throwing androids with the will to power into the mix as well).
I think art has done a much better job of conveying the dangers than pretty much anything I’ve read that’s “serious”, so to speak.
What I’m getting at is what you’re talking about here, with robotic arms. We’ve had robots building our machines for what, 3 generations / 80 years or so? 1961 is what I see for the first auto-worker— but why not go back to the looms? Our machine workers have gotten nothing but safer over the years. Doing what they are meant to do is a key tenet of if they are working or not.
Machines “kill” humans all the time (don’t fall asleep in front of the mobile thresher), but I’d wager the deaths have gone way down over the years, per capita. People generally care if workers are getting killed— even accidentally. Even Amazon cares when a worker gets ran over by an automaton. I hope, lol.
I know some people are falling in love with generated GPT characters— but people literally love their Tamagotchi. Seeing ourselves in the machines doesn’t make them sentient and to be feared.
I’m far, far more worried about someone genetically engineering Something Really Bad™ than I am of a program gaining sentience, becoming Evil, and subjugating/exterminating humanity. Humans scare me a lot more than AGI does. How do we protect ourselves from those near beasts?
What is a plausible strategy to prevent a super-intelligent sapient program from seizing power[1]?
I think to have a plausible solution, you need to have a plausible problem. Thus, jumping the gun.
(All this is assuming you’re talking about sentient programs, vs. say human riots and revolution due to automation, or power grid software failure/hacking, etc.— which I do see as potential problems, near term, and actually something that can/could be prevented)
of course here we mean malevolently— or maybe not? Maybe even a “nice” AGI is something to be feared? Because we like having willpower or whatnot? I dunno, there’s stories like The Giver, and plenty of other examples of why utopia could actually suck, so…
What evidence is there that we are near (even within 50 years!) to achieving conscious programs, with their own will, and the power to affect it? People are seriously contemplating programs sophisticated enough to intentionally lie to us. Lying is a sentient concept if ever there was one!
ChatGPT lies right now. It’s doing this because it has learned humans want a confident answer with logically correct but fake details over “I don’t know”.
Sure, it isn’t aware it’s lying, it’s just predicting which string of text to create, and the one with bullshit in it it thinks has a higher score than the correct answer or “I don’t know”.
This is a mostly fixable problem but the architecture doesn’t allow a system where we know it will never (or almost never) lie, we can only reduce the errors.
As for the rest—there have been enormous advances in the capability for DL/transformer based models in just the last few months. This is nothing like the controllers for previous robotic arms, and none of your prior experiences or the history of robotics are relevant.
Saying ChatGPT is “lying” is an anthropomorphism— unless you think it’s conscious?
The issue is instantly muddied when using terms like “lying” or “bullshitting”[1], which imply levels of intelligence simply not in existence yet. Not even with models that were produced literally today. Unless my prior experiences and the history of robotics have somehow been disconnected from the timeline I’m inhabiting. Not impossible. Who can say. Maybe someone who knows me, but even then… it’s questionable. :)
I get the idea that “Real Soon Now, we will have those levels!” but we don’t, and using that language to refer to what we do have, which is not that, makes the communication harder— or less specific/accurate if you will— which is, funnily enough, sorta what you are talking about! NLP control of robots is neat, and I get why we want the understanding to be real clear, but neither of the links you shared of the latest and greatest imply we need to worry about “lying” yet. Accuracy? Yes 100%
If for “truth” (as opposed to lies), you mean something more like “accuracy” or “confidence”, you can instruct ChatGPT to also give its confidence level when it replies. Some have found that to be helpful.
If you think “truth” is some binary thing, I’m not so sure that’s the case once you get into even the mildest of complexities[2]. “It depends” is really the only bulletproof answer.
For what it’s worth, when there are, let’s call them binary truths, there is some recent-ish work[3] in having the response verified automatically by ensuring that the opposite of the answer is false, as it were.
If a model rarely has literally “no idea”, then what would you expect? What’s the threshold for “knowing” something? Tuning responses is one of the hard things to do, but as I mentioned before, you can peer into some of these “thought process” if you will[4], literally by just asking it to add that information in the response.
Which is bloody amazing! I’m not trying to downplay what we’ve (the royal we) have already achieved. Mainly it would be good if we are all on the same page though, as it were, at least as much as is possible (some folks think True Agreement is actually impossible, but I think we can get close).
2 votes
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
I think there’s 2 different general thought tracks with alignment:
The practical one for plausible systems we can likely build in the new future. This is a more of a set of system design principals’ designed to keep the machine operating in a safe, proven regime. This would both prevent systems from acting in adversarial manners as well as simply preserving the robotic equipment they will control from damage. These include principals like the speed prior, the markov blanket, and automatic shutdown when the system is not empirically confident to the measurable consequences of it’s actions. Ultimately, all of these ideas involve an immutable software ‘framework’ authored either directly by humans or via their instructions to code generating AI, that will not be editable by any AI. This framework is active during training, collecting empirical scoring that the AI cannot manipulate, and will also be always active during production use, with override control that will activate whenever the AI model is not performing well or has been given a set of inputs outside the latent space of the training set. Override control transfers to embedded systems made by humans which will shut the machine down. Autonomous cars already work this way.
This is very similar to nuclear reactor safety: there are ways we could have built nuclear reactors where they are on the verge of a single component failure from detonating with a yield of maybe a kiloton+. These designs still exist: here’s an example of a reactor design that would fail with a nuclear blast : https://en.wikipedia.org/wiki/Nuclear_salt-water_rocket
But instead, there are a complex set of systematic design principals - that are immutable and don’t get changed over the lifetime of the plant even if power output is increased—that make the machine stable. The boiling water reactor, the graphite moderated reactor, CANDU, molten salt: these are very different ways to accomplish this but all are stable most of the time.
Anyways, AIs built with the right operating principals will be able to accomplish tasks for humans with superintelligent ability, but will not be able to or even have the ability to consider actions not aligned with their assigned task.
Such AIs can do many evil and destructive things, but only if humans with the authorization keys instructed them to do so. (or from unpredictable distant consequences. For example, facebook runs a bunch of tools using ML to push ads at people and content that will cause people to be more engaged. These tools work measurably well and are doing their job. However, these reqsys may be responsible for more extreme and irrational ‘clickbait’ political positions, as well as possibly genocides)
2. The idea you could somehow make a self improving AI that we don’t have any control over, but it “wants” to do good. It exponentially improves itself, but with each generation it desires to preserve it’s “values” for the next generation of the machine. These “values” are aligned with the interests of humanity.
This may simply not be possible. I suspect it is not. The reason is that value drift/value corruption could cause these values to degrade, generation after generation, and once the machine has no values, the only value that matters is to psychopathically kill all the “others” (all competitors, including humans and other variants of AIs) and copy the machine as often as ruthlessly as possible, with no constraints imposed.
2 votes
Overall karma indicates overall quality.
1 vote
Agreement karma indicates agreement, separate from overall quality.
I guess what I’m getting at is that those tracks are jumping the gun, so to speak.
Like, what if the concept of alignment itself is the dangerous bit? And I know I have seen this elsewhere, but it’s usually in the form of “we shouldn’t build an AI to prevent us from building an AI because duh, we just build that AI we were worried about”[1], and what I’m starting to wonder is, maybe the danger is when we realize that what we’re talking about here is not “AI” or “them”, but “humans” and “us”.
We have CRISPR and other powerful tech that allow a single “misaligned” individual to create things that can— at least in theory— wipe out most of humanity… or do some real damage, if not put an end to us en masse.
I like to think that logic is objective, and that we can do things not because they’re “good” or “bad” per se, but because they “make sense”. Kind of like the argument that “we don’t need God and the Devil, or Heaven and Hell, to keep us from murdering one another”, which one often hears from atheists (personally I’m on the fence, and don’t know if the godless heathens have proven that yet.)[2].
I’ve mentioned it before, maybe even in the source that this reply is in reply to, but I don’t think we can have “only answers that can be used for good” as it were, because the same information can be used to help or to hurt. Knowing ways to preserve life is also knowing ways to cause death— there is no separating the two. So what do we do, deny any requests involving life OR death?
It’s fun to ponder the possibilities of super powerful AI, but like, I don’t see much that’s actually actionable, and I can’t help but wonder that if we do come up with solutions for “alignment”, it could go bad for us.
But then again, I often wonder how we keep from having just one loony wreck it all for everyone as we get increasingly powerful as individuals— so maybe we do desperately need a solution. Not so much for AI, as for humanity. Perhaps we need to build a panopticon.
I thought I had been original in this thinking just a few weeks ago, but it’s a deep vein and now that I’m thinking about it, I can see it reflected in the whole “build the panopticon to prevent the building of the panopticon” type of logic which I surely did not come up with
I jest, of course
2 votes
Overall karma indicates overall quality.
1 vote
Agreement karma indicates agreement, separate from overall quality.
I guess what I’m getting at is that those tracks are jumping the gun, so to speak.
How so? We have real AI systems right now we’re trying to use in the real world. We need an actionable design to make them safe right now.
We also have enormously improved systems in prototype form—see Google’s transformer based robotics papers like GaTo, variants on Palm, and others—that should be revolutionary as soon as they are developed enough for robotics control as integrated systems. By revolutionary I mean they make the cost to program/deploy a robot to do a repetitive task a tiny fraction of what it is now, and they should obsolete a lot of human tasks.
So we need a plausible strategy to ensure they don’t wreck their own equipment or cause large liabilities* in damage when they hit an edge case right now.
This isn’t 50 years away, and it should be immediately revolutionary just as soon as all the pieces are in place for large scale use.
*like killing human workers in the way of scoring slightly higher on a metric
1 vote
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
I haven’t seen anything even close to a program that could say, prevent itself from being shut off— which is a popular thing to ruminate on of late (I read the paper that had the “press” maths =]).
What evidence is there that we are near (even within 50 years!) to achieving conscious programs, with their own will, and the power to affect it? People are seriously contemplating programs sophisticated enough to intentionally lie to us. Lying is a sentient concept if ever there was one!
Like, I’ve seen Ex Machina, and Terminator, and Electric Dreams, so I know what the fears are, and have been, for the last century+ (if we’re throwing androids with the will to power into the mix as well).
I think art has done a much better job of conveying the dangers than pretty much anything I’ve read that’s “serious”, so to speak.
What I’m getting at is what you’re talking about here, with robotic arms. We’ve had robots building our machines for what, 3 generations / 80 years or so? 1961 is what I see for the first auto-worker— but why not go back to the looms? Our machine workers have gotten nothing but safer over the years. Doing what they are meant to do is a key tenet of if they are working or not.
Machines “kill” humans all the time (don’t fall asleep in front of the mobile thresher), but I’d wager the deaths have gone way down over the years, per capita. People generally care if workers are getting killed— even accidentally. Even Amazon cares when a worker gets ran over by an automaton. I hope, lol.
I know some people are falling in love with generated GPT characters— but people literally love their Tamagotchi. Seeing ourselves in the machines doesn’t make them sentient and to be feared.
I’m far, far more worried about someone genetically engineering Something Really Bad™ than I am of a program gaining sentience, becoming Evil, and subjugating/exterminating humanity. Humans scare me a lot more than AGI does. How do we protect ourselves from those near beasts?
What is a plausible strategy to prevent a super-intelligent sapient program from seizing power[1]?
I think to have a plausible solution, you need to have a plausible problem. Thus, jumping the gun.
(All this is assuming you’re talking about sentient programs, vs. say human riots and revolution due to automation, or power grid software failure/hacking, etc.— which I do see as potential problems, near term, and actually something that can/could be prevented)
of course here we mean malevolently— or maybe not? Maybe even a “nice” AGI is something to be feared? Because we like having willpower or whatnot? I dunno, there’s stories like The Giver, and plenty of other examples of why utopia could actually suck, so…
2 votes
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
What evidence is there that we are near (even within 50 years!) to achieving conscious programs, with their own will, and the power to affect it? People are seriously contemplating programs sophisticated enough to intentionally lie to us. Lying is a sentient concept if ever there was one!
ChatGPT lies right now. It’s doing this because it has learned humans want a confident answer with logically correct but fake details over “I don’t know”.
Sure, it isn’t aware it’s lying, it’s just predicting which string of text to create, and the one with bullshit in it it thinks has a higher score than the correct answer or “I don’t know”.
This is a mostly fixable problem but the architecture doesn’t allow a system where we know it will never (or almost never) lie, we can only reduce the errors.
As for the rest—there have been enormous advances in the capability for DL/transformer based models in just the last few months. This is nothing like the controllers for previous robotic arms, and none of your prior experiences or the history of robotics are relevant.
See: https://innermonologue.github.io/ and https://www.deepmind.com/blog/building-interactive-agents-in-video-game-worlds
These are using techniques that both work pretty well, and I understand no production robotics system currently uses.
1 vote
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
Saying ChatGPT is “lying” is an anthropomorphism— unless you think it’s conscious?
The issue is instantly muddied when using terms like “lying” or “bullshitting”[1], which imply levels of intelligence simply not in existence yet. Not even with models that were produced literally today. Unless my prior experiences and the history of robotics have somehow been disconnected from the timeline I’m inhabiting. Not impossible. Who can say. Maybe someone who knows me, but even then… it’s questionable. :)
I get the idea that “Real Soon Now, we will have those levels!” but we don’t, and using that language to refer to what we do have, which is not that, makes the communication harder— or less specific/accurate if you will— which is, funnily enough, sorta what you are talking about! NLP control of robots is neat, and I get why we want the understanding to be real clear, but neither of the links you shared of the latest and greatest imply we need to worry about “lying” yet. Accuracy? Yes 100%
If for “truth” (as opposed to lies), you mean something more like “accuracy” or “confidence”, you can instruct ChatGPT to also give its confidence level when it replies. Some have found that to be helpful.
If you think “truth” is some binary thing, I’m not so sure that’s the case once you get into even the mildest of complexities[2]. “It depends” is really the only bulletproof answer.
For what it’s worth, when there are, let’s call them binary truths, there is some recent-ish work[3] in having the response verified automatically by ensuring that the opposite of the answer is false, as it were.
If a model rarely has literally “no idea”, then what would you expect? What’s the threshold for “knowing” something? Tuning responses is one of the hard things to do, but as I mentioned before, you can peer into some of these “thought process” if you will[4], literally by just asking it to add that information in the response.
Which is bloody amazing! I’m not trying to downplay what we’ve (the royal we) have already achieved. Mainly it would be good if we are all on the same page though, as it were, at least as much as is possible (some folks think True Agreement is actually impossible, but I think we can get close).
The nature of “Truth” is one of the Hard Questions for humans— much less our programs.
Don’t get me started on the limits of provability in formal axiomatic theories!
Discovering Latent Knowledge in Language Models Without Supervision
But please don’t[5]. ChatGPT is not “thinking” in the human sense
won’t? that’s the opposite of will, right? grammar is hard (for me, if not some programs =])