Depends on the original AI’s value function. If it cares about humanity, or at least it’s own safety, then yes, making smarter AIs is not a convergent goal. But if it’s some kind of roboaccelerationist that has some goal like “maximize intelligence in the universe”, it will make smarter AIs even knowing that it means being paperclipped.
baturinsky
AI is prosperous and all-knowing. No people, hence zero suffering.
Yes, but training AI to try to fix errors is not that hard.
How many of those Ukrainians are draft-evaders? I mean, so far it looks like this money-for-not-fighting program is already implemented, but for the opposite side...
Yes. And also, it is an importance of the human/worker. While there is still some part of work that machine can’t do, human thatcan do the remaining part is important. Once machine can do everything, human is disposable.
If a machine can do 99% of the human’s work, it multiplies human’s power by x100.
If a machine can do 100% of the human’s work, it multiplies human’s power by x0.
Would be amusing if Russia and China would join the “Yudkowsky’s treaty” and USA would not.
I think that the keystone human value is about making significant human choices. Individually and collectively, including choosing the humanity’s course.
You can’t make a choice if you are dead
You can’t make a choice if you are disempowered
You can’t make a human choice if you are not a human
You can’t make a choice if the world is too alien for your human brain
You can’t make a choice if you are in too much of a pain or too much of a bliss
You can’t make a choice if you let AI make all the choices for you
Since there are no humans in the training environment, how do you teach that? Or do you put human-substitutes there (or maybe some RLHF-type thing)?
Yes, probably some human models.
Also, how would such AIs will even reason about humans, since they can’t read our thoughts? How are they supposed to know if we would like to “vote them out” or not?
By being aligned. I.e. understanding the human values and complying to them. Seeking to understand other agents’ motives and honestly communicating it’s own motives and plans to them, to ensure there is no conflicts from misunderstanding. I.e. behaving much like civil and well meaning people behave work together.
And if we come up with a way that allows us to reliably analyze what an AI is thinking, why use this complicated scenario and not just train (RL or something) it directly to “do good things while thinking good thoughts”, if we’re relying on our ability to distinguish “good” and “bad” thoughts anyway?
Because we don’t know how to tell “good” thoughts from “bad” reliably in all possible scenarios.
Agent is anyone or anything that has intelligence and the means of interacting with the real world. I.e. agents are AIs or humans.
One AI =/= one vote. One human = one vote. AIs are only getting as much authority as humans, directly or indirectly, entrust them with. So, if AI needs more authority, it has to justify it to humans and other AIs. And it can’t request too much of authority just for itself, as tasks that would require a lot of authority will be split between many AIs and people.
You are right that the authority to “vote out” other AIs may be misused. That’s where logs would be handy—for other agents to analyse the “minds” of both sides and see who was doing right.
It’s not completely fool proof, of course, but it means that attempts to power grab will not likely to happen completely under the radar.
Our value function is complex and fragile, but we know of a lot of world states where it is pretty high. Which is our current world and few thousands years worth of it states before.
So, we can assume that the world states in the certain neighborhood from our past sates have some value.
Also, states far out of this neighborhood probably have little or no value. Because our values were formed in order to make us orient and thrive in our ancestral environment. So, in worlds too dissimilar from it, our values will likely lose their meaning, and we will lose the ability to normally “function”, ability to “human”.
Point is to make “cooperate” a more convergent instrumental goal than “defect”. And yes, not just in training, but in real world too. And yes, it’s more fine-grained than a binary choice.
There is much more ways to see how cooperative AI is, compared to how well we can check now how human is cooperative. Including checking the complete logs of AI’s actions, knowledge and thinking process.
And there are objective measures of cooperation. It’s how well it’s action affect other agents success in pursuing their goals. I.e. do other agents want to “vote out” this particular AI from being able to make decisions and use resources or not.
While having lower intelligence, humans may have bigger authority. And AIs terminal goals should be about assisting specifically humans too.
GPT4 and ChatGPT seem to be getting gradually better working on letter-level in some cases. For example, it can count the n-th word or letter in the sentence now. But not from the end.
I just mean that “wildly different levels of intelligence” is probably not necessary, and maybe even harmful. Because then there will be few very smart AIs at the top, which could usurp the power without smaller AI even noticing.
Though, it maybe could work if those AI are smartest, but have little authority. For example they can monitor other AIs and raise alarm/switch them off if they misbehave, but nothing else.
I think it could work better if AIs are of roughly the same power. Then if some of them would try to grab for more power, or otherwise misbehave, others could join forces oppose it together.
Ideally, there should be a way for AIs to stop each other fast, without having to resort to actually fight.
My theory is that the core of the human values is about what human brain was made for—making decisions. Making meaningful decision individually and as a group. Including collectively making decisions about the human fate.
Math problems, physical problems, doing stuff in simulations, playing games.
Human values are complex and fragile. We don’t know yet how to make AI pursue such goals.
Any sufficiently complex plan would require pursuing complex and fragile instrumental goals. AGI should be able to implement complex plans. Hence, it’s near certain that AGI will be able to understand complex and fragile values (for it’s instrumental goals).
If we will make an AI which is able to successfully pursue complex and fragile goals, it will likely be enough to make it AGI.
Hence, a complete solution to Alignment will very likely have solving AGI as a side effect. And solving AGI will solve some parts of Alignment, maybe even the hardest ones, but not all of them.
Our brains were not trained for image generation (much). They were trained for converting 2d image into the understanding of the situation. Which AI still struggles with and needs the help of LLMs to have anywhere good results.