Definition of AI Friendliness
How will we know if future AI’s (or even existing planners) are making decisions that are bad for humans unless we spell out what we think is unfriendly?
At a machine level the AI would be recursively minimising cost functions to produce the most effective plan of action to achieve the goal, but how will we know if its decision is going to cause harm?
Is there a model or dataset which describes what is friendly to humans? e.g.
Context
0 - running a simulation in a VM
2 - physical robot with vacuum attachment
9 - full control of a plane
Actions
0 - selecting a song to play
5 - deciding which section of floor to vacuum
99 - deciding who is an ‘enemy’
9999 - aiming a gun at an ‘enemy’
Impact
1 - poor song selected to play, human mildly annoyed
2 - ineffective use of resources (vacuuming the same floor section twice)
99 - killing a human
99999 - killing all humans
This may not be possible to get agreement from all countries/cultures/beliefs, but it is something we should discuss and attempt to get some agreement.
.
This post is now at −3 -- since it has 0 comments and will be invisible to most users, I hazard a guess as to why this is happening it in case the feedback is useful.
People don’t even manage to recognize all of terms in their utility function that are relevant to a single decision. How in the world do you plan to explicitly do this for all circumstances, averaged over all of humanity? Trying to explicitly spell this out by intuition-pumping does not seem like an awesome strategy to input Friendliness into an AI—you know that you will seriously screw something up, since you know that you/discussants will miss something.
Basically, the fact that you seem to think that just throwing actions/contexts out there and attaching numbers is in any way productive gives me the impression that you have no idea how hard codifying human values actually is, and the resulting discussion won’t be very productive until you do. See complexity of value.
I’m sorry if this comes across as rude or condescending; I’m attempting to prioritize clarity over niceness in the time that I have available to write this, given that I think sub-optimal feedback may be better than none at all in this case.
I am tapping out of the conversation at this point.
I’m going to give more general feedback than Dorikka and say that there is a lot of material on AI safety in the Less Wrong sequences and the literature produced by MIRI and FHI people, and any LW post about AI safety is going to have to engage with all that material (at least), if only implicitly, before it gets upvotes.
Thank you both for the feedback—it is always useful. Yes I realise this is a hard job with no likely consensus, but what would the alternative be?
At some stage we need to get the AI to understand human values so it knows if it is being unfriendly, and at the very least if we have no measurable way of identifying friendliness how will progress be tracked?
That question is basically the hard question at the root of the difficulty of friendly AI. Building an AI that would optimize to increase or decrease a value through its actions is comparably easy, but determining how to evaluate actions into a scale that measures results in a comparison with human values is incredibly difficult. Determining and evaluating AI friendliness is a very hard problem, and you should consider reading more about the issue so that you don’t come off as naive.
(Not being mistaken is a better purpose than appearing sophisticated.)
I’m not sure why you phrased your comment as a parenthetical, could you explain that? Also, while I agree with your statement, appearing competent to engage in discussion is quite important for enabling one to take part in discussion. I don’t like seeing someone who is genuinely curious get downvoted into oblivion.
The problem here in not appearing incompetent, but being wrong/confused. This is the problem that should be fixed by reading the literature. It is more efficient to fix it by reading the literature rather than by engaging in a discussion, even given good intentions. Fixing the appearances might change the attitude of other people towards preferring the option of discussion, but I don’t think the attitude should change on that basis, reading the literature is still more efficient, so fixing appearances would mislead rather than help.
(I use parentheticals to indicate that an observation doesn’t work as a natural element of the preceding conversation, but instead raises a separate point that is more of a one-off, probably not worthy of further discussion.)