sbenzell1

Karma: 1

sbenzell1 Jan 25, 2022, 8:40 PM
2 points
on: Prizes for ELK proposals
Hi ARC Team,
Thanks for your valuable work. I’ve been thinking about this problem, and my current thinking is that there is a portion of the ELK problem which is solvable, and a portion which is fundamentally impossible. This is a sketch of my argument—if you think it is worth typing in more detail (or to address an issue you propose) let me know.
Let’s divide facts about the world into two categories: those that are verifiable by some sensor humans can create and understand, and those that are not. My claim is that for the first set of facts it is possible to achieve ELK, and for the latter it is not.
As for how to achieve ELK for the first set of facts, I point you towards the ‘terrifyer’ approach discussed here: https://www.lesswrong.com/posts/QEYWkRoCn4fZxXQAY/prizes-for-elk-proposals?commentId=JfmnRZNgrJn8Jnosx . I think this response adequately responds to the first concern.
However, I think that proposal is over-optimistic when it assumes that the set of facts which are verifiable with a human-understandable sensor includes all important facts. This is a logical positivist approach to knowledge, which has well known problems. To name a few counter examples, humans care about things that are physically intangible like justice and consciousness, certain bads might take a very long time to manifest (and therefore not set off any human-understandable sensor), or there might exist bads that humans can’t understand due to our mental capacities. I don’t think it is possible to train a computer to act on or report these kinds of bads, even though the computer might ‘know about them’ in some way.
Here’s an illustration of what I think the contradiction is in the ‘Diamond Safe’ setting.
Humans train an AI to protect a diamond and report whether the diamond is ‘Safe’. The AI may instead learn the concept “Shmafe” (which in human language means ‘safe to all attacks understandable or detectable by humans’) which is identical to Safe in any possible training data. It is impossible for humans to distinguish or communicate the difference between Safe and Shmafe, and therefore it is impossible to elicit from the AI whether a diamond is Safe or Shmafe.
Here’s a proof sketch, which relies on a theory of what words mean:
A1) Humans cannot fully specify the training set of which outcomes count as true/false, good/bad or success/failure, even with unlimited sensors
A2) Strong AI’s have a ‘better’ predictive model than humans
A3) One of the two leading theories of semantic externalism are true: either words are defined by their reference (i.e. a complete description definition of when they apply) or words arise from objects being ‘baptized’ with a name, and the meaning of the word being gradually discovered as we learn more about the things’ essence (i.e. we initially said “water is whatever the main constituent of that lake is that we use for drinking and stuff” and this eventually evolved into “water is H20”). The first view is classical, the second is associated with Putnam and the Twin Earth thought experiment. ALTERNATIVELY semantic internalism (that words are defined by what the speaker intends them to mean) is true
The proof proceeds by showing that under every theory of meaning in (A3), it is unpreventable that human and a different agent’s understanding of meanings of terms will be different.
C1) If words are defined by their reference (aka a definition), then, by A1, humans cannot fully articulate what words mean in general. It is impossible to achieve agreement between an AI and an ambiguity, so there is no sense in which this problem can be ‘fixed’.
C2) If words are defined by baptisms, and we can learn more about the essence of the object baptized to get better definitions over time, then unless humans already have the perfect understanding of a substance, a more sophisticated AI will have a better and different definition of terms than humans
C3) If words are defined by the intentions of their speakers, then humans may intend an open ended definition (e.g. a safe diamond is what a optimally knowledgeable version of myself would consider a safe diamond). How would a strong AI model an optimally knowledgeable human’s ontology? It would probably use the best ontology it has access to—i.e. The one it is already using. So this doesn’t solve the problem either.

Is this proof too strong? Does it mean that any communication between a pair of agents is impossible? Well, kind of. The key issue is that we have problems communicating about things that are not physically verifiable. Most humans are pretty similar in the level of things that are empirically verifiable to them, so most of the time this isn’t an issue. On the other hand, when they talk about “justice” or “God” to eachother, they may mean extremely different things, and there might not be a good way to get them “on the same page”—especially if one of the pair has a fuzzy idea about one of these concepts. If one human were way more powerful than another, I can see how inability to communicate using these concepts could (and has historically) led to conflict.

(Note—I spoke to Yonadav S. and benefited greatly from conversation with him before submitting this).