Problem: there is no non-fiction about human-level AIs. The training data for LLMs regarding human-level AIs contains only fiction. So consider the hypotheses of chatGPT. In what context encountered in its training data is it most likely to encounter text like “you are Agent, a friendly aligned AI...” followed by humans asking it to do various tasks? Probably some kind of weird ARG. In current interactions with chatGPT, it’s quite possibly just LARPing as a human LARPing as a friendly AI. I don’t know if this is good or bad for safety, but I have a feeling this is a hypothesis we can test.
Problem: there is no non-fiction about human-level AIs. The training data for LLMs regarding human-level AIs contains only fiction. So consider the hypotheses of chatGPT. In what context encountered in its training data is it most likely to encounter text like “you are Agent, a friendly aligned AI...” followed by humans asking it to do various tasks? Probably some kind of weird ARG. In current interactions with chatGPT, it’s quite possibly just LARPing as a human LARPing as a friendly AI. I don’t know if this is good or bad for safety, but I have a feeling this is a hypothesis we can test.