Thank you again for your response! Someone taking the time to discuss this proposal really means a lot to me.
I fully agree with your conclusion of “unnecessary complexity” based on the premise that the method for aligning the judge is then somehow used to align the model, which of course doesn’t solve anything. That said I believe there might have been a misunderstanding, because this isn’t at all what this system is about. The judge, when controlling a model in the real world or when aligning a model that is already reasonably smart (more on this in the following Paragraph) is always a human.
The part about using a model trained via supervised learning to classify good or bad actions isn’t a core part of the system, but only an extension to make the training process easier. It could be used at the start of the training process when the agent, police and Defendant models only possess a really low level of intelligence (meaning the police and Defendant models mostly agree). As soon as the models show really basic levels of intelligence the judge will immediately need to be a human. This should have been better explained in the post, sorry.
Of course there is a point to be made about the models pretending to be dumber than they actually are to prevent them being replaced by the human, but this part of the system is only optional, so I would prefer that we would first talk about the other parts, because the system would still work without this step (If you want I would love to come back to this later).
Mutual self preservation
No punishment/no reward for all parties is an outcome that might become desirable to pursue mutual self-preservation.
At first I also thought this to be the case, but when thinking more about it, I concluded that this would go against the cognitive grooves instilled inside the models during reinforcement learning. This is based on the Reward is not the optimisation target post. This conclusion can definitely be debated.
Also you talked about there being other assumptions, if you listed them I could try to clarify what I meant.
I also don’t consent to the other assumptions you’ve made
I read Reward is not the optimisation target as a result of your article. (It was a link in the 3rd bullet point, under the Assumptions section.) I downvoted that article and upvoted several people who were critical of it.
Near the top of the responses was this quote.
… If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations. …
Emphasis mine.
I tend to be suspicious of people who insist their assumptions are valid without being willing to point to work that proves the hypothesis.
In the end, your proposal has a test plan. Do the test, show the results. My prediction is that your theory will not be supported by the test results, but if you show your work, and it runs counter to my current model and predictions, then you could sway me. But not until then, given the assumptions you made and the assumptions you’re importing via related theories. Until you have test results, I’ll remain skeptical.
Don’t get me wrong, I applaud the intent behind searching for an alignment solution.I don’t have a solution or even a working hypothesis. I don’t agree with everything in this article (that I’m about to link), but it relates to something that I’ve been thinking for a while—that it’s unsafe to abstract away the messiness of humanity in pursuit of alignment. That humans are not aligned, and therefore the difficulty with trying to create alignment where none exists naturally is inherently problematic.
You might argue that humans cope with misalignment, and that that’s our “alignment goal” for AI… but I would propose that humans cope due to power imbalance, and that the adage “power corrupts, and absolute power corrupts absolutely” has relevance—or said another way, if you want to know the true nature of a person, given them power over another and observe their actions.
[I’m not anthropomorphizing the AI. I’m merely saying if one intelligence [humans] can display this behavior, and deceptive behaviors can be observed in less intelligent entities, then an intelligence of similar level to a human might possess similar traits. Not as a certainty, but as a non-negligible possibility.]
If the AI is deceptive so long as humans maintain power over it, and then behave differently when that power imbalance is changed, that’s not “the alignment solution” we’re looking for.
Thank you again for your response! Someone taking the time to discuss this proposal really means a lot to me.
I fully agree with your conclusion of “unnecessary complexity” based on the premise that the method for aligning the judge is then somehow used to align the model, which of course doesn’t solve anything. That said I believe there might have been a misunderstanding, because this isn’t at all what this system is about. The judge, when controlling a model in the real world or when aligning a model that is already reasonably smart (more on this in the following Paragraph) is always a human.
The part about using a model trained via supervised learning to classify good or bad actions isn’t a core part of the system, but only an extension to make the training process easier. It could be used at the start of the training process when the agent, police and Defendant models only possess a really low level of intelligence (meaning the police and Defendant models mostly agree). As soon as the models show really basic levels of intelligence the judge will immediately need to be a human. This should have been better explained in the post, sorry.
Of course there is a point to be made about the models pretending to be dumber than they actually are to prevent them being replaced by the human, but this part of the system is only optional, so I would prefer that we would first talk about the other parts, because the system would still work without this step (If you want I would love to come back to this later).
Mutual self preservation
At first I also thought this to be the case, but when thinking more about it, I concluded that this would go against the cognitive grooves instilled inside the models during reinforcement learning. This is based on the Reward is not the optimisation target post. This conclusion can definitely be debated.
Also you talked about there being other assumptions, if you listed them I could try to clarify what I meant.
Thank you again for your time
I read Reward is not the optimisation target as a result of your article. (It was a link in the 3rd bullet point, under the Assumptions section.) I downvoted that article and upvoted several people who were critical of it.
Near the top of the responses was this quote.
Emphasis mine.
I tend to be suspicious of people who insist their assumptions are valid without being willing to point to work that proves the hypothesis.
In the end, your proposal has a test plan. Do the test, show the results. My prediction is that your theory will not be supported by the test results, but if you show your work, and it runs counter to my current model and predictions, then you could sway me. But not until then, given the assumptions you made and the assumptions you’re importing via related theories. Until you have test results, I’ll remain skeptical.
Don’t get me wrong, I applaud the intent behind searching for an alignment solution. I don’t have a solution or even a working hypothesis. I don’t agree with everything in this article (that I’m about to link), but it relates to something that I’ve been thinking for a while—that it’s unsafe to abstract away the messiness of humanity in pursuit of alignment. That humans are not aligned, and therefore the difficulty with trying to create alignment where none exists naturally is inherently problematic.
You might argue that humans cope with misalignment, and that that’s our “alignment goal” for AI… but I would propose that humans cope due to power imbalance, and that the adage “power corrupts, and absolute power corrupts absolutely” has relevance—or said another way, if you want to know the true nature of a person, given them power over another and observe their actions.
[I’m not anthropomorphizing the AI. I’m merely saying if one intelligence [humans] can display this behavior, and deceptive behaviors can be observed in less intelligent entities, then an intelligence of similar level to a human might possess similar traits. Not as a certainty, but as a non-negligible possibility.]
If the AI is deceptive so long as humans maintain power over it, and then behave differently when that power imbalance is changed, that’s not “the alignment solution” we’re looking for.