During RL finetuning and given this post-unsupervised initialization, there’s now an inductive bias towards just hooking human-like criteria for bidding on internal-AI-plans. IE humans give approval-based reinforcement, and an inductively easy way of upweighting logits on those actions is just hook up the human-like plan-criteria into the AI’s planning process, so the AI gets a humanlike “care about people” shard. P(3 | 2, 1) = .55 due to plurality of value, I expect this to be one way it learns to make decisions
This is where I’d put a significantly low probability. Could you elaborate on why there’s an inductive bias towards “just hooking human-like criteria for bidding on internal-AI-plans”? As far as I can tell, the inductive bias for human-like values would be something that at least seems closer to the human-brain structure than any arbitrary ML architecture we have right now. Rewarding a system to better model human beings’ desires doesn’t seem to me to lead it towards having similar desires. I’d use the “instrumental versus terminal desires” concept here but I expect you would consider that something that adds confusion instead of removing it.
Because it’s shorter edit distance in its internal ontology; it’s plausibly NN-simple to take existing plan-grading procedures, internal to the model, and then hooking those more directly into its logit-controllers.
Also note that probably it internally hooks up lots of ways to make decisions, and this only has to be one (substantial) component. Possibly I’d put .3 or .45 now instead of .55 though.
This is where I’d put a significantly low probability. Could you elaborate on why there’s an inductive bias towards “just hooking human-like criteria for bidding on internal-AI-plans”? As far as I can tell, the inductive bias for human-like values would be something that at least seems closer to the human-brain structure than any arbitrary ML architecture we have right now. Rewarding a system to better model human beings’ desires doesn’t seem to me to lead it towards having similar desires. I’d use the “instrumental versus terminal desires” concept here but I expect you would consider that something that adds confusion instead of removing it.
Because it’s shorter edit distance in its internal ontology; it’s plausibly NN-simple to take existing plan-grading procedures, internal to the model, and then hooking those more directly into its logit-controllers.
Also note that probably it internally hooks up lots of ways to make decisions, and this only has to be one (substantial) component. Possibly I’d put .3 or .45 now instead of .55 though.