I really like you’re putting up your proposal for all to read. It lets stumps like me learn a lot. So, thanks for showing it!
You asked for some ideas about things you want clarity on. I can’t help you yet, but I’d like to ask some questions first, if I can. Some about the technical implementation and some more to do with the logic behind your proposal.
First some technical questions: 1) What if the original AI is not some recurrent ‘thinking’ AI using compute steps, but something that learns a single function mapping from input to output? Like a neural net using 1 hidden layer. Would this still work? 2) When do we exactly train the encoder/decoder? We train it on the predictor states, right? But wouldn’t adding the human imitator degrade our predictor performance by dimensionality reduction making our predictor some less good variant of the one we are interested in? 3) What is the difference between an encoder and a reporter that reports on all of the state? Why wouldn’t all misalignment issues with the reporter occur in the encoder/human imitator in an identical way?
And some other questions: 1) I assume your plan with the human imitator step is to apply or incorporate the idea of a ‘human prior’ (thinking like a human)? My question is: if you have some static function sandwiched between two unbounded priorless optimizing parts, would it actually guarantee that the human prior is applied? Or could the encoder/reporter misuse the human imitator? Or in other words, you assume a faithful encoder, but why can we trust that assumption? ( I know you say that it would be implausible that for the AI to distinguish between training and the real world for example, but exactly that shift from simple to complex makes simple learned functions generalize badly ).
2) Say you have this human imitator you trained on a understandable dataset using the states of the super AI. For simplicity let’s say it trained on a set of x’s and y’s. And let’s say our human could only count to 10 (10 fingers after all), so it’s all simple numbers between 0 and 10 (representation our simple scenario’s). And let’s say it learns that Y is some function of X like f(x): 2x+1 for example. Now this encoder comes along and it tries to explain this infinitely complex super AI to this small human imitator. It tries to ‘dumb it down’ but complex things are simply not something you can count on your fingers. In an effort to make the human imitator understand it starts forcing big numbers like eleventy-one down it’s function. It needs to get thru enough information for the decoder to recover most of it, after all. With some effort our human imitator reproduces some numbers along it’s line, something it usually does not. Can it still be said that it thinks ‘like a human’? My question in a more formal way is this: when the dataset (actions in the case of the ELK example) shifts from simple to complex, and the presumed (already exponential) mismatch between the super AI and the human imitator becomes larger ((thoughts of) states of the diamond yet unheard of) the input to human imitator shifts in complexity as well, but as it’s processing this more complex information how do we know that ‘think like a human’ generalizes to in a space unknown to humans?
3) You proposal requires a dimensionality reduction of the state space of a super AI which might be unfathomably big, done by an encoder. And in the encoder/decoder setup, it requires this to be done without too much information loss but also require it to loose enough get it into the human imitator. This leaves some window. Couldn’t it be said that in the worst case, there is no such window? (like described in the mentioned report in the example using a simple interpretable net).
I think the proposal is not to force the AI to funnel all its state through the human, but instead to make the human-simulation available as extra computing resources that the most-effective AIs should take advantage of.
I really like you’re putting up your proposal for all to read. It lets stumps like me learn a lot. So, thanks for showing it!
You asked for some ideas about things you want clarity on. I can’t help you yet, but I’d like to ask some questions first, if I can.
Some about the technical implementation and some more to do with the logic behind your proposal.
First some technical questions:
1) What if the original AI is not some recurrent ‘thinking’ AI using compute steps, but something that learns a single function mapping from input to output? Like a neural net using 1 hidden layer. Would this still work?
2) When do we exactly train the encoder/decoder? We train it on the predictor states, right? But wouldn’t adding the human imitator degrade our predictor performance by dimensionality reduction making our predictor some less good variant of the one we are interested in?
3) What is the difference between an encoder and a reporter that reports on all of the state? Why wouldn’t all misalignment issues with the reporter occur in the encoder/human imitator in an identical way?
And some other questions:
1) I assume your plan with the human imitator step is to apply or incorporate the idea of a ‘human prior’ (thinking like a human)? My question is: if you have some static function sandwiched between two unbounded priorless optimizing parts, would it actually guarantee that the human prior is applied? Or could the encoder/reporter misuse the human imitator? Or in other words, you assume a faithful encoder, but why can we trust that assumption? ( I know you say that it would be implausible that for the AI to distinguish between training and the real world for example, but exactly that shift from simple to complex makes simple learned functions generalize badly ).
2) Say you have this human imitator you trained on a understandable dataset using the states of the super AI. For simplicity let’s say it trained on a set of x’s and y’s. And let’s say our human could only count to 10 (10 fingers after all), so it’s all simple numbers between 0 and 10 (representation our simple scenario’s). And let’s say it learns that Y is some function of X like f(x): 2x+1 for example. Now this encoder comes along and it tries to explain this infinitely complex super AI to this small human imitator. It tries to ‘dumb it down’ but complex things are simply not something you can count on your fingers. In an effort to make the human imitator understand it starts forcing big numbers like eleventy-one down it’s function. It needs to get thru enough information for the decoder to recover most of it, after all. With some effort our human imitator reproduces some numbers along it’s line, something it usually does not. Can it still be said that it thinks ‘like a human’?
My question in a more formal way is this: when the dataset (actions in the case of the ELK example) shifts from simple to complex, and the presumed (already exponential) mismatch between the super AI and the human imitator becomes larger ((thoughts of) states of the diamond yet unheard of) the input to human imitator shifts in complexity as well, but as it’s processing this more complex information how do we know that ‘think like a human’ generalizes to in a space unknown to humans?
3) You proposal requires a dimensionality reduction of the state space of a super AI which might be unfathomably big, done by an encoder. And in the encoder/decoder setup, it requires this to be done without too much information loss but also require it to loose enough get it into the human imitator. This leaves some window. Couldn’t it be said that in the worst case, there is no such window? (like described in the mentioned report in the example using a simple interpretable net).
I think the proposal is not to force the AI to funnel all its state through the human, but instead to make the human-simulation available as extra computing resources that the most-effective AIs should take advantage of.