Given that much of the discussion revolves around the tool/agent issue, I’m wondering if anyone can point me to a mathematically precise definition of each, in whatever limited context it applies.
It’s mostly a question for philosophy of mind, I think specifically a question about intentionality. I think the closest you’ll get to a mathematical framework is control theory; controllers are a weird edge case between tools and very simple agents. Control theory is mathematically related to Bayesian optimization, which I think Eliezer believes is fundamental to intelligence: thus identifying cases where a controller is a tool or an agent would be directly relevant. But I don’t see how the mathematics, or any mathematics really, could help you. It’s possible that someone has mathematized arguments about intentionality by using information theory or some such, you could Google that. Even so I think that at this point the ideas are imprecise enough such that plain ol’ philosophy is what we have to work with. Unfortunately AFAIK very few people on LW are familiar with the relevant parts of philosophy of mind.
It is an EY’s announced intention to work toward an AI that is provably friendly. “Provably” means that said AI is defined in some mathematical framework first. I don’t see how one can make much progress in that area before rigorously defining intentionality.
I guess I am getting ahead of myself here. What would a relevant mathematical framework entail, to begin with?
(It’s possible that intentionality isn’t the sharpest distinction between “tools” and “agents”, but it’s the one that I see most often emphasized in philosophy of mind, especially with regards to necessary preconditions for the development of any “strong AI”.)
It seems that one could write an AI that is in some sense “provably Friendly” even while remaining agnostic as to whether the described AI is or will ultimately become a tool or an agent. It might be that a proposed AI couldn’t be an agent because it couldn’t solve the symbol grounding problem, i.e. because it lacked intentionality, and thus wouldn’t be an effective FAI, but would nonetheless be Friendly in a certain limited sense. However if effectiveness is considered a requirement of Friendliness then one would indeed have to prove in advance that one’s proposed AI could solve the grounding problem in order to prove that said AI was Friendly, or alternatively, prove that the grounding problem as such isn’t a meaningful concept. I’m not sure what Eliezer would say about this; given his thinking about “outcome pumps” and so on, I doubt he thinks symbol grounding is a fundamental or meaningful problem, and so I doubt that he has or is planning to develop any formal argument that symbol grounding isn’t a fundamental roadblock for his preferred attack on AGI.
I guess I am jumping the shark here. The shark in question being the framework itself. What would a relevant mathematical framework entail?
Your question about what a relevant mathematical framework would entail seems too vague for me to parse; my apologies, it’s likely my exhaustion. But anyway, if minds leave certain characteristic marks on their environment by virtue of their having intentional (mental) states, then how precise and deep you can make your distinguishing mathematical framework depends on how sharp a cutoff there is in reality between intentional and non-intentional states. It’s possible that the cutoff isn’t sharp at all, in which case it’s questionable whether the supposed distinction exists or is meaningful. If that’s the case then it’s quite possible that it’s not possible to formulate a deep theory that could distinguish agents from tools, or intentional states from non-intentional ones. I think it likely that most AGI researchers, including Eliezer, hold the position that it is indeed impossible to do so. I don’t think it would be possible to prove the non-existence of a sharp cutoff, so I think Eliezer could justifiably conclude that he didn’t have to prove that his AI would be an “agent” or a “tool”, because he could deny, even without mathematical justification, that such a distinction is meaningful.
Focusing on intentionality seems interesting since it lets us look at black box actors (whose agent-ness or tool-ness we don’t have to carefully define) and ask if they are acting in an apparently goal-directed manner. I’ve just skimmed [1] and barely remember [2] but it looks like you can make the inference work in simple cases and also prove some intractability results.
Obviously, FAI can’t be solved by just building some AI, modeling P(AI has goal “destroy humanity” | AI’s actions, state of world) and pulling the plug when that number gets too high. But maybe something else of value can be gained from a mathematical formalization like this.
Tenenbaum’s papers and related inductive approaches to detecting agency were the first attacks that came to mind, but I’m not sure that such statistical evidence could even in principle supply the sort of proof-strength support and precision that shminux seems to be looking for. I suppose I say this because I doubt someone like Searle would be convinced that an AI had intentional states in the relevant sense on the basis that it displayed sufficiently computationally complex communication, because such intentionality could easily be considered derived intentionality and thus not proof of the AI’s own agency. The point at which this objection loses its force unfortunately seems to be exactly the point at which you could actually run the AGI and watch it self-improve and so on, and so I’m not sure that it’s possible to prove hypothetical-Searle wrong in advance of actually running a full-blown AGI. Or is my model wrong?
I don’t think anyone will be able to. Here is my attempt at a more precise definition than what we have on the table:
An agent models the world and selects actions in a way that depends on what its modeling says will happen if it selects a given action.
A tool may model the world, and may select actions depending on its modeling, but may not select actions in a way that depends on what its modeling says will happen if it selects a given action.
A consequence of this definition is that some very simple AIs that can be thought of as “doing something,” such as some very simple checkers programs or a program that waters your plants if and only if its model says it didn’t rain, would count as tools rather than agents. I think that is a helpful way of carving things up.
A tool may model the world, and may select actions depending on its modeling, but may not select actions in a way that depends on what its modeling says will happen if it selects a given action.
So if the question is related to the future (such as “will it rain tomorrow?”), does it essentially mean that a tool will model a counterfactual alternative future which would happen if the tool did not provide any answer?
This would be OK for situations where the answer of the AI does not make a big difference (such as “will it rain tomorrow?”).
It would be less OK for situations where the mere knowledge about “what AI said” would influence the result, such as asking AI about important social or political topics, where the answer is likely to be published. (In these situations the question considered would be mixed with specific events of the counterfactual world, such as a worldwide panic “our superhuman AI seems to be broken, we are all doomed!”).
Given that much of the discussion revolves around the tool/agent issue, I’m wondering if anyone can point me to a mathematically precise definition of each, in whatever limited context it applies.
It’s mostly a question for philosophy of mind, I think specifically a question about intentionality. I think the closest you’ll get to a mathematical framework is control theory; controllers are a weird edge case between tools and very simple agents. Control theory is mathematically related to Bayesian optimization, which I think Eliezer believes is fundamental to intelligence: thus identifying cases where a controller is a tool or an agent would be directly relevant. But I don’t see how the mathematics, or any mathematics really, could help you. It’s possible that someone has mathematized arguments about intentionality by using information theory or some such, you could Google that. Even so I think that at this point the ideas are imprecise enough such that plain ol’ philosophy is what we have to work with. Unfortunately AFAIK very few people on LW are familiar with the relevant parts of philosophy of mind.
It is an EY’s announced intention to work toward an AI that is provably friendly. “Provably” means that said AI is defined in some mathematical framework first. I don’t see how one can make much progress in that area before rigorously defining intentionality.
I guess I am getting ahead of myself here. What would a relevant mathematical framework entail, to begin with?
I don’t think that idiom means what you think it means.
Thank you, fixed.
You were probably fishing for “jumping the gun”.
Yeah, should have been shooting instead of fishing.
It could be said that you shot yourself in the foot by jumping the shark while fishing for a gun.
(It’s possible that intentionality isn’t the sharpest distinction between “tools” and “agents”, but it’s the one that I see most often emphasized in philosophy of mind, especially with regards to necessary preconditions for the development of any “strong AI”.)
It seems that one could write an AI that is in some sense “provably Friendly” even while remaining agnostic as to whether the described AI is or will ultimately become a tool or an agent. It might be that a proposed AI couldn’t be an agent because it couldn’t solve the symbol grounding problem, i.e. because it lacked intentionality, and thus wouldn’t be an effective FAI, but would nonetheless be Friendly in a certain limited sense. However if effectiveness is considered a requirement of Friendliness then one would indeed have to prove in advance that one’s proposed AI could solve the grounding problem in order to prove that said AI was Friendly, or alternatively, prove that the grounding problem as such isn’t a meaningful concept. I’m not sure what Eliezer would say about this; given his thinking about “outcome pumps” and so on, I doubt he thinks symbol grounding is a fundamental or meaningful problem, and so I doubt that he has or is planning to develop any formal argument that symbol grounding isn’t a fundamental roadblock for his preferred attack on AGI.
Your question about what a relevant mathematical framework would entail seems too vague for me to parse; my apologies, it’s likely my exhaustion. But anyway, if minds leave certain characteristic marks on their environment by virtue of their having intentional (mental) states, then how precise and deep you can make your distinguishing mathematical framework depends on how sharp a cutoff there is in reality between intentional and non-intentional states. It’s possible that the cutoff isn’t sharp at all, in which case it’s questionable whether the supposed distinction exists or is meaningful. If that’s the case then it’s quite possible that it’s not possible to formulate a deep theory that could distinguish agents from tools, or intentional states from non-intentional ones. I think it likely that most AGI researchers, including Eliezer, hold the position that it is indeed impossible to do so. I don’t think it would be possible to prove the non-existence of a sharp cutoff, so I think Eliezer could justifiably conclude that he didn’t have to prove that his AI would be an “agent” or a “tool”, because he could deny, even without mathematical justification, that such a distinction is meaningful.
I’m tired, apologies for any errors.
Focusing on intentionality seems interesting since it lets us look at black box actors (whose agent-ness or tool-ness we don’t have to carefully define) and ask if they are acting in an apparently goal-directed manner. I’ve just skimmed [1] and barely remember [2] but it looks like you can make the inference work in simple cases and also prove some intractability results.
Obviously, FAI can’t be solved by just building some AI, modeling P(AI has goal “destroy humanity” | AI’s actions, state of world) and pulling the plug when that number gets too high. But maybe something else of value can be gained from a mathematical formalization like this.
[1] I. Van Rooij, J. Kwisthout, M. Blokpoel, J. Szymanik, T. Wareham, and I. Toni, “Intentional communication: Computationally easy or difficult?,” Frontiers in Human Neuroscience, vol. 5, 2011.
[2] C. L. Baker, R. R. Saxe, and J. B. Tenenbaum, “Bayesian theory of mind: Modeling joint belief-desire attribution,” Proceedings of the Thirty-Second Annual Conference of the Cognitive Science Society, 2011.
Tenenbaum’s papers and related inductive approaches to detecting agency were the first attacks that came to mind, but I’m not sure that such statistical evidence could even in principle supply the sort of proof-strength support and precision that shminux seems to be looking for. I suppose I say this because I doubt someone like Searle would be convinced that an AI had intentional states in the relevant sense on the basis that it displayed sufficiently computationally complex communication, because such intentionality could easily be considered derived intentionality and thus not proof of the AI’s own agency. The point at which this objection loses its force unfortunately seems to be exactly the point at which you could actually run the AGI and watch it self-improve and so on, and so I’m not sure that it’s possible to prove hypothetical-Searle wrong in advance of actually running a full-blown AGI. Or is my model wrong?
I am not sure if I agree with Holden that there’s a meaningful distinction between tools an agents. However, one definition I could think of is this:
“A tool, unlike an agent, includes blocking human input in its perceive/decide/act loop.”
Thus, an agent may work entirely autonomously, whereas a tool would wait for a human to make a decision before performing an action.
Of course, under this definition, Google’s webcrawler would be an agent, not a tool—which is one of the reasons I might disagree with Holden.
I don’t think anyone will be able to. Here is my attempt at a more precise definition than what we have on the table:
An agent models the world and selects actions in a way that depends on what its modeling says will happen if it selects a given action.
A tool may model the world, and may select actions depending on its modeling, but may not select actions in a way that depends on what its modeling says will happen if it selects a given action.
A consequence of this definition is that some very simple AIs that can be thought of as “doing something,” such as some very simple checkers programs or a program that waters your plants if and only if its model says it didn’t rain, would count as tools rather than agents. I think that is a helpful way of carving things up.
So if the question is related to the future (such as “will it rain tomorrow?”), does it essentially mean that a tool will model a counterfactual alternative future which would happen if the tool did not provide any answer?
This would be OK for situations where the answer of the AI does not make a big difference (such as “will it rain tomorrow?”).
It would be less OK for situations where the mere knowledge about “what AI said” would influence the result, such as asking AI about important social or political topics, where the answer is likely to be published. (In these situations the question considered would be mixed with specific events of the counterfactual world, such as a worldwide panic “our superhuman AI seems to be broken, we are all doomed!”).
I think that you’re describing a real hurdle, though it seems like a hurdle that could be overcome.