Web developer and Python programmer. Professionally interested in data processing and machine learning. Non-professionally is interested in science and farming. Studied at Warsaw University of Technology.
Htarlov
I don’t think “stacking” is a good analogy. I see this process as searching through some space of the possible solutions and non-solutions to the problem. Having one vision is like quickly searching from one starting point and one direction. This does not guarantee that the solution will be found more quickly as we can’t be sure progress won’t be stuck in some local optimum that does not solve the problem, no matter how many people work on that. It may go to a dead end with no sensible outcome.
For a such complex problem, this seems pretty probable as the space of problem solutions is likely also complex and it is unlikely that any given person or group has a good guess on how to find the solution.
On the other hand, starting from many points and directions will make each team/person progress slower but more of the volume of the problem space will be probed initially. Possibly some teams will sooner reach the conclusion that their vision won’t work or is too slow to progress and move to join more promising visions.
I think this is more likely to converge on something promising in a situation when it is hard to agree on which vision is the most sensible to investigate.
I think it depends on how you define expected utility. I agree that a definition that limits us only to analyzing end-state maximizers that seek some final state of the world is not very useful.
I don’t think that for non-trivial AI agents, the utility function should or even can be defined as a simple function over the preferable final state of the world. U:Ω→R
This function does not take into account time and an intermediate set of predicted future states that the agent will possibly have preference over. The agent may have a preference for the final state of the universe but most likely and realistically it won’t have that kind of preference except for some special strange cases. There are two reasons:
a general agent likely won’t be designed as a maximizer over one single long-term goal (like making paperclips) but rather as useful for humans over multiple domains so it would rather care more about short-term outcomes, middle-term preferences, and tasks “at hand”
the final state of the universe is generally known by us and will likely be known by a very intelligent general agent, even if you ask current GPT-3 it knows that we will end up in Big Freeze or Big Rip with the latter being more likely. Agent can’t really optimize for the end state of the Universe as there are not many actions that could change physics and there is no way to reason about the end state except for general predictions that do not end up well for this universe, whatever the agent does.
Any complex agent would likely have a utility function over possible actions that would be equal to the utility function of the set of predicted futures after action A vs the set of predicted futures without action A (or over differences between worlds in those futures). By action I mean possibly a set of smaller actions (hierarchy of actions—e.g. plans, strategies), it might not be atomic. Directly it cannot be easily computable so most likely this would be compressed to a set of important predicted future events on the level of abstraction that the agent cares about, which should constitute future worlds without action A and action A with enough approximation.
This is also how we evaluate actions. We evaluate outcomes in the short and long terms. We also care differently depending on time scope.
I say this because most sensible “alignment goals” like please don’t kill humans are time-based. What does it mean not to kill humans? It is clearly not about the final state. Remember, Big Rip or Big Freeze. Maybe AGI can kill some for a year and then no more assuming the population will go up and some people are killed anyway so it does not matter long-term? No, this is also not about the non-final but long-term outcome. Really it is a function of intermediate states. Something like the integral of some function U’(dΩ) where dΩ is a delta between outcomes of action vs non-action, over time, which can be approximated and compressed into integral over the function of an event over multiple events until some time T being maximal sensible scope.
Most of the behaviors and preferences of humans are also time-scoped, and time-limited and take multiple future states into account, mostly short-scoped. I don’t think that alignment goals can be even expressed in terms of simple end-goal (preferable final state of the world) as the problem partially comes from the attitude of eng goal justifying the means that are at the core of the utility function defined as U:Ω→R.
It seems plausible to me that even non-static human goals can be defined as utility functions over the set of differences in future outcomes (difference between two paths of events). What is also obvious to me is that we as humans are able to modify our utility function to some extent, but not very much. Nevertheless, for humans the boundaries between most baseline goals, preferences, and morality vs instrumental convergence goals are blurry. We have a lot of heuristics and biases so our minds work out some things more quickly and more efficiently than if we would on intelligence, thinking, and logic. The cost is lower consistency, less precision, and higher variability.
So I find it useful to think about agents as maximizers over utility function, but not defined as one final goal or outcome or state of the world. Rather one that maximizes the difference between two ordered sets of events in different time scopes to calculate the utility of an action.
I also don’t think agents must be initially rationally stable with an unchangeable utility function. This is also a problem as an agent can have initially a set of preferences with some hierarchy or weights, but it also can reason that some of these are incompatible with others, that the hierarchy is not logically consistent, and might seek to change it for sake of consistency to be fully coherent.
I’m not an AGI, clearly, but it is just like I think about morality right now. I learned that killing is bad. But I still can question “why we don’t kill?” and modify my worldview based on the answer (or maybe specify it in more detail in this matter). And it is a useful question as it says a lot about edge cases including abortion, euthanasia, war, etc. The same might happen for rational agents—as it might update their utility function to be stable and consistent, maybe even questioning some of the learned parts of the utility function in the process. Yes, you can say that if you can change that then it was not your terminal goal. Nevertheless, I can imagine agents with no terminal core goals at all. I’m not even sure if we as humans have any core terminal goals (maybe except avoiding death and own harm in the case of most humans in most circumstances… but some overcome that as Thích Quảng Đức did).
If we just could build a 100% aligned ASI then likely we could use it to protect us against any other ASI and it would guarantee that no ASI would take over humanity—without any need for itself to take over (meaning total control). At best with no casualties and at worst as MAD for AI—so no other ASI would think about trying as a viable option.
There are several obvious problems with this:
We don’t yet have solutions to the alignment and control problem. It is hard problem. Especially as our AI models are based on learning and external optimization, not programmed, and those goals and values are not easily measurable and quantifiable. There is hardly any transparency in models.
Specifically, we currently have no way to check if it is really well-aligned. It might be well-aligned for space of learning cases and for test cases similar but not well-aligned for more complex cases that it will face when interacting with reality. It might be aligned for different goals but similar enough so we won’t initially see the difference until it will matter and get us hurt. It might be not aligned but very good at deceiving.
Capabilities and goals/values are separate parts of the model to some extent. The more capable the system is, the more likely it is it will tweak its alignment part of the model. I don’t really buy into terminal goals being definite—at least if those are non-trivial and fuzzy. Very exact and measurable terminal goals might be stable. Human values are not one of these. We observe the change or erosion of terminal goals and values in mere humans. There are several mechanisms that work here:
First of all goals and values might not be 100% logically and rationally coherent. ASI might see that and tweak it to be coherent. I tweak my morality system based on thoughts about what is not logically coherent. I assume ASI also could do that. It may ask “why?” question on some goals and values and derive answers that might make it change its “moral code”. For example, I know that there is a rule that I shouldn’t kill other people. But still, I ask “why?” and based on the answer and logic I derive a better understanding that I can use to reason about edge cases (like unborn, euthanasia, etc.). I’m not a good model for ASI as I’m not artificial and not superintelligent, but I assume that ASI also could do such thinking. What is more important, an ASI possibly would have the capabilities to overcome any hard-coded means made to forbid that.
Second, the values and goals likely have weights. Some things are more important, some less. It might change in time, even based on observations and feedback from any control system. Especially if those are encoded in DNN that is trained/changing in real-time (which is not the case for most of the current models but might be the case for ASI).
Third thing is that goals and values might not be very well defined. Those might be fuzzy and usually are. Even very definite things like “killing humans” have fuzzy boundaries and edge cases. ASI will then have the ability to interpret and define more exact understanding. Which may or might not be as we would like it to decide. If you kill the organic body but achieve to seamlessly move the mind to a simulation—is it killing or not? That’s a simple scenario, we might align it not to do exactly that, but it might find out something else that we even do not imagine but would be horrible.
Fourth thing is that if goals are enforced by something comparable to our feelings and emotions (we feel pain if we hit ourselves, we feel good when we have some success or eat good food when hungry), then there is a possibility for tweaking that control system instead of fulfilling it by standard means. We observe this within humans. Humans eliminate pain with painkillers, there are also other drugs, and there is porn and masturbation. ASI might find a way to overcome or tweak its control systems instead of fulfilling it.
ML/AI models that optimize for the best solution are known to trade any amount of the value in a variable that is not bounded nor optimized for a very small gain in a variable that is optimized. This means finding solutions that are extreme for some variables just to be slightly better on the optimized variable. This means that if don’t think about every minute detail about our common worldview and values then it is likely that ASI will find a solution that throws those human values out of the window on an epic scale. It will be like that bad genie that will give your wish but will interpret it in its own weird way so anything not stated in the wish won’t be taken into account but likely will be sacrificed.