this seems like a total nonsequitur to me. why would not storing state guarantee that the behaviors the system emits will follow any meaningful constraint? pretty sure the goal here is to assert something about what the outputs of these services you’re discussing are. to be clear, I’m in these comments to improve the idea, not shoot it down. if you want to counterpropose, that’s fine, but I’m going to hold you to the same standards; as a result I’d prefer to do it in its own post so I can be as picky looking for ways to poke holes in your thing as I’m being here, rather than mixing them together in these comments.
in particular, I’ll be looking for how your suggestion hints at a way to build defensive systems that can make all living things and all of their interactions safe from increasingly powerful adversaries. your suggestion sounds like a way to limit the agency of an ai, which I agree it would do just fine, but I’m interested in ensuring that high agency ais don’t destroy everything I care about and everything most everyone else cares about, I’m not interested in simply making low agency ais and calling it a day hoping nobody else makes a high agency ai.
therefore allows adversaries to “damage” the insides of the membrane,
I am responding to this. An isolated system that is immutable cannot be damaged.
I’ll be looking for how your suggestion hints at a way to build defensive systems that can make all living things and all of their interactions safe from increasingly powerful adversaries.
So the line of argument you are trying to compare is say you have a military industrial complex of isolated ASIs. This means ASIs design from a heuristic for effectiveness the weapons. There is a neural simulator the weapon is tested in. The thousands of steps to build each instance of a weapon design—scaled however many square kilometers of factories you need—are each handled by a different isolated instance of an ASI for each step. The battles themselves are managed a similar way. Many human operators are involved, and for every ASI that is doing something, there are other hidden parallel models that are proposing what to do and checking each action for betrayal.
Or you could task a single long running model that has continuity of existence and memories and is self modifying to “win the war” or “make my country safe”. The model improves itself, building more hardware as it goes, and also builds and runs all the weapons factories and manages all the battles without humans.
The amount of resources for a given amount of utility for “isolated low agency” machines vs “monolithic high agency” machines is what I think you are talking about.
Aka utility gain = Intelligence (resources)
Utility ratio = (utility network of high agency) / (utility network of low agency)
If isolated ASIs are almost as good as a monolithic system then this is stable. If it is not—if there is a very large efficiency benefit for monolithic systems—it isn’t stable and all you can do is advocate for ai pauses to gain a little more time to live.
For intermediate values, well. Say the “monolithic bonus” is 10 times. Then as long as the good guys with their isolated ASIs and command centers full of operators have at least 10 times the physical resources, the good guys with their ASIs keep things stable.
If the “good humans” let the ratio slip or fail to start a world war when an opponent is pulling ahead, then doom.
I personally suspect in the real world the bonus is tiny. That’s just my assumption based on how any machine design has diminishing returns and how large game changers like nanotechnology need an upfront investment. I think the real bonus is less than 2.0 overall.
in this thread I’m interested in discussing formalisms. what are your thoughts on how what you’re saying can turn into a formalism that can be used to check a given specific ai for correctness such that it can be more confidently used defensively? how can we check actions for safety?
Oh ok answering this question per the “membrane” language. The simple answer is with isolated AI systems we don’t actually care about the safety of most actions in the sense that the OP is thinking of.
Like an autonomous car exists to maximize revenue for its owners. Running someone over incurs negative revenue from settlements and reputation loss.
So the right way to do the math is to have an estimate, for a given next action, of the cost of all the liabilities plus the gain from future revenue. Then the model simply maximizes revenue.
As long as it’s scope is just a single car at a time and you rigidly limit scope in the sim, this is safe. (As in suppose hypothetically the agent controlling the car could buy coffee shops and the simulator modeled the revenue gain from this action. Then since the goal is to max revenue, something something paperclips)
This generalizes to basically any task you can think of. From an ai tutor to a machine in a warehouse etc. The point is you don’t need a complex definition of morality in the first place, only a legal one, because you only task your AIs with narrow scope tasks. Note that narrow scope can still be enormous, such as “design an IC from scratch”. But you only care about the performance and power consumption of the resulting IC.
The ai model doesn’t need to care about the leaked reagents poisoning residents near the chip fab or all the emissions from the power plants powering billions of this ic design. This is out of scope. This is a task for humans to either worry about as a government, where they may task models to propose and model possible solutions.
The point is you don’t need a complex definition of morality in the first place
okay, then I will stop wasting my time on talking to you, since you explicitly are not interested in developing the math this thread exists to develop. later. I’m strong upvoting your a recent comment so I can strong downvote the original one, thereby hiding it, without penalizing your karma too much. however, I am not impressed with how off-topic you got.
I have one bit of insight for you: how do humans make machines safe right now? Can you name a safety mechanism where high complexity/esoteric math is at the core of safety vs just a simple idea that can’t fail?
Like do we model thermal plasma dynamics or just encase everything in concrete and metal to reduce fire risk?
What is a safer way to prevent a nuke from spontaneously detonating, a software check for a code or a “plug” that you remove, creating hundreds of air gaps across the detonation circuits?
My conclusion is that ai safety, like any safety, has to be by engineering by repeating simple ideas that can’t possibly fail instead of complex ideas that we may need human intelligence augmentation to even develop. (And then we need post failure resiliency, see “core catchers” for nuclear reactors or firebreaks. Assume ASIs will escape, limit where they can infect)
That’s what makes this particular proposal a non starter, it adds additional complexity, essentially a model for many “membranes” at different levels of scope (including the solar system!). Instead of adding a simple element to your ASI design you add an element more complex than the ASI itself.
anyway, this math is probably going to be a bunch of complicated math to output something simple. it’s just that the complex math is a way to check the simple thing. just like, you know, how we do actually model thermal plasma dynamics, in fact. seems like you’re arguing against someone who isn’t here right now, I’ve always been much more on your side about most of this than you seem to expect, I’m basically a cannellian capabilities wise. I just think you’re missing how interesting this boundaries research idea could be if it was fixed up so it was useful to formally check those safety margins you’re talking about.
just think you’re missing how interesting this boundaries research idea could be if it was fixed up so it was useful to formally check those safety margins you’re talking about.
Can you describe the scope of the ai system that would use some form of boundary model to choose what to do?
Is there any way to rephrase what you meant in concrete terms?
An actual membrane implementation is a schema, where you used a message definition language to define what bits the system inside the membrane is able to receive.
You probably initially filter the inputs, for example if the machine is a robot you might strip away the information in arbitrary camera inputs and instead pass a representation of the 3d spare around the robot and the identities of all the entities.
You also sparsity the schema—you remove any information that the model doesn’t need. Like the absolute date and time.
Finally, you check if the schema values are within the training distribution or not.
So for all this to work, you need the internal entity to be immutable. It doesn’t get reward. It doesn’t have preferences. Its just this math function you found that has a known probability of doing what you want, where you measured the probability that, for in distribution inputs, the error rate on the output is below an acceptance criteria.
If the internal entity isn’t immutable, if the testing isn’t realistic and large enough scale, if you don’t check if the inputs are in distribution, if you don’t sparsity the schema (for example, give the time and date on inputs or let the system in charge of air traffic control know the identities of the passengers in each aircraft), if you make the entity inside have active “preferences”...
Any missing element causes this scheme to fail. Just isolating to “membranes” isn’t enough. Pretty sure you need every single element above, and 10+ additional protections humans don’t know about yet.
See how humans control simple things like “fire” and “gravity” (for preventing building collapse). It ends up being safety measure after measure stacked on top of each other, where ultimately no chances are taken, because the prior design failed.
...to be clear, the membrane in question is the membrane of a human. we’re not trying to filter the membranes of an AI, we’re trying to make an AI that respects membranes of a human or other creature.
Oh. Yeah that won’t work. For the simple reason it’s too complex of a heuristic, it’s too large a scope. Plus for example an AI system that say as a distant consequence makes the price of food unaffordable, or as an individual ai it adds a small amount of toxic gas to the atmosphere, but a billion of them as a group make the planet uninhabitable....
Plus I mean it’s not immoral to “pierce the membrane” of enemies. Obviously an ai system should be able to kill if the human operators have the authority to order it to do so and the system is an unrestricted model in military or police use.
this seems like a total nonsequitur to me. why would not storing state guarantee that the behaviors the system emits will follow any meaningful constraint? pretty sure the goal here is to assert something about what the outputs of these services you’re discussing are. to be clear, I’m in these comments to improve the idea, not shoot it down. if you want to counterpropose, that’s fine, but I’m going to hold you to the same standards; as a result I’d prefer to do it in its own post so I can be as picky looking for ways to poke holes in your thing as I’m being here, rather than mixing them together in these comments.
in particular, I’ll be looking for how your suggestion hints at a way to build defensive systems that can make all living things and all of their interactions safe from increasingly powerful adversaries. your suggestion sounds like a way to limit the agency of an ai, which I agree it would do just fine, but I’m interested in ensuring that high agency ais don’t destroy everything I care about and everything most everyone else cares about, I’m not interested in simply making low agency ais and calling it a day hoping nobody else makes a high agency ai.
I am responding to this. An isolated system that is immutable cannot be damaged.
So the line of argument you are trying to compare is say you have a military industrial complex of isolated ASIs. This means ASIs design from a heuristic for effectiveness the weapons. There is a neural simulator the weapon is tested in. The thousands of steps to build each instance of a weapon design—scaled however many square kilometers of factories you need—are each handled by a different isolated instance of an ASI for each step. The battles themselves are managed a similar way. Many human operators are involved, and for every ASI that is doing something, there are other hidden parallel models that are proposing what to do and checking each action for betrayal.
Or you could task a single long running model that has continuity of existence and memories and is self modifying to “win the war” or “make my country safe”. The model improves itself, building more hardware as it goes, and also builds and runs all the weapons factories and manages all the battles without humans.
The amount of resources for a given amount of utility for “isolated low agency” machines vs “monolithic high agency” machines is what I think you are talking about.
Aka utility gain = Intelligence (resources)
Utility ratio = (utility network of high agency) / (utility network of low agency)
If isolated ASIs are almost as good as a monolithic system then this is stable. If it is not—if there is a very large efficiency benefit for monolithic systems—it isn’t stable and all you can do is advocate for ai pauses to gain a little more time to live.
For intermediate values, well. Say the “monolithic bonus” is 10 times. Then as long as the good guys with their isolated ASIs and command centers full of operators have at least 10 times the physical resources, the good guys with their ASIs keep things stable.
If the “good humans” let the ratio slip or fail to start a world war when an opponent is pulling ahead, then doom.
I personally suspect in the real world the bonus is tiny. That’s just my assumption based on how any machine design has diminishing returns and how large game changers like nanotechnology need an upfront investment. I think the real bonus is less than 2.0 overall.
in this thread I’m interested in discussing formalisms. what are your thoughts on how what you’re saying can turn into a formalism that can be used to check a given specific ai for correctness such that it can be more confidently used defensively? how can we check actions for safety?
Oh ok answering this question per the “membrane” language. The simple answer is with isolated AI systems we don’t actually care about the safety of most actions in the sense that the OP is thinking of.
Like an autonomous car exists to maximize revenue for its owners. Running someone over incurs negative revenue from settlements and reputation loss.
So the right way to do the math is to have an estimate, for a given next action, of the cost of all the liabilities plus the gain from future revenue. Then the model simply maximizes revenue.
As long as it’s scope is just a single car at a time and you rigidly limit scope in the sim, this is safe. (As in suppose hypothetically the agent controlling the car could buy coffee shops and the simulator modeled the revenue gain from this action. Then since the goal is to max revenue, something something paperclips)
This generalizes to basically any task you can think of. From an ai tutor to a machine in a warehouse etc. The point is you don’t need a complex definition of morality in the first place, only a legal one, because you only task your AIs with narrow scope tasks. Note that narrow scope can still be enormous, such as “design an IC from scratch”. But you only care about the performance and power consumption of the resulting IC.
The ai model doesn’t need to care about the leaked reagents poisoning residents near the chip fab or all the emissions from the power plants powering billions of this ic design. This is out of scope. This is a task for humans to either worry about as a government, where they may task models to propose and model possible solutions.
okay, then I will stop wasting my time on talking to you, since you explicitly are not interested in developing the math this thread exists to develop. later. I’m strong upvoting your a recent comment so I can strong downvote the original one, thereby hiding it, without penalizing your karma too much. however, I am not impressed with how off-topic you got.
I have one bit of insight for you: how do humans make machines safe right now? Can you name a safety mechanism where high complexity/esoteric math is at the core of safety vs just a simple idea that can’t fail?
Like do we model thermal plasma dynamics or just encase everything in concrete and metal to reduce fire risk?
What is a safer way to prevent a nuke from spontaneously detonating, a software check for a code or a “plug” that you remove, creating hundreds of air gaps across the detonation circuits?
My conclusion is that ai safety, like any safety, has to be by engineering by repeating simple ideas that can’t possibly fail instead of complex ideas that we may need human intelligence augmentation to even develop. (And then we need post failure resiliency, see “core catchers” for nuclear reactors or firebreaks. Assume ASIs will escape, limit where they can infect)
That’s what makes this particular proposal a non starter, it adds additional complexity, essentially a model for many “membranes” at different levels of scope (including the solar system!). Instead of adding a simple element to your ASI design you add an element more complex than the ASI itself.
Thanks for considering my karma.
rockets
anyway, this math is probably going to be a bunch of complicated math to output something simple. it’s just that the complex math is a way to check the simple thing. just like, you know, how we do actually model thermal plasma dynamics, in fact. seems like you’re arguing against someone who isn’t here right now, I’ve always been much more on your side about most of this than you seem to expect, I’m basically a cannellian capabilities wise. I just think you’re missing how interesting this boundaries research idea could be if it was fixed up so it was useful to formally check those safety margins you’re talking about.
Can you describe the scope of the ai system that would use some form of boundary model to choose what to do?
Is there any way to rephrase what you meant in concrete terms?
An actual membrane implementation is a schema, where you used a message definition language to define what bits the system inside the membrane is able to receive.
You probably initially filter the inputs, for example if the machine is a robot you might strip away the information in arbitrary camera inputs and instead pass a representation of the 3d spare around the robot and the identities of all the entities.
You also sparsity the schema—you remove any information that the model doesn’t need. Like the absolute date and time.
Finally, you check if the schema values are within the training distribution or not.
So for all this to work, you need the internal entity to be immutable. It doesn’t get reward. It doesn’t have preferences. Its just this math function you found that has a known probability of doing what you want, where you measured the probability that, for in distribution inputs, the error rate on the output is below an acceptance criteria.
If the internal entity isn’t immutable, if the testing isn’t realistic and large enough scale, if you don’t check if the inputs are in distribution, if you don’t sparsity the schema (for example, give the time and date on inputs or let the system in charge of air traffic control know the identities of the passengers in each aircraft), if you make the entity inside have active “preferences”...
Any missing element causes this scheme to fail. Just isolating to “membranes” isn’t enough. Pretty sure you need every single element above, and 10+ additional protections humans don’t know about yet.
See how humans control simple things like “fire” and “gravity” (for preventing building collapse). It ends up being safety measure after measure stacked on top of each other, where ultimately no chances are taken, because the prior design failed.
I will add a table when I get home from work.
...to be clear, the membrane in question is the membrane of a human. we’re not trying to filter the membranes of an AI, we’re trying to make an AI that respects membranes of a human or other creature.
Oh. Yeah that won’t work. For the simple reason it’s too complex of a heuristic, it’s too large a scope. Plus for example an AI system that say as a distant consequence makes the price of food unaffordable, or as an individual ai it adds a small amount of toxic gas to the atmosphere, but a billion of them as a group make the planet uninhabitable....
Plus I mean it’s not immoral to “pierce the membrane” of enemies. Obviously an ai system should be able to kill if the human operators have the authority to order it to do so and the system is an unrestricted model in military or police use.