90% awful idea: “Genetic diversity” in computer programs for resistance to large-scale cyberattacks.
The problem: Once someone has figured out the right security hole in Tesla’s software (and, say, broken into a server used to deliver software updates), they can use this to install their malicious code into all 5 million Teslas in the field (or maybe just one model, so perhaps 1 million cars), and probably make them all crash simultaneously and cause a catastrophe.
The solution: There will probably come a point where we can go through the codebase and pick random functions and say, “Claude, write a specification of what this function does”, and then “Claude, take this specification and write a new function implementing it”, and end up with different functions that accomplish the same task, which are likely to have different bugs. Have every Tesla do this to its own software. Then the virus or program that breaks into some Teslas will likely fail on others.
One reason this is horrible is that you would need an exceptionally high success rate for writing those replacement functions—else this process would introduce lots of mundane bugs, which might well cause crashes of their own. That, or you’d need a very extensive set of unit tests to catch all such bugs—so extensive as to probably eat up most of your engineers’ time writing them. Though perhaps AIs could do that part.
Our strategy is for variants to preserve well-defined behavior in the application but introduce diversity in the effect of undefined behavior (such as out-of-bounds accesses).
This Galois work is a lot narrower and targeted at low-level details irrelevant to most code, which thankfully is now written in non-C languages - where out-of-bounds accesses don’t pwn your machine and undefined behavior does not summon nasal demons and stuff like ASLR is largely irrelevant.
So AI is wholly necessary for most of the value of such a metamorphic code idea.
And yeah, I think it’s a pretty decent idea: with cheap enough LLMs, you can harden applications by sampling possible implementations which pass all unit-tests, and whose final combination pass all end-to-end or integration tests. You can already do this a bit to check things with LLMs being so cheap. (Last night, Achmiz asked a Markov chain question and I was too lazy to try to figure it out myself, so I had ChatGPT solve it 3 ways in R: Monte Carlo, solving the matrix, and proving an exact closed-form probability. The answer could be wrong but that seems unlikely when they all seem to agree. If I wanted to write it up, I’d also have Claude solve it independently in Python so I could cross-check all 6 versions...)
This would help avoid a decent number of logic bugs and oversights, and it would also have some benefits in terms of software engineering: you are getting a lot of automated ‘chaos engineering’ and unit-test generation and performance benchmarking for free, by distributing a combinatorial number of implementations. It’s almost like a mass fuzzing exercise, where the users provide the fuzz.
You might think this would run into issues with tracking the combinatorial number of binaries, which could take up petabytes if you are distributing, say, a 1GB package to 1 million users, but this has plenty of possible fixes: if you are using reproducible builds, as you ought to, then you only need to track a list of the variants for each function and store that per user, and then you can rebuild the exact binary for a given user on-demand.* I think a bigger issue is that forcing diversity out of tuned LLMs is quite hard, and so you would run into the systematic error problem at a higher level: all the tuned LLMs, feeding on each others’ outputs & mode-collapsed, will turn in code with the same implicit assumptions & algorithms & bugs, which would mostly defeat the point.
* Similarly, the LLMs are, or should be, deterministic and fixable with a seed. So the overhead here might be something like, if you have a codebase with 10,000 functions, each time you push out a release—which might happen daily or weekly—you store the RNG seed for the LLM snapshot ID (maybe a kilobyte total), generate 2 versions of each function and randomize per user, and track 10,000 bits or ~1kb per user, so if you have a million users that’s just a gigabyte. Whenever you need to investigate a specific binary because it triggered a crash or something, you just fetch the LLM ID & RNG, decode the specific 10,000 function variants they used, and compile. For anyone with millions of users who is serious about security or reliability, a gigabyte of overhead per release is nothing. You already waste that much with random Docker images and crap.
yup, as @sanxiyn says, this already exists. Their example is, AIUI, a high-end research one; an actually-on-your-laptop-right-now, but admittedly more narrow example is address space layout randomization.
For example:
The common saying, “Anything worth doing is worth doing [well/poorly]” needs more qualifiers. As it is, the opposite respective advice can often be just as useful. I.E. not very.
Better V1: “The cost/utility ratio of beneficial actions at minimum cost are often less favorable than they would be with greater investment.”
Better V2: “If an action is beneficial, a flawed attempt may be preferable to none at all.”
However, these are too wordy to be pithy and in pop culture transmission accuracy is generally sacrificed in favor of catchiness.
90% awful idea: “Genetic diversity” in computer programs for resistance to large-scale cyberattacks.
The problem: Once someone has figured out the right security hole in Tesla’s software (and, say, broken into a server used to deliver software updates), they can use this to install their malicious code into all 5 million Teslas in the field (or maybe just one model, so perhaps 1 million cars), and probably make them all crash simultaneously and cause a catastrophe.
The solution: There will probably come a point where we can go through the codebase and pick random functions and say, “Claude, write a specification of what this function does”, and then “Claude, take this specification and write a new function implementing it”, and end up with different functions that accomplish the same task, which are likely to have different bugs. Have every Tesla do this to its own software. Then the virus or program that breaks into some Teslas will likely fail on others.
One reason this is horrible is that you would need an exceptionally high success rate for writing those replacement functions—else this process would introduce lots of mundane bugs, which might well cause crashes of their own. That, or you’d need a very extensive set of unit tests to catch all such bugs—so extensive as to probably eat up most of your engineers’ time writing them. Though perhaps AIs could do that part.
This is a good idea and it already works, it is just that AI is wholly unnecessary. Have a look at 2018 post Protecting Applications with Automated Software Diversity.
This Galois work is a lot narrower and targeted at low-level details irrelevant to most code, which thankfully is now written in non-C languages - where out-of-bounds accesses don’t pwn your machine and undefined behavior does not summon nasal demons and stuff like ASLR is largely irrelevant.
So AI is wholly necessary for most of the value of such a metamorphic code idea.
And yeah, I think it’s a pretty decent idea: with cheap enough LLMs, you can harden applications by sampling possible implementations which pass all unit-tests, and whose final combination pass all end-to-end or integration tests. You can already do this a bit to check things with LLMs being so cheap. (Last night, Achmiz asked a Markov chain question and I was too lazy to try to figure it out myself, so I had ChatGPT solve it 3 ways in R: Monte Carlo, solving the matrix, and proving an exact closed-form probability. The answer could be wrong but that seems unlikely when they all seem to agree. If I wanted to write it up, I’d also have Claude solve it independently in Python so I could cross-check all 6 versions...)
This would help avoid a decent number of logic bugs and oversights, and it would also have some benefits in terms of software engineering: you are getting a lot of automated ‘chaos engineering’ and unit-test generation and performance benchmarking for free, by distributing a combinatorial number of implementations. It’s almost like a mass fuzzing exercise, where the users provide the fuzz.
You might think this would run into issues with tracking the combinatorial number of binaries, which could take up petabytes if you are distributing, say, a 1GB package to 1 million users, but this has plenty of possible fixes: if you are using reproducible builds, as you ought to, then you only need to track a list of the variants for each function and store that per user, and then you can rebuild the exact binary for a given user on-demand.* I think a bigger issue is that forcing diversity out of tuned LLMs is quite hard, and so you would run into the systematic error problem at a higher level: all the tuned LLMs, feeding on each others’ outputs & mode-collapsed, will turn in code with the same implicit assumptions & algorithms & bugs, which would mostly defeat the point.
* Similarly, the LLMs are, or should be, deterministic and fixable with a seed. So the overhead here might be something like, if you have a codebase with 10,000 functions, each time you push out a release—which might happen daily or weekly—you store the RNG seed for the LLM snapshot ID (maybe a kilobyte total), generate 2 versions of each function and randomize per user, and track 10,000 bits or ~1kb per user, so if you have a million users that’s just a gigabyte. Whenever you need to investigate a specific binary because it triggered a crash or something, you just fetch the LLM ID & RNG, decode the specific 10,000 function variants they used, and compile. For anyone with millions of users who is serious about security or reliability, a gigabyte of overhead per release is nothing. You already waste that much with random Docker images and crap.
yup, as @sanxiyn says, this already exists. Their example is, AIUI, a high-end research one; an actually-on-your-laptop-right-now, but admittedly more narrow example is address space layout randomization.
Pithy sayings are lossily compressed.
Yes.
For example: The common saying, “Anything worth doing is worth doing [well/poorly]” needs more qualifiers. As it is, the opposite respective advice can often be just as useful. I.E. not very.
Better V1: “The cost/utility ratio of beneficial actions at minimum cost are often less favorable than they would be with greater investment.”
Better V2: “If an action is beneficial, a flawed attempt may be preferable to none at all.”
However, these are too wordy to be pithy and in pop culture transmission accuracy is generally sacrificed in favor of catchiness.