Our strategy is for variants to preserve well-defined behavior in the application but introduce diversity in the effect of undefined behavior (such as out-of-bounds accesses).
This Galois work is a lot narrower and targeted at low-level details irrelevant to most code, which thankfully is now written in non-C languages - where out-of-bounds accesses don’t pwn your machine and undefined behavior does not summon nasal demons and stuff like ASLR is largely irrelevant.
So AI is wholly necessary for most of the value of such a metamorphic code idea.
And yeah, I think it’s a pretty decent idea: with cheap enough LLMs, you can harden applications by sampling possible implementations which pass all unit-tests, and whose final combination pass all end-to-end or integration tests. You can already do this a bit to check things with LLMs being so cheap. (Last night, Achmiz asked a Markov chain question and I was too lazy to try to figure it out myself, so I had ChatGPT solve it 3 ways in R: Monte Carlo, solving the matrix, and proving an exact closed-form probability. The answer could be wrong but that seems unlikely when they all seem to agree. If I wanted to write it up, I’d also have Claude solve it independently in Python so I could cross-check all 6 versions...)
This would help avoid a decent number of logic bugs and oversights, and it would also have some benefits in terms of software engineering: you are getting a lot of automated ‘chaos engineering’ and unit-test generation and performance benchmarking for free, by distributing a combinatorial number of implementations. It’s almost like a mass fuzzing exercise, where the users provide the fuzz.
You might think this would run into issues with tracking the combinatorial number of binaries, which could take up petabytes if you are distributing, say, a 1GB package to 1 million users, but this has plenty of possible fixes: if you are using reproducible builds, as you ought to, then you only need to track a list of the variants for each function and store that per user, and then you can rebuild the exact binary for a given user on-demand.* I think a bigger issue is that forcing diversity out of tuned LLMs is quite hard, and so you would run into the systematic error problem at a higher level: all the tuned LLMs, feeding on each others’ outputs & mode-collapsed, will turn in code with the same implicit assumptions & algorithms & bugs, which would mostly defeat the point.
* Similarly, the LLMs are, or should be, deterministic and fixable with a seed. So the overhead here might be something like, if you have a codebase with 10,000 functions, each time you push out a release—which might happen daily or weekly—you store the RNG seed for the LLM snapshot ID (maybe a kilobyte total), generate 2 versions of each function and randomize per user, and track 10,000 bits or ~1kb per user, so if you have a million users that’s just a gigabyte. Whenever you need to investigate a specific binary because it triggered a crash or something, you just fetch the LLM ID & RNG, decode the specific 10,000 function variants they used, and compile. For anyone with millions of users who is serious about security or reliability, a gigabyte of overhead per release is nothing. You already waste that much with random Docker images and crap.
This is a good idea and it already works, it is just that AI is wholly unnecessary. Have a look at 2018 post Protecting Applications with Automated Software Diversity.
This Galois work is a lot narrower and targeted at low-level details irrelevant to most code, which thankfully is now written in non-C languages - where out-of-bounds accesses don’t pwn your machine and undefined behavior does not summon nasal demons and stuff like ASLR is largely irrelevant.
So AI is wholly necessary for most of the value of such a metamorphic code idea.
And yeah, I think it’s a pretty decent idea: with cheap enough LLMs, you can harden applications by sampling possible implementations which pass all unit-tests, and whose final combination pass all end-to-end or integration tests. You can already do this a bit to check things with LLMs being so cheap. (Last night, Achmiz asked a Markov chain question and I was too lazy to try to figure it out myself, so I had ChatGPT solve it 3 ways in R: Monte Carlo, solving the matrix, and proving an exact closed-form probability. The answer could be wrong but that seems unlikely when they all seem to agree. If I wanted to write it up, I’d also have Claude solve it independently in Python so I could cross-check all 6 versions...)
This would help avoid a decent number of logic bugs and oversights, and it would also have some benefits in terms of software engineering: you are getting a lot of automated ‘chaos engineering’ and unit-test generation and performance benchmarking for free, by distributing a combinatorial number of implementations. It’s almost like a mass fuzzing exercise, where the users provide the fuzz.
You might think this would run into issues with tracking the combinatorial number of binaries, which could take up petabytes if you are distributing, say, a 1GB package to 1 million users, but this has plenty of possible fixes: if you are using reproducible builds, as you ought to, then you only need to track a list of the variants for each function and store that per user, and then you can rebuild the exact binary for a given user on-demand.* I think a bigger issue is that forcing diversity out of tuned LLMs is quite hard, and so you would run into the systematic error problem at a higher level: all the tuned LLMs, feeding on each others’ outputs & mode-collapsed, will turn in code with the same implicit assumptions & algorithms & bugs, which would mostly defeat the point.
* Similarly, the LLMs are, or should be, deterministic and fixable with a seed. So the overhead here might be something like, if you have a codebase with 10,000 functions, each time you push out a release—which might happen daily or weekly—you store the RNG seed for the LLM snapshot ID (maybe a kilobyte total), generate 2 versions of each function and randomize per user, and track 10,000 bits or ~1kb per user, so if you have a million users that’s just a gigabyte. Whenever you need to investigate a specific binary because it triggered a crash or something, you just fetch the LLM ID & RNG, decode the specific 10,000 function variants they used, and compile. For anyone with millions of users who is serious about security or reliability, a gigabyte of overhead per release is nothing. You already waste that much with random Docker images and crap.