Tor Økland Barstad comments on Making it harder for an AGI to “trick” us, with STVs

Tor Økland Barstad 13 Jul 2022 16:03 UTC
3 points
1
I’ve also been thinking about answer-generator-generators (STV makers, as you’ve termed it) as potentially an important path to explore.
That’s cool to hear :)

I tried to skim through some of your stuff (hope to read more of it later). I assume that far from all of your thoughts have been put into writing, but I like e.g. this excerpt from here:

”Paper 2 is something interesting and different. Imagine you had a powerful complex abstruse optimizer that wrote code to solve problems. Say you had this optimizer create you an agent or set of agents written in code you could understand. You could enforce the constraint that you must be able to fully understand these simpler agents in order to trust them enough to use them in the real world. Once you are happy with the performance of the agents in the test environment, you can copy them, edit them as you see fit, and use them for real world tasks.

The complex ‘black box’ ai itself never gets to directly interact with the world. You may not fully understand all the computation going on in the code creator, but maybe that’s ok if you fully understand the programs it writes. The computer science field already has a lot of well-defined techniques for analyzing untrusted code and validating its safety. There are certainly limits to the efficacy we might maximally expect from code which we could guarantee to be safe, but I think there’s a lot of value we could derive from code below our max-complexity-for-validation threshold. Importantly, it should be doable to determine whether a given set of code can be validated or not. Knowing what we can trust and what we can’t trust would be a big step forward.”

The more of our notions about programs/code that we can make formal (maybe even computer-checkable?), the better I’d think. For example, when it comes to interpretability, one question is “how can we make the systems we make ourselves more interpretable?”. But another question is “if we have a superhuman AI-system that can code for us, and ask it to make other AI-systems that are more interpretable, what might proofs/verifications of various interpretability-properties (that we would accept) look like?”. Could it for example be proven/verified that certain sub-routines/modules think about only a certain parts/aspects of some domain, or only do a certain type of thinking? If we want to verify not only that an agent doesn’t think about something directly, but also that it doesn’t take something into account indirectly somehow, do we have ideas of what proofs/verifications for this might look like? (And if we think something like that can’t be proven/verified, why?)
I’ve got a couple links to work by others that I think is relevant progress towards this. Their ambitions sound rather more simple and short-term, but I think it’s good to have something concrete to look at and discuss.
Thanks, those were interesting :)

In my case, I unfortunately don’t have a deep/detailed/precise technical understanding of neural networks and modern machine learning (although I’m not totally without any understanding either). My “abstraction” (lack of thinking about details of AGI internals) is partly because I think that kind of thinking also can be valuable, but also due to necessity. The best is of course to be able to do both :)