SimonBiggs comments on Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

SimonBiggs 25 May 2023 2:02 UTC
3 points
0
This reminds me of the problems that STPA are trying to solve in safe systems design:
https://psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf

And, for those who prefer video, here’s a good video intro to STPA:
Their approach is designed to handle complex systems, by decomposing the system into parts. However, they are not decomposed into functions or tasks, but instead they decompose the system into a control structure.
They approach this problem by, addressing a system as built up of a graph of controllers (internal mesa optimisers which are potentially nested) which control processes and then receive feedback (internal loss functions) from those processes. From there, they are then able to logically decompose the system in such a way for each controller component and present the ways in which the resulting overall system can be unsafe due to that particular controller.

Wouldn’t it be amazing if one day we could make a neural network that when trained, the result is subsequently verifiably mappable via mech-int onto an STPA control structure. And then, potentially have verifiable systems in place that themselves undergo STPA analyses on larger yet systems, in order to flag potential hazards given a scenario, and given its current control structure.
Maybe this could look something like this?
What links here?
- SimonBiggs's comment on Is CIRL a promising agenda? by Chris_Leong (11 Jun 2023 4:41 UTC; 3 points)