Ricardo Meneghin comments on What I Was Thinking About Before Alignment

Ricardo Meneghin 7 Apr 2022 13:03 UTC
8 points
I don’t think we have hope of developing such tools, at least not in a way that looks like anything we had in the past. In the past we have been able to analyse large systems by throwing away an immense amount of detail—it turns out that you don’t need the specific position of atoms to predict the movement of the planets, and you don’t need the details to predict all of the other things we have successfully predicted with traditional math.
With the systems you are describing, this is simply impossible. Changing a single bit in a computer can change its output completely, so you can’t build a simple abstraction that predicts it, you need to simulate it completely.
We already have a way of taking immense amounts of complicated data and finding patterns in it, it’s machine learning itself. If you want to translate what it learned into human readable descriptions, you just have to incorporate language in it—humans after all can describe their reasoning steps and why they believe what they believe (maybe not easily).
Google throws tremendous amounts of data and computational resources into training neural networks, but decoding the internal models used by those networks? We lack the mathematical tools to even know where to start.
I predict this will be done in the coming years by using large multimodal models to analyse neural network parameters, or to explain their own workings.
- Thomas Kwa 8 Apr 2022 2:11 UTC
  4 points
  Parent
  Changing a single bit in a computer can change its output completely, so you can’t build a simple abstraction that predicts it, you need to simulate it completely.
  Biology is complex, but changing a single molecule in a bacterium or neuron in a brain doesn’t completely change the output because they’re evolved to be robust to such things