This is basically how I view the DeepMind Flamingo model training to have operated, where a few stitching layers learn to translate the outputs of a frozen vision encoder into “subroutine calls” into the frozen language model, such that visual concept circuits ping their corresponding text token output circuits.
This is basically how I view the DeepMind Flamingo model training to have operated, where a few stitching layers learn to translate the outputs of a frozen vision encoder into “subroutine calls” into the frozen language model, such that visual concept circuits ping their corresponding text token output circuits.