The intuitive idea is to share activations as well as weights, i.e. to have two heads (or more realistically one head consulted twice) on top of the same model. There is a fair amount of uncertainty about this kind of “detail” but I think for now it’s smaller than the fundamental uncertainty about whether anything in this vague direction will work.
The intuitive idea is to share activations as well as weights, i.e. to have two heads (or more realistically one head consulted twice) on top of the same model. There is a fair amount of uncertainty about this kind of “detail” but I think for now it’s smaller than the fundamental uncertainty about whether anything in this vague direction will work.