Any network big enough to be interesting is big enough that the programmers don’t have the time to write decorative labels. If you had some algorithm that magically produced a bays net with a billion intermediate nodes that accurately did some task, then it would also be an obvious black box. No one will have come up with a list of a billion decorative labels.
That is pretty much how text-image and multimodal languages work now, really. You get a giant inscrutable vector embedding from one modality’s model and another giant inscrutable model spits out the corresponding output in a different modality. It seems to work well for all the usual tasks like captioning or question-answering...
Any network big enough to be interesting is big enough that the programmers don’t have the time to write decorative labels. If you had some algorithm that magically produced a bays net with a billion intermediate nodes that accurately did some task, then it would also be an obvious black box. No one will have come up with a list of a billion decorative labels.
You could theoretically finagle a language model to produce them.
That is pretty much how text-image and multimodal languages work now, really. You get a giant inscrutable vector embedding from one modality’s model and another giant inscrutable model spits out the corresponding output in a different modality. It seems to work well for all the usual tasks like captioning or question-answering...