Idea for using current AI to accelerate medical research: suppose you were to take a VLM and train it to verbally explain the differences between two image data distributions. E.g., you could take 100 dog images, split them into two classes, insert tiny rectangles into class 1, feed those 100 images into the VLM, and then train it to generate the text “class 1 has tiny rectangles in the images”. Repeat this for a bunch of different augmented datasets where we know exactly how they differ, aiming for a VLM with a general ability to in-context learn and verbally describe the differences between two sets of images. As training processes, keep making there be more and subtler differences, while training the VLM to describe all of them.
Then, apply the model to various medical images. E.g., brain scans of people who are about to develop dementia versus those who aren’t, skin photos of malignant and non-malignant blemishes, electron microscope images of cancer cells that can / can’t survive some drug regimen, etc. See if the VLM can describe any new, human interpretable features.
The VLM would generate a lot of false positives, obviously. But once you know about a possible feature, you can manually investigate whether it holds to distinguish other examples of the thing you’re interested in. Once you find valid features, you can add those into the training data of the VLM, so it’s no longer just trained on synthetic augmentations.
You might have to start with real datasets that are particularly easy to tell apart, in order to jumpstart your VLM’s ability to accurately describe the differences in real data.
The other issue with this proposal is that it currently happens entirely via in context learning. This is inefficient and expensive (100 images is a lot for one model at once!). Ideally, the VLM would learn the difference between the classes by actually being trained on images from those classes, and learn to connect the resulting knowledge to language descriptions of the associated differences through some sort of meta learning setup. Not sure how best to do that, though.
I think it actually points to convergence between human and NN learning dynamics. Human visual cortices are also bad at hands and text, to the point that lucid dreamers often look for issues with their hands / nearby text to check whether they’re dreaming.
One issue that I think causes people to underestimate the degree of convergence between brain and NN learning is to compare the behaviors of entire brains to the behaviors of individual NNs. Brains consist of many different regions which are “trained” on different internal objectives, then interact with each other to collectively produce human outputs. In contrast, most current NNs contain only one “region”, which is all trained on the single objective of imitating certain subsets of human behaviors.
We should thus expect NN learning dynamics to most resemble those of single brain regions, and that the best match for humanlike generalization patterns will arise from putting together multiple NNs that interact with each other in a similar manner as human brain regions.