The “dark horse” of AI i.e. Apple has started to show its capabilities with MM1 (a family of multimodal models of upto 30B params) trained on synthetic data generated from GPT-4V. The quite interesting bit is the advocacy of different training techniques; both MoE and dense variants, using diverse data mixtures.
From the paper:
It finds image resolution, model size, and pre-training data richness crucial for image encoders, whereas vision-language connector architecture has a minimal impact.
The details are quite neat and too specific for a company like Apple known for being less open as Jim Fan noted compared to the others which is pretty amazing. I think this is just the start. I am convinced they have more in store considering the research they have been putting out.
The “dark horse” of AI i.e. Apple has started to show its capabilities with MM1 (a family of multimodal models of upto 30B params) trained on synthetic data generated from GPT-4V. The quite interesting bit is the advocacy of different training techniques; both MoE and dense variants, using diverse data mixtures.
From the paper:
The details are quite neat and too specific for a company like Apple known for being less open as Jim Fan noted compared to the others which is pretty amazing. I think this is just the start. I am convinced they have more in store considering the research they have been putting out.