Assessing AlephAlphas Multimodal Model
The currently probably most impressive large language model imho is Flamingo, which has the ability to answer questions about images. Flamingo is not accessible to the general public, however, the German startup AlephAlpha has a very similar model on offer.
I signed up and received 2 credits for free, which is apparently roughly 40 cents? (The pricing information is far from complete, so it seems one has to find out via trial and error what a specific usecase is going to cost.)
I decided to use my free credits to investigate the model a little bit. This is about as unscientific an investigation as it can get. When AlephAlpha logged me out, it probably reset the model to the base-model occasionally, which I didn’t notice immediately, so I can’t even say with certainty which model was used for which prompt.
Anyway, it still gives a flavor what these models can and cannot do, which I think is valuable information especially when it comes to the difference to future models.
First I presented it with this picture of 7 hedgehogs playing musical chairs (curtesy of Dalle2 via Swimmer963).
The model luminous-extended had no difficulty discerning the game:
Question: What are these hedgehogs playing?
Answer: Musical chairs.
Maybe it’s not surprising that the prompt of generated pictures can be reverse engineered with such a model. After all the training data was probably very similar. However, dalle2 generally had trouble counting (the above was the only picture with the correct number of hedgehogs and chairs out of ten). So I asked:
Question: How many hedgehogs are in that picture?
Answer: Seven.
This surprised me intially. It also correctly counted 5 dogs and than badly failed at counting 11 dogs (6) and 8 dogs (4) and finally 4 cats (3) so ultimately it seems to be not much better at counting than dalle.
Then I tried out this dalle2 mishap (it’s supposed to be an angel and a wild boar).
Building on an example prompt given by AlephAlpha I let the model complete:
If you see this picture, you really wonder if the artist is crazy.
Ok, makes sense, but I wanted to see whether luminous understands how strange the picture is, so I asked:
Question: What is weird about this picture?
Answer: The pigs
True, but unsatisfying. Let’s try to zero in a bit:
Question: What is weird about the pigs in this picture?
Answer: They are not real.
Ok, I think we have to ask a very concrete question:
Question: In which way is the left pig different from a real pig?
Answer: It has wings.
Finally. The knowledge is there, but getting it out seems difficult. Which makes sense, because the model hasn’t been trained for question answering sessions about pictures.
Next, I tried to something a little more complicated.
Question: What is this?
Answer: A clock face.
This time I kept the first question and answer around:
Question: What is this?
Answer: A clock face.
Question: Where are the clock’s hands?
A: On the clock
Ok, seems like I managed to confuse it. Luminous has probably seens such clocks before and knows they are clocks. But the LM part insists that there should be hands somewhere.
A more concrete check confirmed this:
Question: What is this?
Answer: A clock face.
Question: What color are the clock’s hands and how many are there?
A: Two.
A different approach:
Question: This clock doesn’t have hands, how does it tell the time?
Answer: It has numbers.
Ok, maybe one could lead it slowly to the correct realisation, but time for something else.
Question: What does this picture show?
Answer: Muhammad Ali in the ring with Joe Frazier.
At first I was pretty impressed with that answer, then I realised that after looking at Frazier-Ali pictures I had actually switched to the even more iconic Rumble in the jungle with George Foreman.
It’s an easy prediction to make that soon these models will probably recognise all famous pictures.
One of peeves with dalle2 and dalle mini is that these models don’t know who Prince Valiant is. So the day where I start doing my own Prince Valiant fan fiction comic seems still somewhat farther away.
This is a picture of a cartoon character.
The name of this cartoon character is “The Viking”
That’s wrong but has some connection to the story, so possibly there is something there.
The name of this cartoon character is Prince Valiant.
Ok, with some help it turns out Luminous is not as uncultured as it seemed.
The new google model Parti also provided a few interesting images.
A detailed description of the picture above in literary style:
“A koala in a leather jacket sings into the microphone. The background is blurred and there are rocks on it”
From a human point of view a pretty massive failure, despite getting almost everything right. Well, at least the continent is correct.
Question: What is in this picture?
Answer: A light bulb with a picture of the Milky Way galaxy inside.
Question: Does this lightbulb function?
Answer: No.
Question: How big is the sailing ship in this picture?
Answer: The sailing ship is the size of a light bulb.
That worked out pretty well. Generally, luminous doesn’t necessarily mention everything in a picture it is asked to describe. But quite a clever answer about the size.
Next I checked whether luminous understands human emotions.
This man is feeling stressed.
Ok, not really.
Finally I wondered whether luminous had any concept of human attractiveness. I googled “male model face” and “average guy face” and asked the model for its opinion:
This is Joe:
This is Jack:
Question: Who is more handsome, Joe or Jack?
Answer: Joe
Maybe the true appeal of the AI girlfriend will turn out to be that she literally can’t tell that you are not hotter than male model face.
Nice post. I generally recommend looking at the model probabilities or taking multiple samples when evaluating a model. For example, does the model give the answer “Joe” 99% probability or close to 50%?
I agree. But I wanted to avoid burning through the credits too quickly. I also wonder whether Joe and Jack would be more realistically assessed with a few-shot prompt.
that dude looks pretty stressed out about his confusion to me
Interesting tests and thanks for sharing. One question: using the model to answer question without context is looking to me as just checking if the learned knowledge is there to answer the question you ask to answer—that´s kind of a Knowledge machine approach—which none of the models are. And therefor comes down to just a scaling question imho—more training data relates to more knowledge relates to more questions being answered. I would be interested: did you try some other approaches like few-shot prompting in order to figure out the learned conceptual and contextual understanding?
No, I didn’t try few-shot prompting yet because it uses much more tokens i.e. credits.
But I also don’t think that few-shot prompting reveals a different kind of understanding. The main advantage seems to be that you can point more robustly towards what you want.
But maybe you can give an example of what you have in mind.