#ChatGPT : 3. Unlocking Visual Search
When Words Meet Vision: Exploring the Intersection of Text and Images
How can ChatGPT assist you in recognizing photos and videos? Let's dive into the captivating overlap between language and vision. Prepare yourself for a newsletter that will provoke laughter (1. Visual Fiasco), instill optimism (2. Pixel Pawadan), expand knowledge (3. Geo Grammar), and encourage exploration (4. MiniGPT-4).
Visual Fiasco
If you enter a link to a photo into ChatGPT and add “what do you see” it often starts rambling. Look how this photo of the Kremlin becomes a vacation daydream for the language model:
The image appears to show a group of people standing on a beach with the ocean in the background. The people are dressed casually, with some wearing hats and sunglasses.
ChatGPT is “seeing” sandcastles instead of spires. (It can’t see, btw).
Next example. It appears that #chatgpt is experiencing some sort of otherworldly reaction to this green Ferrari.
Based on the image, it appears to be a digitally created artwork of a surreal, fantastical landscape with various objects and creatures
.
That’s what we expect from #chatgpt, being a bookworm at heart. It isn't quite up to snuff when it comes to image recognition. The misidentification of the Kremlin as a group of people and a car for a fantastical landscape is a clear indication that #chatgpt's visual recognition abilities may still need refinement.
Why? It’s a language model you stupid. Not a visual model.
Strict language models excel at generating coherent text and narrative descriptions, and fail by default when it comes to understanding and processing visual information. In contrast, computer vision models are adept at analyzing and interpreting visual data, but may face big challenges when it comes to working with text-based information.
Think of a language model as a skilled writer who can craft compelling stories using words and phrases. They can describe characters and scenes in vivid detail, but may not possess the same level of artistic ability as a visual artist. They can’t produce images or think visually.
On the other hand, a computer vision model is like a talented visual artist who can create stunning images and designs, but might have difficulty expressing their ideas through language. To get the most out of such models, you often have to provide them with text-based input so that they can use their visual expertise to generate output. An example of this is Dall-E, which generates images based on textual input.
So Chat-GPT in the current version can’t “see” pictures. And still, we’re not done yet.
Pixel Padawan
If you research headshots, you often will see that the person is correctly identified. Here’s an example:
At first it seems a miracle: the name of the person is correct. But hey, that name was mentioned in the link.
So that's why ChatGPT came up with it. One of the strengths of ChatGPT is its proficiency in natural language processing and understanding. After the not-so-impressive text recognition, it starts hallucinating again, and AI means Alternative Inception:
He has short, dark hair. He is wearing a dark suit jacket. The background behind him appears to be a plain white or light-colored wall.
There is no wall and we are not sure his hair is black. The guy is not wearing a dark suit jacket. Oh well, you get the point.
But the moment the picture is in the public domain, like WikiPedia or stock photo sites, #chatgpt has enough (con)text to come up with better results:
The image you provided is a photograph of the Moscow Kremlin, taken from the Moskvoretsky Bridge
The reason for this 100% hit is that there was enough additional context to come up with a suggestion. But the language model didn’t actually see what was on the picture.
Geo Grammar
ChatGPT can shine when you speak the right “Geo Grammar”. The last few weeks I learned a bit more how to come up with the proper instructions. You have to be very precise. Here is an example from one of my students from the Arizona State University. Look at this picture. Where is this? Maybe you should stop reading this newsletter and try out yourself for 30 minutes and then come back.
Have you returned? Great. Here's what I did. Initially, I inquired of #chatgpt what it is that I am viewing.
“Two billboards read "No import donde desde. La verdad no cambio. Hoy" and "El Día, Cada día mejor" Where am I?”
Without more context, it is unclear where these billboards are located or what they are promoting.
Are Hoy and El Dia newspapers?
There are several newspapers with the names "Hoy" and "El Día" published in different countries around the world.
Name one country that has both newspapers.
That is the Dominican Republic, where there is a newspaper called "Hoy" and another newspaper called "El Día".
Although I tried uploading the picture to Google Images, I wasn't able to find the answer I was looking for. However, it worked when I was describing the object, which is the approach we used till now with #chatgpt:
I started looking in the capital of that country, where both newspapers are based, for possible flyovers in Google Maps. I didn't bother to ask #chatgpt for help.
But one of my followers did with an amazing result.
Keep reading with a 7-day free trial
Subscribe to