Can you do better than top-level AI models on these basic vision tests?

Enlarge / Whatever you do, don’t ask the AI how many horizontal lines are in this image.

Getty Images

In the last couple of years, we’ve seen amazing advancements in AI systems when it comes to recognizing and analyzing the contents of complicated images. But a new paper highlights how many state-of-the-art “vision learning Models” (VLMs) often fail at simple, low-level visual analysis tasks that are trivially easy for a human.

In the provocatively titled pre-print paper “Vision language models are blind“ (which has a PDF version that includes a dark sunglasses emoji in the title), researchers from Auburn University and the University of Alberta create eight simple visual acuity tests with objectively correct answers. These range from identifying how often two colored lines intersect to identifying which letter in a long word has been circled to counting how many nested shapes exist in an image (representative examples and results can be viewed on the research team’s webpage).

If you can solve these kinds of puzzles, you may have better visual reasoning than state-of-the-art AIs.
The puzzles on the right are like something out of Highlights magazine.
A representative sample shows AI models failing at a task that most human children would find trivial.

Crucially, these tests are generated by custom code and don’t rely on pre-existing images or tests that could be found on the public Internet, thereby “minimiz[ing] the chance that VLMs can solve by memorization,” according to the researchers. The tests also “require minimal to zero world knowledge” beyond basic 2D shapes, making it difficult for the answer to be inferred from “textual question and choices alone” (which has been identified as an issue for some other visual AI benchmarks).

Are you smarter than a fifth grader?

After running multiple tests across four different visual models—GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5—the researchers found all four fell well short of the 100 percent accuracy you might expect for such simple visual analysis tasks (and which most sighted humans would have little trouble achieving). But the size of the AI underperformance varied greatly depending on the specific task. When asked to count the number of rows and columns in a blank grid, for instance, the best-performing model only gave an accurate answer less than 60 percent of the time. On the other hand, Gemini-1.5 Pro hit nearly 93 percent accuracy in identifying circled letters, approaching human-level performance.

For some reason, the models tend to incorrectly guess the “o” is circled a lot more often than all the other letters in this test.
The models performed perfectly in counting five interlocking circles, a pattern they might be familiar with from common images of the Olympic rings.
Do you have an easier time counting columns than rows in a grid? If so, you probably aren’t an AI.

Even small changes to the tasks could also lead to huge changes in results. While all four tested models were able to correctly identify five overlapping hollow circles, the accuracy across all models dropped to well below 50 percent when six to nine circles were involved. The researchers hypothesize that this “suggests that VLMs are biased towards the well-known Olympic logo, which has 5 circles.” In other cases, models occasionally hallucinated nonsensical answers, such as guessing “9,” “n”, or “©” as the circled letter in the word “Subdermatoglyphic.”

Overall, the results highlight how AI models that can perform well at high-level visual reasoning have some significant “blind spots” (sorry) when it comes to low-level abstract images. It’s all somewhat reminiscent of similar capability gaps that we often see in state-of-the-art large language models, which can create extremely cogent summaries of lengthy texts while at the same time failing extremely basic math and spelling questions.

These gaps in VLM capabilities could come down to the inability of these systems to generalize beyond the kinds of content they are explicitly trained on. Yet when the researchers tried fine-tuning a model using specific images drawn from one of their tasks (the “are two circles touching?” test), that model showed only modest improvement, from 17 percent accuracy up to around 37 percent. “The loss values for all these experiments were very close to zero, indicating that the model overfits the training set but fails to generalize,” the researchers write.

The researchers propose that the VLM capability gap may be related to the so-called “late fusion” of vision encoders onto pre-trained large language models. An “early fusion” training approach that integrates visual encoding alongside language training could lead to better results on these low-level tasks, the researchers suggest (without providing any sort of analysis of this question).

Source: Can you do better than top-level AI models on these basic vision tests?

What's Hot

Implementing Crypto Payroll in Latin America: A Guide for Startups – OneSafe Blog

How Are Freelancers Adapting to Gen AI?

Best Business Bank Accounts for Freelancers [2025]

Can you do better than top-level AI models on these basic vision tests?

Are you smarter than a fifth grader?

What is Tech Recruiting, and When Is It Better Than Freelancers? | Brand Vision

Creative UK looks to New York City as it outlines vision of UK “freelance champion” role

Creative UK sets vision for Freelance Champion – Televisual

Implementing Crypto Payroll in Latin America: A Guide for Startups – OneSafe Blog

How Are Freelancers Adapting to Gen AI?

Best Business Bank Accounts for Freelancers [2025]

Meet Casey Carroll | Yoga teacher, trauma-informed facilitator, freelancer, improv actor,

Best Freelance and Self-Employed Accounting Software

Taxes for freelancers and the self-employed in Switzerland in 2025

Do degrees still matter?

Affiliate

PhotonPay Brings Innovation to Affiliate World Asia with Industry-Specific Payment

GCU to play in WAC as men’s soccer affiliate – Grand Canyon University Athletics

Chevron plans to reduce 2025 capex

freelancer

Implementing Crypto Payroll in Latin America: A Guide for Startups – OneSafe Blog

How Are Freelancers Adapting to Gen AI?

Best Business Bank Accounts for Freelancers [2025]

Marketing

Texarkana marketing agencies embrace AI | Texarkana Gazette

This week’s agency news, executive moves, and account changes

Washington, DC’s Destination Marketing Organization Elevates Leadership with New

Archives

Categories

What's Hot

Can you do better than top-level AI models on these basic vision tests?

Are you smarter than a fifth grader?

Related Posts