Top

Humans Outperform AI In Situational Awareness

The research proposed zooming in on ‘concept parts’ like the legs of a chair or beak of a bird, and showed that this part-based learning outperformed traditional models by a 12 per cent margin.

Hyderabad: Despite their sophistication, the world’s most advanced artificial intelligence (AI) models have failed to match human understanding in analysing short video clips by a wide margin.

According to a study presented at the prestigious Conference on Computer Vision and Pattern Recognition (CVPR 2025) by researchers from IIIT-Hyderabad, these state-of-the-art video-language models correctly interpreted short clips less than half the time. However, humans achieved over 90 per cent accuracy when asked to answer basic questions such as who is doing what and when.

The paper aimed to explore AI’s capabilities in understanding road safety, sign language, image generation, and sketch-based communication. It evaluated leading video-language models such as GPT-4o and Gemini-1.5-Pro on their ability to understand short video clips with human-level reasoning.

“Despite all the sophistication, these models fail at fundamental visual reasoning tasks,” said Darshana Saravanan, who presented the study, VELOCITI, at the conference held in Nashville last month. The research highlighted the gap between flashy generative capabilities and true understanding of dynamic scenes.

“People were adding missing context, local knowledge, visual details the camera didn’t catch, expert takes. It’s a goldmine for AI training,” said Prof Ravikiran Sarvadevabhatla, who led the work.

Prof Sarvadevabhatla and Mohd Hozaifa Khan explored the complexities of human interaction through Sketchtopia, a Pictionary-inspired dataset that uses sketches and asynchronous feedback to model multi-modal communication.

“Everyone at CVPR was excited about the idea of benchmarking AI’s ability to hold a conversation not just through words, but through visuals and interaction,” said Khan.

Another standout paper, TIDE, tackled a core issue in generalisation: what happens when an AI trained on one type of image (say, indoor photos) encounters sketches, cartoons or new environments?

The research proposed zooming in on ‘concept parts’ like the legs of a chair or beak of a bird, and showed that this part-based learning outperformed traditional models by a 12 per cent margin.

A paper titled PALS, which won Best Paper at a workshop on visual categorisation, introduced a method to train image classifiers even when data labelling is noisy or incomplete, a scenario common in fields like wildlife conservation. “We trained the model to learn from uncertain guesses, which then improved over time,” said Saravanan, who co-authored the work.

Notably, some of the most promising papers came from undergraduates. One explored how to generate images with multiple objects in precise orientations. Another proposed a better method to translate sign language videos by incorporating the surrounding visual and textual context. Both works addressed gaps in model control and accessibility, pushing the boundaries of how AI understands and generates human communication.

While the research at CVPR 2025 showed that progress is being made, it also made clear that AI systems, especially those mimicking human vision and language, are still prone to confusion, bias and brittleness. The proposed benchmarks, datasets and algorithms aim to create tougher tests, better learning setups and ultimately, more reliable systems in the wild.

( Source : Deccan Chronicle )
Next Story