A recent study from Johns Hopkins University has uncovered a significant gap in how artificial intelligence models perceive human social behavior.

07.05.2025

A recent study from Johns Hopkins University has uncovered a significant gap in how artificial intelligence models perceive human social behavior. While AI has shown impressive progress in tasks like object recognition, it still falls short when trying to understand the complexities of social interactions—an essential skill for systems like autonomous vehicles and social robots that need to operate in real-world environments.

The researchers found that current AI systems struggle with interpreting subtle cues such as gestures, shared attention, or the intentions behind people’s actions. Leyla Isik, who led the study, noted that for AI to function safely and intuitively around humans, it must recognize what people are likely to do next—whether someone is about to cross the street or simply talking with a friend. At this stage, AI systems are still far from making such distinctions accurately.

In their experiments, human participants watched short video clips and rated the social dynamics within them. These same clips were analyzed by over 350 AI models, including language, video, and image-based systems. While human judgments were largely consistent, the AI predictions varied widely and often failed to match human perception. Language models were slightly better at interpreting human behavior, but video and image models frequently misread the situations or missed the interactions altogether.

The researchers believe this shortcoming may be rooted in how current AI architectures are designed. Many are modeled after how the brain processes static images, but understanding social interactions requires dynamic, context-sensitive interpretation—something human brains are highly specialized for. As a result, AI tends to miss the ongoing narrative that defines real human engagement.

This research underscores a critical limitation in AI development: recognizing people and objects in a frame is no longer enough. To be truly effective in human-centered environments, AI needs to understand relationships, intentions, and unfolding events. Bridging this gap may require rethinking both how AI is trained and the cognitive functions it’s designed to emulate.