Reasoning in Vision-Language Models

Evaluated the reasoning ability of VLMs using visual abductive tasks and a custom dataset across styles and prompts.

Project Details

This project investigated the reasoning capacity of modern Vision-Language Models (VLMs) by presenting them with abductive visual inference tasks. A custom dataset was created featuring a range of scenarios and visual styles including sketches, anime, paper cuts, and photo-realism.

Prompt engineering played a central role in evaluating how different phrasing affected the models’ outputs. The analysis revealed a significant bias in most models toward selecting the first hypothesis presented regardless of plausibility, highlighting issues in reasoning and response consistency. The work offered insights into current VLM limitations and directions for improvement.

Poster summarizing the VLM reasoning project.