Gaslight, gatekeep, V1-V3: early visual cortex alignment shields vision-language models from Sycophantic manipulation

Silpasuwanchai, Chaklam

doi:10.48550/arXiv.2604.13803

Gaslight, gatekeep, V1-V3: early visual cortex alignment shields vision-language models from Sycophantic manipulation

Source

arXiv

ISSN

2331-8422

Date Issued

2026-04-01

Author(s)

Shah, Arya

Tripathi, Vaibhav

Singh, Mayank

Silpasuwanchai, Chaklam

DOI

10.48550/arXiv.2604.13803

Abstract

Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40\times parameter range (256M--10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region-of-interest analysis reveals that alignment specifically in early visual cortex (V1--V3) is a reliable negative predictor of sycophancy (r = -0.441, BCa 95\% CI [-0.740, -0.031]), with all 12 leave-one-out correlations negative and the strongest effect for existence denial attacks (r = -0.597, p = 0.040). This anatomically specific relationship is absent in higher-order category-selective regions, suggesting that faithful low-level visual encoding provides a measurable anchor against adversarial linguistic override in vision-language models. We release our code on \href{this https URL}{GitHub} and dataset on \href{this https URL}{Hugging Face}

URI

https://repository.iitgn.ac.in/handle/IITG2025/35128

Subjects

Vision-Language Models

Brain Alignment

Sycophancy

Neural Predictivity

Adversarial Robustness

fMRI