[Article] Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction

Jan 6, 2026

Summary: This paper introduces a framework that uses a Vision Transformer 3D (ViT3D) to preserve the 3D structure of brain data for better visual decoding. It eliminates the need for subject-specific models, allowing high-quality reconstruction from a single experimental trial across different people. By integrating with Large Language Models (LLMs), the system can also perform tasks like brain captioning and complex reasoning using natural language.

Shen, Guobin, et al. “Neuro-vision to language: Enhancing brain recording-based visual reconstruction and language interaction.” Advances in Neural Information Processing Systems 37 (2024): 98083-98110.