[Article] Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction
Summary: This paper introduces a framework that uses a Vision Transformer 3D (ViT3D) to preserve the 3D structure of brain data for better visual decoding. It eliminates the need for subject-specific models, allowing high-quality reconstruction from a single experimental trial across different people. By integrating with Large Language Models (LLMs), the system can also perform tasks like brain captioning and complex reasoning using natural language.