Soumya Shamarao Jahagirdar

I am a PhD student at University of Tübingen, advised by Prof. Hilde Kuhene. I am working on multimodal learning especially efficient vision-language models, video understanding and multimodal reasoning. I am also part of MIT-IBM Watson AI Sight and Sound Project.

I completed my Master's by Research from IIIT Hyderabad, India under the guidance of Prof. C V Jawahar and Prof. Dimosthenis Karatzas from Computer Vision Center (CVC), UAB, Spain. My thesis was centered around understanding and combining information in videos through visual and textual modalities for question-answering. Previously in my undergraduate research, I have worked with Prof. Shankar Gangisetty from KLE Technological University and Prof. Anand Mishra on text-based multimodal learning, specifically, utilizing Scene-text in images for Text-based Visual Question Generation. I have also worked with Prof. Uma Mudenagudi and Samsung R&D Institute India-Bangalore on Depth Estimation and Densification. In my undergrad I also worked as a Research Assistant with Prof. B A Patil at Think & Ink Education and Research Foundation.

Email / GitHub / Google Scholar / LinkedIn / CV / Twitter

Research

My research interests lie in building better multimodal models!

News

May 2026: We are organizing a challenge on Time-Logic at Second Workshop on VideoLLMs at CVPR 2026!

April 2026: MaskLLaVA and TTA-Vid are now on arxiv!

March 2026: Visualoverload got accept at CVPR!

September 2024: Started my PhD in University of Tübingen!

August 2024: Our competition ICDAR 2024 Competition on Reading Documents Through Aria Glasses was presented by Dr. Ajoy Mondal at ICDAR in Athens, Greece!

June 2024: Our paper "Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos" got accepted at International Conference on Computer Vision & Image Processing - CVIP!

April 2024: Organizing a competition on Reading documents through ARIA glasses with Meta Reality Labs!

March 2024: Successfully defended MS thesis titled: Text-based Video Question Answering!

February 2024: Participated in Google Research Week in Bangalore!

September 2023: Attended International Space Conference in collaboration with ISRO, InSpace, and NSIL!

August 2023: Our paper "Understanding Video Scenes through Text: Insights from Text-based Video Question Answering" got accepted at ICCV Workshop (VLAR) 2023. (Spotlight)

March 2023: Organized competition on Video Question Answering on News videos in ICDAR 2023.

February 2023: Two papers got accepted in CVPR-2023 O-DRUM Workshop.

December 2022: Student volunteer member at ICFHR conference 2023.

October 2022: Our paper "Watching the News: Towards VideoQA Models that can Read" got accepted at WACV, 2023.

September 2022: I started my internship at Computer Vision Center (CVC), UAB, Barcelona, Spain.

July 2022: Conducted a tutorial on transformers in Summer School of AI, CVIT.

May 2022: First Patent on Single Image Depth Estimation with SRIB-Bangalore got accpeted.

Publications

	TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning Soumya Shamarao Jahagirdar, Edson Araujo, Anna Kukleva, M Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R Glass, Hilde Kuehne arxiv, 2026 paper This paper introduces TTA-Vid, a test-time reinforcement learning framework for video-language reasoning that adapts pretrained models to new video samples during inference without requiring labeled data. The method combines step-by-step reasoning over multiple frame subsets with a frequency-based reward mechanism and an adaptive frame selection strategy to improve temporal and multimodal understanding. Experiments show that TTA-Vid consistently improves performance across video reasoning tasks and can outperform state-of-the-art supervised approaches while requiring only test-time adaptation.
	When LLaVA Meets Objects: Token Composition for Vision-Language-Models Soumya Shamarao Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne arxiv, 2026 paper This paper presents Mask-LLaVA, a token-efficient framework for autoregressive vision-language models that combines mask-based object representations with global and local visual features to create compact image representations. The approach enables dynamic reduction of visual tokens during inference without retraining and with minimal performance degradation. Experiments on standard benchmarks show competitive results compared to existing efficient VLM methods while using substantially fewer visual tokens.
	Visualoverload: Probing visual understanding of vlms in really dense scenes Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne arxiv, 2026 paper / website / dataset / Evaluation Server This paper introduces VisualOverload, a VQA benchmark designed to evaluate fine-grained visual understanding in densely populated scenes. Using high-resolution paintings annotated with diverse question types, the benchmark reveals that even state-of-the-art vision-language models struggle with basic perception tasks such as counting, OCR, and logical reasoning under visual complexity. The results suggest that current VLM benchmarks may overestimate model capabilities and highlight significant limitations in detailed scene understanding.
	Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos Soumya Shamarao Jahagirdar, Jayasree Saha, C. V. Jawahar International Conference on Computer Vision & Image Processing (CVIP)* Workshops, VLAR*, 2024 paper / code This paper introduces a multimodal dataset of long-form lecture and news videos aimed at advancing video understanding in domains where manual annotation is costly and requires subject expertise. The work explores the use of Large Language Models together with ASR and OCR signals to automatically capture informative video content and evaluates baseline methods to identify current limitations. The authors further motivate the need for improved prompt engineering techniques for comprehensive understanding of long-form multimodal videos.
	Understanding Video Scenes through Text: Insights from Text-based Video Question Answering Soumya Shamarao Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar International Conference on Computer Vision (ICCV)* Workshops, VLAR*, 2023 paper This paper analyzes two video question answering datasets, NewsVideoQA and M4-ViteVQA, which focus on understanding textual content within videos. The authors examine the extent to which these datasets require genuine visual and temporal reasoning and show that a text-only BERT-QA model achieves performance comparable to multimodal approaches, revealing limitations in dataset design. They also study cross-dataset domain adaptation, highlighting both the challenges and potential benefits of out-of-domain training for video QA tasks.
	Weakly Supervised Visual Question Answer Generation Charani Alampalle, Shamanthak Hegde, Soumya Shamarao Jahagirdar, Shankar Gangisetty Conference on Computer Vision and Pattern Recognition (CVPR)* Workshops, ODRUM*, 2023 paper We propose a weakly-supervised visual question answer generation method that generates a relevant question-answer pairs for a given input image and associated caption.
	Making the V in Text-VQA Matter Shamanthak Hegde, Soumya Shamarao Jahagirdar, Shankar Gangisetty Conference on Computer Vision and Pattern Recognition (CVPR)* Workshops, ODRUM*, 2023 paper We propose a method to learn visual features (making V matter in TextVQA) along with the OCR features and question features using VQA dataset as external knowledge for Text-based VQA.
	Watching the News: Towards VideoQA Models that can Read Soumya Shamarao Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar Winter Conference on Applications of Computer Vision, WACV , 2023 paper / code / website / youtube We propose a novel VideoQA task that requires reading and understanding the text in the video. We focus on news videos and require QA systems to comprehend and answer questions about the topics presented by combining visual and textual cues in the video. We introduce the “NewsVideoQA” dataset that comprises more than 8,600 QA pairs on 3,000+ news videos obtained from diverse news channels from around the world.
	Look, Read and Ask: Learning to Ask Questions by Reading Text in Images Soumya Shamarao Jahagirdar, Shankar Gangisetty, Anand Mishra International Conference on Document Analysis and Recognition, (ICDAR) , 2021 paper / code / website / youtube We present a novel problem of text-based visual question generation or TextVQG in short. Given the recent growing interest of the document image analysis community in combining text understanding with conversational artificial intelligence, e.g., text-based visual question answering, TextVQG becomes an important task. TextVQG aims to generate a natural language question for a given input image and an automatically extracted text also known as OCR token from it such that the OCR token is an answer to the generated question.
	DeepDNet: Deep Dense Network for Depth Completion Task Girish Hegde, Tushar "Soumya Shamarao Jahagirdar, Vaishakh Nargund, Ramesh Ashok Tabib, Uma Mudenagudi, Basavaraja Vandrotti, Ankit Dhiman" Conference on Computer Vision and Pattern Recognition (CVPR)* Workshops, WiCV*, 2021 paper We propose a Deep Dense Network for Depth Completion Task (DeepDNet) towards generating dense depth map using sparse depth and captured view. We propose Dense-Residual-Skip (DRS) Autoencoder along with an attention towards edge preservation using Gradient Aware Mean Squared Error (GAMSE) Loss.

Patents

Method and Device of Depth Densification using RGB Image and Sparse Depth

patent
2022-05-05
website

PATENT NUMBER: WO2022103171A1; PATENT OFFICE: US; PUBLICATION DATE: 2022/05/19; Inventors Suhas MUDENAGUDI Uma, HEGDE Girish, Dattatray, Tabib Ramesh Ashok, JAHAGIRDAR Soumya, Shamarao, PHARALE Tushar, Irappa, Vandrotti Basavaraja, Shanthappa, Dhiman Ankit, NARGUND Vaishakh

Design and source code from Jon Barron's website

Soumya Shamarao Jahagirdar

Research

News

Publications

TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Visualoverload: Probing visual understanding of vlms in really dense scenes

Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos

Understanding Video Scenes through Text: Insights from Text-based Video Question Answering

Weakly Supervised Visual Question Answer Generation

Making the V in Text-VQA Matter

Watching the News: Towards VideoQA Models that can Read

Look, Read and Ask: Learning to Ask Questions by Reading Text in Images

DeepDNet: Deep Dense Network for Depth Completion Task

Patents

Method and Device of Depth Densification using RGB Image and Sparse Depth