Research
My research interests lie in building better multimodal models!
|
News
May 2026: We are organizing a challenge on Time-Logic at Second Workshop on Video-Language Models for Long Videos in CVPR 2026!
April 2026: MaskLLaVA and TTA-Vid are now on arxiv!
March 2026: Visualoverload got accept at CVPR!
September 2024: Started my PhD in University of Tübingen!
August 2024: Our competition ICDAR 2024 Competition on Reading Documents Through Aria Glasses was presented by Dr. Ajoy Mondal at ICDAR in Athens, Greece!
June 2024: Our paper "Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos" got accepted at International Conference on Computer Vision & Image Processing - CVIP!
April 2024: Organizing a competition on Reading documents through ARIA glasses with Meta Reality Labs!
March 2024: Successfully defended MS thesis titled: Text-based Video Question Answering!
February 2024: Participated in Google Research Week in Bangalore!
September 2023: Attended International Space Conference in collaboration with ISRO, InSpace, and NSIL!
August 2023: Our paper "Understanding Video Scenes through Text: Insights from Text-based Video Question Answering" got accepted at ICCV Workshops, VLAR!
March 2023: Organized competition on Video Question Answering on News videos in ICDAR 2023.
February 2023: Two papers got accepted in CVPR-2023 O-DRUM Workshop.
December 2022: Student volunteer member at ICFHR conference 2023.
October 2022: Our paper "Watching the News: Towards VideoQA Models that can Read" got accepted at WACV, 2023.
September 2022: I started my internship at Computer Vision Center (CVC), UAB, Barcelona, Spain.
July 2022: Conducted a tutorial on transformers in Summer School of AI, CVIT.
May 2022: First Patent on Single Image Depth Estimation with SRIB-Bangalore got accpeted.
|
|
TTA-Vid: Generalized Test-Time Adaptation for Video Reasoning
Soumya Shamarao Jahagirdar*, Edson Araujo*, Anna Kukleva, M Jehanzeb Mirza, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Rogerio Feris, James R Glass, Hilde Kuehne
arxiv, 2026
paper
This paper introduces TTA-Vid, a test-time reinforcement learning framework for video-language reasoning that adapts pretrained models to new video samples during inference without retraining.
|
|
When LLaVA Meets Objects: Token Composition for Vision-Language-Models
Soumya Shamarao Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne
arxiv, 2026
paper
This paper presents Mask-LLaVA, a token-efficient framework for autoregressive vision-language models that combines mask-based object representations with global and local visual features.
|
|
Visualoverload: Probing visual understanding of vlms in really dense scenes
Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne
arxiv, 2026
paper /
website /
dataset /
Evaluation Server
This paper introduces VisualOverload, a VQA benchmark designed to evaluate fine-grained visual understanding in densely populated scenes.
|
|
Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
Soumya Shamarao Jahagirdar, Jayasree Saha, C. V. Jawahar
International Conference on Computer Vision & Image Processing (CVIP) Workshops, VLAR, 2024
paper
/ code
This paper introduces a multimodal dataset of long-form lecture and news videos aimed at advancing video understanding in domains where manual annotation is costly and requires subject expertise.
|
|
Understanding Video Scenes through Text: Insights from Text-based Video Question Answering
Soumya Shamarao Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar
International Conference on Computer Vision (ICCV) Workshops, VLAR, 2023
paper
This paper analyzes two video question answering datasets, NewsVideoQA and M4-ViteVQA, which focus on understanding textual content within videos.
|
|
Weakly Supervised Visual Question Answer Generation
Charani Alampalle, Shamanthak Hegde, Soumya Shamarao Jahagirdar, Shankar Gangisetty
Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, ODRUM, 2023
paper
We propose a weakly-supervised visual question answer generation method that generates a relevant question-answer pairs for a given input image and associated caption.
|
|
Making the V in Text-VQA Matter
Shamanthak Hegde, Soumya Shamarao Jahagirdar, Shankar Gangisetty
Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, ODRUM, 2023
paper
We propose a method to learn visual features (making V matter in TextVQA) along with the OCR features and question features using VQA dataset as external knowledge for Text-based VQA.
|
|
Watching the News: Towards VideoQA Models that can Read
Soumya Shamarao Jahagirdar, Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar
Winter Conference on Applications of Computer Vision, WACV , 2023
paper
/ code
/ website
/ youtube
We propose a novel VideoQA task that requires reading and understanding the text in the video. We focus on news videos and require QA systems to comprehend and answer questions about the scene and text.
|
|
Look, Read and Ask: Learning to Ask Questions by Reading Text in Images
Soumya Shamarao Jahagirdar, Shankar Gangisetty, Anand Mishra
International Conference on Document Analysis and Recognition, (ICDAR) , 2021
paper
/ code
/ website
/ youtube
We present a novel problem of text-based visual question generation or TextVQG in short. Given the recent growing interest of the document image analysis community in combining text understanding with visual reasoning, this task studies question generation from images containing text.
|
|
DeepDNet: Deep Dense Network for Depth Completion Task
Girish Hegde, Tushar "Soumya Shamarao Jahagirdar, Vaishakh Nargund, Ramesh Ashok Tabib, Uma Mudenagudi, Basavaraja Vandrotti, Ankit Dhiman"
Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, WiCV, 2021
paper
We propose a Deep Dense Network for Depth Completion Task (DeepDNet) towards generating dense depth map using sparse depth and captured view. We propose Dense-Residual-Skip (DRS) Autoencoder to estimate scene depth.
|
|
Method and Device of Depth Densification using RGB Image and Sparse Depth
patent
2022-05-05
website
PATENT NUMBER: WO2022103171A1; PATENT OFFICE: US; PUBLICATION DATE: 2022/05/19; Inventors
Suhas MUDENAGUDI Uma, HEGDE Girish, Dattatray, Tabib Ramesh Ashok, JAHAGIRDAR Soumya, Shamarao, PHARALE Tushar, Irappa, Vandrotti Basavaraja, Shanthappa, Dhiman Ankit, NARGUND Vaishakh
|
|