I am a Research Scientist working on multimodal foundation models, with a focus on vision-language learning, multimodal post-training, and retrieval-augmented reasoning. My research aims to build scalable multimodal systems that are more grounded, factual, and capable of complex image-text understanding.
I hold a Ph.D. in Artificial Intelligence at AImageLab, University of Modena and Reggio Emilia, where I worked on scaling knowledge-grounded vision-language systems, spanning image captioning, visual reasoning, and large multimodal language models.
My work has been published at top-tier venues including CVPR, ACL, ICLR, and BMVC. I previously conducted industrial research at Amazon AGI (Nova), and I am currently building multimodal and video foundation models at Tether.

In Annual Meeting of the Association for Computational Linguistics, 2026

In Conference on Computer Vision and Pattern Recognition, 2025

In International Conference on Learning Representations 2025

In British Machine Vision Conference 2024

In Conference on Computer Vision and Pattern Recognition Workshops, 2024

In Findings of the Association for Computational Linguistics, 2024

In International Conference on Pattern Recognition, 2024

In International Journal of Computer Vision, 2025

In British Machine Vision Conference 2025

In International Conference on Computer Vision, 2025
In European Conference on Computer Vision and Pattern Recognition, 2024

In Sensors MDPI, 2023

In IEEE Intelligent Systems, 2024
A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning.
I'm always open to discussing research ideas, potential collaborations, or opportunities to apply AI in innovative ways.
Email Me