Current multimodal retrieval-augmented generation (RAG) benchmarks primarily focus on textual knowledge retrieval for question answering, which presents significant limitations. In…
Vision-language models (VLMs) have demonstrated impressive capabilities in general image understanding, but face significant challenges when processing text-rich visual content…