VLM-R1 adopts the same GRPO algorithm as Deepseek R1, enhancing its visual capabilities. Today, let's explore how this algorithm improves performance in visual tasks.

What is VLM-R1?
GRPO (Group Relative Policy Optimization) helped Deepseek R1 improve its reasoning abilities; the VLM-R1 team found that GRPO can also help vision-language models (VLMs) perform better in general computer vision tasks, and its generalization ability surpasses traditional SFT (Supervised Fine-Tuning) methods.
Currently, VLM-R1 performs exceptionally well, with a rapidly growing number of Stars.

Trial link
https://huggingface.co/spaces/omlab/VLM-R1-Referral-Expression







Evaluation
The team used the Qwen 2.5 VL 3B model for training on RefCOCO (a visual grounding task) and conducted evaluations on RefCOCO Val and RefGTA (an OOD task).
Specifically, in the Referring Expression Comprehension (REC) task, the Qwen2.5-VL model was trained using the R1 and SFT methods. The results show that on in-domain test data, the performance of the SFT model is slightly lower than that of the R1 model.
However, on out-of-domain test data, the performance of the SFT model significantly decreases as the number of training steps increases, whereas the R1 model demonstrates stable improvement.