CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models

Wuhan University
*Equal Contribution

Corresponding Author
Descriptive alt text

Abstract

The rapid advancement of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing, has demonstrated exceptional perception and reasoning capabilities in Earth observation tasks. However, a benchmark for systematically evaluating their capabilities in this domain is still lacking. To bridge this gap, we propose CHOICE, an extensive benchmark designed to objectively evaluate the hierarchical remote sensing capabilities of VLMs. Focusing on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 23 leaf tasks to ensure a well-rounded assessment coverage. CHOICE guarantees the quality of a total of 10,507 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control. The newly curated data and the format of multiple-choice questions with definitive answers allow for an objective and straightforward performance assessment. Our evaluation of 3 proprietary and 21 open-source VLMs highlights their critical limitations within this specialized context. We hope that CHOICE will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing.

CHOICE Leaderboard

L2

Model ILC SII CID AttR AssR CSR
Proprietary Large Vision-Language Models (click to toggle)
GPT-4o-mini 0.800 0.588 0.448 0.494 0.474 0.876
GPT-4o-2024-11-20 0.845 0.616 0.591 0.536 0.277 0.900
Gemini-1.5-Pro 0.867 0.585 0.636 0.590 0.611 0.876
General-domain Large Vision-Language Models (click to toggle)
Qwen2-VL-7B 0.806 0.638 0.675 0.600 0.550 0.884
Qwen2-VL-72B 0.855 0.702 0.742 0.634 0.580 0.914
Ovis1.6-Gemma2-9B 0.828 0.606 0.632 0.598 0.645 0.890
InternVL2-8B 0.789 0.566 0.648 0.664 0.580 0.816
InternVL2-26B 0.820 0.578 0.595 0.594 0.638 0.890
InternVL2-40B 0.838 0.629 0.721 0.694 0.530 0.946
LLaVA-1.6-7B 0.686 0.463 0.574 0.450 0.433 0.749
LLaVA-1.6-13B 0.719 0.522 0.626 0.474 0.470 0.733
Llama3.2-11B 0.746 0.502 0.504 0.460 0.388 0.803
GLM-4V-9B 0.785 0.555 0.644 0.564 0.450 0.863
DeepSeek-VL-7B 0.764 0.561 0.595 0.528 0.373 0.840
MiniCPM-V-2.5 0.757 0.546 0.631 0.414 0.493 0.857
Phi3-Vision 0.710 0.503 0.582 0.428 0.253 0.700
mPLUG-Owl3-7B 0.766 0.524 0.487 0.422 0.340 0.850
Molmo-7B-D 0.720 0.553 0.648 0.536 0.548 0.724
Remote Sensing Large Vision-Language Models (click to toggle)
GeoChat 0.642 0.469 0.480 0.384 0.368 0.697
LHRS-Bot 0.633 0.290 0.171 0.366 0.426 0.610
LHRS-Bot-nova 0.688 0.530 0.526 0.450 0.120 0.644
VHM 0.751 0.703 0.436 0.392 0.348 0.744
RemoteCLIP 0.657 0.283 0.552 0.326 0.364 0.739
GeoRSCLIP 0.745 0.198 0.285 0.210 0.397 0.884

Perception

Model Image-Level Comprehension Single-Instance Identification Cross-Instance Discernment
IM IQ MR SC IC LR OC OL OP AR VG HD AC SR CD
Proprietary Large Vision-Language Models (click to toggle)
GPT-4o-mini 0.781 0.500 0.490 0.937 0.940 0.940 0.542 0.612 0.914 0.604 0.000 0.906 0.426 0.340 0.650
GPT-4o-2024-11-20 0.868 0.564 0.567 0.954 1.000 0.940 0.594 0.768 0.918 0.610 0.028 0.830 0.509 0.567 0.770
Gemini-1.5-Pro 0.874 0.678 0.790 0.937 0.955 0.960 0.634 0.578 0.952 0.548 0.027 0.810 0.797 0.413 0.690
General-domain Large Vision-Language Models (click to toggle)
Qwen2-VL-7B 0.720 0.540 0.800 0.930 0.980 0.980 0.560 0.790 0.950 0.570 0.372 0.570 0.750 0.510 0.790
Qwen2-VL-72B 0.860 0.560 0.800 0.960 1.000 0.970 0.620 0.840 0.950 0.580 0.530 0.670 0.800 0.610 0.840
Ovis1.6-Gemma2-9B 0.800 0.570 0.660 0.940 0.990 0.880 0.500 0.720 0.940 0.570 0.060 0.900 0.680 0.490 0.760
InternVL2-8B 0.730 0.520 0.430 0.930 0.970 0.770 0.400 0.590 0.890 0.590 0.173 0.790 0.850 0.370 0.710
InternVL2-26B 0.780 0.540 0.610 0.950 0.960 0.980 0.460 0.680 0.950 0.620 0.052 0.730 0.660 0.410 0.760
InternVL2-40B 0.820 0.570 0.550 0.960 0.990 0.970 0.530 0.770 0.980 0.610 0.228 0.670 0.870 0.500 0.790
LLaVA-1.6-7B 0.340 0.460 0.380 0.920 0.880 0.740 0.220 0.690 0.970 0.590 0.237 0.060 0.730 0.340 0.650
LLaVA-1.6-13B 0.480 0.490 0.400 0.910 0.940 0.700 0.400 0.670 0.860 0.610 0.302 0.300 0.790 0.400 0.680
Llama3.2-11B 0.650 0.400 0.650 0.910 0.940 0.890 0.510 0.650 0.920 0.590 0.002 0.360 0.580 0.310 0.660
GLM-4V-9B 0.660 0.500 0.680 0.940 0.950 0.980 0.530 0.670 0.930 0.600 0.003 0.620 0.840 0.370 0.710
DeepSeek-VL-7B 0.570 0.540 0.670 0.920 0.940 0.890 0.460 0.740 0.950 0.530 0.253 0.430 0.770 0.340 0.670
MiniCPM-V-2.5 0.610 0.430 0.690 0.930 0.980 0.900 0.460 0.600 0.940 0.610 0.055 0.640 0.790 0.400 0.700
Phi3-Vision 0.510 0.480 0.480 0.880 0.910 0.660 0.350 0.760 0.910 0.560 0.105 0.380 0.770 0.370 0.570
mPLUG-Owl3-7B 0.770 0.420 0.670 0.890 0.960 0.920 0.500 0.660 0.950 0.590 0.073 0.380 0.520 0.300 0.710
Molmo-7B-D 0.560 0.530 0.350 0.870 0.920 0.740 0.390 0.630 0.920 0.670 0.015 0.760 0.800 0.490 0.620
Remote Sensing Large Vision-Language Models (click to toggle)
GeoChat 0.313 0.299 0.433 0.922 0.710 0.790 0.218 0.762 0.934 0.510 0.297 0.064 0.671 0.300 0.415
LHRS-Bot 0.288 0.306 0.318 0.915 0.755 0.500 0.252 0.164 0.936 0.330 0.076 0.267 0.325
LHRS-Bot-nova 0.479 0.271 0.350 0.950 0.840 0.690 0.412 0.642 0.972 0.528 0.271 0.372 0.737 0.257 0.560
VHM 0.621 0.428 0.299 0.966 0.765 0.760 0.342 0.872 0.932 0.564 0.598 0.922 0.431 0.347 0.580
RemoteCLIP 0.510 0.303 0.369 0.891 0.775 0.700 0.435 0.295 0.715 0.500 0.050 0.688 0.248 0.545
GeoRSCLIP 0.705 0.303 0.471 0.944 0.860 0.980 0.230 0.360 0.750 0.250 0.050 0.361 0.248 0.570

Reasoning

Model AttR AssR CSR
TP PP EA RA DD GD SI
Proprietary Large Vision-Language Models
GPT-4o-mini 0.420 0.543 0.420 0.492 0.870 0.925 0.847
GPT-4o-2024-11-20 0.565 0.517 0.470 0.213 0.890 0.950 0.873
Gemini-1.5-Pro 0.520 0.637 0.460 0.662 0.835 0.985 0.830
General-domain Large Vision-Language Models
Qwen2-VL-7B 0.510 0.660 0.580 0.540 0.880 0.970 0.830
Qwen2-VL-72B 0.580 0.670 0.490 0.610 0.900 0.980 0.880
Ovis1.6-Gemma2-9B 0.580 0.610 0.600 0.660 0.910 0.930 0.850
InternVL2-8B 0.520 0.760 0.400 0.640 0.880 0.790 0.790
InternVL2-26B 0.510 0.650 0.690 0.620 0.850 0.960 0.870
InternVL2-40B 0.610 0.750 0.560 0.520 0.950 0.980 0.920
LLaVA-1.6-7B 0.300 0.550 0.320 0.470 0.740 0.710 0.780
LLaVA-1.6-13B 0.330 0.570 0.560 0.440 0.740 0.640 0.790
Llama3.2-11B 0.430 0.480 0.440 0.370 0.810 0.890 0.740
GLM-4V-9B 0.420 0.660 0.360 0.480 0.860 0.990 0.780
DeepSeek-VL-7B 0.390 0.620 0.170 0.440 0.860 0.880 0.800
MiniCPM-V-2.5 0.390 0.430 0.620 0.450 0.920 0.880 0.800
Phi3-Vision 0.380 0.460 0.290 0.240 0.850 0.580 0.680
mPLUG-Owl3-7B 0.350 0.470 0.520 0.280 0.900 0.860 0.810
Molmo-7B-D 0.500 0.560 0.600 0.530 0.940 0.530 0.710
Remote Sensing Large Vision-Language Models
GeoChat 0.255 0.470 0.105 0.455 0.660 0.810 0.647
LHRS-Bot 0.260 0.437 0.480 0.408 0.550 0.435 0.767
LHRS-Bot-nova 0.315 0.540 0.185 0.098 0.525 0.520 0.807
VHM 0.405 0.383 0.340 0.350 0.740 0.760 0.737
RemoteCLIP 0.335 0.210 0.035 0.245 0.820 0.650 0.740
GeoRSCLIP 0.400 0.310 0.310 0.260 0.945 0.935 0.880

BibTeX


      @misc{an2025choicebenchmarkingremotesensing,
      title={CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models}, 
      author={Xiao An and Jiaxing Sun and Zihan Gui and Wei He},
      year={2025},
      eprint={2411.18145},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.18145}, 
}