The rapid advancement of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing, has demonstrated exceptional perception and reasoning capabilities in Earth observation tasks. However, a benchmark for systematically evaluating their capabilities in this domain is still lacking. To bridge this gap, we propose CHOICE, an extensive benchmark designed to objectively evaluate the hierarchical remote sensing capabilities of VLMs. Focusing on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 23 leaf tasks to ensure a well-rounded assessment coverage. CHOICE guarantees the quality of a total of 10,507 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control. The newly curated data and the format of multiple-choice questions with definitive answers allow for an objective and straightforward performance assessment. Our evaluation of 3 proprietary and 21 open-source VLMs highlights their critical limitations within this specialized context. We hope that CHOICE will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing.
Model | ILC | SII | CID | AttR | AssR | CSR |
---|---|---|---|---|---|---|
Proprietary Large Vision-Language Models (click to toggle) | ||||||
GPT-4o-mini | 0.800 | 0.588 | 0.448 | 0.494 | 0.474 | 0.876 |
GPT-4o-2024-11-20 | 0.845 | 0.616 | 0.591 | 0.536 | 0.277 | 0.900 |
Gemini-1.5-Pro | 0.867 | 0.585 | 0.636 | 0.590 | 0.611 | 0.876 |
General-domain Large Vision-Language Models (click to toggle) | ||||||
Qwen2-VL-7B | 0.806 | 0.638 | 0.675 | 0.600 | 0.550 | 0.884 |
Qwen2-VL-72B | 0.855 | 0.702 | 0.742 | 0.634 | 0.580 | 0.914 |
Ovis1.6-Gemma2-9B | 0.828 | 0.606 | 0.632 | 0.598 | 0.645 | 0.890 |
InternVL2-8B | 0.789 | 0.566 | 0.648 | 0.664 | 0.580 | 0.816 |
InternVL2-26B | 0.820 | 0.578 | 0.595 | 0.594 | 0.638 | 0.890 |
InternVL2-40B | 0.838 | 0.629 | 0.721 | 0.694 | 0.530 | 0.946 |
LLaVA-1.6-7B | 0.686 | 0.463 | 0.574 | 0.450 | 0.433 | 0.749 |
LLaVA-1.6-13B | 0.719 | 0.522 | 0.626 | 0.474 | 0.470 | 0.733 |
Llama3.2-11B | 0.746 | 0.502 | 0.504 | 0.460 | 0.388 | 0.803 |
GLM-4V-9B | 0.785 | 0.555 | 0.644 | 0.564 | 0.450 | 0.863 |
DeepSeek-VL-7B | 0.764 | 0.561 | 0.595 | 0.528 | 0.373 | 0.840 |
MiniCPM-V-2.5 | 0.757 | 0.546 | 0.631 | 0.414 | 0.493 | 0.857 |
Phi3-Vision | 0.710 | 0.503 | 0.582 | 0.428 | 0.253 | 0.700 |
mPLUG-Owl3-7B | 0.766 | 0.524 | 0.487 | 0.422 | 0.340 | 0.850 |
Molmo-7B-D | 0.720 | 0.553 | 0.648 | 0.536 | 0.548 | 0.724 |
Remote Sensing Large Vision-Language Models (click to toggle) | ||||||
GeoChat | 0.642 | 0.469 | 0.480 | 0.384 | 0.368 | 0.697 |
LHRS-Bot | 0.633 | 0.290 | 0.171 | 0.366 | 0.426 | 0.610 |
LHRS-Bot-nova | 0.688 | 0.530 | 0.526 | 0.450 | 0.120 | 0.644 |
VHM | 0.751 | 0.703 | 0.436 | 0.392 | 0.348 | 0.744 |
RemoteCLIP | 0.657 | 0.283 | 0.552 | 0.326 | 0.364 | 0.739 |
GeoRSCLIP | 0.745 | 0.198 | 0.285 | 0.210 | 0.397 | 0.884 |
Model | Image-Level Comprehension | Single-Instance Identification | Cross-Instance Discernment | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
IM | IQ | MR | SC | IC | LR | OC | OL | OP | AR | VG | HD | AC | SR | CD | |
Proprietary Large Vision-Language Models (click to toggle) | |||||||||||||||
GPT-4o-mini | 0.781 | 0.500 | 0.490 | 0.937 | 0.940 | 0.940 | 0.542 | 0.612 | 0.914 | 0.604 | 0.000 | 0.906 | 0.426 | 0.340 | 0.650 |
GPT-4o-2024-11-20 | 0.868 | 0.564 | 0.567 | 0.954 | 1.000 | 0.940 | 0.594 | 0.768 | 0.918 | 0.610 | 0.028 | 0.830 | 0.509 | 0.567 | 0.770 |
Gemini-1.5-Pro | 0.874 | 0.678 | 0.790 | 0.937 | 0.955 | 0.960 | 0.634 | 0.578 | 0.952 | 0.548 | 0.027 | 0.810 | 0.797 | 0.413 | 0.690 |
General-domain Large Vision-Language Models (click to toggle) | |||||||||||||||
Qwen2-VL-7B | 0.720 | 0.540 | 0.800 | 0.930 | 0.980 | 0.980 | 0.560 | 0.790 | 0.950 | 0.570 | 0.372 | 0.570 | 0.750 | 0.510 | 0.790 |
Qwen2-VL-72B | 0.860 | 0.560 | 0.800 | 0.960 | 1.000 | 0.970 | 0.620 | 0.840 | 0.950 | 0.580 | 0.530 | 0.670 | 0.800 | 0.610 | 0.840 |
Ovis1.6-Gemma2-9B | 0.800 | 0.570 | 0.660 | 0.940 | 0.990 | 0.880 | 0.500 | 0.720 | 0.940 | 0.570 | 0.060 | 0.900 | 0.680 | 0.490 | 0.760 |
InternVL2-8B | 0.730 | 0.520 | 0.430 | 0.930 | 0.970 | 0.770 | 0.400 | 0.590 | 0.890 | 0.590 | 0.173 | 0.790 | 0.850 | 0.370 | 0.710 |
InternVL2-26B | 0.780 | 0.540 | 0.610 | 0.950 | 0.960 | 0.980 | 0.460 | 0.680 | 0.950 | 0.620 | 0.052 | 0.730 | 0.660 | 0.410 | 0.760 |
InternVL2-40B | 0.820 | 0.570 | 0.550 | 0.960 | 0.990 | 0.970 | 0.530 | 0.770 | 0.980 | 0.610 | 0.228 | 0.670 | 0.870 | 0.500 | 0.790 |
LLaVA-1.6-7B | 0.340 | 0.460 | 0.380 | 0.920 | 0.880 | 0.740 | 0.220 | 0.690 | 0.970 | 0.590 | 0.237 | 0.060 | 0.730 | 0.340 | 0.650 |
LLaVA-1.6-13B | 0.480 | 0.490 | 0.400 | 0.910 | 0.940 | 0.700 | 0.400 | 0.670 | 0.860 | 0.610 | 0.302 | 0.300 | 0.790 | 0.400 | 0.680 |
Llama3.2-11B | 0.650 | 0.400 | 0.650 | 0.910 | 0.940 | 0.890 | 0.510 | 0.650 | 0.920 | 0.590 | 0.002 | 0.360 | 0.580 | 0.310 | 0.660 |
GLM-4V-9B | 0.660 | 0.500 | 0.680 | 0.940 | 0.950 | 0.980 | 0.530 | 0.670 | 0.930 | 0.600 | 0.003 | 0.620 | 0.840 | 0.370 | 0.710 |
DeepSeek-VL-7B | 0.570 | 0.540 | 0.670 | 0.920 | 0.940 | 0.890 | 0.460 | 0.740 | 0.950 | 0.530 | 0.253 | 0.430 | 0.770 | 0.340 | 0.670 |
MiniCPM-V-2.5 | 0.610 | 0.430 | 0.690 | 0.930 | 0.980 | 0.900 | 0.460 | 0.600 | 0.940 | 0.610 | 0.055 | 0.640 | 0.790 | 0.400 | 0.700 |
Phi3-Vision | 0.510 | 0.480 | 0.480 | 0.880 | 0.910 | 0.660 | 0.350 | 0.760 | 0.910 | 0.560 | 0.105 | 0.380 | 0.770 | 0.370 | 0.570 |
mPLUG-Owl3-7B | 0.770 | 0.420 | 0.670 | 0.890 | 0.960 | 0.920 | 0.500 | 0.660 | 0.950 | 0.590 | 0.073 | 0.380 | 0.520 | 0.300 | 0.710 |
Molmo-7B-D | 0.560 | 0.530 | 0.350 | 0.870 | 0.920 | 0.740 | 0.390 | 0.630 | 0.920 | 0.670 | 0.015 | 0.760 | 0.800 | 0.490 | 0.620 |
Remote Sensing Large Vision-Language Models (click to toggle) | |||||||||||||||
GeoChat | 0.313 | 0.299 | 0.433 | 0.922 | 0.710 | 0.790 | 0.218 | 0.762 | 0.934 | 0.510 | 0.297 | 0.064 | 0.671 | 0.300 | 0.415 |
LHRS-Bot | 0.288 | 0.306 | 0.318 | 0.915 | 0.755 | 0.500 | 0.252 | 0.164 | 0.936 | 0.330 | – | 0.076 | – | 0.267 | 0.325 |
LHRS-Bot-nova | 0.479 | 0.271 | 0.350 | 0.950 | 0.840 | 0.690 | 0.412 | 0.642 | 0.972 | 0.528 | 0.271 | 0.372 | 0.737 | 0.257 | 0.560 |
VHM | 0.621 | 0.428 | 0.299 | 0.966 | 0.765 | 0.760 | 0.342 | 0.872 | 0.932 | 0.564 | 0.598 | 0.922 | 0.431 | 0.347 | 0.580 |
RemoteCLIP | 0.510 | 0.303 | 0.369 | 0.891 | 0.775 | 0.700 | 0.435 | 0.295 | 0.715 | 0.500 | – | 0.050 | 0.688 | 0.248 | 0.545 |
GeoRSCLIP | 0.705 | 0.303 | 0.471 | 0.944 | 0.860 | 0.980 | 0.230 | 0.360 | 0.750 | 0.250 | – | 0.050 | 0.361 | 0.248 | 0.570 |
Model | AttR | AssR | CSR | ||||
---|---|---|---|---|---|---|---|
TP | PP | EA | RA | DD | GD | SI | |
Proprietary Large Vision-Language Models | |||||||
GPT-4o-mini | 0.420 | 0.543 | 0.420 | 0.492 | 0.870 | 0.925 | 0.847 |
GPT-4o-2024-11-20 | 0.565 | 0.517 | 0.470 | 0.213 | 0.890 | 0.950 | 0.873 |
Gemini-1.5-Pro | 0.520 | 0.637 | 0.460 | 0.662 | 0.835 | 0.985 | 0.830 |
General-domain Large Vision-Language Models | |||||||
Qwen2-VL-7B | 0.510 | 0.660 | 0.580 | 0.540 | 0.880 | 0.970 | 0.830 |
Qwen2-VL-72B | 0.580 | 0.670 | 0.490 | 0.610 | 0.900 | 0.980 | 0.880 |
Ovis1.6-Gemma2-9B | 0.580 | 0.610 | 0.600 | 0.660 | 0.910 | 0.930 | 0.850 |
InternVL2-8B | 0.520 | 0.760 | 0.400 | 0.640 | 0.880 | 0.790 | 0.790 |
InternVL2-26B | 0.510 | 0.650 | 0.690 | 0.620 | 0.850 | 0.960 | 0.870 |
InternVL2-40B | 0.610 | 0.750 | 0.560 | 0.520 | 0.950 | 0.980 | 0.920 |
LLaVA-1.6-7B | 0.300 | 0.550 | 0.320 | 0.470 | 0.740 | 0.710 | 0.780 |
LLaVA-1.6-13B | 0.330 | 0.570 | 0.560 | 0.440 | 0.740 | 0.640 | 0.790 |
Llama3.2-11B | 0.430 | 0.480 | 0.440 | 0.370 | 0.810 | 0.890 | 0.740 |
GLM-4V-9B | 0.420 | 0.660 | 0.360 | 0.480 | 0.860 | 0.990 | 0.780 |
DeepSeek-VL-7B | 0.390 | 0.620 | 0.170 | 0.440 | 0.860 | 0.880 | 0.800 |
MiniCPM-V-2.5 | 0.390 | 0.430 | 0.620 | 0.450 | 0.920 | 0.880 | 0.800 |
Phi3-Vision | 0.380 | 0.460 | 0.290 | 0.240 | 0.850 | 0.580 | 0.680 |
mPLUG-Owl3-7B | 0.350 | 0.470 | 0.520 | 0.280 | 0.900 | 0.860 | 0.810 |
Molmo-7B-D | 0.500 | 0.560 | 0.600 | 0.530 | 0.940 | 0.530 | 0.710 |
Remote Sensing Large Vision-Language Models | |||||||
GeoChat | 0.255 | 0.470 | 0.105 | 0.455 | 0.660 | 0.810 | 0.647 |
LHRS-Bot | 0.260 | 0.437 | 0.480 | 0.408 | 0.550 | 0.435 | 0.767 |
LHRS-Bot-nova | 0.315 | 0.540 | 0.185 | 0.098 | 0.525 | 0.520 | 0.807 |
VHM | 0.405 | 0.383 | 0.340 | 0.350 | 0.740 | 0.760 | 0.737 |
RemoteCLIP | 0.335 | 0.210 | 0.035 | 0.245 | 0.820 | 0.650 | 0.740 |
GeoRSCLIP | 0.400 | 0.310 | 0.310 | 0.260 | 0.945 | 0.935 | 0.880 |
@misc{an2025choicebenchmarkingremotesensing,
title={CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models},
author={Xiao An and Jiaxing Sun and Zihan Gui and Wei He},
year={2025},
eprint={2411.18145},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.18145},
}