GeoVista Logo GeoVista:

Web-Augmented Agentic Visual Reasoning for Geolocalization

Yikun Wang1,4, Zuyan Liu3, Ziyi Wang3, Han Hu2, Pengfei Liu4, Yongming Rao2
1Fudan University 2Tencent Hunyuan 3Tsinghua University 4Shanghai Innovation Institute
GeoVista teaser

Abstract

Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocation task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocation benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocation ability of agentic models.

We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocation performance.

Experimental results show that GeoVista surpasses other open-source agentic models on the geolocation task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

GeoVista Logo GeoVista

GeoVista Demo Video — watch the demo inline.

GeoVista agentic pipeline

Image examples from GeoBench and the training data, and the agentic pipeline of GeoVista. Given a query and image, the policy model iteratively generates thoughts and actions; each action is parsed, executed, and yields a new observation, repeating this loop until it outputs a final geolocation prediction or reaches the maximum interaction turn limit.

Benchmark

Level-wise evaluation

The evaluation pipeline of GeoBench dataset. The evaluation system consists of (1) Level-wise evaluation, which employs both rule-based and model-based verifiers to determine correctness at different administrative levels, and (2) nuanced evaluation, which extracts the predicted address, applies geocoding to obtain the predicted geolocalization point, and computes the haversine distance to the ground-truth location.

Nuanced evaluation

GeoBench is the first benchmark to evaluate agentic models’ general geolocalization ability.

Experiments

Experiments level-wise

The Comparison on GeoBench. The bold figures indicate the best performance among closed-source and open-source models, and the underlined figures indicate open-source results that surpass at least one of their closed-source counterparts.

Experiments nuanced evaluation

Nuanced distance statistics of different models' performance on GeoBench. The bold figures indicate the best performance among closed-source and open-source models.

Citation


@misc{wang2025geovistawebaugmentedagenticvisual,
      title={GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization}, 
      author={Yikun Wang and Zuyan Liu and Ziyi Wang and Pengfei Liu and Han Hu and Yongming Rao},
      year={2025},
      eprint={2511.15705},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15705}, 
}