CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data

CVPR 2022

Qi Yan, Jianhao Zheng, Simon Reding, Shanci Li, Iordan Doytchinov

How to obtain Data and Model for
Absolute Visual Re-localization?


Synthetic training data generation workflow utilizing open source 3D topography models.



The CrossLoc datasets include 50K+ annotated images for sim-to-real aerial localization.

Preview and download


Real-data efficient absolute visual re-localization algorithm through cross-modal representation learning.



We present a visual localization system that learns to estimate camera poses in the real world with the help of synthetic data.

Despite significant progress in recent years, most learning-based approaches to visual localization target at a single domain and require a dense database of geo-tagged images to function well. To mitigate the data scarcity issue and improve the scalability of the neural localization models, we introduce TOPO-DataGen, a versatile synthetic data generation tool that traverses smoothly between the real and virtual world, hinged on the geographic camera viewpoint. New large-scale sim-to-real benchmark datasets are proposed to showcase and evaluate the utility of the said synthetic data. Our experiments reveal that synthetic data generically enhances the neural network performance on real data. Furthermore, we introduce CrossLoc, a cross-modal visual representation learning approach to pose estimation that makes full use of the scene coordinate ground truth via self-supervision. Without any extra data, CrossLoc significantly outperforms the state-of-the-art methods and achieves substantially higher real-data sample efficiency.


An open-source multimodal synthetic data generation tool tailored to aerial scenes.
It takes common geo-data as inputs and outputs diverse synthetic visual data such as 2D image-3D geometry-semantics-camera pose. The rendering engine is developed upon CesiumJS. See the TOPO-DataGen code for details and feel free to try the demo to generate synthetic data in Genève.

TOPO-DataGen is generically useful for enhancing aerial visual localization networks. The proposed data generation workflow outputs both 2D RGB images and the associated 3D labels such as the pixel-wise scene coordinates. It has the following advantages:
Obtaining dense and accurate 2D-3D correspondences without much engineering efforts, unlike the structure-from-motion.
Seamless traversing between the real and virtual world hinged on the camera viewpoint.
Work for both image-based methods such as absolute pose regression and structure-based methods that require 3D labels.

The TOPO-DataGen workflow can be used to improve the training of most visual localization networks, including both the image-based methods (by cross-domain data augmentation) and the structure-based methods (by multi-modal visual data and cross-modal representations).

CrossLoc Benchmark Datasets

large-scale benchmark datasets for sim-to-real visual localization,
including synthetic and real images with 3D and semantic labels on urban and natural sites.


Synthetic Images


Real Images





We introduce two large-scale sim-to-real benchmark datasets to exemplify the utility of the TOPO-DataGen. Please visit the CrossLoc Benchmark Datasets page hosted by Drayd for download. See the GitHub repo for other dataset details and how to setup.
All 7k+ real images are accurately geo-referenced using a professional drone, numerous ground control points, and a GNSS base.
All real images have matching synthetic RGB images with a validated positioning accuracy at cm-level.
Most generated 3D labels have a reprojection error of no more than 2 pixels.
All synthetic data is created using the fully open-sourced geo-data from the Swiss government swisstopo. The semantic maps are also generated.
Carefully calibrated camera intrinsic parameters and the 6D camera poses are provided.

(Note: quality may be compromised due to GIF compression)

CrossLoc localization

A cross-modal visual representation learning method via self-supervision for absolute localization.

The CrossLoc learns to localize the query image by predicting its scene coordinates using a set of cross-modal encoders, followed by camera pose estimation using a PnP solver. Similar to self-supervised learning, it leverages data structure to create additional supervisory signals to enhance learning. Specifically, it makes use of the coordinate-depth-normal geometric hierarchy for self-supervision: from the 3D scene coordinate labels, one could compute accurate 6D camera pose and subsequently compute the depth and surface normals without any external labels. It's observed that the tasks of coordinate-depth-normal regression are geometrically highly related. The aggregated cross-modal representations could be used to enhance the final task of coordinate regression.

See the CrossLoc localization code for training code, pretrained models, and baseline implementations.


CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data
CVPR 2022

GitHub repositories:

Baseline implementation:

CVPR Camera Ready


        title={CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data},
        author={Yan, Qi and Zheng, Jianhao and Reding, Simon and Li, Shanci and Doytchinov, Iordan},
        journal={arXiv preprint arXiv:2112.09081},
	title={CrossLoc Benchmark Datasets}, 
	author={Doytchinov, Iordan and Yan, Qi and Zheng, Jianhao and Reding, Simon and Li, Shanci}, 


Qi Yan


Jianhao Zheng


Simon Reding


Shanci Li


Iordan Doytchinov