CVPR 2026
TIGeR learns a shared geo-temporal embedding for images, GPS coordinates, and timestamps, enabling geo-time aware image retrieval, time-of-capture prediction, and geo-localization within one unified framework.
Many real-world applications need to jointly reason about how a place looks, where it is, and when it was captured. TIGeR formalizes this as geo-time aware image retrieval: retrieve an image of the same place at a specified target time.
The model preserves location identity across large seasonal, lighting, and structural changes, making it useful for digital forensics, environmental monitoring, and long-term visual analysis.
TIGeR consistently improves over strong baselines, with up to 16% better time-of-year prediction, 8% better time-of-day prediction, and 14% better geo-time retrieval recall.
Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, location, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time.
TIGeR is a unified framework for Time, Images and Geo-location Retrieval. It supports flexible input configurations across single-modality and multi-modality queries and uses one shared representation to perform geo-localization, time-of-capture prediction, and geo-time aware retrieval.
To support this setting, the paper introduces a multistage curation pipeline and a new benchmark containing 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. By preserving location identity despite large appearance changes, TIGeR retrieves images based on where and when a scene was captured, not just how it looks.
TIGeR embeds image, geo-location, and time into a shared geo-temporal feature space. A frozen CLIP visual encoder produces image tokens, while dedicated location and time encoders represent GPS coordinates and timestamps using random Fourier features. These unimodal and bimodal token sets are processed by a shared multimodal transformer that learns explicit interactions between where, when, and how a place appears.
The refined tokens are pooled and projected into a shared embedding space. TIGeR is trained with cross-modal contrastive objectives so that compatible image, location, and time inputs align, while auxiliary classification heads with soft metric targets regularize the space to vary smoothly over nearby places and nearby times.
The same representation supports single-modality and fused queries such as
I -> t, I -> l, and It -> I.
Instead of aligning modalities only at the end, the shared transformer lets image, location, and time tokens attend to each other directly.
Soft targets over discretized location and time classes encourage embeddings to respect geographic and temporal neighborhood structure.
The benchmark is curated from AMOS webcam data using a semi-supervised quality filtering pipeline. TIGeR identifies corruption modes, trains an image-quality classifier on labeled examples, removes low-quality frames, and then builds geographically balanced train and test splits with no camera overlap.
The resulting dataset contains 4.5M training images and 86k evaluation images with broad geographic coverage and substantial diversity across months and times of day.
TIGeR outperforms prior geo-temporal baselines across geo-time retrieval, time prediction, and geo-localization on both TIGeR-test-86k and CVT.
Geo-time Retrieval
37.51%
R@10 on TIGeR-test-86k
TIGeR reaches 37.51% R@10 on TIGeR-test-86k and 29.98% on CVT, delivering up to 14% higher geo-time retrieval recall than prior methods.
Time Prediction (ToY)
48.86 days
Best ToY error on TIGeR-test-86k with Il -> t
TIGeR reduces ToY error to 48.86 on TIGeR-test-86k and 47.00 on CVT, yielding up to 16% better time-of-year prediction.
Time Prediction (ToD)
3.06 hours
Best ToD error on TIGeR-test-86k with Il -> t
TIGeR lowers ToD error to 3.06 on TIGeR-test-86k and 2.35 on CVT, corresponding to up to 8% better time-of-day prediction.
Given a query image and target time, TIGeR retrieves an image from the same location at the requested time. It substantially improves recall on the new benchmark and remains strong on CVT.
| Dataset | R@1 | R@5 | R@10 |
|---|---|---|---|
| TIGeR-test-86k | 3.51 | 23.30 | 37.51 |
| CVT | 14.55 | 24.46 | 29.98 |
TIGeR improves both time-of-year and time-of-day prediction, and using location as auxiliary context yields further gains.
| Query | Dataset | ToY | ToD |
|---|---|---|---|
I -> t |
TIGeR-test-86k | 51.49 | 3.13 |
I -> t |
CVT | 62.88 | 2.73 |
Il -> t |
TIGeR-test-86k | 48.86 | 3.06 |
Il -> t |
CVT | 47.00 | 2.35 |
TIGeR achieves the strongest localization accuracy on both datasets, with especially large gains on the new TIGeR benchmark.
| Query | Dataset | 200 km | 750 km |
|---|---|---|---|
I -> l |
TIGeR-test-86k | 48.63 | 65.61 |
I -> l |
CVT | 51.90 | 66.53 |
It -> l |
TIGeR-test-86k | 48.34 | 64.51 |
It -> l |
CVT | 53.40 | 67.15 |
@inproceedings{shatwell2026tiger,
title = {TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval},
author = {Shatwell, David G. and Swetha, Sirnam and Shah, Mubarak},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}