CVPR 2026

TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval

David G. Shatwell, Sirnam Swetha, Mubarak Shah

Institute of Artificial Intelligence, University of Central Florida

TIGeR learns a shared geo-temporal embedding for images, GPS coordinates, and timestamps, enabling geo-time aware image retrieval, time-of-capture prediction, and geo-localization within one unified framework.

arXiv GitHub

Overview of TIGeR geo-time aware image retrieval. — Given a query image and a target time, TIGeR retrieves an image from the same physical location at the specified time rather than relying on raw visual similarity alone.

What TIGeR Solves

Many real-world applications need to jointly reason about how a place looks, where it is, and when it was captured. TIGeR formalizes this as geo-time aware image retrieval: retrieve an image of the same place at a specified target time.

Why It Matters

The model preserves location identity across large seasonal, lighting, and structural changes, making it useful for digital forensics, environmental monitoring, and long-term visual analysis.

Main Outcome

TIGeR consistently improves over strong baselines, with up to 16% better time-of-year prediction, 8% better time-of-day prediction, and 14% better geo-time retrieval recall.

Abstract

Many real-world applications in digital forensics, urban monitoring, and environmental analysis require jointly reasoning about visual appearance, location, and time. Beyond standard geo-localization and time-of-capture prediction, these applications increasingly demand more complex capabilities, such as retrieving an image captured at the same location as a query image but at a specified target time.

TIGeR is a unified framework for Time, Images and Geo-location Retrieval. It supports flexible input configurations across single-modality and multi-modality queries and uses one shared representation to perform geo-localization, time-of-capture prediction, and geo-time aware retrieval.

To support this setting, the paper introduces a multistage curation pipeline and a new benchmark containing 4.5M paired image-location-time triplets for training and 86k high-quality triplets for evaluation. By preserving location identity despite large appearance changes, TIGeR retrieves images based on where and when a scene was captured, not just how it looks.

Method

TIGeR embeds image, geo-location, and time into a shared geo-temporal feature space. A frozen CLIP visual encoder produces image tokens, while dedicated location and time encoders represent GPS coordinates and timestamps using random Fourier features. These unimodal and bimodal token sets are processed by a shared multimodal transformer that learns explicit interactions between where, when, and how a place appears.

The refined tokens are pooled and projected into a shared embedding space. TIGeR is trained with cross-modal contrastive objectives so that compatible image, location, and time inputs align, while auxiliary classification heads with soft metric targets regularize the space to vary smoothly over nearby places and nearby times.

Architecture diagram of TIGeR. — TIGeR uses modality-specific encoders, a shared multimodal transformer, and joint contrastive plus soft classification losses to build a unified geo-temporal embedding space.

Flexible Querying

The same representation supports single-modality and fused queries such as I -> t, I -> l, and It -> I.

Cross-Modal Fusion

Instead of aligning modalities only at the end, the shared transformer lets image, location, and time tokens attend to each other directly.

Structured Supervision

Soft targets over discretized location and time classes encourage embeddings to respect geographic and temporal neighborhood structure.

Dataset and Benchmark Curation

The benchmark is curated from AMOS webcam data using a semi-supervised quality filtering pipeline. TIGeR identifies corruption modes, trains an image-quality classifier on labeled examples, removes low-quality frames, and then builds geographically balanced train and test splits with no camera overlap.

The resulting dataset contains 4.5M training images and 86k evaluation images with broad geographic coverage and substantial diversity across months and times of day.

Dataset distribution visualization for TIGeR. — The benchmark covers diverse locations worldwide and maintains broad month and hour distributions.

Dataset curation pipeline for TIGeR. — Multistage curation filters corrupted imagery and produces balanced train and benchmark splits.

Main Results

TIGeR outperforms prior geo-temporal baselines across geo-time retrieval, time prediction, and geo-localization on both TIGeR-test-86k and CVT.

Geo-time Retrieval

37.51%

R@10 on TIGeR-test-86k

TIGeR reaches 37.51% R@10 on TIGeR-test-86k and 29.98% on CVT, delivering up to 14% higher geo-time retrieval recall than prior methods.

Time Prediction (ToY)

48.86 days

Best ToY error on TIGeR-test-86k with Il -> t

TIGeR reduces ToY error to 48.86 on TIGeR-test-86k and 47.00 on CVT, yielding up to 16% better time-of-year prediction.

Time Prediction (ToD)

3.06 hours

Best ToD error on TIGeR-test-86k with Il -> t

TIGeR lowers ToD error to 3.06 on TIGeR-test-86k and 2.35 on CVT, corresponding to up to 8% better time-of-day prediction.

Qualitative geo-time aware retrieval examples for TIGeR. — Qualitative geo-time aware retrieval: TIGeR retrieves images from the same place at the requested time more reliably than prior approaches.

Geo-time Aware Image Retrieval

Given a query image and target time, TIGeR retrieves an image from the same location at the requested time. It substantially improves recall on the new benchmark and remains strong on CVT.

Dataset	R@1	R@5	R@10
TIGeR-test-86k	3.51	23.30	37.51
CVT	14.55	24.46	29.98

Time Prediction

TIGeR improves both time-of-year and time-of-day prediction, and using location as auxiliary context yields further gains.

Query	Dataset	ToY	ToD
`I -> t`	TIGeR-test-86k	51.49	3.13
`I -> t`	CVT	62.88	2.73
`Il -> t`	TIGeR-test-86k	48.86	3.06
`Il -> t`	CVT	47.00	2.35

Time prediction distributions for TIGeR. — TIGeR produces sharp, well-localized temporal distributions for both time-of-year and time-of-day prediction.

Geo-localization examples for TIGeR. — Retrieved geo-location candidates concentrate near the ground truth across diverse scenes.

Geo-localization Accuracy

TIGeR achieves the strongest localization accuracy on both datasets, with especially large gains on the new TIGeR benchmark.

Query	Dataset	200 km	750 km
`I -> l`	TIGeR-test-86k	48.63	65.61
`I -> l`	CVT	51.90	66.53
`It -> l`	TIGeR-test-86k	48.34	64.51
`It -> l`	CVT	53.40	67.15

BibTeX

@inproceedings{shatwell2026tiger,
  title     = {TIGeR: A Unified Framework for Time, Images and Geo-location Retrieval},
  author    = {Shatwell, David G. and Swetha, Sirnam and Shah, Mubarak},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}