GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space

TL;DR

GT-Loc is a geo-temporal retrieval model capable of:

Time Prediction & Geo‑Localization: Given a query image, GT‑Loc encodes its visual content and compares it to two separate galleries: one of timestamp embeddings and one of GPS‑location embeddings. The closest matches yield an estimated capture time (e.g. “Dec 18, 1:50 PM”) and an estimated geographic coordinate (e.g. “41° 42′ 9″ N, 86° 14′ 16″ W”).

Geo‑Time Composed Image Retrieval: GT‑Loc can also work in reverse: you specify a desired time and place, each encoded into its own embedding. Those two embeddings are fused into a single query, which is then used to search an image gallery. The model returns the image whose embedding best matches the combined time‑and‑location request.

Abstract

Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, instead of conventional contrastive learning with hard positives and negatives, we propose a temporal metric-learning objective providing soft targets by modeling pairwise time differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses previous time prediction methods, even those using the ground-truth geo-location as an input during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, and the unified embedding space facilitates compositional and text-based image retrieval.

Method

GT-Loc Model

GT-Loc is built around three parallel encoders that project each input into a single shared feature space: one for images, one for timestamps, and one for GPS coordinates. During training, triplets of an image, its true capture time, and its true location are encoded and then aligned through two complementary retrieval losses: a novel temporal metric learning (TML) loss that softly models cyclic time differences (hour and month) on a toroidal surface, and a geo‑localization loss that brings matching images and coordinates together. By jointly optimizing these objectives, the model learns an embedding space where any modality can retrieve the others: at test time an image can be used to look up its best‑matching time and place, or conversely a specified time and location can be fused into a combined query to retrieve the most relevant image.

Temporal Metric Learning

The model first computes cosine similarities between image embeddings and their paired time embeddings, then applies a cross‑entropy loss that forces these similarities to match a precomputed target distance matrix reflecting true cyclic time intervals. This aligns the embedding similarities with actual temporal distances in the supervision signal.

Cyclic Distance

Each time unit is mapped onto a 2D cyclic surface (a torus) so that true separations follow the shortest wrap‑around path. In contrast, straight‑line (L2) distances on this plane cut across the cycle and therefore overestimate real time differences, whereas our toroidal distance correctly measures wrap‑around intervals.

Qualitative Results

GT-Loc predictions from the SkyFinder test set. The top-1 predicted location and time are shown below the distributions.

- Left: Input RGB images, with ground truth location and capture time.
- Center: Spatial distribution of the predicted GPS coordinates colored by the cosine similarity between the location and image embeddings.
- Right: Histogram of the top-1k retrieved months and hours, weighted by the cosine similarity between the image and time embeddings.

Quantitative Results

Time-of-Capture Prediction

Zero-shot time prediction on the unseen cameras of SkyFinder dataset. Rows marked by * indicate methods we replicate, closely adhering to the protocols outlined by prior work.

Geo-Localization

Zero-shot geo-localization on Im2GPS3k & GWS15k datasets, reported on the ratio of samples that are correctly predicted under distance thresholds of 1 km radius.

Image Retrieval

Zero-shot image retrieval (T+L→I) on unseen cameras of the SkyFinder dataset. Given a query location and time, we retrieve the top-k candidate images from a gallery and report the recall metric.

BibTeX

@misc{shatwell2025gtloc,
      title={GT-Loc: Unifying When and Where in Images Through a Joint Embedding Space}, 
      author={David G. Shatwell and Ishan Rajendrakumar Dave and Sirnam Swetha and Mubarak Shah},
      year={2025},
      eprint={2507.10473},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.10473}, 
}