The Technical Debt of Earth Embedding Products

Last month our team spent three days debugging why AlphaEarth embeddings loaded upside-down and how to best handle it. The fix required patches to GDAL, Rasterio, and TorchGeo. These aren't independent: TorchGeo depends on Rasterio depends on GDAL. All three need updates, all three need version pins, and now your users can't run older environments. One flipped coordinate killed backwards compatibility across the stack.

This is the pattern. Every new Earth embedding product ships like a snowflake. If you want to compare them or stack them, you become the integrator for half a dozen geospatial libraries. Our new paper formalizes this with a taxonomy and TorchGeo integration. This post is about what keeps breaking and why the ecosystem still needs some work.

Everything ships, nothing plugs in

Embeddings are scattered across Source Cooperative, Hugging Face, Earth Engine, private servers, and one-off GitHub repos. Each has its own tile scheme, CRS assumptions, file layout, and storage format. The teams behind these products did the hard part: petabyte-scale processing, cloud cover filtering, reprojection, model inference, etc. The distribution layer is where it falls apart.

Here's what we hit integrating each product into TorchGeo:

Clay: Non-standard tile naming; had to reverse-engineer the grid layout from file paths.
Major TOM: Parquet with nested geometry columns; required custom deserialization.
Earth Index: Clean GeoParquet. This is what all of them should look like. Also check out the source imagery.
Copernicus-Embed: 0.25° resolution is ~25km at mid-latitudes. Too coarse for most applications.
Presto: GeoTIFF but with implicit CRS assumptions that differ from the imagery it was derived from. See the embeddings on Hugging Face.
Tessera: Hidden behind an API running on a university server. Returns raw numpy arrays, not geospatial data. No CRS, no bounds, no metadata. You get numbers and a prayer.
AlphaEarth: Originally locked inside Earth Engine. Moving 465 TB to Source Cooperative cost tens of thousands of dollars in egress fees. Taylor Geospatial Engine and Radiant Earth paid that bill so the rest of us don't have to.

The problem:

Every team solves distribution independently. The integration tax compounds across products.

Three layers, one tradeoff

The data layer is where most decisions get made. Patch embeddings are manageable and cheap, but they throw away spatial detail. Pixel embeddings are finer-grained, but they blow up storage and bandwidth. Once you see that tradeoff, the rest of the ecosystem starts to make sense.

The tools layer is where you figure out if embeddings are any good: benchmarks, intrinsic dimension analysis, and the open challenges nobody has solved yet. The value layer is what you actually do with them: mapping, retrieval, time-series analysis. Most teams jump straight to value without building the tools to know if their approach is working.

Data

Embeddings

· Location embeddings
· Patch embeddings
· Pixel embeddings

Tools

Analysis

· Benchmarks
· Intrinsic dimension
· Open challenges

Value

Applications

· Mapping
· Retrieval
· Time-series

What's actually out there right now

Here's the full landscape. This is the part that looks clean on paper, but every row hides a different file format, spatial grid, and distribution story. Patch vs pixel, snapshot vs annual coverage, and licenses that don't always play nice together. You can pick any one of these and make progress. The moment you try to compare them, the hidden assumptions start to matter.

Product	Kind	Spatial	Dims	Dtype	License
Clay	Patch	5.12 km	768	float32	ODC-By-1.0
Copernicus-Embed	Patch	0.25°	768	float32	CC-BY-4.0
Major TOM	Patch	~3 km	2048	float32	CC-BY-SA-4.0
Earth Index	Patch	320 m	384	float32	CC-BY-4.0
AlphaEarth	Pixel	10 m	64	int8	CC-BY-4.0
Tessera	Pixel	10 m	128	int8	CC-BY-4.0
Presto	Pixel	10 m	128	uint16	CC-BY-4.0

If you only care about a city-scale workflow, almost any of these will get you there. The moment you care about global coverage or consistent evaluation, the missing standards become the bottleneck.

The part everyone underestimates: storage

The storage math is where enthusiasm dies. embedding_dim × dtype × spatial_resolution compounds fast. A city-scale analysis is fine. Continent-scale? Pixel embeddings explode. This is the part that never shows up in model cards.

Patch embeddings: continent-scale storage + cost (Africa, 30M km²)

Clay

3.5 GB

$0.08/mo + $0.32 egress

Copernicus-Embed

147.5 MB

$0.00/mo + $0.01 egress

Major TOM

27.3 GB

$0.63/mo + $2 egress

Earth Index

450.0 GB

$10/mo + $41 egress

All embeddings: continent-scale storage + cost (Africa, 30M km²)

Clay

3.5 GB

$0.08/mo + $0.32 egress

Copernicus-Embed

147.5 MB

$0.00/mo + $0.01 egress

Major TOM

27.3 GB

$0.63/mo + $2 egress

Earth Index

450.0 GB

$10/mo + $41 egress

AlphaEarth

19.2 TB

$442/mo + $1.7k egress

Tessera

38.4 TB

$883/mo + $3.5k egress

Presto

76.8 TB

$1.8k/mo + $6.9k egress

PatchPixel

Presto and Tessera at 10m resolution mean 300 billion embeddings for Africa alone. That's 77 TB for Presto (uint16) and 38 TB for Tessera (int8). Patch products like Clay and Copernicus-Embed stay under 4 GB, but you pay for that with spatial detail. This is why so many "global" embeddings end up being theoretical rather than something you can actually download and use.

Hard truths

Stop over-indexing on Sentinel-1/2. The oceans, atmosphere, and hyperspectral exist. We can't keep claiming we model the Earth if the majority of Earth is out of scope.
Cloud-native formats are table stakes. GeoParquet, COG, GeoZarr. Pick one and commit. Bespoke formats are a tax on every downstream user and they compound across products.
Benchmarks must ship with models. Private benchmarks kill reproducibility. If I can't run your eval, your numbers don't exist.
Embeddings need provenance. Not just vectors, uncertainty, source imagery hashes, model versions. The metadata matters because the underlying data is a moving target.
Temporal embeddings are still undercooked. AlphaEarth, Tessera, and Presto do encode time, but most released embeddings are annual composites. Sub-annual dynamics (phenology, flooding, urban growth) get averaged out. The number of timesteps used to create these composites varies arbitrarily between products. Until we have embeddings at native temporal resolution, change detection stays manual.

What you can do

If you're producing embeddings: use GeoParquet for patch embeddings, COG or Zarr for pixel embeddings. Include CRS metadata. Document your tile scheme. Create a tile index. Make it boring.

If you're consuming embeddings: try the TorchGeo loaders. File issues when things break. The only way this gets better is if the pain is visible.

Read the paper

Fang, H., Stewart, A. J., Corley, I., Zhu, X. X., & Azizpour, H. (2026). Earth Embeddings as Products: Taxonomy, Ecosystem, and Standardized Access. arXiv:2601.13134.

View on arXiv