← earth-labsresearch / proposal · v0.10 · march 2026

§ 00 abstract

jennifer-h2 — a joint-embedding foundation model for the Earth's crust.

joint embedding neural network interpolation for earth resources · h2

Beneath observable geophysical measurements lies a lower-dimensional manifold of latent geological processes. JENNIFER-H2 is a self-supervised, multi-modal foundation model that learns this manifold from petabyte-scale, real-world subsurface data — seismic volumes, borehole logs, gridded gravity and magnetics, heat flow, and free-text geological descriptions — and exposes it as a continuous embedding space.

Inspired by Joint Embedding Predictive Architectures (JEPA), the model is trained to predict the latent representation of masked subsurface regions rather than their raw pixel values. This yields stable, physically meaningful descriptors that survive acquisition noise and survey artifacts, and it enables probabilistic zero-shot inversion of subsurface properties — a task that is otherwise intractable.

The first benchmark application is natural ("white") hydrogen — a zero-carbon primary energy source generated continuously by serpentinization and radiolysis. Hydrogen exploration is chosen as the first benchmark precisely because it presents the greatest density of unsolved subsurface problems: sparse observations, ephemeral signals, and no established exploration workflow. Any model that performs well under these conditions will naturally generalize to better-constrained subsurface tasks — geothermal, critical minerals, carbon storage, conventional imaging.

↓ download proposal · pdf · 1.2 mb11 pages · 47 references

§ 01objectives

two scientific objectives. framed around natural hydrogen and its subsurface expression.

· so1

zero-shot prediction with quantified uncertainty

Predict hydrogen system components — sources, reservoirs, seals, migration pathways — across diverse geological settings. Confidently abstain when extrapolating beyond the training manifold. Drilling decisions involve multi-million-euro investments; calibrated uncertainty is non-negotiable.

· so2

universal geophysical embeddings

A latent space general enough that simple downstream models — logistic regressors, shallow nets, random forests — recover state-of-the-art performance on hydrogen prospectivity, geothermal assessment, and critical mineral prediction with minimal task-specific fine-tuning.

§ 02architecture

asymmetric encoder–predictor. jepa-style training in latent space, not pixel space.

The encoder f_θ processes only the visible patches of a multi-modal input — colocated borehole logs, seismic traces, gridded gravity and magnetic anomalies, text descriptions. A lightweight predictor g_φ then predicts the latent representation of the masked regions, conditioned on the visible context and a small latent z capturing residual uncertainty.

The encoder is intentionally substantially larger than the predictor: at inference time, the encoder is what transfers to downstream tasks, while the predictor is discarded. This shape encourages the encoder to learn rich, transferable descriptors rather than survey-specific shortcuts.

Predicting in latent space — rather than reconstructing raw measurements — is the critical design choice. It avoids overfitting to acquisition noise, polarity conventions, and vintage-specific artifacts that are abundant in real subsurface data and would dominate any pixel-level reconstruction loss.

fig. 1 · jepa pipeline · adapted from assran et al. 2023

masking strategy

modality-specific. natural images tolerate 75%, video 90%, sea-surface temperature 80%; the LILY borehole dataset becomes unstable above 20%. JENNIFER will run systematic ablations on hydrogen-fertile ultramafic provinces and adopt a curriculum that increases masking as the encoder stabilizes.

multi-modal fusion

modality-specific patch/token embeddings — separate encoders for seismic, borehole logs, gridded geophysical fields, and text-based lithology — followed by fusion blocks that operate in the shared latent space. early curriculum: predict masked logs from logs + local seismic. later: predict logs from global gravity and magnetics + sparse seismic.

release plan

model weights on huggingface. globally-evaluated embeddings released as a continuous subsurface feature dataset, mirroring the AlphaEarth release pattern, so other groups can build hydrogen and subsurface applications on top of JENNIFER without retraining the foundation model.

§ 03data

tens of petabytes, spanning four open subsurface databases plus globally continuous geophysical context fields.

A core challenge of WP1 is reducing the raw archive — tens of petabytes across SEG-Y, LAS, DLIS, miniSEED, and CSV variants — to a quality-controlled, harmonized training set comparable in size to modern multi-modal vision corpora (hundreds of terabytes). An AI ingestion agent will parse format variants, run physics-based plausibility checks, and flag anomalies for human review; it doubles as a natural-language interface to the constructed database.

· diskos

norwegian offshore directorate

12,601 wellbores · 5,817 seismic surveys

public repository · north sea coverage · seg-y + las

· iodp · lily

international ocean discovery program

petabyte-scale ocean borehole record

ocean drilling cores · log data · iodp lims

· namss

national archive of marine seismic surveys

continental-scale us coverage

usgs · seg-y volumes · all major basins

· epos

european plate observing system

european coverage · multidisciplinary

seismicity · geomagnetism · geodesy · in-situ

globally continuous context fields

gravity

grace / goce satellite anomalies

magnetics

emag2v3 global anomaly grid

topo · bathy

macferrin et al. 2025

crustal model

pasyanos et al. 2014

§ 04benchmarks

two focused tasks. spatial cross-validation across geologically-distinct provinces.

· bt1

continental hydrogen prospectivity

given JENNIFER embeddings on a grid across europe and the united states, can a simple downstream model predict the composite chance-of-sufficiency (cos) score for hydrogen prospectivity? compared against (i) a raw-indicator baseline, (ii) zero-shot embeddings, (iii) embeddings + 1/5/10 labeled high-prospectivity locations per country.

auc · binary classification · spatial cv across countries

· bt2

hydrogen seep location prediction

given embeddings on a 50 km grid around documented seeps, can a shallow classifier predict the seep cell from the spatial pattern of embeddings alone? trained on, e.g., balkan ophiolites; held out on scandinavian cratonic seeps.

precision–recall auc · highly imbalanced positives

§ 05compute

tacc horizon. 4,000 nvidia h100 gpus, ib interconnect.

primary allocation

dedicated allocation on TACC's Horizon supercomputer, sited adjacent to UTIG. 4,000 H100s + IB fabric. JENNIFER's training requirements would saturate any norwegian compute resource; co-locating training with the U.S. partnership unblocks scale.

academic grant

NVIDIA Academic Grant — call for proposals: simulation and modeling — providing up to 30,000 H100 GPU-hours. complementary to the TACC allocation; covers ablations, fusion-strategy sweeps, and curriculum-learning experiments.

storage · ingestion

multi-petabyte staged ingest from IODP LIMS, DISKOS, NAMSS, EPOS into the TACC corral filesystem. ZFS-checksummed cold storage; warm tier of curated, harmonized training shards for the encoder.

§ 06project team

oslo · austin · utrecht. geophysics, hydrogen, critical minerals, ai/ml.

John M. Aiken, PhD

↗

principal investigator

njord centre · university of oslo

ai + subsurface geophysics · prior pi of serprateai

Thorsten Becker, PhD

co-investigator

institute for geophysics · ut austin

geodynamics · plate tectonics · global modeling

Dunyu Liu, PhD

co-investigator

institute for geophysics · ut austin

computational seismology · earthquake physics

William Gilpin, PhD

co-investigator

department of physics · ut austin

machine learning · nonlinear dynamics

Charlie Beard, PhD

co-investigator

department of geosciences · utrecht university

natural hydrogen · serpentinization petrology

+ postdoc · strong computing background · to be hired

§ 07deliverables

open by default. data, weights, embeddings, benchmark.

· wp1

multi-modal training database

constructed corpus from IODP, DISKOS, NAMSS, EPOS, plus globally continuous context fields. python ingestion libraries. an LLM-backed agent for natural-language queries against the database. accessible through NIRD.

· wp2

trained jennifer-h2 model

at least one full-scale multi-modal model. weights hosted on huggingface. ablation suite quantifying masking ratio, fusion strategy, and JEPA vs. MAE training. globally-evaluated embeddings released as an AlphaEarth-style feature dataset.

· wp3

hydrogen benchmark suite

peer-reviewed benchmark paper (target: jgr solid earth, neurips main track). open-source evaluation code. trained baseline models. withheld test sets enabling future subsurface foundation models to report performance on the same benchmarks.

· wp1–3

application case study

real-world hydrogen exploration case study in a high-priority under-explored region (hungary, denmark, or poland) demonstrating improved targeting via JENNIFER embeddings.

The complete proposal — including references, work-package timeline, and ethical-issues statement — is available as a single PDF.

↓ download · jennifer-h2.pdf