Self-supervised pretraining when your data isn't natural images

Act 1 — The problem

1. The construction sheet that won’t pretrain itself

A construction sheet looks nothing like ImageNet. Open one in a viewer and you see thin black lines on white space, a few standardized symbols (door swings, dimension witnesses, stair rises), the occasional block of dimension text, and a wide field of empty paper. There’s no figure-ground in the natural-image sense: no sky, no foreground, no plausible boundary between an object and its surroundings. The information per square inch is sparse and structural.

A general contractor’s archive easily runs to tens of thousands of these sheets. Floor plans, sections, details, schedules. Often unlabeled, often in PDFs. If you want a model that can read them (pull out a wall type, count plumbing fixtures, segment slab regions), the labeled set you can afford is small. A few thousand sheets, maybe ten thousand if you spend on it. The unlabeled corpus is in a different order of magnitude.

This is the practitioner’s setup most SSL papers don’t cover. MAE trained on ImageNet-1K. DINOv3’s largest run trained on LVD-1689M, a curated 1.7B-image scrape biased heavily toward natural photographs. C-RADIOv4 distills from SigLIP2, DINOv3, and SAM3 teachers. None of these papers ship a construction-document checkpoint, and none report a benchmark on construction-document segmentation. Some natural-image SSL backbones do ship domain-specialized variants. DINOv3 has SAT-493M, a satellite specialization with Earth-observation benchmark numbers. No comparable construction-document specialization or transfer benchmark is reported in the recipes this post covers.

Fig 1. A construction document carries thin black line-art over a wide field of empty paper, scattered standardized symbols (door swings, callout tags, dimension witnesses), and a few blocks of dimension text. Image stand-in based on Schenkel Shultz, Sanibel Fire and Rescue Station 172, second-floor architectural plan A102 (1/4” = 1’-0”, 2024), used as a representative sheet. The post’s question: do SSL recipes that learned representations from natural-image photographs transfer cleanly to this kind of input?

The defensible framing is narrower than “SSL hasn’t seen this kind of data.” Web-scale pretraining sets very plausibly include some line drawings; the open question is what the tested-and-reported transfer surface actually covers, and that surface excludes construction documents specifically. As the practitioner with this corpus, you’re being asked to pick a recipe (fine-tune a natural-image checkpoint zero-shot? continue self-supervised pretraining on top of one? train SSL from scratch on your own corpus?) without a published comparison that tells you which to bet on for this domain.

This post walks through seven recipe families and grades each against that question. Each section asks: what does the recipe extract from a sheet like this; where does the signal break down; what do the published numbers say about its dense-feature quality. The closing section is a routing-questions tree. For your data shape (corpus scale, distance from natural images, downstream task density), it points at what the published evidence supports as a starting point, and it flags where you’d be running an experiment instead of following a recipe.

The motivating use case throughout is construction documents. The tree generalizes to other domains the canonical SSL backbones haven’t reported transfer to: scanned reports, low-color line-art, niche scientific imagery. If your corpus is satellite imagery or 3D medical volumes, you have published evidence to lean on; the tree’s terminals point at it.

2. Dense-feature quality is the load-bearing axis

A ViT trained on ImageNet outputs a CLS token plus a grid of patch embeddings, one per non-overlapping patch. For classification, you train a head on the CLS token and that’s the model. For segmentation, you train a head on the per-patch embeddings, one prediction per patch (or per upsampled position). The two downstream tasks read different parts of the encoder’s output, and they reward different things.

This sounds obvious until you look at what SSL recipes actually optimize. MAE optimizes pixel reconstruction loss on masked patches. DINO optimizes a similarity between teacher and student class probabilities under different augmentations. These objectives shape both the CLS token and the per-patch features, but not symmetrically. A recipe can produce a CLS token that classifies ImageNet at 85% while leaving per-patch features that don’t separate adjacent objects cleanly, or the reverse.

DINO’s original paper noticed this empirically: self-supervised ViT features carry “explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.” That observation is the seed for treating the DINO line as the dense-feature SSL family. It’s also why the same ImageNet linear-probe percentage means different things across recipes once you use them for segmentation downstream.

Fig 2. Two heads, one input. The CLS head reads a single summary token and emits one classification (“floor plan”, “facade photo”, “portrait”). The per-patch head reads every patch token in parallel and emits one class-probability per patch, giving enough spatial resolution to segment regions of the sheet (here: wall blue, dimension text sand, whitespace cream). The same SSL recipe shapes both reads, but not symmetrically; recipe choice can produce a strong CLS token and weak per-patch features, or the reverse.

Two later observations sharpen the picture. MIM-Refiner showed that MAE’s encoder organizes itself in three regimes: early blocks learn general features, middle blocks form abstractions (k-NN accuracy peaks here), late blocks pre-allocate themselves to the reconstruction task (k-NN accuracy drops). If you plug the last block of a frozen MAE into a segmentation head, you’re reading from the wrong place: the dense-feature signal is in the middle. Off-the-shelf last-layer features under-deliver on dense tasks not because MAE is bad at SSL, but because the loss specializes the late blocks to a target the segmentation head doesn’t care about.

DINOv3 documented the dual problem on the self-distillation side: long-trained DINO/iBOT shows “cosine similarity between the CLS token and the patch outputs gradually increases during training… the locality of the patch features diminishes.” The CLS token leaks into the patch features. Same encoder, later in training, but the patches stop being patch-specific and start carrying CLS-like information. Different failure mode from MAE, same consequence: dense-feature quality is fragile.

So when this post asks which SSL recipe gives the best fine-tuning starting point for construction-document segmentation, the question can’t be reduced to “best ImageNet number wins.” The recipe shapes how the encoder allocates representational capacity between CLS-shaped and patch-shaped features, and the segmentation head reads only from the patch-shaped half. Recipes that already produce strong dense features (DINOv3 with Gram anchoring; iBOT with its online tokenizer) have the lead before fine-tuning starts. Recipes with weak dense features can sometimes be salvaged with layer selection and head design, but not for free.

Three factors route your starting-point recipe choice in act 3. Corpus scale: some recipes only work at natural-image-pretraining scale (1B+ images), others succeed on 39k volumes; your data shape gates which is realistic. Domain distance from natural images: DINOv3 Web’s frozen features are state-of-the-art on satellite, but they have no measured number on construction-document line-art; distance gates how much you can lean on a natural-image checkpoint. Downstream task density: classification can ride a strong CLS token alone; dense prediction needs the patch features to be good and in the right encoder layer. Each act-2 recipe section walks one family across these three factors. Act 3 turns them into a routing tree.

3. The question we can’t answer head-on

There is no canonical bake-off for construction-document SSL. Adjacent domains have published comparisons; construction documents do not. Before the post walks through recipe families, you need to see exactly what’s measured and what isn’t.

Satellite imagery has the most evidence. DINOv3 §8.3 reports that DINOv3 Web ViT-7B (frozen, no satellite-specific fine-tune) sets state-of-the-art on LoveDA (56.2 mIoU) and DIOR (80.5 mAP), beating both DINOv3’s own SAT-493M satellite specialization and prior satellite-specialized models on those benchmarks. iSAID is a hedge: DINOv3 Web 71.4 < SkySense V2 71.9. The headline is that natural-image self-distillation at scale transfers cleanly to satellite without a satellite-specific recipe.

Lahrichi 2025 ran the head-to-head you’d actually want to see. The authors pretrained MAE and SwAV on GeoNet (a satellite corpus) versus ImageNet, then evaluated across six downstream remote-sensing tasks (four segmentation: SEN12MS, DeepGlobe, Field Delineation, LandCoverNet; plus two classification: BigEarthNet multi-label and EuroSAT multi-class). The conclusion: “no consistent advantage to pre-training with GeoNet as compared to ImageNet, regardless of whether SwAV or MAE was used.” Two-stage MAE-IN→GN beats from-scratch MAE-GN on five of six tasks, with a modest 1–2% advantage. On satellite, ImageNet pretraining is at minimum competitive with domain pretraining; sometimes better.

GLARE 2026 measures continual pretraining specifically. Starting from a UDI-initialized natural-image-SSL checkpoint, GLARE trains an adapter via SSL on top and reports modest +0.2 to +0.6 mIoU gains over UDI alone across natural and satellite segmentation benchmarks (ADE20K 41.2→41.6, Pascal Context 49.1→49.3, Cityscapes 74.7→75.3, LoveDA 50.9→51.5). So the “continue pretraining on top of a natural-image checkpoint” recipe is documented to help, modestly, on the domains GLARE evaluates.

3D medical imaging has its own story. Medical 3D MAE ran MAE on 39,168 unlabeled MRI volumes inside an nnU-Net architecture and reports approximately +3 DSC over a state-of-the-art dynamically-optimized nnU-Net baseline. The medical paper does not compare against continual SSL on a natural-image checkpoint, so the head-to-head against the natural-image-init recipe is absent. What you have is “from-scratch MAE on 39k medical volumes beats no pretraining.”

Documents have DiT. DiT runs BEiT-style MIM on IIT-CDIP (42M document images) and reports gains on PubLayNet layout detection (91.0→94.9 mAP), ICDAR cTDaR table detection (94.23→96.55 F1), and RVL-CDIP document classification (91.11→92.69%). The downstream tasks are layout, table, classification. Not vision-only segmentation on line-art. DiT’s “documents” are scanned reports and forms, text-heavy and layout-driven. Construction documents share the scanned, mostly-empty-paper surface texture but the dense task is different.

Construction-document line-art is not in any of these comparisons. None of the canonical SSL papers (MAE, BEiT family, MaskFeat, data2vec, MIM-Refiner, DINO line, JEPA family, AIM/AIMv2, AM-RADIO/RADIOv2.5/C-RADIOv4) report transfer to construction documents. None of the domain-adaptive papers (DiT, Medical 3D MAE, the satellite line) measure construction-document line-art either. The closest published surface texture is DiT’s document corpus, but its dense task is layout / table detection on text-heavy documents, not segmentation on line-art floor plans.

This is the open question the post explores, and it’s the reason act 3’s deliverable is a routing-questions tree rather than a measured ranking. For domain shapes with published evidence (satellite, medical, document-layout), the tree’s terminals point at specific recipes the literature backs. For construction-document line-art, the tree’s terminal is “start from DINOv3 frozen probe; whether continual pretraining on top helps is an experiment you’d run, not a recipe with published numbers.” A practitioner working on construction documents is at the frontier of what published SSL has measured.

Act 2 walks each recipe family across the construction sheet. For each family it asks the same three questions: what signal does the recipe extract from a sheet like this; where does that signal break down on whitespace-heavy line-art; what do the published dense-prediction numbers say. The construction sheet stays in view through every recipe section. The closing routing-questions tree in act 3 routes you to the leaf that fits your data shape.

Act 2 — The recipes

4. Pixel-reconstruction MIM (MAE)

Take the construction sheet. Cut it into a grid of 16×16 patches. Mask 75% of them at random; the remaining 25% are the visible set. MAE feeds only the visible 25% into a ViT encoder, then injects mask tokens at the held-out positions and runs the result through a small decoder. The decoder reconstructs the original pixel values at the masked positions. The training loss is mean squared error between predicted and ground-truth pixels, computed only on masked patches.

Two design choices are doing most of the work. First, the encoder sees only the visible patches: for a 75% mask ratio, that’s 4× fewer tokens, which is what makes MAE pretraining cheap enough to be practical at scale. Second, the decoder is asymmetric, much smaller than the encoder. Since the task is just pixel reconstruction (not a downstream task), the decoder gets discarded after pretraining. Only the encoder transfers downstream.

Fig 3. MAE on the construction sheet. The sheet is patchified at 16×16; 75% of patches are masked. Only the visible 25% reach the ViT encoder (purple cells in the sequence are learnable [MASK] tokens, injected at the held-out positions before the small decoder). Reconstruction loss is MSE between predicted and true pixels, computed only on the masked patches. The decoder is asymmetric (small) and discarded after pretraining; only the encoder transfers downstream.

Fed an ImageNet image, MAE produces an encoder that fine-tunes to ADE20K mIoU 48.1 (ViT-B) and 53.6 (ViT-L) under a UperNet head. These are decent dense numbers for a recipe whose loss never touches segmentation; reconstruction is enough of a pretext to land usable patch features.

Now feed it the construction sheet. The 75% mask falls across a sheet where many patches carry little visual signal beyond white paper. Many visible patches are plain white. Many of the masked patches the decoder is asked to reconstruct are also mostly white, with a smaller subset containing a dimension witness, a wall hatch, or a corner of a door swing. The reconstruction loss is plausibly dominated by trivially recovering whitespace, with the patches that contain real structure contributing a smaller fraction of the total error.

Whether this signal is enough to drive useful representation learning on construction documents specifically is a question no published paper answers. MAE itself is honest about a related concern: “Images are merely recorded light without a semantic decomposition into the visual analogue of words. Instead of attempting to remove objects, we remove random patches that most likely do not form a semantic segment.” On natural images that’s a complaint about the proxy task; on construction-sheet line-art it’s a complaint with an extra layer, because the proxy task is reconstruction, and the things you’d actually want reconstructed (symbols, lines, dimension text) are a small minority of the input. The post raises this as a plausible low-information-signal concern. It does not measure the failure. The practitioner who runs MAE on a construction-document corpus and compares against the alternatives in §11 is the one who would.

Adjacent line-art SSL work supports the mechanism without measuring construction-document transfer. Sketch and handwriting SSL (Vectorization and Rasterization, 2021) explicitly calls out that the standard photo augmentation menu (color jitter, Gaussian blur) is poorly matched to sparse-stroke inputs: jitter on monochrome is a no-op, blur erases the thin lines that carry the signal. Drawing-domain detection (Domain-Adaptive SSL for face/body in drawings, 2022) lands on a teacher-student continued-pretraining shape with adapted augmentations, the same recipe shape this post discusses for natural-image SSL transfer. Construction sheets share the sparse-stroke failure mode that motivates these papers, but add canonical orientation (titleblocks, north arrows, dimension reading direction), standardized symbols whose semantics are sensitive to small perturbations, grid and dimension-text overlays, OCR-able callout text, and aspect ratios that exceed anything the sketch literature trains at. Treat these papers as background for augmentation design, not as evidence of measured transfer to construction-document line-art.

The recipe choices act 2 explores from here can be read as different answers to “is pixel reconstruction the right pretext, and if not, what should we do instead?” §5 asks a complementary question one level up, on natural images: even when MAE does learn useful representations, where do they live in the encoder?

5. The MIM block regime

The construction sheet helped frame §4’s open question, but the next question is upstream of construction-document specifics. Even on natural images, where do MAE’s useful features actually live inside the encoder?

MIM-Refiner put k-NN accuracy and reconstruction loss on the same axis (block index) and showed three distinct regimes:

“1. In early ViT blocks, general purpose features are learned, which improve the reconstruction loss and the k-NN accuracy simultaneously. 2. In middle ViT blocks, abstractions are formed. The reconstruction loss improves only slightly, while the k-NN accuracy improves drastically. 3. In late ViT blocks, features are prepared for the reconstruction task. The reconstruction loss improves at a faster rate, while the k-NN accuracy decreases.”

Read the curves left to right. Early blocks are doing what every encoder does: learning low-level structure. Middle blocks are doing what segmentation downstream actually wants: forming patch-level abstractions that separate objects. Late blocks are doing what the loss function asks for: collapsing the patch features back into something the small decoder can reconstruct pixels from. The reconstruction objective rewards the model for forgetting high-level structure in the late layers, because a reconstruction-friendly representation is closer to a pixel-friendly one than a segmentation-friendly one.

Fig 4. MIM-Refiner measured k-NN accuracy and reconstruction loss at every encoder block of a ViT-L pretrained with MAE on ImageNet. Three regimes emerge along block depth. Early blocks improve both metrics. Middle blocks form abstractions: k-NN accuracy keeps rising while reconstruction loss flattens. Late blocks specialize for the reconstruction task: reconstruction loss drops faster while k-NN accuracy declines. The two curves are on different y-scales in the source paper; this figure shows the qualitative shape.

The MIM-Refiner authors put it more bluntly: “as models increase in size, the decoder eventually reaches a point where it cannot further improve the pre-training objective on its own. Consequently, it begins to delegate a portion of the reconstruction task back to the last encoder blocks. This transfer adversely affects the feature quality for downstream tasks associated with those blocks.” The bigger the model, the more aggressively the loss specializes the late blocks against your downstream task.

This explains a lot of MAE’s reputation. “MAE underperforms DINO on dense tasks” reads as a recipe-level verdict but is partly a layer-selection artifact: if you take the last block of a frozen MAE and run a linear probe, you’re reading from the regime that’s been optimized against you. If you take a middle block, or run UperNet which sees multiple block depths, the verdict is much closer to even. MAE’s ADE20K 53.6 at ViT-L (UperNet, fine-tuned) is one number; MAE’s k-NN accuracy at the last block is another; they’re not telling you the same thing.

For the construction-document practitioner this matters in a specific way. The block regime is reported on natural-image pretraining; the construction-document version of the same analysis would be the published figure this post can’t include. But the regime is a generic property of what reconstruction objectives do to encoder layers, not a property specific to the natural-image domain. If MAE pretrains usefully on a construction-document corpus at all, the same three regimes are very likely to emerge inside the encoder, and the practitioner’s segmentation head is going to want to read from the middle blocks (or use a multi-scale head like UperNet that does it for them).

§5 is also the point where this post stops treating MAE as a single recipe and starts treating it as a family with knobs. The next two sections turn the most consequential knobs: §6 changes the prediction target (HOG features, EMA latents, discrete tokens) while keeping the masking shape, and §7 swaps the entire framing from masked prediction to multi-crop self-distillation.

6. Changing the MIM target: features and tokens

§5 showed that MAE’s late blocks are specialized to pixel reconstruction, and that specialization actively hurts dense-task transfer. The natural next question is to keep the masking shape (mask 60-75% of patches at random; predict at masked positions) but change what the encoder is asked to predict. If the prediction target is something other than pixels, do the late blocks still burn against the downstream task?

Three target swaps have been published, each with a different bet.

MaskFeat predicts hand-designed HOG (Histogram of Oriented Gradients) features at masked positions. Per the paper, HOG “works particularly well in terms of both performance and efficiency” relative to the other targets MaskFeat tested (raw pixels, deep features, sparse codes). HOG is a fixed, parameter-free target: orientation-aware enough to capture local structure, but stripped of color and lighting noise. The intuition is that asking the model to predict edge gradients instead of raw pixels keeps the high-frequency structural signal while skipping the appearance noise the late blocks would otherwise model.

data2vec predicts the EMA-teacher’s own latent features at masked positions, averaged across the top K teacher FFN blocks. The target is dynamic: it’s whatever representation the teacher (a slow-moving copy of the student) currently produces. The bet is that a learned, semantic latent target keeps the model from wasting capacity on pixel detail that doesn’t matter downstream. data2vec 2.0 is the same idea, faster.

BEiT v1 predicts discrete visual tokens from a pretrained DALL-E dVAE (vocab size 8192). The model masks 40% of patches and minimizes cross-entropy against the dVAE’s tokenization of the original image. BEiT v2 replaces the dVAE with a VQ-KD tokenizer trained by knowledge-distilling CLIP features: “the output vectors aim at reconstructing the semantic features of a teacher model, e.g., DINO, and CLIP.”

Fig 5. Three target swaps for masked-image modeling. The masking shape is the same in all three (mask 60–75% of patches, predict at masked positions); what differs is what the encoder is asked to predict. Pixel target (MAE): raw pixel MSE. Feature target (MaskFeat / data2vec): orientation histograms or EMA-teacher latents. Token target (BEiT v1 / v2): discrete codebook indices, cross-entropy loss. The token-target lead carries a caveat: BEiT v2’s VQ-KD tokenizer is itself distilled from CLIP, so the empirical pairing doesn’t cleanly separate “token target” from “CLIP-distilled tokenizer.”

The dense-prediction numbers tell a partial story. At ViT-L on IN1K with a UperNet head (head explicitly matched), MAE lands at 53.6 mIoU on ADE20K, D2V2-Refined (MIM-Refiner applied to data2vec 2.0) at 54.4, and BEiT v2 at 56.7. BEiT v1 reports 53.3 at ViT-L, but under a SETR-PUP head rather than UperNet, so it sits adjacent to that bin rather than head-to-head with it. Within the MIM family at matched UperNet protocol, BEiT v2 is the leader by 3.1 points over MAE and 2.3 points over D2V2-Refined. MaskFeat and data2vec do not report image dense-prediction transfer at all in their published evaluation sets. Both are classification-only on ImageNet; MaskFeat additionally reports video transfer (Kinetics, AVA, SSv2).

The BEiT v2 lead carries a caveat. The VQ-KD tokenizer is itself distilled from CLIP, which means BEiT v2’s pretraining inherits CLIP’s web-pretrained semantic structure through the tokenizer. That doesn’t make BEiT v2 “not really SSL,” but it does mean the empirical pairing (CLIP-distilled tokenizer plus 56.7 mIoU) doesn’t cleanly separate “BEiT v2’s recipe is the best MIM target” from “BEiT v2’s tokenizer carried in CLIP’s representational quality.” The post’s claim-source matrix backs the pairing as a fact; it does not back a causal “v2 only works because of CLIP” reading.

For the construction-document practitioner this section narrows the MIM target choice. Pixel reconstruction (MAE) is the most-tested baseline and the cheapest to run from scratch on a custom corpus, but its late-block degradation is the worst. Feature targets (MaskFeat, data2vec) skip pixel detail but don’t have published dense-prediction numbers, so transfer to construction-document segmentation is even more an open question than for MAE. Token targets (BEiT v1/v2) have the highest published dense numbers, but the highest entry (BEiT v2) requires running a CLIP-distilled VQ-KD tokenizer, which itself is a natural-image artifact that may or may not work as a target distribution for line-art. The practitioner who picks BEiT v2 as a starting recipe should note that the tokenizer is the load-bearing piece, not the masked-prediction loss.

§7 leaves the MIM framing entirely. Self-distillation skips the “what should we predict at masked positions?” question by not having explicit prediction targets at all.

7. Self-distillation: DINO, iBOT, DINOv2

§6 showed that MIM’s dense-feature signal varies with the prediction target, and that BEiT v2’s lead carries a CLIP-distillation caveat. Self-distillation skips the target-choice question entirely. There is no reconstruction target, no tokenizer, no pretext task involving “predict X at masked positions.” The signal comes from somewhere else.

Take the construction sheet. Apply two augmentations to it (two random crops, color jitter, blur). Run both through a student ViT. Run one of them through a teacher ViT, which is just an exponentially-moving average of the student’s own weights. Train the student so that its output distribution on the augmented view matches the teacher’s output distribution on the other view, under a sharp temperature on the teacher and a softer one on the student. That’s DINO. The supervision signal is “two augmented views of the same image should produce similar embeddings”; the model isn’t told what those embeddings should look like, only that they should be invariant to the augmentations.

The first surprising result from this recipe: when you visualize the student’s self-attention at the last block, the attention maps light up segmentation-shaped regions that nobody explicitly trained the model to find. The DINO paper put it directly: “self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.” That observation is the seed for treating the DINO line as the dense-feature SSL family. Invariance-to-augmentation, applied at scale to ViTs, produces patch-level structure as a side effect.

iBOT extends DINO with masked-patch self-distillation. The student sees a masked image; the teacher (still an EMA of the student) sees the unmasked version. The student’s masked-patch features have to match the teacher’s corresponding patch features. iBOT keeps DINO’s class-token loss but adds the patch-token loss; it also dispenses with a separate tokenizer because the teacher itself plays that role online. Per the paper: “the online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand.” iBOT lands ADE20K mIoU 50.0 at ViT-B/16 under UperNet, beating MAE’s 48.1 at the same scale.

Fig 6. DINO + iBOT self-distillation, with the construction sheet as the source image. Top: DINO trains the student to match the teacher’s CLS-token output distribution under different augmentations of the same sheet; the teacher is an EMA of the student. Bottom: iBOT adds a per-patch loss at the masked positions. Student sees a masked sheet, teacher sees the unmasked one, and the student’s masked-patch features have to match the teacher’s at the corresponding positions. The teacher itself plays the tokenizer role, dispensing with a separate pretraining stage.

DINOv2 is DINO plus iBOT plus two pieces of glue. The first is a KoLeo regularizer that uses the Kozachenko-Leonenko entropy estimator to push the feature distribution toward uniform spread; the second is Sinkhorn-Knopp centering that replaces DINO/iBOT’s softmax centering with a doubly-stochastic batch normalization. Heads for the DINO and iBOT objectives are kept untied. The loss is unchanged in spirit; the regularization makes it more stable at scale.

The other thing DINOv2 changes is the data. The recipe is trained on LVD-142M, a 142M-image curated set built by retrieving inverted-file-index neighbors of a 1.2B-image pile against a small seed corpus. A ViT-g (1.1B params) is trained on LVD-142M and distilled into smaller students. ADE20K mIoU under a frozen-backbone linear probe lands at 49.0 single-scale and 53.0 multi-scale at ViT-g, comparable to MAE’s full fine-tune at ViT-L but starting from frozen features rather than fine-tuning.

For the construction-document practitioner, self-distillation has two specific properties. There’s no reconstruction target to pick, so the section’s worry about pixel reconstruction on whitespace-heavy data (§4) doesn’t apply. The supervision signal is “two augmentations of this construction sheet should embed similarly,” which is well-defined regardless of how much of the sheet is empty paper. There’s also no pretrained tokenizer involved (DINO and iBOT use the model itself as the tokenizer), so the “the tokenizer is doing the work” caveat from §6 doesn’t apply either. The trade-off is that self-distillation needs scale to produce strong dense features. DINO at small ViT scales is fine; DINOv2 at ViT-g on 142M curated images is what produces the 53.0 multi-scale frozen-linear ADE20K number that the rest of this post compares against.

§8 turns to what happens when you push self-distillation further than DINOv2. Two failure modes show up at scale, and the post’s most recent recipe (DINOv3) is engineered specifically against them.

8. Dense-feature collapse and how the latest recipes fix it

§7 ended on DINOv2 as the recipe that produces the strongest frozen-backbone dense features in the post’s coverage so far. What §7 didn’t say is that pushing self-distillation further than DINOv2 reveals two failure modes the recipe’s authors didn’t anticipate at first. Both are observed on natural-image pretraining; the construction-document version of either failure would be the figure this post can’t include. But the failure modes themselves are generic enough that they’re worth understanding before deciding whether to use a self-distillation recipe at all.

The first failure mode is artifact tokens. Darcet et al. 2024 noticed that large self-distilled ViTs (DINOv2 ViT-L and ViT-g, but also OpenCLIP and DeiT-III) produce a small fraction of output tokens with much higher norm than the rest. From the paper: “tokens with roughly 10x higher norm at the output and correspond to a small fraction of the total sequence (around 2%).” The artifacts emerge “after one third of training” and only on the three largest model sizes; smaller DINO ViTs don’t show them. Spatially, the artifact tokens land in low-information background regions of the image: patches the model evidently doesn’t need for the task at hand and has repurposed for internal computation.

The fix is small. Add four extra tokens to the input sequence that the model can use as scratchpad space (“registers”), and discard them at the output. The artifact-token rate drops; ADE20K linear probe goes from 46.6 to 47.9 on DINOv2 with registers; object discovery (LOST corloc on VOC2007) jumps from 35.3 to 55.4. The FLOP overhead is under 2%. Registers don’t change the loss or the data; they change the architecture by giving the model an output-side workspace it doesn’t have to take from the patches.

The second failure mode is patch-locality decay. DINOv3 documented it directly: “cosine similarity between the CLS token and the patch outputs gradually increases during training… the locality of the patch features diminishes.” Long-trained DINO/iBOT shows the patch features drifting toward the CLS token’s geometry; the patches stop being patch-specific and start carrying CLS-like information. This is a different failure from artifact tokens (which only afflict ~2% of tokens) and more pernicious: it’s a uniform drift in the patch-feature distribution.

DINOv3’s fix is a regularizer called Gram anchoring. The Gram matrix of a set of feature vectors is the matrix of all pairwise dot products: it captures the geometry of the feature distribution without caring about absolute values. Gram anchoring computes a target Gram matrix from an early-training snapshot of the model (the “Gram teacher”) and adds a Frobenius-norm loss against the live student’s Gram matrix:

ℒ_Gram = ‖X_S · X_S^⊤ − X_G · X_G^⊤‖_F²

The patch-patch similarities are pinned to the early snapshot; individual patch features can drift, but only as long as the relative similarity structure stays close to the teacher’s. Trained with this loss in the refinement phase, DINOv3 hits ADE20K mIoU 63.0 on a frozen ViT-7B backbone, the highest dense number in the post’s matrix.

Fig 7. Two failure modes of long-trained self-distillation, both reported by Darcet et al. 2024 and DINOv3 on natural-image pretraining. Panel A: artifact tokens emerge in low-information regions of the image at large model size, late in training. Panel B: registers fix the artifacts by giving the model an output-side scratchpad. Panel C: Gram anchoring fixes patch-locality decay by pinning patch-patch similarities to an early snapshot. The construction-sheet version of either failure or fix is unmeasured; the figure is conceptual.

Both failure modes are described and fixed on natural images. Whether long DINOv3-style training on a construction-document corpus would produce the same failures is plausible (the artifact-token paper specifically blames “low-information background regions” for the artifact emergence, and a construction sheet has more low-information regions than most natural images) but unmeasured. The post raises the failure modes as a generic property of long-trained self-distillation, not as a measured construction-sheet result. The construction-document version of either analysis would require running the model on a construction-document corpus and looking at the patch-feature norms and Gram structure across training. Nobody has published that.

For the construction-document practitioner deciding whether to use a self-distillation recipe at all, §8’s takeaway is that DINOv3 is the most engineered-against-collapse self-distillation recipe currently published. The recipe is mature: registers, Gram anchoring, the LVD-1689M data curation pipeline, and a 7B-parameter teacher distilled into smaller students all serve the dense-feature signal. Whether that maturity carries to construction documents is the open question the post can’t answer.

§9 swaps frames again. JEPA-family recipes don’t predict pixels, don’t predict tokens, and don’t match teacher distributions; they predict the model’s own latent features at masked positions in a different way. The bet is that latent-space prediction sidesteps both the pixel-target waste and the self-distillation collapse modes. The published evidence is thinner.

9. JEPA: predicting in latent space

§8 closed on DINOv3 as the most engineered-against-collapse self-distillation recipe currently published. The JEPA family takes a different bet. Pixel reconstruction wastes capacity on appearance detail (§4); self-distillation needs collapse-fix engineering at scale (§8). What if you skip both and predict the encoder’s own latent features at held-out positions instead?

Take the construction sheet. Pick a context block (some contiguous region of patches). Pick several target blocks (other regions, randomly chosen). Run the context block through a context encoder. Run the original full image through a target encoder, which is an EMA of the context encoder. The training task is for a small predictor module to take the context-encoder output and predict the target-encoder’s features at the target-block positions. The prediction loss is MSE in feature space; nothing is reconstructed in pixel or token space.

That’s I-JEPA. The recipe trades the masked-prediction-on-pixels framing for masked-prediction-on-features, and explicitly argues that learning to predict in latent space sheds detail the model doesn’t need: “JEPAs do not seek representations invariant to a set of hand-crafted data augmentations, but instead seek representations that are predictive of each other.” And: “by predicting in representation space, I-JEPA produces semantic representations while using less compute.” The paper backs the latent-space bet with an ablation: the same recipe with pixel-space prediction “leads to a significant degradation in the linear probing performance.”

Fig 8. MAE and I-JEPA on the construction sheet: same input, different prediction targets. MAE has a single encoder and an asymmetric small decoder; the loss is pixel-space MSE on the masked patches. I-JEPA splits into a context encoder + a target encoder (EMA of the context encoder), with a small predictor that takes the context-encoder output and predicts the target-encoder’s features at the target-block positions; the loss is feature-space MSE. The capacity that MAE spends on pixel detail (color, antialiasing) is what I-JEPA’s authors argue gets shed by predicting in latent space instead.

The compute story is real. I-JEPA’s headline is that “a huge I-JEPA model (ViT-H/14) requires less compute than a small iBOT model (ViT-S/16).” On dense-prediction transfer, the picture is partial. I-JEPA reports ImageNet linear probes (ViT-H/14 79.3%, ViT-H/16₄₄₈ 81.1%) and Clevr/Count + Clevr/Dist for spatial-reasoning evaluation (Clevr/Count 86.7, Clevr/Dist 72.4 vs DINO ViT-B/8 53.4 and iBOT ViT-L/16 62.8). But ADE20K mIoU is not in the paper’s evaluation set. There is no published I-JEPA dense-segmentation transfer number to compare against MAE 53.6 or DINOv3 63.0.

V-JEPA extends the framing to video: spatiotemporal feature prediction across video clips. Headline numbers are video-only (Kinetics-400 81.9%, SSv2 72.2%) plus an ImageNet attentive probe (77.9%) from video-only pretraining. V-JEPA 2 is a two-phase recipe: phase 1 mask-denoising feature prediction on VideoMix22M (~22M video samples covering >1M hours) plus 1M ImageNet, phase 2 action-conditioned post-training on 62 hours of robot videos. Headline: ImageNet attentive probe 85.1% at ViT-g384. Neither V-JEPA nor V-JEPA 2 reports image dense-prediction transfer. The evaluation sets confine to video classification (Kinetics, SSv2, AVA) and image classification.

For the construction-document practitioner, the JEPA family is a recipe shape with a clean theoretical case but no published evidence for dense-prediction transfer on images, let alone construction documents. The bet that latent-space prediction sidesteps pixel-target waste is plausible and partly backed by the I-JEPA pixel-vs-latent ablation; whether that translates to higher segmentation transfer than self-distillation does is an empirical question nobody in the JEPA line has measured. Picking JEPA as a starting recipe for construction-document segmentation means running the experiment yourself.

§10 leaves SSL-as-pretext-task altogether. Multi-teacher distillation trains a student to imitate a fixed ensemble of foundation models; there’s no masked-prediction loss, no augmentation invariance, no latent-space prediction. The supervision shape is categorically different.

10. Multi-teacher distillation: the RADIO line

Every recipe in §4 through §9 answers the same question: what pretext task generates the supervision signal? Pixel reconstruction, feature reconstruction, token reconstruction, augmentation invariance, latent-space prediction. The student in each case has to invent its own supervision from the unlabeled corpus.

Multi-teacher distillation answers a different question. Suppose somebody else has already trained good foundation models on the data shape you care about. Can you compress an ensemble of those models into one student that has all of their abilities at the price of one inference?

AM-RADIO, the first paper in the RADIO line, said yes. The student is a ViT. The teachers are an ensemble of pretrained foundation models (the original paper used DFN CLIP ViT-H/14, OpenAI CLIP ViT-L/14, and DINOv2 ViT-g/14; SAM ViTDet-H entered later). For each input image, the student has to match the teachers’ summary outputs (CLS-token-shaped) and their spatial outputs (per-patch features). The losses:

Summary: L_summary(x) = Σᵢ λᵢ L_cos(yᵢ^(s), zᵢ^(s)) Spatial: L_features(x) = Σᵢ γᵢ L_match(hᵢ^(v)(x’ | Θᵢ^(v)), tᵢ^(v)(x | Φᵢ^(v))) with L_match = α·L_cos + β·L_smooth_L1, α=0.9, β=0.1

In AM-RADIO, there is no SSL auxiliary loss on the student. The entire training signal is “match these teachers’ outputs.” Whatever representational structure the teachers have, the student is being trained to inherit it directly. (Later RADIO-line variants add a self-supervised regularizer, MESA, on top of teacher imitation; §11 reaches that shape after the AM-RADIO base.)

RADIOv2.5 added two pieces. First, a diagnosis: the original RADIO has a “mode-switching” pathology where the features behave like DINOv2 at resolutions ≤512² and like SAM at higher resolutions. Per the paper: “at resolutions lower than or equal to 512², the features most closely resemble those of DINOv2… At higher resolutions, the model starts to behave more like SAM.” The cause: at high resolutions, the student in earlier RADIO training only saw SAM features. Second, a remedy: balanced multi-teacher loss schedules and PHI-S loss balancing (more in a moment). RADIOv2.5 lands ADE20K mIoU 54.56 at the g-scale student.

C-RADIOv4 is the most recent step in the line. The teachers shifted to the current SOTA: SigLIP2, DINOv3, SAM3. The loss formulations changed too. Spatial loss is now squared error against PHI-S-normalized teacher outputs (per the paper’s Equation 1, §2.3.1):

L_spatial(x, ŷ) = (1/|Ω|) Σ_(u∈Ω) (𝓕_(S→T)[x]_u − ŷ_u)²

Cosine distance is explicitly dropped from the summary loss in favor of an angular-cone-normalized formulation:

L_angle(x, y) = Θ(x, y)² / Disp(Θ_y)

C-RADIOv4 also adds MESA, a shift-equivariant regularizer where the student matches its own EMA on a shifted crop of the same image:

L_mesa(x, x̃) = (1/|Ω|) Σ (𝓕_(S→S̃)[LN(x)]_u − LN(x̃)_u)²

MESA is genuinely self-supervised (student’s EMA, not a teacher), but it’s a regularizer that enforces shift equivariance, not a primary pretext task. Without MESA, C-RADIOv4 is still trained primarily by teacher imitation. C-RADIOv4-H lands ADE20K mIoU 55.20 at 512px.

A piece of the recipe worth knowing about across the line: PHI-S. When you distill from teachers whose activation statistics differ wildly (CLIP’s cosine-distance-shaped features vs DINOv2’s distributional features vs SAM’s segmentation-shaped features), the per-teacher losses can dominate each other in ways that don’t reflect the teachers’ relative importance. PHI-S applies a Hadamard isotropic standardization to balance the activation distributions before the loss runs, “where each dimension of a multivariate distribution is standardized using the same scale” via Hadamard matrices. The PHI-S paper reports it “produces the best student model across the suite of methods studied.”

Fig 9. AM-RADIO base architecture. The student is a ViT trained to match a frozen ensemble of pretrained teachers on both summary (CLS-shaped) and spatial (per-patch) outputs. In the AM-RADIO base, the student’s only training signal is teacher activations; there is no SSL auxiliary loss. C-RADIOv4 keeps the student-and-teachers shape but rewrites both losses (angular-cone for summary, PHI-S-normalized squared error for spatial) and adds a MESA shift-equivariant self-supervised regularizer where the student matches its own EMA on a shifted crop. MESA makes C-RADIOv4 a hybrid: teacher imitation plus a self-supervised term, not pure teacher imitation.

The categorical observation across the line is this: multi-teacher distillation occupies a different position in the SSL recipe taxonomy than the prior recipes. Supervision shape is external: the labels are another model’s activations, not a pretext task on corrupted or multi-view inputs of the unlabeled corpus. This isn’t “RADIO is not SSL”; it’s a different kind of self-supervision, where the label-source is a foundation-model checkpoint rather than a transformation of the input. The student needs no labeled data, but it needs working teachers.

That’s where the recipe meets the construction-document practitioner’s reality. RADIO’s pretraining starting point is the teacher set: CLIP, DINOv2, SAM, and their successors. If the practitioner has access to teacher checkpoints that produce useful features on construction documents (perhaps a CLIP-style model fine-tuned on construction-document image-caption pairs, or a DINOv2-style model continued-pretrained on construction sheets), RADIO can compress that ensemble into a fast student. If they don’t, RADIO has nothing to compress.

Whether suitable teacher checkpoints exist is a question the practitioner answers from outside this post’s matrix. The post does not assert “no teacher set ships with construction-document weights” as an absence claim, since that’s a statement about all possible teachers in the world, which the matrix cannot cover. What the post does say: §14’s decision tree treats teacher availability as a hard practitioner-facing gate, not a confounder. If you have teachers that work on your domain, RADIO is a recipe; if you don’t, the section’s recipe doesn’t apply, and you fall back to one of §11’s options.

§11 turns to those options. With multi-teacher distillation off the table for practitioners without a working teacher set, the remaining recipe shapes are from-scratch SSL on the practitioner’s own corpus, or continual SSL on top of a natural-image checkpoint.

11. Domain-adaptive recipes: from-scratch and continual-from-natural

§10 closed multi-teacher distillation when the practitioner doesn’t have working teachers. With that path closed, two recipe shapes remain that you can actually run: train SSL from scratch on your own domain corpus, or continue SSL pretraining on top of a natural-image checkpoint. Both are documented in adjacent domains; neither has a published comparison on construction-document line-art specifically.

Take the from-scratch shape first. DiT ran BEiT-style MIM on IIT-CDIP, a 42M-image corpus of scanned business documents, and reports gains on three downstream tasks. PubLayNet layout detection: 91.0 → 94.9 mAP. ICDAR cTDaR table detection: 94.23 → 96.55 F1. RVL-CDIP document classification: 91.11 → 92.69%. The headline is that BEiT-style pretraining on a domain corpus, at scale, transfers cleanly to document-shaped downstream tasks.

Notice the downstream task list. Layout, table, classification. Not vision-only segmentation on line-art. DiT’s “documents” are scanned reports, forms, invoices, contracts: text-heavy and layout-driven. Construction documents share the surface texture (scanned, mostly-empty paper) but the dense task is different. A floor plan’s segmentation labels are spatial regions of structure, not bounding boxes for text columns or tables. DiT’s evidence is for document-layout pretraining specifically. It does not generalize to “vision-only SSL works on line-art.”

Medical 3D MAE takes the from-scratch shape into a different domain: MAE on 39,168 unlabeled 3D MRI brain volumes, inside an nnU-Net architecture. The result is approximately +3 DSC over a dynamically-optimized nnU-Net baseline (“the first work to demonstrate that SSL pretraining with a fixed architecture can consistently outperform a state-of-the-art, dynamically optimized nnU-Net baseline”; abstract headline of “approximately 3 Dice points”). The medical paper does not compare against continual SSL on a natural-image checkpoint, so the head-to-head against the natural-image-init recipe is absent. What you have is: from-scratch MAE on 39k medical volumes beats no pretraining.

The two from-scratch papers cover different scales (DiT at 42M, Medical 3D MAE at 39k) and different downstream tasks. Both work, in their own setting. Neither tells you what would happen if you ran the same recipe on construction-document line-art and compared it against continual pretraining on a natural-image checkpoint.

Continual-from-natural is the second recipe shape. GLARE 2026 trains a small adapter on top of a UDI-initialized natural-image-SSL checkpoint, using SSL as the adapter’s training objective. The original natural-image-SSL features are kept frozen; only the adapter’s parameters update. The result, across natural and satellite segmentation benchmarks at ViT-S/16, is modest gains over the UDI initialization: ADE20K 41.2 → 41.6, Pascal Context 49.1 → 49.3, Cityscapes 74.7 → 75.3, LoveDA 50.9 → 51.5. Each gain is +0.2 to +0.6 mIoU. Modest, but consistent across the four benchmarks GLARE evaluates.

Lahrichi 2025 ran the more demanding head-to-head: pretrain MAE and SwAV on GeoNet (a satellite corpus) versus ImageNet, then evaluate across six downstream remote-sensing tasks (four segmentation, two classification). Conclusion: “no consistent advantage to pre-training with GeoNet as compared to ImageNet, regardless of whether SwAV or MAE was used.” Two-stage MAE-IN→GN beats from-scratch MAE-GN on five of six tasks, with a modest 1–2% advantage. The two-stage result is the relevant data point for continual-from-natural: pretraining first on natural images, then continuing on the domain corpus, beats pretraining only on the domain corpus, modestly, in this satellite setting (the gain is reported across the mixed segmentation/classification task set, not segmentation alone).

Fig 10. Domain-adaptive recipe deltas, by recipe shape. Panel A is from-scratch SSL on a domain corpus, with absolute deltas in each benchmark’s native metric (mAP / F1 / % / DSC). Panel B is continual SSL on top of a natural-image checkpoint, in mIoU. Cross-panel comparison is invalid: panel A uses metrics that are not mIoU and a from-scratch (no-pretrain) baseline; panel B is mIoU vs a natural-image-init baseline. The Lahrichi bar is hatched because it’s a 5/6-benchmark range, not a single number. Sources: DiT, Medical 3D MAE, GLARE 2026, Lahrichi 2025.

For the construction-document practitioner, §11 is the most directly applicable section in act 2. Both recipe shapes are runnable on a construction-document corpus without external teachers; both have published evidence in adjacent domains showing the recipe shape can produce useful representations. The trade-offs.

From-scratch domain SSL is the cleanest experimental setup. You take MAE or BEiT-style MIM, run it on your construction-document corpus, and you get a checkpoint that has only seen construction documents. The downside is that you start from scratch, with no inherited representational structure from natural images. At a corpus scale of “tens of thousands of unlabeled sheets,” whether from-scratch SSL produces useful representations at all is an open question. Medical 3D MAE worked at 39k volumes, but those are 3D MRI volumes with full volumetric structure; a 39k-volume construction-document corpus is a different shape.

Continual SSL on a natural-image checkpoint is the “safe” shape. You start from DINOv3 (or whichever natural-image SSL recipe you pick), continue self-supervised training on your construction-document corpus, and inherit whatever representational structure DINOv3 already has. The downside is that you don’t know whether DINOv3’s natural-image structure helps or hurts on construction-document line-art. GLARE’s modest gains over UDI on natural + satellite suggest the recipe shape works in adjacent domains, modestly. Lahrichi’s two-stage MAE-IN→GN result suggests the continual recipe shape beats from-scratch, modestly, in satellite. Whether either result transfers to construction documents is the open question this post can’t answer.

Neither recipe shape has a published bake-off on construction-document line-art. The practitioner who runs both and compares is the one who would produce that bake-off.

§12 leaves act 2 and turns to the verdict. Act 3 starts by putting all the published dense-prediction numbers from this act onto a single chart, with caveats about cross-protocol comparison. Then it walks the OOD-domain transfer evidence one paper at a time. Then it closes with the routing-questions tree.

Act 3 — The verdict

12. Where each recipe lands on dense prediction (with caveats)

Act 2 walked seven recipe families across the construction sheet. Two of them, JEPA and autoregressive image pretraining (iGPT, AIM, AIMv2), do not report image dense-prediction transfer in their published evaluation sets. The remaining five report ADE20K mIoU somewhere in their papers. The natural temptation is to put all those numbers on a single chart and read off the ranking. The chart is real, and it’s coming, but the ranking it implies is not a clean cross-recipe verdict.

The reason is protocol drift. Different recipes report different combinations of (model size, pretraining dataset, segmentation head, training regimen). Some report fine-tuned UperNet at IN1K-pretrained ViT-L. Some report frozen-backbone linear probes at ViT-g. Some report frozen-backbone at ViT-7B with a paper-specific segmentation head. Comparing 53.6 (MAE ViT-L UperNet, fine-tuned) to 63.0 (DINOv3 ViT-7B frozen, segmentation head) requires holding constant variables that aren’t actually held constant.

Fig 11 stratifies the published numbers into five loose groupings:

(a) ViT-L IN1K UperNet, head explicitly matched: MAE 53.6, D2V2-Refined 54.4, BEiT v2 56.7. These three sit in the same protocol bin and the within-bin ranking is meaningful. BEiT v2 leads.

(a’) BEiT v1 at ViT-L (53.3) sits adjacent to bin (a) at the same model scale, but reports under SETR-PUP rather than UperNet. The number is real, and it’s roughly in the same ballpark, but it doesn’t slot directly into bin (a)‘s head-matched comparison.

(b) ViT-B IN1K UperNet, head explicitly matched: MAE 48.1, iBOT 50.0, BEiT v2 53.1. Same protocol as (a) but at smaller scale. iBOT leads MAE; BEiT v2 leads iBOT.

(c) DINO-line frozen-backbone, mixed heads: DINOv2-g 53.0 (multi-scale linear probe), DINOv3-7B 63.0 (DINOv3’s segmentation head). These two are not protocol-matched to (a) or (b), or to each other. The numbers tell you the DINO line scales, not that DINOv3 outperforms BEiT v2 at matched protocol.

(d) RADIO line under each paper’s own protocol: RADIOv2.5-g 54.56, C-RADIOv4-H 55.20 at 512px. RADIO papers have their own evaluation regimens (model-size axis varies; the resolution at which C-RADIOv4-H reports matters). These numbers are best read as RADIO-line-internal comparisons.

Fig 11. Published ADE20K mIoU stratified by evaluation protocol. Only bins (a) and (b) are fully protocol-matched (UperNet head, ViT-L and ViT-B IN1K respectively); within those bins, bars are comparable. Bin (a’) is an adjacent ViT-L IN1K result under SETR-PUP rather than UperNet (ballpark, but not slottable into bin (a)‘s comparison). Bin (c) is DINO-line frozen-backbone with mixed heads (DINOv2 multi-scale linear probe vs DINOv3 segmentation head; the numbers tell you the DINO line scales, not that DINOv3 outperforms BEiT v2 at matched protocol). Bin (d) is RADIO-line under each paper’s own protocol; bars are best read as RADIO-line-internal. Bars across bins are never comparable. AR family, JEPA family, and iGPT have no published ADE20K mIoU under any of these protocols. Sources: MAE, BEiT v1, BEiT v2, iBOT, data2vec 2.0 + MIM-Refiner, DINOv2, DINOv3, RADIO, C-RADIOv4.

The construction sheet has no bar in this chart, because no canonical bake-off exists for it. Treat these bars as proxy evidence for what to start from, not as direct construction-document scores. The reader who walks away with “DINOv3 wins outright” has read more into Fig 11 than the matrix supports.

A note on the recipe families that don’t appear here. Autoregressive image pretraining (iGPT, AIM, AIMv2) reports ImageNet classification only. AIM specifically frames itself as a scaling-laws result for classification (“performance of the visual features scale with both the model capacity and the quantity of data”); it does not report ADE20K, COCO segmentation, or depth transfer. The JEPA family (I-JEPA, V-JEPA, V-JEPA 2) reports image classification and video classification; none of the three reports image dense-prediction transfer. For the practitioner with a dense downstream task, these two families don’t currently have a published number to compare against MAE 53.6 or DINOv3 63.0. They might, eventually. They don’t yet.

What Fig 11 does support, when you read it carefully:

Within the MIM family at ViT-L UperNet IN1K (bin a), BEiT v2 leads MAE by 3.1 points and D2V2-Refined by 2.3 points. This is the cleanest within-recipe comparison the post has. The DINO line scales from 53.0 (frozen ViT-g multi-scale) to 63.0 (frozen ViT-7B with segmentation head); the 10-point spread reflects mostly model size and protocol differences, not a 10-point recipe gap. The RADIO line at C-RADIOv4-H lands at 55.20 at 512px under its own protocol, competitive with BEiT v2 56.7 within MIM, but the comparison isn’t head-to-head. Five recipes have published dense numbers; two don’t.

§13 turns from natural-image dense numbers to out-of-domain transfer. Different question, different evidence, different chart.

13. The published OOD evidence

§12 read across natural-image dense numbers. §13 reads across out-of-domain transfer. The question is different: not “which recipe makes the best ADE20K segmentation,” but “which recipe shape produces the best fine-tuning starting point for a domain that’s not in any of these papers’ training sets.”

The headline result for satellite is the one §3 introduced, expanded. DINOv3 §8.3 reports that DINOv3 Web ViT-7B (frozen backbone, no satellite SSL pretraining) sets state-of-the-art on LoveDA (56.2 mIoU) and DIOR (80.5 mAP), beating both DINOv3’s own SAT-493M satellite specialization and prior satellite-specialized models. iSAID is the hedge: DINOv3 Web 71.4 < SkySense V2 71.9. Two takeaways. First, natural-image self-distillation at scale transfers cleanly to satellite imagery without a domain-specific SSL recipe; DINOv3 Web is doing frozen-backbone transfer (the supervised benchmark heads are still trained on the satellite labels) and beating models that ran satellite-specific SSL pretraining. Second, “domain-adaptive SSL strictly wins” is the wrong intuition for satellite. DINOv3 Sat-493M, the satellite specialization, doesn’t beat its own natural-image starting point on the segmentation tasks in DINOv3’s evaluation set.

This intuition is not unique to DINOv3. Lahrichi 2025 ran the head-to-head you’d want to see one level up, before specialization. Pretrain MAE and SwAV on GeoNet (a satellite corpus) versus ImageNet, evaluate across six downstream remote-sensing tasks (four segmentation: SEN12MS, DeepGlobe, Field Delineation, LandCoverNet; two classification: BigEarthNet multi-label, EuroSAT multi-class). The conclusion: “no consistent advantage to pre-training with GeoNet as compared to ImageNet, regardless of whether SwAV or MAE was used.” The two-stage MAE-IN→GN result is the relevant data point for continual-from-natural: pretraining first on natural images, then continuing on the domain corpus, beats pretraining only on the domain corpus on 5 of 6 tasks, by a modest 1-2%. So Lahrichi reports two findings, scoped narrowly across the mixed task set: pretraining on ImageNet vs only on GeoNet shows no consistent winner, and the two-stage continual MAE-IN→GN beats from-scratch MAE-GN modestly. The paper does not rank natural-image-only pretraining against from-scratch domain SSL on segmentation alone; the only consistent improvement Lahrichi reports is the continual-from-natural step on top of the domain corpus.

GLARE 2026 measures the continual-from-natural recipe shape directly. Adapter-based continual pretraining on top of a UDI initialization, evaluated across natural and satellite segmentation: ADE20K 41.2 → 41.6, Pascal Context 49.1 → 49.3, Cityscapes 74.7 → 75.3, LoveDA 50.9 → 51.5. Each gain is +0.2 to +0.6 mIoU. The pattern is consistent across the four benchmarks GLARE evaluates. Continual SSL on top of a natural-image checkpoint helps, modestly, when the recipe is well-tuned.

Medical imaging gives a different shape of evidence. Medical 3D MAE reports approximately +3 DSC over a dynamically-optimized nnU-Net baseline on 39k MRI volumes. The medical paper does not compare against continual SSL on a natural-image checkpoint. It compares against “no pretraining,” which is the right baseline to argue “SSL is worth doing” but not the right baseline to argue “domain-SSL beats continual-from-natural.” For 3D MRI volumes, no published natural-image SSL recipe applies directly (you can’t run DINOv3 on volumetric data without an architectural adapter), so the comparison would have to be against continual SSL on a 3D-medical-init checkpoint, which doesn’t exist in published form. Medical’s published evidence is “from-scratch SSL on 39k volumes beats no pretraining.”

Documents have DiT for from-scratch domain SSL at 42M images. The downstream tasks are document-layout (PubLayNet), table detection (ICDAR cTDaR), and document classification (RVL-CDIP). DiT does not evaluate continual-from-natural either. The published evidence is “from-scratch BEiT-style on 42M document images beats supervised baselines on document-layout, table, and classification tasks.”

Fig 12. Published evidence by domain × pretraining strategy. A filled cell means at least one paper measured that combination on that domain; an empty hatched cell means no published number was found. Cross-domain comparison is disqualified: the cells use different benchmarks (LoveDA vs PubLayNet vs DSC), different heads (UperNet vs SETR-PUP vs nnU-Net decoder), and different architectures. The construction-drawings row is the row this post exists to walk around. Sources: DINOv3 §8.3, GLARE 2026, Lahrichi 2025, Medical 3D MAE, DiT.

The published evidence does not support a simple “domain-adaptive SSL strictly wins” thesis. The closest the literature comes to that thesis is Medical 3D MAE, which compares from-scratch domain SSL to no pretraining, not to continual-from-natural. Where there is direct head-to-head evidence (Lahrichi 2025), the two-stage continual-from-natural beats from-scratch-on-domain modestly across a mixed task set (four segmentation + two classification, not segmentation alone). Where there is frozen-backbone evidence (DINOv3 Web on satellite), natural-image SSL transfers without domain-specific SSL pretraining. Where there is continual-pretraining evidence (GLARE 2026), the recipe shape works modestly. None of these results is on construction-document line-art.

For the construction-document practitioner, §13’s takeaway is that the published evidence points toward “start from a natural-image checkpoint” as the most-supported recipe shape across the domains where head-to-head evidence exists. DINOv3 has the highest published frozen-backbone dense numbers; it has documented frozen-backbone transfer to satellite (no satellite SSL pretraining); it has not been tested on construction-document line-art. Whether it works on construction documents is the experiment a practitioner would run, not a recipe with a published number.

§14 turns this into the routing-questions tree. Three axes (corpus scale, domain distance from natural images, downstream task density), one gate (teacher availability), and four leaves with explicit per-leaf evidence flags.

14. The decision tree (what to do, hedged)

The construction sheet that opened §1 has walked through seven recipe families. It has no published bake-off; it has no canonical SSL backbone trained on it; the practitioner picking a recipe for it is at the frontier of what’s measured. §14 takes everything the post has covered and lays it out as a decision tree.

A note on what kind of tree this is. The three axes (corpus scale, domain distance from natural images, downstream task density) and the teacher-availability gate that sits outside the hierarchy are a post-hoc descriptive synthesis of how the covered evidence factors. The post is not claiming these are the optimal axes, the only axes, or axes that have been head-to-head measured against alternative split orders. It claims they are the relevant practitioner factors implied by what the published evidence actually establishes:

Corpus scale gates whether SSL is feasible at all. Medical 3D MAE worked at 39k volumes; DiT at 42M document images; DINOv3 at 1.7B natural images. Different recipes have different scale floors.
Domain distance from natural images gates how much you can lean on a natural-image checkpoint. DINOv3 Web → satellite is documented transfer; construction documents have no measured transfer.
Downstream task density gates whether dense-feature quality dominates. AIM/iGPT’s classification-only evaluation sits at one end; ADE20K-shaped segmentation sits at the other.
Teacher availability is binary practitioner access, not a domain-shape axis. RADIO needs working teachers; if you don’t have them, that branch closes.

Treat the tree as descriptive guidance, not prescriptive truth. It’s a way to organize the published evidence into a routing map.

Fig 13. The four leaves and the parallel teacher-availability gate. Each leaf names a starting point under that domain’s published protocol, an explicit experiment-vs-evidence flag, and a per-leaf baseline disclaimer. The four leaves are not four comparable points on a single decision surface; they are four routing destinations whose evidence is intra-domain. Construction-document line-art routes to leaf 1 with the experiment flag, not to leaf 4.

The four leaves:

1. Construction-document leaf (line-art, segmentation downstream, tens of thousands of unlabeled sheets). Starting point: DINOv3 frozen probe (matrix row 18: ADE20K mIoU 63.0 at frozen ViT-7B, the highest published frozen-backbone dense number). Continuation: continual self-distillation on top is an A/B the practitioner would run, not a recipe with a published number. Flag: experiment, no measured bake-off. Row 32 (Lahrichi + GLARE) backs the absence by establishing what published comparisons exist for adjacent domains; none for construction documents specifically.

Per-leaf baseline disclaimer: this leaf’s evidence is ViT-7B frozen-backbone with DINOv3’s segmentation head (row 18). The practitioner’s continual-pretraining A/B is what would produce the construction-document number. Cross-leaf comparison to the satellite, medical, or document-corpus leaves below is not supported; their baselines differ.

2. Satellite leaf. Starting point: DINOv3 Web ViT-7B frozen (row 31: LoveDA 56.2 / DIOR 80.5 SOTA; iSAID 71.4 hedge). Continuation: optional adapter-based continual pretraining for marginal gains (row 32 / GLARE deltas of +0.2 to +0.6 mIoU). Flag: published evidence backs the starting point. From-scratch GeoNet pretraining shows no consistent advantage over ImageNet (Lahrichi 2025); two-stage MAE-IN→GN beats from-scratch MAE-GN modestly.

Per-leaf baseline disclaimer: starting-point evidence is at ViT-7B frozen-backbone (row 31, DINOv3 Web on LoveDA / DIOR / iSAID); continuation evidence (GLARE) is adapter-based at ViT-S/16 on UDI initialization (row 32). The starting-point and continuation evidence run on different protocols within the same leaf; neither is a unified ViT-frozen baseline against the other leaves.

3. Medical (3D MRI volumes) leaf. Starting point: from-scratch 3D MAE on the domain (row 30: approximately +3 DSC over a dynamically-optimized nnU-Net baseline). Continuation: no published comparison against continual SSL on a natural-image checkpoint (a head-to-head doesn’t exist for volumetric medical data). Flag: published evidence vs no-pretraining; experiment vs continual-from-natural. This leaf does NOT claim “domain-SSL beats continual-from-natural”; the comparison hasn’t been published.

Per-leaf baseline disclaimer: 3D Residual Encoder U-Net within nnU-Net; DSC against no-pretraining (row 30). Not a ViT frozen / linear dense-feature probe. Cross-leaf comparison to the satellite or construction-document leaf is not supported; the model architecture and the metric are both different.

4. Document-corpus (text-heavy, layout-heavy) leaf. Applies to scanned reports, forms, invoices, contracts where the dense-prediction task is layout / table detection or document classification, NOT to line-art construction documents. Starting point: DiT BEiT-style on IIT-CDIP-style corpus (row 29: PubLayNet 91.0 → 94.9 mAP, ICDAR cTDaR 94.23 → 96.55 F1, RVL-CDIP 91.11 → 92.69%). Continuation: no published continual-from-natural comparison. Flag: published evidence (from-scratch domain-SSL vs supervised, at 42M corpus scale, on document-layout tasks specifically). Construction-sheet line-art routes to leaf 1, not this leaf.

Per-leaf baseline disclaimer: BEiT-style MIM on a 42M document corpus; downstream evaluation mixes layout (PubLayNet), table detection (ICDAR cTDaR), and document classification (RVL-CDIP) deltas. Not a unified dense-feature probe. The leaf reports DiT’s result on its own protocol.

The multi-teacher gate sits outside the three-axis hierarchy. If the practitioner has access to teacher checkpoints that work on their domain (perhaps a CLIP fine-tuned on construction-document image-caption pairs, or a DINOv2-style model continued-pretrained on construction sheets), RADIO-style multi-teacher distillation is the recipe (rows 23-28, 35, 36 establish the loss formulations and the central thesis). If they don’t, fall back to whichever leaf the practitioner’s data shape routes to. The post does not claim “no teacher set ships with construction-document weights” as an absence claim; teacher availability is a question the practitioner answers from their own context.

The post-level claim is narrower than “follow the tree to the right recipe.” Each leaf names a defensible STARTING POINT under that domain’s published protocol, with an explicit flag for what the practitioner would have to run themselves. The four leaves are NOT four comparable points on a single decision surface; they are four routing destinations whose evidence is intra-domain. A practitioner whose corpus is construction-document line-art is being routed to leaf 1, with the experiment flag, not given a measured comparison against leaves 2, 3, or 4.

§15 closes on what the matrix doesn’t yet cover, and what’s likely to move in the next twelve months.

15. Coda: what we don’t know

The construction-document SSL bake-off is the figure this post couldn’t include. The decision tree’s construction-document leaf is a starting point with an experiment flag, not a measured number. Whether DINOv3 frozen-probe features are useful on a floor plan; whether continual self-distillation on top of DINOv3 helps or hurts compared to from-scratch domain SSL; whether the failure modes from §8 (artifact tokens, patch-locality decay) emerge on a corpus where most patches are whitespace. Every one of those questions is an experiment a practitioner has to run, not a number they can look up.

What this post does not do: prescribe a specific continual-pretraining recipe for construction documents. The matrix doesn’t back a prescription, and the post shouldn’t make one up. Any practitioner running an A/B on top of DINOv3 with continual self-distillation on construction sheets is running an experiment, not following a published recipe. What they would report (corpus scale, training schedule, downstream segmentation numbers) would be the figure this post couldn’t include.

The next twelve months will probably move some of these questions. C-RADIOv4 is a few months old at the time of writing; its successors will explore different teacher sets, possibly including domain-specific teachers. DINOv3 has about nine months of provenance; its specialization on satellite (SAT-493M) suggests more domain-tuned variants are likely. JEPA recipes still don’t have published image dense-prediction transfer, but if a future V-JEPA or I-JEPA paper reports ADE20K numbers, the JEPA-vs-self-distillation question becomes more tractable. None of that timeline is on construction-document line-art specifically.

If you’re a practitioner picking an SSL recipe for a domain that isn’t natural images, the shape of the decision the matrix backs is: start with the highest-published-frozen-backbone recipe that fits your scale (DINOv3 frozen probe at the high end, MAE or BEiT-style MIM if you can only afford to run a smaller recipe), measure your downstream task density honestly (is it really segmentation, or could a strong CLS token work?), and treat a continual-pretraining A/B on top of the natural-image checkpoint as the experiment worth budgeting for, if your scale can afford it. The published evidence the post draws on is adjacent-domain (Lahrichi 2025 on satellite, GLARE 2026 across natural plus satellite); whether the same shape generalizes to construction-document line-art is the experiment, not a recipe with published numbers. The question of whether construction-document SSL deserves its own published checkpoint is one a future paper would have to answer; this post couldn’t.

References

AIM: Scalable Pre-training of Large Autoregressive Image Models, El-Nouby et al., Apple (2024).
AIMv2: Multimodal Autoregressive Pre-training of Large Vision Encoders, El-Nouby et al., Apple (2024).
AM-RADIO: Agglomerative Vision Foundation Model, Ranzinger et al., NVIDIA (2023; CVPR 2024).
BEiT: BERT Pre-Training of Image Transformers, Bao, Dong, Wei, MSRA (2021).
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers, Peng et al. (2022).
BEiT v3: Image as a Foreign Language for Vision and Vision-Language Tasks, Wang et al. (2022).
C-RADIOv4: NVIDIA Multi-teacher Distillation Foundation Model, NVIDIA (2026).
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, Baevski et al., FAIR (2022).
Efficient Self-supervised Learning with Contextualized Target Representations (data2vec 2.0), Baevski et al., FAIR (2022).
DINO: Emerging Properties in Self-Supervised Vision Transformers, Caron et al., FAIR (2021).
DINOv2: Learning Robust Visual Features without Supervision, Oquab et al., Meta (2023).
Vision Transformers Need Registers (DINOv2 with Registers), Darcet et al., Meta (2023; ICLR 2024 Oral).
DINOv3, Meta (2025).
DiT: Self-supervised Pre-training for Document Image Transformer, Microsoft (2022).
Domain-Adaptive Self-Supervised Pre-Training for Face & Body Detection in Drawings (2022).
GLARE: Adapter-based continual SSL pretraining over UDI for natural and satellite segmentation, TMLR-formatted submission (2025; v2 2026).
iBOT: Image BERT Pre-Training with Online Tokenizer, Zhou et al. (2021).
Generative Pretraining from Pixels (iGPT), Chen et al., OpenAI (ICML 2020).
I-JEPA: Self-Supervised Learning from Images with a Joint Embedding Predictive Architecture, Assran et al., Meta (2023).
Lahrichi 2025: ImageNet vs GeoNet pretraining for satellite segmentation, Lahrichi et al. (2025).
Masked Autoencoders Are Scalable Vision Learners (MAE), He et al., FAIR (2021).
MaskFeat: Masked Feature Prediction for Self-Supervised Visual Pre-Training, Wei et al., FAIR (2021).
Self-supervised pretraining at scale for 3D medical image segmentation (Medical 3D MAE), (2024).
MIM-Refiner: A Contrastive Learning Boost from Intermediate Pre-Trained Representations, Alkin et al. (2024; ICLR 2025).
PHI-S: Distribution Balancing for Label-Free Multi-Teacher Distillation, NVIDIA (2024).
RADIOv2.5: Improving Multi-Teacher Distillation with Better Loss Schedules, Heinrich, Ranzinger et al., NVIDIA (2024; CVPR 2025).
SimMIM: A Simple Framework for Masked Image Modeling, Xie et al., MSRA (2021).
V-JEPA: Revisiting Feature Prediction for Learning Visual Representations from Video, Bardes et al., Meta (2024).
V-JEPA 2: Self-Supervised Video Models for Understanding, Prediction, and Planning, Meta (2025).
Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting (2021).