SATtxt - Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Rao Kompella
La Trobe University, Cisco Research
| Date | Update |
|---|---|
| Mar 9, 2026 | We have released model code and weights. |
| Feb 23, 2026 | SATtxt is accepted at CVPR 2026. We appreciate the reviewers and ACs. |
SATtxt is a vision-language foundation model for satellite imagery. We train only the projection heads, keeping both encoders frozen.
| Component | Backbone | Parameters |
|---|---|---|
| Vision Encoder | DINOv3 ViT-L/16 | Frozen |
| Text Encoder | LLM2Vec Llama-3-8B | Frozen |
| Vision Head | Transformer Projection | Trained |
| Text Head | Linear Projection | Trained |
git clone https://github.com/ikhado/sattxt.git
cd sattxt
pip install -r requirements.txt
pip install flash-attn --no-build-isolation # Required for LLM2VecDownload the required weights:
| Component | Source |
|---|---|
| DINOv3 ViT-L/16 | facebookresearch/dinov3 → dinov3_vitl16_pretrain_sat493m.pth |
| LLM2Vec | McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse |
| Vision Head | sattxt_vision_head.pt |
| Text Head | sattxt_text_head.pt |
Clone DINOv3 into the thirdparty folder:
cd thirdparty && git clone https://github.com/facebookresearch/dinov3.gitimport sys
from pathlib import Path
import torch
sys.path.insert(0, str(Path(__file__).resolve().parent / "thirdparty" / "dinov3"))
from sattxt.model import SATtxt
from sattxt.utils import image_loader, get_preprocess, zero_shot_classify
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SATtxt(
dinov3_weights_path="/PATH/TO/dinov3_vitl16_pretrain_sat493m-eadcf0ff.pth",
sattxt_vision_head_pretrain_weights="/PATH/TO/sattxt_vision_head.pt",
text_encoder_id="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
sattxt_text_head_pretrain_weights="/PATH/TO/sattxt_text_head.pt",
).to(device).eval()
categories = [
"AnnualCrop", "Forest", "HerbaceousVegetation", "Highway", "Industrial",
"Pasture", "PermanentCrop", "Residential", "River", "SeaLake"
]
image = image_loader("./asset/Residential_167.jpg")
image_tensor = get_preprocess(is_ms=False, all_bands=False)(image).unsqueeze(0).to(device)
logits, pred_idx = zero_shot_classify(model, image_tensor, categories)
print("Prediction:", categories[pred_idx.item()])Please check demo.py for more details.
@misc{do2026sattxt,
title={Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery},
author={Minh Kha Do and Wei Xiang and Kang Han and Di Wu and Khoa Phan and Yi-Ping Phoebe Chen and Gaowen Liu and Ramana Rao Kompella},
year={2026},
eprint={2602.22613},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.22613},
}We pretrained the model with: Lightning-Hydra-Template
We use evaluation scripts from: MS-CLIP and Pangaea-Bench
We also use LLMs (such as ChatGPT and Claude) for code refactoring.
This work was supported in part by the Australian Government through the Australian Research Council’s Discovery Projects Funding Scheme under Project DP220101634, and by the NVIDIA Academic Grant Program.
We welcome contributions and issues to further improve SATtxt.
