Automated wafer map defect classification with a 3-stage AI review pipeline —
ResNet18 confidence gate → local VLM briefing → self-contained HTML report
| 📦 Dataset | WM-811K — 811,457 wafer maps, 9 defect classes |
| 🏆 Best model | ResNet18 — macro-F1 0.9109 on held-out test set |
| ⚡ Auto-acceptance | 96.6 % of samples cleared without human review |
| 🤖 VLM triage | Qwen2.5-VL-3B-Instruct runs locally — no API calls |
| 📄 Report | Single self-contained HTML file, no server required |
| 🔁 Modes | Full · Mock (layout test) · Disabled (scores only) |
- Motivation
- Model Results
- Quick Start
- Review Pipeline
- All Scripts
- Project Structure
- Dataset
- Environment Notes
- Roadmap
Semiconductor wafer manufacturing involves hundreds of interdependent process steps. When a defect pattern appears on a wafer map, its spatial arrangement directly indicates which process step or equipment caused the failure. Accurate, automated classification enables faster root-cause analysis, reduces engineer workload, and tightens process control feedback loops — directly impacting yield and cost.
This project builds a configurable deep-learning pipeline on the WM-811K benchmark that goes beyond classification: uncertain predictions are automatically routed to a local VLM which writes a structured briefing card for the human reviewer.
All models trained on a 70/15/15 stratified split of WM-811K with focal loss + weighted
sampling to handle severe class imbalance (none class × 985 vs Near-Full × 149).
| Model | Params | Macro-F1 | Accuracy |
|---|---|---|---|
| ResNet18 ⭐ | 11 M | 0.9109 | ~97 % |
| ConvNeXt-Tiny | 28 M | — | — |
| ViT-Tiny | 5.7 M | 0.8366 | — |
| SimpleCNN | 0.3 M | — | — |
ResNet18 is the production model used by the review pipeline.
| Class | F1 | Notes |
|---|---|---|
| Center | ~0.97 | Strong spatial signal |
| Donut | ~0.95 | Distinctive ring pattern |
| Edge-Ring | ~0.98 | High contrast at edge |
| Edge-Loc | ~0.89 | Confused with Loc |
| Loc | ~0.86 | Hardest class |
| Near-Full | ~0.91 | Rare (149 samples) |
| Random | ~0.90 | Diffuse pattern |
| Scratch | ~0.95 | Linear feature |
| none | ~0.99 | Majority class |
total_cost = |reviewed| + k × |missed_errors| k ∈ {1, 5, 10}
| k | Optimal policy | Cost | vs. baseline (conf=0.85/margin=0.20) |
|---|---|---|---|
| 1 | conf=0.50, margin=OFF | 542 | −503 (−48 %) |
| 5 | conf=0.85, margin=OFF | 1,701 | tied (margin = dead weight) |
| 10 | conf=0.95, margin=OFF | 2,305 | −216 (−9 %) |
Finding: the margin gate never improves on confidence-only gating.
Default is --margin-thresh 0.00 (disabled).
git clone https://github.com/yz847zzz/WaferDefectClassifier.git
cd WaferDefectClassifier
python -m venv .venv
# Windows
.venv\Scripts\activate
# Linux / macOS
source .venv/bin/activate
pip install -r requirements.txt
# For VLM triage (Stage 2):
pip install transformers accelerate qwen-vl-utilscp .env.example .env
# Edit .env — set TORCH_HOME and HF_HOME to a drive with ≥ 15 GB free# Download LSWMD.pkl from Kaggle and place at data/raw/LSWMD.pkl
# https://www.kaggle.com/datasets/qingyi/wm811k-wafer-map
python scripts/preprocess_data.py --config configs/resnet18.yamlpython scripts/train.py --config configs/resnet18.yaml# Full pipeline — ResNet18 gate + VLM briefing + HTML report
python scripts/run_pipeline.py
# Fast first-pass — no VLM, 2-column cards
python scripts/run_pipeline.py --disable-vlm
# Layout test — mock VLM, no GPU needed for Stage 2
python scripts/run_pipeline.py --mock-vlm --max-review 50Open outputs/reports/review_queue.html in any browser.
A 3-stage system that converts raw wafer maps into a priority-sorted HTML review queue.
Input
──────────────────────────────────────────────────────────────────────────
.npy array (N, H, W) float32 | folder of .npy files | test split
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ STAGE 1 — NNFilter (ResNet18 confidence gate) │
│ │
│ Accept condition: top1_prob ≥ conf_thresh (default 0.85) │
│ margin ≥ margin_thresh (default 0.00 = OFF) │
│ │
│ ┌──────────────────────┐ ┌───────────────────────────────┐ │
│ │ AUTO-ACCEPTED │ │ UNCERTAIN → Stage 2 │ │
│ │ 96.6 % of samples │ │ 3.4 % of samples │ │
│ └──────────────────────┘ └───────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
│
▼ (capped at --max-review)
┌───────────────────────────────────────────────────────────────────────┐
│ STAGE 2 — VLMTriage (Qwen2.5-VL-3B-Instruct, local GPU) │
│ │
│ Per sample the VLM receives: │
│ • Color-coded wafer map image (320 × 320 px) │
│ • ResNet18 top-3 predictions + confidence scores │
│ • Flag reason (low confidence / margin) │
│ │
│ VLM writes a structured briefing card: │
│ DESCRIPTION what it sees in the image │
│ PATTERN name of the visible spatial pattern │
│ AGREEMENT agree / partial / disagree with ResNet18 │
│ BEST_CLASS which of the top-3 looks most plausible │
│ RECOMMENDATION accept / review_carefully / reject │
│ REASONING one-sentence justification │
│ │
│ Priority: HIGH (disagree) · MEDIUM (partial) · LOW (agree) │
│ │
│ ⚠ The VLM is ADVISORY ONLY — it does not override ResNet18. │
│ Human reviewers make the final decision. │
└───────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────────┐
│ STAGE 3 — ReportBuilder (self-contained HTML) │
│ │
│ • Priority-sorted queue: HIGH → MEDIUM → LOW │
│ • 3-column cards: [wafer image] [score bars] [VLM briefing] │
│ • All images base64-embedded — open in any browser, offline │
│ • Sticky header: thresholds · counts · timestamp │
└───────────────────────────────────────────────────────────────────────┘
│
▼
outputs/reports/review_queue.html
| Flag | Stage 2 | Layout | Use case |
|---|---|---|---|
| (none) | Qwen2.5-VL-3B | 3-column | Production |
--mock-vlm |
Placeholder text | 3-column + [MOCK] badge |
CI / layout testing |
--disable-vlm |
Skipped | 2-column | Fast triage / no GPU |
python scripts/run_pipeline.py [options]
Input:
--input-npy FILE .npy array (N,H,W) float32
--indices FILE index subset of --input-npy
--input-dir DIR folder of per-sample .npy files
(none) demo mode: preprocessed test split
Thresholds:
--conf-thresh F auto-accept threshold (default 0.85)
--margin-thresh F margin gate (default 0.00 = disabled)
--max-review N VLM sample cap (default 200)
VLM mode:
--disable-vlm skip VLM, ResNet18 scores only
--mock-vlm placeholder cards, no Qwen loading
Output:
--out FILE HTML path (default outputs/reports/review_queue.html)
| Script | Purpose |
|---|---|
preprocess_data.py |
Parse WM-811K pickle → float32 .npy arrays + stratified splits |
train.py |
Training loop with focal loss, LR scheduler, best-F1 checkpointing |
evaluate.py |
Per-class metrics, confusion matrix, multi-model comparison plots |
error_analysis.py |
Misclassification grids, CSV, markdown report (ResNet18) |
review_policy_analysis.py |
Confidence/margin histogram analysis, ROC-style curves |
threshold_sweep.py |
110-config Pareto sweep over (conf, margin) pairs |
cost_optimization.py |
Cost-function optimization for k = 1, 5, 10 |
run_pipeline.py |
End-to-end: NNFilter → VLMTriage → HTML report |
WaferDefectClassifier/
├── configs/ YAML training configs per model
│ ├── simple_cnn.yaml
│ ├── resnet18.yaml
│ ├── convnext_tiny.yaml
│ └── vit_tiny.yaml
├── data/ raw/, processed/, splits/ (not tracked)
├── scripts/ CLI entry points (see table above)
├── src/
│ ├── datasets/ WaferDataset, CLASS_NAMES, data loading
│ ├── models/ SimpleCNN, ResNet18, ConvNeXt, ViT + factory
│ ├── training/ Trainer, focal loss, Evaluator
│ ├── explainability/ Grad-CAM, attention rollout (stubs)
│ ├── agents/
│ │ └── vlm_reviewer.py LocalVLMReviewer — prompt, call, parse
│ ├── pipeline/
│ │ ├── nn_filter.py Stage 1 — ResNet18 confidence gate
│ │ ├── vlm_triage.py Stage 2 — VLM briefing + MockVLMTriage
│ │ └── report_builder.py Stage 3 — self-contained HTML renderer
│ └── utils/ config, paths, seed, logging, metrics, viz
├── tests/ pytest unit tests (no full dataset needed)
├── .env.example template for TORCH_HOME / HF_HOME
└── requirements.txt
WM-811K — 811,457 real production wafer maps from TSMC, labeled with 9 defect pattern types.
| Class | Train count | Notes |
|---|---|---|
| none | ~103 K | Background / no defect |
| Center | ~4.3 K | |
| Donut | ~555 | |
| Edge-Loc | ~5.2 K | |
| Edge-Ring | ~9.7 K | |
| Loc | ~3.6 K | |
| Near-Full | ~104 | Rarest class |
| Random | ~866 | |
| Scratch | ~432 |
- Go to https://www.kaggle.com/datasets/qingyi/wm811k-wafer-map
- Download
LSWMD.pkl - Place at
data/raw/LSWMD.pkl - Run
python scripts/preprocess_data.py --config configs/resnet18.yaml
All large files must stay on the drive you configure — do not cache to C:.
This project sets TORCH_HOME and HF_HOME from .env before any import:
| Variable | Default | What it caches |
|---|---|---|
TORCH_HOME |
E:/cache/torch |
PyTorch hub weights |
HF_HOME |
E:/cache/huggingface |
Qwen2.5-VL-3B (~7 GB) + tokenizer |
Edit .env to point these at any drive with ≥ 15 GB free.
Windows note: num_workers=0 is set in all YAML configs (Python multiprocessing
spawn on Windows is incompatible with DataLoader workers).
| Phase | Status | Feature |
|---|---|---|
| 1 | ✅ | Project skeleton, config utilities, dataset loader |
| 2 | ✅ | Model zoo: SimpleCNN · ResNet18 · ConvNeXt-Tiny · ViT-Tiny |
| 3 | ✅ | Training loop, checkpointing, LR scheduler |
| 4 | ✅ | Evaluation: confusion matrix, per-class metrics, multi-model comparison |
| 5 | ✅ | Class imbalance: focal loss + weighted sampler |
| 6 | ✅ | Error analysis: misclassification grids, CSV, markdown report |
| 7 | ✅ | Cost-sensitive review policy + 110-config threshold sweep |
| 8 | ✅ | End-to-end pipeline: NNFilter → VLMTriage → HTML report |
| 9 | 🔲 | Grad-CAM / attention rollout for explainability |
| 10 | 🔲 | Similar-case retrieval (embedding + ANN index) |
| 11 | 🔲 | Active-learning loop: reviewer corrections → retraining |
| 12 | 🔲 | Diffusion-based rare-class augmentation |
| 13 | 🔲 | Transfer to SEM / optical defect inspection |
MIT — see LICENSE for details.