# Transformer pretraining for log anomaly detection

LogBERT trains **BERT-style masked-language-modeling on log
sequences** and uses the resulting representations for unsupervised
anomaly scoring. The closest published example of "BERT, but for
host telemetry."

## What we borrowed

- **The transformer entry in our model comparison.** LogBERT is the
  citation for why a transformer is even in the model lineup on
  scene 9 — it shows that attention over moderate-length log windows
  has enough signal to separate normal from anomalous *without*
  per-anomaly labels.
- **Pretraining + fine-tune split.** Their two-stage setup
  (self-supervised pretrain on benign logs, downstream classifier
  on labeled anomalies) is the template we follow when describing
  the BERT model's training story on the *training-code* scene.

## Where it differs

- Logs are categorical (template tokens); our windows are dense
  float vectors (12 channels × 100 samples). The BERT we run is the
  same architecture but reads continuous-valued tokens, so the
  masking objective is regression-on-masked-channels rather than
  cross-entropy-on-masked-token.