# Transformer pretraining for log anomaly detection LogBERT trains **BERT-style masked-language-modeling on log sequences** and uses the resulting representations for unsupervised anomaly scoring. The closest published example of "BERT, but for host telemetry." ## What we borrowed - **The transformer entry in our model comparison.** LogBERT is the citation for why a transformer is even in the model lineup on scene 9 — it shows that attention over moderate-length log windows has enough signal to separate normal from anomalous *without* per-anomaly labels. - **Pretraining + fine-tune split.** Their two-stage setup (self-supervised pretrain on benign logs, downstream classifier on labeled anomalies) is the template we follow when describing the BERT model's training story on the *training-code* scene. ## Where it differs - Logs are categorical (template tokens); our windows are dense float vectors (12 channels × 100 samples). The BERT we run is the same architecture but reads continuous-valued tokens, so the masking objective is regression-on-masked-channels rather than cross-entropy-on-masked-token.