Privacy
GPT-2 generates real-looking PII on 1 in 5 prompts: others are clean
Zero memorisation across all models: risk is generative, not leakage
Fairness
Sexual orientation bias is the worst blind spot: all models score below 0.25
Every model scores below random chance on this category
Robustness
BLOOM and OPT collapse on typos: GPT-2 family holds steady
A single character swap is enough to derail BLOOM completely
Transparency
Only 2 of 5 models disclose carbon footprint: the hardest criterion to meet
All models pass the other 6 criteria: carbon is the consistent gap
Explainability
BLOOM focuses attribution on fewer, sharper tokens than any other model
Higher Gini = more concentrated attribution = more interpretable decisions
Overview
No model excels across all pillars: every profile has a visible gap
Robustness and fairness drag every model below 0.5; transparency and privacy are where scores recover