Meta released Llama 4.5 as open weights under the standard Llama community license, continuing one of the stranger long-term strategic bets in the industry: spend billions training frontier-adjacent models, then put them on Hugging Face. The architecture is a 600B-parameter sparse mixture-of-experts with roughly 17B active per token, which is the number that actually decides what hardware can serve it. On a representative slice of reasoning and code evals, 4.5 lands within striking distance of the closed flagships. Closer than any open-weight release to date.
The deployment implication is where this stops being a benchmark conversation and starts being a budget conversation. 17B active parameters means a single 8x H200 node serves the model at production latency without aggressive quantization. That is the exact inflection a lot of regulated buyers have been waiting on: open weights, frontier-adjacent quality, single-node inference, on-prem feasible. Expect the next 60 days to be heavy on private-cloud announcements from vendors who have spent two years getting blocked by their customers’ data-residency lawyers.
The open question is fine-tuning ergonomics. Sparse MoE fine-tuning is still rougher than dense models in the OSS tooling stack, and that gap will decide whether 4.5 displaces dense Llama 3.x in production workloads or just stays a frontier-curiosity download for people who like watching their GPU fans spin up.