Blog

Thoughts on Open Models This Week

America needs to step up its Open Model game.

Sam Krystal

29 Aug 2025 — 6 min read

I wrote some thoughts about what I was reading this week (8/8/25, republished much later because we didn’t have a blog yet). These are my opinions, not a reflection of my employer’s views, even if they circulate them because I asked nicely.

Introduction

I will assume you know what a model is, what open-source means, and the difference between open, open-weight, and closed models. If not, the FTC’s July 2024 explainer is a useful primer.

I am not here to make grand proclamations that open-source models are poised to disrupt or match closed-source incumbents like GPT-4, Claude, or Gemini. I don’t think that’s true. The companies behind these frontier closed models have achieved near full vertical integration, and this quasi-monopolistic control is precisely why competitive, open alternatives are essential. If we value functioning (and free?!) markets, then we should care about having open-source alternatives.

But who’s building those alternatives? Certainly not the United States.

I. Open Models Are Important

I think American open models suck, and gpt-oss has not changed my opinion.

Up until this week, the trending models on Hugging Face were dominated by Chinese releases: DeepSeek, Qwen, Kimi, and their derivatives. This leaderboard doesn’t measure quality directly, but it does signal developer sentiment. My sentiment above.

https://huggingface.co/models?sort=trending Retrieved 8/5/25

The launch of the American Truly Open Models (ATOM) Project this week mirrors my position (albeit politely). ATOM’s framing is clear: the U.S. once led the world in open AI research thanks to the combined strength of universities, national labs, and private research. This ecosystem produced much of the foundational work closed providers now monetize. But if other countries, especially geopolitical rivals, lead in open models, U.S. capabilities could be throttled by a handful of domestic, closed providers. In the face of a high-stakes confrontation, this consolidation poses a national security risk. ATOM’s call to invest in open AI is a call to shore up America’s AI security posture. As they note, a closed-only ecosystem risks replicating the worst dynamics of other defense-industrial complexes but for intelligence infrastructure.

II. gpt-oss is an Underwhelming Half-Step Forward

To me, OpenAI’s recent release of gpt-oss was merely a gesture toward openness. Symbolically significant, but strategically, less so. The model’s early evaluations suggest it’s not competitive with the best Chinese open models. Even without full third-party benchmarks (my feelings on the quality of most benchmarks aside), we can compare available performance data from model cards, Vectara, and Hugging Face.

Model	GPQA / GPQA-Diamond	SWE-bench pass@1	Hallucination (Vectara HHEM, %)
DeepSeek-R1-0528	GPQA-Diamond 81.0	57.6	7.7
OpenAI gpt-oss-120B (with tools)	GPQA-Diamond 80.9	62.4	2.4
OpenAI gpt-oss-120B (without tools)	GPQA-Diamond 80.1	62.4	N/A (keep reading)
Qwen2.5-72B-Instruct	49.0 (non-Diamond)	≈ 44 (est.)	4.3
Kimi-Dev-72B	N/A	60.4	1.1

Estimation note: DeepSeek-R1-0528 reports LiveCodeBench 73.3 and SWE-Verified 57.6 (ratio ≈ 0.79). Qwen2.5-72B reports LiveCodeBench 55.5; applying the same ratio yields ~43–44 pass@1. This is a clearly labeled heuristic, not an owner number, and it matches practitioner expectations that LCB→SWE transfer is imperfect.

Gpt-oss looks pretty damn good, right? Looks can be deceiving...

I wanted to give you a table that captures what I believe to be three key characteristics for comparing gpt-oss to the most popular open models: “can it reason about science” (GPQA/GPQA-Diamond), “can it fix real software reliably” (SWE-Verified), and “does it fabricate in standardized summarization tests” (HHEM) as these were the stated strengths of gpt-oss. From the tests I’ve outlined, we see DeepSeek-R1 sits on top for hard science reasoning; Kimi-Dev is a coding specialist with a strong SWE score and very low summarization hallucination. Gpt-oss-120B can hold its own against GPQA-Diamond and “can fix software most reliably” (SWE-Verified). But I thought it was important to highlight the hallucination rate measurements because my initial reaction upon reading gpt-oss’ model card led me down a rabbit hole about hallucination rates reporting assessment and their implications.

The metrics provided by OpenAI in gpt-oss’ model card were from instances where the model was not connected to the internet, also known as non-tooled or retrieval-less. When you compare the hallucination rates of retrieval-less gpt-oss to the hallucination rates of gpt-oss with tools, the contrast in rates is striking!

The reported retrieval-less gpt-oss-120B hallucination rates are 78.2% and 49.1% (accuracy 16.8% and 29.8%) under two evaluation methods. That is an order of magnitude worse than what tool-connected tests show. While connecting gpt-oss to the internet mitigates its egregious hallucination rates, its internet-less hallucination rates are still unacceptable.

I’ll outline three scenarios wherein enterprises might disconnect gpt-oss from the internet:

1. Tool brittleness: if production, retrieval, web-browsing, or calculator tools fail, network errors, rate limits, or sandboxing forces the model to revert to internal knowledge. If the base model has high QA hallucination without tools, you inherit brittle failure modes.

2. Cost and latency budgets: teams often disable RAG on low-value calls to save tokens/latency. Your risk then is contingent on the base model’s retrieval-less reliability.

3. Security and compliance: secure environments sometimes forbid external calls.

The reality is, retrieval-less truthfulness is an important performance floor. When tool use is taken away, gpt-oss collapses on factuality by OpenAI’s own disclosure. Sure, OpenAI offers fine-tuning via Hugging Face Transformers, but fine-tuning can increase hallucinations if done poorly, and with a base hallucination rate far above peers, the margin for error is nontrivial. So much so that it would not make sense for a company to implement gpt-oss over existing Chinese models.

Why bother releasing this model? My guess is that gpt-oss was designed not to cannibalize ChatGPT, a political and performative product, not a competitive one. Gpt-oss appeases calls for openness without offering an alternative that could sway customers away from OpenAI’s paid platform. The opportunity cost here is real - resources spent on gpt-oss could have advanced genuinely competitive open research.

III. Bigger Isn’t Always Better

A fun paper from NVIDIA and Georgia Tech surfaced on my Twitter feed this week. These researchers make a strong case for Small Language Models (SLMs) as a viable future for otherwise LLM-based systems*. Their core argument boils down to:

SLMs are sufficiently powerful to handle language modeling tasks in agentic applications
SLMs are inherently more operationally suitable for agentic systems than LLMs
SLMs are necessarily more economical for the vast majority of language model uses in agentic systems

Lately I’ve found myself talking more and more about the benefits of edge fine-tuning (LoRA and similar techniques) for domain-specific deployments, and this paper reinforces my priors. Put simply, you don’t need a 70B-parameter generalist to solve most specialized problems. In transitioning from LLM generalists to SLM specialists, this paper outlines the tremendous cost savings in memory, compute, and energy.

This accessibility matters; smaller models can be trained, adapted, and deployed by academic labs, startups, and individuals without hyperscaler budgets. This could be how the U.S. wins in the open model marketplace: building an ecosystem of specialized, efficient, interoperable models rather than chasing China’s massive generalists.

IV. Where This Leaves the U.S.

We’re already confronting the reality that America is lagging in open-weight model competitiveness. Gpt-oss doesn’t change that. But we have advantages in infrastructure, tooling, and ecosystem depth that remain under leveraged.

The strategic question isn’t “How do we Out-China China?” it’s “What would an AI ecosystem optimized for American strengths look like?” The answer may be smaller, more specialized, and more distributed - a network of models whose collective capability exceeds any single behemoth.

If the ATOM Project can align government investment with private innovation in specialized frameworks, the U.S. could define the future of open AI on its own terms.

* I should note the authors of this paper discuss SLM viability for agentic AI systems, but their research is applicable to my position as described. I think agents and the emphasis on model agency as they exist today are an overhyped abstraction of current LLM limitations. Until we figure out how to have unreasonably large context windows that don't break the bank, "agents" are mostly elaborate prompt engineering with extra steps.