Harness Engineering: Research References¶
A complete citation record for the ten sources documented in Harness Engineering: A Field Converges. All sources published between February and April 2026.
Primary Sources¶
1. OpenAI — Engineering Practice¶
Lopopolo, R. (February 11, 2026). Harness Engineering: Leveraging Codex in an Agent-First World. OpenAI Blog. https://openai.com/index/harness-engineering
Lopopolo, R. (April 7, 2026). Token Billionaire Life: Harness Engineering for Dark Factories. Latent Space / AI Engineer Podcast. https://youtu.be/CeOXx-XTYek?si=wJs1yj_u9g-lOPQr
2. Anthropic — Model Performance Research¶
Rajasekaran, P. (March 24, 2026). Harness Design for Long-Running Application Development. Anthropic Engineering Blog. https://www.anthropic.com/engineering/harness-design-long-running-apps
3. Google DeepMind — Automated Harness Synthesis¶
Lou, X., Lázaro-Gredilla, M., Dedieu, A., Wendelken, C., Lehrach, W., & Murphy, K. P. (March 5, 2026). AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness. arXiv:2603.03329. https://arxiv.org/pdf/2603.03329
4. Stanford — Automated Harness Optimization¶
Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (March 30, 2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052v1. https://arxiv.org/html/2603.28052v1
5. Shenzhen/Tsinghua — Academic Formalization¶
Pan, L., Zou, L., Guo, S., Ni, J., & Zheng, H. (March 26, 2026). Natural-Language Agent Harnesses. arXiv:2603.25723v1. Shenzhen International Graduate School, Tsinghua University & Harbin Institute of Technology.
6. Hanyang University — Governance Framework¶
Kim, J. (March 23, 2026). Harness Engineering: A Governance Framework for AI-Driven Software Engineering. Preprint. Hanyang University, Seoul, Republic of Korea. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6372119
Note: Scheduled for conference presentation — conference details to be confirmed. Citation should be updated when available.
7. OpenDev — Production System Documentation¶
Bui, N. (March 13, 2026). Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned. arXiv:2603.05344v3. https://arxiv.org/pdf/2603.05344v3
8. Microsoft — Fleet Governance at Scale¶
Abdul Aziz, S. (April 5, 2026). How We Build and Use Azure SRE Agent with Agentic Workflows. Microsoft Apps on Azure Blog. https://techcommunity.microsoft.com/blog/appsonazureblog/how-we-build-and-use-azure-sre-agent-with-agentic-workflows/4508753
9. LangChain — Product and Framework Perspective¶
Trivedy, V. (March 10, 2026). The Anatomy of an Agent Harness. LangChain Blog. https://blog.langchain.com/the-anatomy-of-an-agent-harness/
10. Red Hat — Developer Field Experience¶
Rizzi, M. (April 7, 2026). Harness Engineering: Structured Workflows for AI-Assisted Development. Red Hat Developer Blog. https://developers.redhat.com/articles/2026/04/07/harness-engineering-structured-workflows-ai-assisted-development
Supporting Research¶
Mobile Application Development — Harness Performance Quantification¶
Tian, M., Wang, Z., Yang, B., Tang, Z., Zhu, K., Dong, H., Li, H., Xie, X., Wang, G., & You, J. (February 2026). SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications? arXiv:2602.09540v1.
Note: This paper is the source of the 6× performance gap finding cited in Movement 1 — that changing the harness around a fixed language model can produce a 6× performance gap on the same benchmark. The finding was cited by the Stanford Meta-Harness team [reference 4 above] and was widely misattributed in press coverage to Stanford. The quantification belongs to this mobile application development research team.
Mechanistic Interpretability — Theoretical Foundation¶
Anthropic. (2025). On the Biology of a Large Language Model. Anthropic Interpretability Research.
Note: This research informed the coalition drift theoretical framework documented in Movement 3. The poetry experiment and feature activation analysis provided the mechanistic basis for pre-inference tonal monitoring. The inference that tonal coherence is mechanistically prior to semantic coherence was drawn from direct feature-level analysis of the poetry experiment data, not from the paper's own conclusions.
TONE Experiment Repositories¶
ArchieCur, Claude Sonnet & Claude Code (Anthropic). (2026). TONE Agent — Runs 1-7. GitHub. https://github.com/ArchieCur/tone_agent
ArchieCur, Claude Sonnet & Claude Code (Anthropic). (2026). TONE Agent Neighborhoods — Runs 8-13. GitHub. https://github.com/ArchieCur/tone_agents_neighborhoods
AI System Design Curriculum¶
ArchieCur & Claude Sonnet (Anthropic). (2025-2026). AI System Design Documentation. GitHub Pages. MIT License. https://archiecur.github.io/ai-system-design
Citation Notes¶
On the 6× performance gap: This finding originates in the SWE-Bench Mobile paper and was cited by the Stanford Meta-Harness team as evidence that harness choice produces large performance differences on the same benchmark. Press coverage widely attributed this finding to Stanford. The correct attribution is to Tian et al.
On the Kim governance framework: The SSRN preprint link is confirmed. Conference presentation details to be added when available.
On authorship convention: TONE experiment repositories are credited to ArchieCur, Claude Sonnet, and Claude Code (Anthropic) as a three-way collaborative work. The AI System Design curriculum is credited to ArchieCur and Claude Sonnet (Anthropic), consistent with the authorship model documented throughout the curriculum.
References compiled April 2026. AI System Design Curriculum — Harness Engineering Section. ArchieCur & Claude Sonnet 4.6 (Anthropic).