Skip to content

Harness Engineering: Research References

A complete citation record for the ten sources documented in Harness Engineering: A Field Converges. All sources published between February and April 2026.


Primary Sources

1. OpenAI — Engineering Practice

Lopopolo, R. (February 11, 2026). Harness Engineering: Leveraging Codex in an Agent-First World. OpenAI Blog. https://openai.com/index/harness-engineering

Lopopolo, R. (April 7, 2026). Token Billionaire Life: Harness Engineering for Dark Factories. Latent Space / AI Engineer Podcast. https://youtu.be/CeOXx-XTYek?si=wJs1yj_u9g-lOPQr


2. Anthropic — Model Performance Research

Rajasekaran, P. (March 24, 2026). Harness Design for Long-Running Application Development. Anthropic Engineering Blog. https://www.anthropic.com/engineering/harness-design-long-running-apps


3. Google DeepMind — Automated Harness Synthesis

Lou, X., Lázaro-Gredilla, M., Dedieu, A., Wendelken, C., Lehrach, W., & Murphy, K. P. (March 5, 2026). AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness. arXiv:2603.03329. https://arxiv.org/pdf/2603.03329


4. Stanford — Automated Harness Optimization

Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (March 30, 2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052v1. https://arxiv.org/html/2603.28052v1


5. Shenzhen/Tsinghua — Academic Formalization

Pan, L., Zou, L., Guo, S., Ni, J., & Zheng, H. (March 26, 2026). Natural-Language Agent Harnesses. arXiv:2603.25723v1. Shenzhen International Graduate School, Tsinghua University & Harbin Institute of Technology.


6. Hanyang University — Governance Framework

Kim, J. (March 23, 2026). Harness Engineering: A Governance Framework for AI-Driven Software Engineering. Preprint. Hanyang University, Seoul, Republic of Korea. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6372119

Note: Scheduled for conference presentation — conference details to be confirmed. Citation should be updated when available.


7. OpenDev — Production System Documentation

Bui, N. (March 13, 2026). Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned. arXiv:2603.05344v3. https://arxiv.org/pdf/2603.05344v3


8. Microsoft — Fleet Governance at Scale

Abdul Aziz, S. (April 5, 2026). How We Build and Use Azure SRE Agent with Agentic Workflows. Microsoft Apps on Azure Blog. https://techcommunity.microsoft.com/blog/appsonazureblog/how-we-build-and-use-azure-sre-agent-with-agentic-workflows/4508753


9. LangChain — Product and Framework Perspective

Trivedy, V. (March 10, 2026). The Anatomy of an Agent Harness. LangChain Blog. https://blog.langchain.com/the-anatomy-of-an-agent-harness/


10. Red Hat — Developer Field Experience

Rizzi, M. (April 7, 2026). Harness Engineering: Structured Workflows for AI-Assisted Development. Red Hat Developer Blog. https://developers.redhat.com/articles/2026/04/07/harness-engineering-structured-workflows-ai-assisted-development


Supporting Research

Mobile Application Development — Harness Performance Quantification

Tian, M., Wang, Z., Yang, B., Tang, Z., Zhu, K., Dong, H., Li, H., Xie, X., Wang, G., & You, J. (February 2026). SWE-Bench Mobile: Can Large Language Model Agents Develop Industry-Level Mobile Applications? arXiv:2602.09540v1.

Note: This paper is the source of the 6× performance gap finding cited in Movement 1 — that changing the harness around a fixed language model can produce a 6× performance gap on the same benchmark. The finding was cited by the Stanford Meta-Harness team [reference 4 above] and was widely misattributed in press coverage to Stanford. The quantification belongs to this mobile application development research team.


Mechanistic Interpretability — Theoretical Foundation

Anthropic. (2025). On the Biology of a Large Language Model. Anthropic Interpretability Research.

Note: This research informed the coalition drift theoretical framework documented in Movement 3. The poetry experiment and feature activation analysis provided the mechanistic basis for pre-inference tonal monitoring. The inference that tonal coherence is mechanistically prior to semantic coherence was drawn from direct feature-level analysis of the poetry experiment data, not from the paper's own conclusions.


TONE Experiment Repositories

ArchieCur, Claude Sonnet & Claude Code (Anthropic). (2026). TONE Agent — Runs 1-7. GitHub. https://github.com/ArchieCur/tone_agent

ArchieCur, Claude Sonnet & Claude Code (Anthropic). (2026). TONE Agent Neighborhoods — Runs 8-13. GitHub. https://github.com/ArchieCur/tone_agents_neighborhoods


AI System Design Curriculum

ArchieCur & Claude Sonnet (Anthropic). (2025-2026). AI System Design Documentation. GitHub Pages. MIT License. https://archiecur.github.io/ai-system-design


Citation Notes

On the 6× performance gap: This finding originates in the SWE-Bench Mobile paper and was cited by the Stanford Meta-Harness team as evidence that harness choice produces large performance differences on the same benchmark. Press coverage widely attributed this finding to Stanford. The correct attribution is to Tian et al.

On the Kim governance framework: The SSRN preprint link is confirmed. Conference presentation details to be added when available.

On authorship convention: TONE experiment repositories are credited to ArchieCur, Claude Sonnet, and Claude Code (Anthropic) as a three-way collaborative work. The AI System Design curriculum is credited to ArchieCur and Claude Sonnet (Anthropic), consistent with the authorship model documented throughout the curriculum.


References compiled April 2026. AI System Design Curriculum — Harness Engineering Section. ArchieCur & Claude Sonnet 4.6 (Anthropic).