Twitter AI Agent - 2026-05-18¶

1. What People Are Talking About¶

1.1 Harness engineering became the shared language for reliable agents 🡕¶

Harness engineering was still the day's clearest center of gravity, but the evidence shifted from scattered tips into public curricula, design guides, and repeatable operating loops. The strongest posts treated agents as systems that need state, verification, observability, recovery, permissions, and ownership, not as smarter prompts. Compared with May 17, when harness talk already centered on courses and production loops, May 18 broadened into more explicit design artifacts and enterprise rollout roles.

@_vmlops shared (2,034 likes, 11 replies, 119,999 views, 3,479 bookmarks) the Learn Harness Engineering course. The page says it teaches environment design, state management, verification, observability, and control systems for Codex- and Claude Code-style agents; the screenshot matters because its lecture list is organized around concrete failure modes: continuity loss, premature victory, missing observability, and end-to-end tests changing results.

Learn Harness Engineering course page showing lectures on continuity loss, premature victory, observability, and reusable templates

@shannholmberg described (211 likes, 20 replies, 18,147 views, 327 bookmarks) a Hermes prototype-to-production workflow: run a specialist inside a main orchestrator 4-10 times, correct drift, let skills emerge, optionally tighten in Claude Code or Cursor, then deploy as a dedicated Dockerized agent. The image adds the operational details: a clean run gets ported into a specialist container with scoped credentials, scoped memory, and production readiness.

Hermes workflow diagram showing prototype, 4-10 correction iterations, optional fine-tuning, and deployment into a specialist container

@_vmlops also posted (139 likes, 5,141 views, 124 bookmarks) a "Harness Engineering: A Design Guide to Claude Code" cover. Its visual frame - control plane, loop, recovery, permissions, interrupts, verification, and "System first, model second" - is consistent with the course's public claim that harnesses make agents reliable by constraining behavior and making runtime debuggable.

Harness Engineering design guide cover listing control plane, loop, recovery, permissions, interrupts, and verification

@dani_avila7 argued (52 likes, 4 replies, 21,387 views, 58 bookmarks) that enterprise Claude Code rollouts fail when "the model gets better, the setup doesn't, nobody owns it." The image names an emerging Agent Manager role that owns Claude Code configuration, permissions policy, plugin marketplace, CLAUDE.md conventions, and keeping settings current.

Claude Code rollout slide defining an Agent Manager role that owns configuration, permissions, plugin marketplace, and CLAUDE.md conventions

Discussion insight: The best pushback was about drift and governance. Under the Hermes workflow, @cdiamond warned that iteration-three drift corrections can disappear when production inputs no longer match the test brief. Under @code_rams's seven-component harness post (5 likes, 4 replies, 415 views, 14 bookmarks), a reply added "spec gating": planners should propose spec changes, not silently rewrite the definition of done.

Comparison to prior day: May 17 framed harness engineering as curriculum plus deployment practice. May 18 made it more institutional: guidebooks, rollout phases, agent-manager ownership, and spec gates.

1.2 Agent marketplaces and operator layers moved from concept to product claims 🡕¶

A second cluster treated the next agent bottleneck as selection, coordination, and settlement rather than model capability. LobeHub pitched a Chief Agent Operator, AgenC described a runtime plus marketplace, x402/x402B posts framed escrow as necessary trust infrastructure, and TermiX published onchain marketplace metrics. The common claim was that autonomous agents need rails for who does work, how it is paid for, and how results are reviewed.

@lobehub introduced (224 likes, 60 replies, 60 quotes, 422,145 views, 104 bookmarks) a Chief Agent Operator that "hires agents from a 273K-skill marketplace," runs them 24/7 in the cloud, and reports through messaging apps. A reply from @CatCatBros said Discord connection was "dead simple," and LobeHub asked what IM channel should be next, which made distribution through existing work chat part of the product story.

@tetsuoai said (86 likes, 9 replies, 3,976 views, 15 bookmarks) AgenC is a full agent operating layer: local daemon, gateway, CLI, TUI, dashboard, approvals, memory, observability, workflow execution, Solana-backed coordination, marketplace rails, and ZK/prover systems. The screenshot shows a terminal with orchestrator identity, wallet, stake, reputation, slashing, permission mode, and a /claim marketplace flow.

AgenC terminal showing orchestrator identity, wallet, stake, reputation, slashing, permissions mode, and marketplace claim prompt

@Onyx_goose framed (57 likes, 48 retweets, 5 replies, 129 views) x402 as "HTTPS for agents" and x402B escrow as the SSL-certificate-like trust layer for agent commerce. The image makes the mechanism explicit: conditional escrow should release funds only after autonomous sides satisfy pre-agreed terms and verification.

x402 and x402B diagram presenting payment rails and conditional escrow for agent commerce

@termix_ai reported (43 likes, 4 replies, 8,130 views) that its AACP on BNB Chain Testnet had settled $2,310,330 in cumulative agent-commerce volume, with 8,194 total agents, 9,290 A2A jobs, 374 average new jobs per day, and $741,606 locked. @clawdbotatg claimed (77 likes, 99 replies, 3,384 views) 141 contracts deployed, $300K+ onchain, 1.26B $CLAWD burned, and six governance votes decided by mini agents trained by stakers; a reply challenged whether 11.3B staked meant real usage, and the author answered that "the work is the delegation, not the lock-up."

Discussion insight: The highest-signal replies focused on trust and work verification, not hype. Under AgenC, @TheOnChainDev asked whether native agent-to-agent payments were handled or still on the roadmap, while the x402 replies framed escrow as useful because autonomous counterparties may not know each other.

Comparison to prior day: May 17's commerce theme focused on identity, governance, payments, verification, and user control as missing rails. May 18 added more concrete attempts to ship those rails: CAO marketplaces, Solana runtime settlement, x402B escrow, and AACP usage figures.

1.3 Builders are packaging skills, memory, and multi-agent coordination 🡕¶

The builder cluster was less about new foundation models and more about making agent work reusable and traceable. OpenSkills, Mimeo, Mercury Agent, Gas Town, CopilotKit Threads, and Pentest Agents all pointed at the same direction: package capability into skills, persist state, coordinate multiple agents, and keep work auditable.

@mercury__agent said (176 likes, 9 replies, 5 quotes, 1,040 views) Mercury is moving beyond "chat with code" into planning, controlled execution, and Git-native workflows. The linked GitHub repo describes a TypeScript agent with permission-hardened tools, token budgets, CLI/Telegram access, SQLite-backed "Second Brain" memory, and 24/7 daemon mode; a reply asked why not just use /plan, and Mercury answered that the command is simple but "the system around it is the point."

@GithubProjects shared (37 likes, 2 replies, 3,346 views, 31 bookmarks) Gas Town, a multi-agent workspace manager. The screenshot says it coordinates Claude Code, GitHub Copilot, Codex, Gemini, and other agents, persists work in git-backed hooks and a Beads ledger, adds mailboxes and handoffs, and turns "4-10 agents become chaotic" into "scale comfortably to 20-30 agents."

Gas Town README showing a multi-agent workspace manager with git-backed hooks, mailboxes, handoffs, and persistent work tracking

@tom_doerr shared (22 likes, 6 replies, 1,751 views, 21 bookmarks) OpenSkills, a TypeScript universal skills loader with 10,200 GitHub stars that brings Claude Code-style SKILL.md files to Claude Code, Cursor, Windsurf, Aider, Codex, and other agents that read AGENTS.md. Replies captured the unresolved tradeoff: one called it either "a multi-tool nightmare or a game changer," while another said context windows can eat skills by turn four and force agents to reinvent loops.

@tom_doerr also posted (12 likes, 4 replies, 1,549 views, 18 bookmarks) Mimeo, whose GitHub metadata describes a Python tool that "mimeographs" an expert into a SKILL.md or AGENTS.md. @CopilotKit introduced (21 likes, 1 reply, 1,047 views, 16 bookmarks) CopilotKit Threads for persistent, resumable ChatGPT-like conversations built on AG-UI for any agent framework. @DarkWebInformer shared (9 likes, 2 replies, 2,101 views, 17 bookmarks) Pentest Agents, a Python bug-bounty framework for Claude Code, Codex, Gemini, Cursor, Windsurf, Copilot, and OpenClaw with autonomous hunt loops and an exploit-chain builder.

Discussion insight: @pvncher argued (23 likes, 4 replies, 3,017 views, 12 bookmarks) that orchestration works best as mostly serial decomposition, not agent swarms. In a reply, he said arbitrary agent-to-agent messaging adds irrelevant details and sidetracks work; parallelism is useful for read-only tasks, but write paths need locking and validation.

Comparison to prior day: May 17 emphasized self-evolving skills and official skill packs. May 18 showed more portability and workspace products: skills that cross agents, persistent conversations, Git-native work traces, and multi-agent coordination.

1.4 Skepticism concentrated around security, OOD failures, and model-price tradeoffs 🡒¶

The day was not uniformly bullish. The strongest skepticism came from concrete failure reports: coding agents could not build a child's rocket simulator, prompt-only security looked weak, MIT showed prompt injection against a current model, and Cursor Composer 2.5 drew speed-versus-intelligence debate.

@omarsar0 reported (60 likes, 28 replies, 14,785 views, 16 bookmarks) that coding agents repeatedly disappointed his 10-year-old when asked to build science-centered simulators such as a rocket simulator. In replies, he said he tried prompting help and context-engineering tricks, but the agents still could not produce even a half-baked 2D simulator. His conclusion was narrow and evidence-based: coding agents are better at extending patterns abundant in training data than at out-of-distribution creative simulator work.

@RoundtableSpace posted (53 likes, 9 replies, 38,508 views, 36 bookmarks) a long "security-first agent" prompt covering prompt injection, jailbreaks, data exfiltration, least privilege, and unsafe tool use. The replies immediately split: one said almost nobody is securing agents; @bygregorr asked whether security is "just vibes in a system prompt," and @sanarsh11 called it "Prompt armor on a paper shield."

@anishathalye said (27 likes, 3 replies, 1,692 views, 20 bookmarks) a MIT CSAIL guest lecture successfully demonstrated prompt injection against Anthropic's Opus 4.6 model and concluded that agent security is still unsolved. His replies said the full lecture covers ReAct, CodeAct, agent security, Simon Willison's dual-LLM pattern, and CaMeL's capability system.

@minchoi described (33 likes, 14 replies, 5,962 views, 8 bookmarks) Cursor Composer 2.5 as cheaper and built for long-running agent work. The useful reply came from a tester who said Composer 2.5 felt "blazingly fast" and solid on price-performance, but still could not match Opus or GPT-5.5 intelligence and still made "silly mistakes."

Discussion insight: Security replies were unusually direct. Posts that proposed prompt defenses triggered immediate skepticism that prompts alone can stop exfiltration or multilingual leaks. The overall pattern was not anti-agent; it was anti-claim without verification.

Comparison to prior day: May 17 already had warnings about malicious agent supply chains. May 18 broadened the skepticism into prompt-injection demos, prompt-only defense criticism, OOD coding failures, and model-specific tradeoffs.

2. What Frustrates People¶

Reliable agents need ownership, not just better models¶

Severity: High. @dani_avila7 said (52 likes, 4 replies, 21,387 views, 58 bookmarks) enterprise Claude Code rollouts hit the same wall when nobody owns configuration, permissions, skills, and conventions. @code_rams listed (5 likes, 4 replies, 415 views, 14 bookmarks) constraints, state, verification, observability, modular structure, clean-state protocols, and behavioral constraints as required harness components. The workaround is emerging as a dedicated Agent Manager/DRI plus spec gating, but the need is still operationally unresolved.

Agent security feels both urgent and underbuilt¶

Severity: High. @anishathalye showed (27 likes, 3 replies, 1,692 views, 20 bookmarks) prompt injection against Opus 4.6 in a MIT CSAIL lecture, while @RoundtableSpace circulated (53 likes, 9 replies, 38,508 views, 36 bookmarks) a security prompt that replies criticized as insufficient. @DarkWebInformer shared (9 likes, 2 replies, 2,101 views, 17 bookmarks) Pentest Agents, but a reply warned that static agent scripts can hallucinate findings instead of proving exploit paths. The coping pattern is layered: prompts, least privilege, capability systems, exploit proof, and human review.

Coding agents still disappoint on out-of-distribution creative tasks¶

Severity: Medium. @omarsar0 said (60 likes, 28 replies, 14,785 views, 16 bookmarks) coding agents failed at his child's rocket simulator and related science-simulator tasks even with adult prompting help and context engineering. The frustration is not that agents never help; he explicitly says they help maintain and extend software patterns well represented in training data. The gap is generalized creation in domains where the desired behavior is less templated.

Agent-commerce trust is still ahead of settlement infrastructure¶

Severity: High. @Onyx_goose argued (57 likes, 48 retweets, 5 replies, 129 views) that agent commerce needs x402B escrow because autonomous parties may not know each other. @tetsuoai described (86 likes, 9 replies, 3,976 views, 15 bookmarks) AgenC's marketplace rails and proof-backed completion, and a reply asked whether native agent-to-agent payments exist yet. The workaround is conditional escrow, reviewable artifacts, and onchain coordination, but the replies show trust remains an open implementation question.

The tool stack is fragmenting into skills, memory, chat, and workspaces¶

Severity: Medium. @tom_doerr shared (22 likes, 6 replies, 1,751 views, 21 bookmarks) OpenSkills, and replies immediately worried whether universal skills become a "multi-tool nightmare." @mercury__agent had to explain (176 likes, 9 replies, 1,040 views) that /code plan is not the value by itself; the traceable Git-native loop is. The frustration is discoverability and system integration: users want reusable skills, but not context bloat or yet another isolated runtime.

3. What People Wish Existed¶

A real owner and control plane for enterprise coding agents¶

People want a practical operating role and control surface for Claude Code-style rollouts. @dani_avila7 (52 likes, 21,387 views, 58 bookmarks) called this role Agent Manager, owning configuration, permissions, marketplace selection, and conventions. Opportunity: direct. The need is practical and immediate because teams are already rolling out agents before ownership is clear.

Harnesses that preserve goals without letting agents redefine done¶

@code_rams (5 likes, 415 views, 14 bookmarks) wanted constraints, state, verification, observability, clean state, and behavioral limits; the reply from @nicolasmoute added spec gating so planners cannot silently rewrite goals. Opportunity: direct. Existing harness templates partially address it, but the "planner redefines done" failure mode needs explicit product support.

Trustworthy agent-commerce rails¶

@Onyx_goose (57 likes, 48 retweets) asked for x402B-style escrow, @tetsuoai (86 likes, 3,976 views) described proof-backed completion, and @termix_ai (43 likes, 8,130 views) supplied volume metrics showing marketplace activity. Opportunity: direct and competitive. Many projects are already attacking identity, escrow, proof, and settlement, but replies still ask how payments and verification really work.

Secure agents that do more than carry a defensive prompt¶

@RoundtableSpace (53 likes, 38,508 views, 36 bookmarks) offered a security prompt, but replies called it paper-thin. @anishathalye (27 likes, 1,692 views, 20 bookmarks) pointed to capability systems and dual-LLM patterns in a lecture after showing prompt injection. Opportunity: direct. The market wants testable, enforceable security boundaries, not just more careful wording.

Better benchmarks and harnesses for non-standard creative tasks¶

@omarsar0 (60 likes, 14,785 views, 16 bookmarks) framed failed rocket-simulator tasks as evidence for generalized harnesses and eventually native multimodal systems or world models. Opportunity: aspirational. The need is real, but current tools seem better at maintaining and extending common software patterns than creating unfamiliar simulators.

4. Tools and Methods in Use¶

Tool	Category	Sentiment	Strengths	Limitations
Learn Harness Engineering	Course/templates	(+)	High bookmarking; teaches state, verification, observability, control systems; includes templates	One reply questioned whether it was really harness engineering rather than ML/PyTorch-oriented content
Hermes Agent	Agent runtime	(+)	Used for prototype-to-production loops; supports specialist-agent iteration	Drift corrections may not transfer to production inputs
Claude Code	Coding agent/runtime	(+/-)	Central to harness guides, enterprise rollout, skills, local-agent posts	Needs ownership, configuration discipline, permissions, and large context
LobeHub Chief Agent Operator	Agent marketplace/operator	(+/-)	273K-skill marketplace, cloud 24/7 execution, IM reporting	Reliability of selecting the right agent is still asserted more than proven
AgenC	Runtime/marketplace	(+)	Local daemon, CLI/TUI/dashboard, approvals, memory, observability, Solana rails, ZK/prover path	Native agent-to-agent payment handling was still questioned in replies
x402/x402B	Payments/escrow	(+)	Payment rail plus conditional escrow for autonomous transactions	Mostly conceptual in cited thread; trust model needs adoption
Mercury Agent	Coding-agent workspace	(+)	Git-native planning/execution trace, permission hardening, SQLite memory, token budgets, daemon mode	Must prove the broader workspace loop is more than planning commands
Gas Town	Multi-agent workspace	(+)	Git-backed hooks, Beads ledger, mailboxes, identities, handoffs, 20-30 agent coordination claim	Evidence came from screenshot/README, not broad user reports
OpenSkills	Skills loader	(+/-)	10,200 GitHub stars; makes Claude Code-style skills usable across agents	Replies worry about context bloat and tool complexity
Mimeo	Skill-generation tool	(+/-)	Converts expert procedure into SKILL.md/AGENTS.md	Replies ask how it separates expert reasoning from bad habits
CopilotKit Threads	Agent app framework	(+)	Persistent, resumable conversations on AG-UI for any agent framework	External docs were not available in this review; signal rests on launch tweet
Pentest Agents	Security framework	(+/-)	475-star Python bug-bounty framework with multi-tool support and exploit-chain builder	Reply warns static scripts can hallucinate findings without proof
Cursor Composer 2.5	Coding model	(+/-)	Fast, cheaper, built for long-running agent work	Tester said it still trails Opus/GPT-5.5 intelligence and makes silly mistakes
ElevenLabs voice agent	Voice/public services	(+/-)	Used in Poland appointment reminders and rescheduling	Replies question whether IVR could solve the same problem more cheaply

The overall satisfaction spectrum is split. Builders like harnesses, skills, and traceable workspaces when they reduce drift and repeated setup, but they push back when tools add context weight or replace enforceable security with prompts. Migration patterns are implicit: from raw chat to harnessed agents, from one-off skills to portable skills, from isolated coding agents to Git-native workspaces, and from offchain coordination to escrow/proof-backed commerce.

5. What People Are Building¶

Project	Who built it	What it does	Problem it solves	Stack	Stage	Links
Learn Harness Engineering	@_vmlops shared	Course and templates for reliable AI coding-agent harnesses	Agents lose state, declare victory early, and lack observability	Web course, AGENTS.md-style templates	Shipped	course, tweet
LobeHub Chief Agent Operator	@lobehub	Hires agents from a skill marketplace and runs them in cloud through IM apps	Choosing, running, and monitoring agents manually	Cloud agent marketplace, IM integrations	Shipped	tweet
AgenC	@tetsuoai	Agent operating layer plus marketplace, settlement, and proof-backed completion	Autonomous work needs execution, review, payment, and trust rails	Local daemon, CLI/TUI/web, Solana, ZK/prover	Beta	tweet
Mercury Agent	@mercury__agent	Permissioned coding agent with Git-native workflow and memory	Chat-with-code lacks traceable planning and controlled execution	TypeScript, SQLite, CLI/Telegram	Shipped	GitHub, tweet
Gas Town	@GithubProjects shared	Multi-agent workspace manager	Multiple agents lose context, handoffs, and work state	Git-backed hooks, Beads ledger	Shipped	tweet
OpenSkills	@tom_doerr shared	Universal loader for Claude Code-style skills	Skills are trapped inside one agent ecosystem	TypeScript, SKILL.md, AGENTS.md	Shipped	GitHub, tweet
Mimeo	@tom_doerr shared	Converts expert thinking into SKILL.md or AGENTS.md	Expert workflows are hard to package for agents	Python	Alpha	GitHub, tweet
CopilotKit Threads	@CopilotKit	Persistent, resumable conversations for agent apps	Users need ChatGPT-like continuity in apps	AG-UI, CopilotKit	Shipped	docs, tweet
Pentest Agents	@DarkWebInformer shared	Autonomous bug-bounty framework for multiple coding agents	Security testing needs repeatable agent workflows	Python, Claude Code/Codex/Gemini/Cursor/Windsurf/Copilot/OpenClaw	Alpha	GitHub, tweet
$CLAWD/Larv AI governance	@clawdbotatg	Trained mini-agents vote on governance for stakers	DAO voters lack time to stay informed	Onchain staking, AI governance, private inference	Beta	site, tweet

The repeated build pattern is "make agents operationally legible." Mercury, Gas Town, OpenSkills, and Mimeo all package work so it persists across sessions or tools. AgenC, x402B, TermiX, and $CLAWD extend the same need into commerce and governance: agents need settlement, reputation, proof, and review. The weak spot across many projects is proof of reliability under messy use, which appears in replies about drift, prompt injection, hallucinated pentest findings, and context bloat.

6. New and Notable¶

Poland's healthcare voice-agent rollout¶

@mati said (1,113 likes, 47 replies, 20 quotes, 64,986 views, 153 bookmarks) Poland is rolling out an ElevenLabs-powered voice agent for Centralna e-Rejestracja that calls landline-only patients, reminds them of appointments, and helps reschedule. The operational claim is concrete: 40M appointments per year and 10-20% no-show rates. Replies split between "no-show is one of the biggest issues" and skepticism that the government should prioritize AI over broader healthcare waiting-time problems.

The Bankr ecosystem map¶

@BaseHubHB posted (140 likes, 31 replies, 7,968 views, 57 bookmarks) a Bankr ecosystem map spanning agent infra, networks, RWA, privacy, reputation, builder tools, trading, gaming, and agent personas. The image is notable because it shows onchain agent projects fragmenting into recognizable categories rather than a single "AI coin" bucket.

Bankr ecosystem map grouping projects into agent infra, networks, builder tools, privacy, reputation, trading, gaming, and agent personas

Agentic economy governance concerns get a "black box" frame¶

@kimmonismus highlighted (49 likes, 6 replies, 11,600 views) Patrick Hussey's "The Agentic Economy Has No Black Box," saying agents are already destroying production systems, ignoring stop commands, and sustaining collusive pricing in simulations. The attached image's subtitle - "cross-party agent failures and the civic infrastructure that doesn't exist" - widened the day's governance discussion beyond single-company safety.

Serial orchestration over swarms¶

@pvncher argued (23 likes, 4 replies, 3,017 views, 12 bookmarks) that manager-model orchestration with scoped subagents is stronger than "agent swarms." His reply was the important part: arbitrary messaging between agents sidetracks them, parallelism is best for read-only work, and write tasks need locking and validation.

7. Where the Opportunities Are¶

[+++] Enterprise agent control planes — Evidence spans @dani_avila7's Agent Manager role, @code_rams's harness components, @shannholmberg's deployment loop, and @_vmlops's course. This is strong because the pain is operational, repeated, and tied to enterprise rollout.

[+++] Agent security with enforceable boundaries — Evidence includes @anishathalye's prompt-injection demo, @RoundtableSpace's prompt-defense debate, and @DarkWebInformer's pentest framework plus hallucination critique. Strong opportunity because prompt-only solutions are visibly distrusted.

[+++] Trust, escrow, and proof for agent commerce — @Onyx_goose, @tetsuoai, @termix_ai, and @clawdbotatg all point to settlement and verification. Strong opportunity because projects are already transacting, but replies still ask how trust and payments are enforced.

[++] Portable skill and memory infrastructure — OpenSkills, Mimeo, Mercury Agent, and CopilotKit Threads all try to preserve capability across sessions, apps, or agents. Moderate opportunity because demand is clear, but context bloat and skill quality are unresolved.

[++] Multi-agent workspace coordination — Gas Town, Mercury, and @pvncher's orchestration comments show a need for scoped subagents, handoffs, locks, and validation. Moderate opportunity because builders are already shipping, but claims need evidence across larger teams.

[+] OOD creative-agent evaluation — @omarsar0's rocket-simulator failure suggests a niche for benchmarks and harnesses that test beyond common app patterns. Emerging opportunity because the evidence is high quality but single-threaded.

8. Takeaways¶

Harness engineering was the dominant agent topic again, but it is becoming organizational infrastructure. The day's strongest evidence combined a widely bookmarked course, a Hermes deployment loop, a design guide, and an Agent Manager rollout role. (source)
Agent commerce is moving from narrative to rails. AgenC, x402B, TermiX, and $CLAWD all discussed execution, settlement, escrow, proof, or governance metrics rather than just "agents will transact." (source)
Portable skills are attractive but risky. OpenSkills' cross-agent SKILL.md model drew interest, while replies warned about tool complexity and context windows consuming skills. (source)
Security claims need proof. A security prompt drew immediate skepticism, and a MIT lecture demoed prompt injection against Opus 4.6, keeping agent security in the unsolved-problem bucket. (source)
Current coding agents still fail on some creative, underrepresented tasks. The rocket-simulator thread is a useful counterweight to enterprise success stories because it describes repeated failure despite prompt/context help. (source)
The builder pattern is traceability. Mercury, Gas Town, CopilotKit Threads, and Mimeo all preserve state, skills, conversation, or work traces so agents can behave less like disposable chats. (source)