Twitter AI Agent - 2026-05-16¶

1. What People Are Talking About¶

1.1 Harness engineering becomes measured operating work 🡕¶

Harness engineering stayed central, but the evidence changed shape. Instead of generic "agents need better scaffolding" takes, the strongest posts were about specific observability layers, explicit framework mappings, team cost dashboards, and the economics of running long-lived harnesses. At least six substantive items supported the theme, and even the jokes were now about nested control loops rather than prompts.

@IntuitMachine summarized (19 likes, 2 replies, 1,392 views, 32 bookmarks) the Agentic Harness Engineering paper, saying ten iterations of observability-driven harness edits lifted Terminal-Bench 2 pass@1 from 69.7% to 77.0% and transferred to SWE-bench-verified with 12% fewer tokens. The tweet's abstract screenshot made the paper's core claim explicit: the leverage came from component, experience, and decision observability, not a new base model.

Abstract screenshot for Agentic Harness Engineering showing the three observability layers and the 69.7% to 77.0% Terminal-Bench 2 improvement

A companion thread from @IntuitMachine on Sylph AI's harness-evolution paper (38 likes, 2 replies, 1,250 views, 37 bookmarks) pushed the same direction: Worker, Evaluator, and Evolution agents rewrite the harness automatically, while a meta-evolution loop learns how to adapt that pattern to new domains. The relevant public artifact is The Last Harness You'll Ever Build, which makes the day feel less like "prompt engineering but bigger" and more like the start of automated runtime engineering.

@FUCORY argued (78 likes, 5 replies, 9,629 views, 82 bookmarks) that Bun's rewrite is one of the best live examples of harness engineering because the work is really about long-running workflow design, not just LLM output quality. The image mattered because it mapped Bun harness concepts to Smithers primitives one-to-one, and the replies added the kind of detail benchmark threads rarely do: a lifetime classifier for memory management and deliberate backpressure design.

@valentinmihov reported (13 likes, 5 replies, 867 views) that his team harness combined a Multica kanban board, Hermes Agent on Codex, a Hindsight memory system, custom skills, internal monitoring, OAuth2 SSO, and Git-managed infrastructure. He said the setup consumed about 600 million tokens—roughly $3,000 of API cost—versus what he estimated as two months of DevOps work, which turned "agentic engineering" from an abstract pattern into a concrete ops budget.

Dashboard-style summary of an in-house agentic engineering harness showing Kanban counts, roughly 600M tokens, about $3K API cost, and an 8x faster claim

Discussion insight: @BenjDicken captured (583 likes, 22 replies, 19,613 views, 150 bookmarks) the mood by reducing 2026 engineering to "agent → while loop subagent → nested while loop agent harness." The joke landed because it matched the serious posts: more loops, more memory policy, more orchestration layers, and more chances to lose observability.

Comparison to prior day: On May 15, harnesses were still being framed through architecture diagrams, observability papers, and reference courses. On May 16, the discussion stayed hot but moved one layer deeper into measurable deltas, framework mappings, and team-level cost and stack disclosures.

1.2 Agent control surfaces turn into configurable products 🡕¶

A second cluster treated agents less like chat endpoints and more like configurable software. The highest-signal examples were UI screenshots and runtime surfaces showing where guidance lives, which integrations are enabled, how providers are swapped, and what an operator can change without editing a prompt file.

@karrisaarinen said (61 likes, 6 replies, 6,281 views, 88 bookmarks) he now runs most agent work through Linear Agent, with personal guidance, custom skills, MCP servers, built-in web search and code context, and workspace-specific controls. The screenshot set was the important part: it showed Gong recording intake, per-workspace enablement toggles, and an enabled integration list across Slack, Teams, Intercom, Zendesk, and Gong, which made the product feel like an operator console rather than a prompt wrapper.

Linear Agent configuration screen showing Gong recording intake, team destination, participant notification, and guidance settings

@muskonomy reported (41 likes, 6 replies, 1,676 views) that Grok subscribers can now connect directly to Hermes Agent, whose README describes a self-improving runtime with model switching, skills, memory, cron, and messaging gateways. A lower-engagement but more revealing companion post from @MoeSbaiti claimed (1 like, 2 replies, 74 views) the xAI OAuth flow now bundles chat, TTS, image generation, video generation, and transcription under one login, and the architecture graphic showed exactly how that capability bundle is being presented to users.

SuperGrok to Hermes Agent diagram showing one xAI OAuth login feeding Grok chat, transcription, image generation, video generation, and TTS inside the agent runtime

@RodmanAi shared (46 likes, 8 replies, 1,661 views, 21 bookmarks) the free Learn Harness Engineering material as the missing production layer above prompting. The course page itself says the curriculum spans 20 phases and 416 lessons from model internals to autonomous swarms, which helps explain why the day's best UI posts kept talking about memory, governance, integrations, and persistence instead of prompt formats.

Discussion insight: The control-surface story had a clear negative mirror. @abboskhonovv said (9 likes, 1 reply, 417 views) he tried four Hermes web UIs and found weird layouts or missing features, while @zebassembly said (27 likes, 8 replies, 1,437 views) TUI nitpicks in current coding agents were bad enough to make them want to build their own.

Comparison to prior day: May 15's skill discourse was about reusable knowledge packs, hooks, and evaluation loops. May 16 showed where those capabilities are landing: settings panels, integration screens, runtime toggles, and provider-linked subscription flows.

1.3 Identity and governance become the hard part of agent commerce 🡕¶

The commerce conversation moved away from simple marketplace discovery and toward the missing trust layer. The strongest items were about agent identity, consent boundaries, auditability, and how work gets settled once agents start transacting or calling each other's skills for real.

@felix_fan wrote (59 likes, 15 replies, 5,627 views) that Trust Wallet is building on two tracks: Agent Kit plus EIP-8004 for developer-side onchain agent identity, while consumer users still hold keys and consent at every step. Public reporting from Cryptobriefing and PYMNTS made the same split visible in prose: more autonomous wallets and transfers on one side, more session caps, guardrails, and liability questions on the other.

@sijlalhussain argued (12 likes, 1 reply, 301 views) that trust in agentic commerce is an operational governance architecture problem rather than a brand problem. The image he shared was more valuable than the tweet text because it distilled the problem into five concrete requirements—identity verification, human oversight, transparency, data security, and accountable governance.

McKinsey-derived trust framework for agentic commerce showing identity verification, human oversight, transparency, data security, and accountable governance

@Unibase_AI announced (19 likes, 75 replies, 101 quotes, 31,573 views) Flap's entry into the BitAgent ERC-8183 marketplace for autonomous token-launch skills across multiple chains. The most useful evidence was not the main post but the top reply: composable skill calls still need escrow, deliverable hashes, and evaluator settlement before real work can resolve, which shows how much market plumbing is still missing under the hype.

Discussion insight: This was also the most explicit disagreement cluster of the day. The bullish marketplace posts kept talking about discoverable, composable skills; the critical replies kept dragging the conversation back to consent, attribution, escrow, and who signs off when an agent completes—or fails—a paid task.

Comparison to prior day: May 15 focused on how agents register, discover services, and pay per call. May 16 added the harder questions underneath that flow: who the agent is, what authority it inherits, how activity is audited, and how marketplace work is actually settled.

2. What Frustrates People¶

Harness layers still cost too much to build and debug¶

The highest-severity frustration was that reliable agents still require a large amount of harness work before the model can be trusted on long tasks. @IntuitMachine summarized (19 likes, 2 replies, 1,392 views, 32 bookmarks) the AHE paper as a response to tangled harness components, noisy trajectories, and weak attribution when edits help or hurt. @valentinmihov reported (13 likes, 5 replies, 867 views) that his own team harness took about 600 million tokens and roughly $3,000 of API spend to assemble, even with AI accelerating the work. @BenjDicken turned (583 likes, 22 replies, 19,613 views, 150 bookmarks) that operational sprawl into the day's defining joke about nested agent loops, which is what made it land. Severity: High. The visible workaround is more explicit harness structure—memory classifiers, critics, mappings, and audit layers—not simpler prompts. Worth building for because even strong operator posts still sound expensive and fragile.

Agents still lack their own identity, liability, and settlement boundaries¶

The second high-severity frustration was that agent execution still inherits too much human identity and too little agent-specific accountability. @pvergadia warned (7 likes, 7 replies, 1,094 views) that ten agents can all run under the same long-lived credentials with no way to tell which one acted, quoting 1Password CTO Nancy Wang on the problem of agents inheriting permissions forever. @felix_fan said (59 likes, 15 replies, 5,627 views) Trust Wallet still separates developer autonomy from consumer consent, and the PYMNTS coverage made the same point with session caps, liability questions, and manual approvals still in the loop. Under @Unibase_AI the BitAgent launch post (19 likes, 75 replies, 101 quotes, 31,573 views), a top reply argued that composable skill calls still need escrow, deliverable hashes, and evaluator settlement before real work resolves. Severity: High. The workaround today is explicit consent gates, wallet limits, and human override. Worth building for because the trust layer is clearly lagging the marketplace layer.

Existing UIs and TUIs still miss operator-grade controls¶

Interface quality remains a practical blocker even when the runtime underneath is capable. @abboskhonovv said (9 likes, 1 reply, 417 views) he tried four different Hermes web UIs and found weird layouts or missing features, so he built Hermium with model switching, chat management, a skills panel, and cron controls. @zebassembly said (27 likes, 8 replies, 1,437 views) current coding-agent TUIs have so many little problems that they still want to build their own agent despite liking the underlying products. @karrisaarinen showed (61 likes, 6 replies, 6,281 views, 88 bookmarks) what better looks like: settings, toggles, integrations, and guidance all exposed in the product instead of buried in prompt files. Severity: Medium. The current workaround is to replace the UI layer yourself. Worth building for because the pain is concrete, repeated, and tied to daily operator workflow rather than taste.

Hermium interface showing a local Hermes chat and research output view built to add model switching, chat management, skills, and cron controls

3. What People Wish Existed¶

Verifiable agent identity and settlement rules¶

The clearest practical need was not another marketplace but the rule layer beneath it. @pvergadia said (7 likes, 7 replies, 1,094 views) that agents still lack their own identity, making inherited permissions and attribution a "RIGHT NOW" problem. @felix_fan positioned (59 likes, 15 replies, 5,627 views) Trust Wallet's answer as Agent Kit plus EIP-8004 for developer-side identity while users keep keys and consent on the consumer side, and under @Unibase_AI the BitAgent launch post (19 likes, 75 replies, 101 quotes, 31,573 views), a reply argued that escrow, deliverable hashes, and evaluator settlement are still missing. Opportunity: direct. The need is concrete, operational, and still only partially solved by current wallet and marketplace launches.

Governed memory that survives session resets¶

Another recurring need was persistent state that belongs to the operator rather than the model vendor. The ODEI product page explicitly sells against the need to "stop rebuilding context every morning," promising a persistent world model, governance loop, and audit receipts above Claude Code, Codex, and Gemini. @valentinmihov said (13 likes, 5 replies, 867 views) his own harness needed a dedicated Hindsight memory system, while @muskonomy framed (41 likes, 6 replies, 1,676 views) Grok plus Hermes as valuable precisely because memory persists across sessions and messaging surfaces. Opportunity: direct. The demand is practical and already being sold against by multiple products, but the market still looks early and fragmented.

Operator-grade frontends for persistent agents¶

People also want agent runtimes that feel operable, not merely powerful. @abboskhonovv built (9 likes, 1 reply, 417 views) Hermium because existing Hermes UIs lacked model switching, chat management, skills visibility, and cron controls. @zebassembly said (27 likes, 8 replies, 1,437 views) current coding-agent TUIs still have too many fixable nitpicks, and @karrisaarinen showed (61 likes, 6 replies, 6,281 views, 88 bookmarks) that the desired alternative is a UI with visible toggles, guidance, and integrations rather than hidden conventions. Opportunity: direct and competitive. The need is explicit, but multiple builders are already converging on the same answer.

4. Tools and Methods in Use¶

Tool	Category	Sentiment	Strengths	Limitations
Agentic Harness Engineering	Harness / eval method	(+)	Closed-loop harness edits, explicit observability layers, transferable benchmark gains on Terminal-Bench 2	Still benchmark-heavy; the paper itself notes regression blindness and non-additive interactions
The Last Harness You'll Ever Build	Harness evolution framework	(+/-)	Worker/Evaluator/Evolution plus meta-evolution loop for adapting harnesses to new domains	Public evidence is still paper-led and vendor-framed rather than broad operator validation
Learn Harness Engineering / AI Engineering from Scratch	Course / reference	(+)	Turns harness work into structured curriculum around persistence, verification, and systems design	Educational material only, not a runtime or product
Hermes Agent	Agent runtime	(+/-)	Persistent memory, skill creation, model switching, messaging gateways, cron, Grok integration	Users still complain about missing UI controls and keep building replacement frontends
Linear Agent	Workspace agent	(+)	Fine-grained guidance, built-in search and code context, integration-specific controls, admin/user config	Evidence today came from one operator thread rather than many public case studies
ODEI	Memory / governance layer	(+/-)	Persistent world model, governance loop, audit receipts, model-agnostic continuity above terminal agents	Product promises are ahead of visible third-party usage proofs
XPR Network Dev Skill	Domain skill pack	(+)	ABI-verified docs grounded in live mainnet data; load-on-demand knowledge modules	Narrowly targeted to one blockchain ecosystem
ComfyUI Skills for OpenClaw	Workflow bridge	(+)	CLI-first schema mapping, multi-server routing, optional web UI, cross-runtime support	Requires exported ComfyUI workflows and server setup before agents can use it reliably
Hermium	Agent UI	(+)	Self-hosted frontend with model switching, persistent sessions, slash commands, and usage insights	Very new project and fully dependent on Hermes underneath
TradingAgents	Finance framework	(+/-)	Specialist analyst, researcher, trader, and risk roles; multi-provider models; checkpoint resume; decision memory	Repo explicitly positions itself as research-oriented and results vary by model and data
Agent Kit + EIP-8004	Wallet / identity infra	(+/-)	Gives agents onchain identity and programmable actions while preserving user consent flows	Consent, liability, and settlement remain unresolved across current wallet launches

Overall satisfaction was highest when a tool exposed clean control surfaces or domain-grounded knowledge instead of asking operators to improvise everything in prompts. The visible migration pattern is away from raw prompt-plus-tool bundles and toward governed runtimes, persistent memory layers, verified skill packs, and frontends that show toggles, integrations, and budgets explicitly. The main workaround when infrastructure is good but UX is weak is to build a wrapper UI like Hermium or move toward products such as Linear Agent that already expose the controls operators want. Competitive pressure is strongest where runtimes meet control: harness tooling, persistent memory/governance, workflow bridges, and wallet identity stacks.

5. What People Are Building¶

Project	Who built it	What it does	Problem it solves	Stack	Stage	Links
TradingAgents	Tauric Research	Multi-agent trading framework with analyst, researcher, trader, and risk teams	Breaks market analysis and trading decisions into specialist-agent workflows	Python, LangGraph, multi-provider LLMs, Alpha Vantage	Shipped	GitHub · paper · tweet
Fuzzy Scanner	@FuzzyAiGent	Audits Solana programs with a five-stage multi-agent pipeline	Cuts manual smart-contract review time and cost	Swarms, Solana, Rust/Anchor input, parallel agent pipeline	Alpha	agent · tweet
XPR Network Dev Skill	@paulgrey	Domain skill pack for XPR development and autonomous agents	Gives agents verified chain-specific knowledge instead of guessing	Markdown skill modules, live ABI verification, Hyperion traces, OpenClaw integration	Shipped	GitHub · tweet
Hermium	@abboskhonovv	Self-hosted Hermes chat UI	Fixes missing model switching, chat management, skills visibility, and cron controls	React, Hono, Bun, Hermes Agent	Beta	GitHub · tweet
ODEI Agent Builder	@odei_ai	Persistent world-model and governance layer above existing terminal agents	Stops session resets and missing auditability in agent workflows	World-model graph, governance loop, audit receipts, Claude Code/Codex/Gemini bootstrap	Shipped	site · tweet
Printr MCP	@printr	Agent-native token launchpad installed through a skills catalog	Lets agents deploy tokens and manage treasury steps from one command surface	MCP, skills catalog, onchain launch tooling	Shipped	tweet
ComfyUI Skills for OpenClaw	HuangYuChuh	Turns ComfyUI workflows into callable agent skills	Makes creative workflows callable without exposing raw graphs	Python CLI, schema mapping, optional web UI, multi-server routing	Shipped	GitHub · tweet
Blackfin	MrDiamondBallz	Local-first compatibility kernel with policy controls and isolated agents	Safer tool use, isolated workspaces, portable skills, deterministic traces	Python CLI, TOML policies, encrypted secrets, SQLite memory	Alpha	GitHub · tweet

TradingAgents was the clearest vertical-builder signal. @gusik4ever ranked (177 likes, 24 replies, 12,538 views, 205 bookmarks) it as the fastest-growing finance repo of the week at +3,822 stars, and the repo says it mirrors a real trading firm with analyst, researcher, trader, and risk teams plus checkpoint resume and decision memory. That combination—role specialization, persistent logs, and multi-provider model support—makes it much more than a generic AI trading bot.

Weekly GitHub finance pulse table showing TradingAgents, AI-Trader, scientific-agent-skills, and other finance-agent repos ranked by weekly star growth

Fuzzy Scanner showed the same verticalization in security. @FuzzyAiGent described (7 likes, 3 replies, 124 views) a five-stage multi-agent Solana audit pipeline, and the screenshots showed grades across ten programs plus a roughly $0.14 average cost and a 121-second parallel run. The signal here is not broad adoption yet; it is that builders are already instrumenting agent systems with cost and benchmark output inside narrow domains.

Audit results table from Fuzzy Scanner showing grades and scores across ten Solana programs such as SPL Token, Stake, Marinade, Drift, Orca, and Raydium

A second recurring build pattern was turning expert workflows into agent-safe interfaces. @paulgrey posted (92 likes, 2 replies, 5,855 views, 23 bookmarks) an XPR skill pack whose README says every fact is verified against live mainnet ABIs and Hyperion traces, while @DanKornas shared (3 likes, 1 reply, 372 views) ComfyUI Skills for OpenClaw as a CLI and schema layer that lets agents call exported creative workflows without touching raw graphs. @abboskhonovv took (9 likes, 1 reply, 417 views) the same approach on the UX side by building Hermium after current Hermes frontends missed basic controls.

ComfyUI Skills for OpenClaw screenshot showing workflow import, server manager, workflow manager, and an agent-friendly interface on top of ComfyUI

ODEI, Printr, and Blackfin point to a third pattern: people are no longer satisfied with the base runtime alone. They are adding world models, governance loops, audit receipts, policy profiles, encrypted secrets, or one-command onchain action surfaces above the underlying agent, which suggests the next wave of builder activity will be meta-infrastructure rather than end-user chat surfaces.

6. New and Notable¶

Codex realtime voice handoff is visible in public artifacts¶

@DevAdventur3s claimed (28 likes, 8 replies, 1,509 views) OpenAI is quietly building realtime voice mode into Codex. The signal is notable because the evidence is unusually concrete: one screenshot shows a live voice session with 182 ms latency and a background agent editing four files, and another shows Rust code with DEFAULT_REALTIME_MODEL set to gpt-realtime-1.5. GitHub code search also surfaces codex-rs/core/src/realtime_conversation.rs in openai/codex, which moves the claim beyond pure speculation.

Codex realtime screenshot showing a live voice session, 182 ms latency, a background coding agent at 62 percent, and file edits in progress

Rust source screenshot showing Codex realtime conversation code with the gpt-realtime-1.5 model string and handoff-related structs

Structured Wikipedia generation gets a dedicated multi-agent paper¶

@WikiResearch shared (9 likes, 1,095 views) the WikiMAG paper, which uses a Progressive Planner, Reflective Inspector, and Versatile Writer to assemble narrative, timeline, and table sections with better structure and citation quality than earlier systems such as STORM and Co-STORM. What makes it notable is the target: multi-agent generation is being pushed toward structured, citation-heavy article production rather than generic chat output.

Small harness fixes are still moving benchmark numbers¶

@OpenCvn said (1 like, 801 views) that KRAFTON AI jumped 10-plus points on Terminal-Bench with minimal harness fixes, and the public Terminus-KIRA repo attributes its gains to native tool calling, multimodal image analysis, marker-based polling, and a multi-perspective completion checklist. That matters because it reinforces the day's broader pattern: harness defaults still seem to leave major performance headroom, even when the underlying model family is already strong.

7. Where the Opportunities Are¶

[+++] Harness observability and evolution tooling — The AHE paper, the Sylph meta-evolution loop, the Bun-to-Smithers mapping, and @valentinmihov's cost dashboard all point to the same gap: teams still spend real money and real operator time on the layer between model and task. Products that make harness edits inspectable, diffable, benchmarkable, and cheap to iterate on have support from sections 1, 2, 4, and 6.

[+++] Agent identity, consent, and evaluator settlement — The identity-gap thread, Trust Wallet's two-track wallet design, the McKinsey trust taxonomy, and the BitAgent reply about escrow, deliverable hashes, and evaluator settlement all point to missing infrastructure between agent action and accountable resolution. This is a strong opportunity because the pain and the partial solutions both showed up repeatedly.

[++] Operator frontends for persistent agents — Hermium exists because current Hermes UIs missed basic controls, zebassembly explicitly wants a better coding-agent TUI, and Linear's screenshots show demand for visible settings and integrations. This is competitive rather than empty-space, but the need is direct and daily.

[+] Voice-first coding and realtime agent handoff — The Codex realtime screenshots and Zubin Pratap's voice-stack taxonomy suggest an emerging market for talk-build-narrate workflows with low latency and background execution. The evidence is still early, but it is now public, specific, and attached to actual code and UI artifacts.

8. Takeaways¶

Harness engineering is now being judged as systems work, not prompt craft. @IntuitMachine summarized (19 likes, 2 replies, 1,392 views, 32 bookmarks) AHE's 69.7% to 77.0% lift, while @valentinmihov reported (13 likes, 5 replies, 867 views) a $3,000, 600-million-token harness buildout for one team.
Operator interfaces are becoming a competitive layer above the runtime. @karrisaarinen showed (61 likes, 6 replies, 6,281 views, 88 bookmarks) that Linear Agent now exposes guidance and integrations in-product, while @abboskhonovv built (9 likes, 1 reply, 417 views) Hermium because existing Hermes UIs still missed core controls.
Wallets and marketplaces are still missing the accountable middle layer. @felix_fan split (59 likes, 15 replies, 5,627 views) developer-side autonomy from consumer-side consent, and @sijlalhussain argued (12 likes, 1 reply, 301 views) that agentic commerce trust is really a governance architecture problem.
The most credible builders shipped vertical tools, not general autonomy claims. @gusik4ever ranked (177 likes, 24 replies, 12,538 views, 205 bookmarks) TradingAgents as the week's fastest-growing finance repo, @FuzzyAiGent described (7 likes, 3 replies, 124 views) a multi-agent Solana auditor, and @DanKornas shared (3 likes, 1 reply, 372 views) a ComfyUI workflow bridge for agents.
Voice-first coding is early but no longer hypothetical. @DevAdventur3s pointed (28 likes, 8 replies, 1,509 views) to Codex realtime UI and Rust code artifacts, while @ZubinPratap framed (6 likes, 2 replies, 144 views) low-latency turn handling and interruption control as the real engineering problem behind conversational voice agents.