Twitter AI Agent - 2026-06-06¶

1. What People Are Talking About¶

1.1 Harness engineering got operationalized into copyable artifacts 🡕¶

June 6's biggest AI-agent theme was that "harness engineering" stopped sounding like a reading list and started showing up as assets teams can copy: roadmaps, agent charters, runtime diagrams, and stack maps. Five retained items supported this theme.

@aakashgupta summarized (65 likes, 5 replies, 10,798 views, 94 bookmarks) an OpenAI team roadmap that spends months 1-2 on repo legibility, months 3-4 on automated validation, and months 5-6 on opening the system to PMs and designers through specs and evals. The attached roadmap made the progression concrete by naming AGENTS.md, docs trees, reviewer personas, E2E tests, and painted-door experiments, and by quantifying the destination as 1 million lines of code, 250,000 lines of prompts, and zero human-typed code.

Roadmap image showing repo legibility, automated validation, and expanded leverage phases for agent-driven engineering

@PrajwalTomar_ shared (48 likes, 4 replies, 2,574 views, 86 bookmarks) a sanitized SOUL.md template for Hermes Agent. The screenshot matters because it makes the operating contract explicit: stance, autonomy, mission, accountability, and pushback are written down as first-class configuration, and the thread argues that missing this layer is why agents keep asking permission for every action.

Screenshot of the SOUL.md template defining stance, autonomy, mission, and accountability for a Hermes agent

@_lopopolo argued (83 likes, 9 replies, 10,597 views, 71 bookmarks) that agent systems need expectations for when and how to seek context, not a pile of generic rules. The replies sharpened the failure mode: one asked whether AGENTS.md exists largely because context windows still fail over long runs, another said autocompaction makes deterministic context stuffing unreliable, and a third proposed "librarian" agents whose whole job is retrieving the right context for worker agents.

@pauliusztin_ outlined (16 likes, 9 replies, 274 views, 10 bookmarks) a four-layer architecture for AI software: presentation surfaces, a harness/runtime with unified memory, connectivity through skills, CLIs, and MCP clients, and MCP servers below that. The attached diagram is unusually specific about where durable execution, retries, checkpoints, permissions, and MCP apps belong, and replies argued that durable execution is what turns a chat loop into a real background worker.

Architecture diagram showing presentation surfaces above a harness runtime, skills/CLI/MCP connectivity, and MCP servers below

@LearnWithBrij mapped (4 likes, 2 replies, 83 views, 4 bookmarks) the "modern AI agent stack" as 12 terms from RAG, context, and memory through tools, MCP, skills, hooks, subagents, orchestration, and eval. The post itself is small, but the image is high-signal because it turns the day's vocabulary into a dependency chain: weak memory, tools, or evaluation means the whole system collapses regardless of model quality.

Hand-drawn stack diagram mapping RAG, context, memory, tools, MCP, skills, hooks, subagents, orchestration, and eval

Discussion insight: The replies did not ask for bigger context windows. They asked for paved workflows, context-seeking behavior, checkpoints, and clearer operating contracts that stop the agent from guessing.

Comparison to prior day: June 5 was still arguing that harness engineering is the missing skill. June 6 turned that argument into copyable artifacts.

1.2 Cost governance and task-based model routing got more concrete 🡕¶

The second cluster treated agent operations as a budgeting and routing problem rather than a brand-loyalty contest. Three retained items supported this theme.

@chamath argued (897 likes, 115 replies, 278,744 views, 856 bookmarks) that the capability gap between frontier closed models and strong open models has narrowed much faster than the pricing gap. His cost table put GPT-5.5 Pro at about $105,000 per month for 1 billion input plus 1 billion output tokens, Claude Opus 4.8 at about $30,000, DeepSeek V4 Pro at about $5,220, and DeepSeek R1 at about $2,740, then recommended DeepSeek for volume, Opus for premium reliability, and GPT-5.5 Pro only when the extra capability clearly pays for itself. The replies pushed back usefully by saying apples-to-apples benchmarking still matters and that enterprise AI needs traces, token budgets, and named ROI targets, not just cheaper models.

@AlexFinn listed (108 likes, 20 replies, 6,149 views, 144 bookmarks) seven Hermes practices that effectively amount to a routing manual: use a different profile for each model, match GPT-5.5 to coding, Opus to writing and research, and local models to cheap repetitive work, then keep heavy workflows in /background and trim cron jobs and compression settings when memory quality drops. The correction thread mattered because it shows these setups are not static vendor bets; they are active operating choices, down to which Qwen version is actually runnable locally.

@levie argued (139 likes, 27 replies, 23,187 views, 70 bookmarks) that coding remains the easiest domain for agents precisely because the work is verifiable and the context is already digitized in the codebase. Replies extended that thought by saying the ceiling in other knowledge work is not raw model intelligence but how much of the real context still lives in one senior person's head or an unstructured shared drive.

Discussion insight: The useful disagreement was not over which lab is winning. It was over when a premium model is actually worth paying for and how much routing, evaluation, and governance has to sit above the model.

Comparison to prior day: June 5 framed scaffolding above the model as the moat. June 6 added explicit monthly spend tables and per-model playbooks.

1.3 Agent frameworks kept turning into products with desktop surfaces, marketplaces, and payment rails 🡕¶

The third cluster was about distribution surfaces: desktop control centers, storefronts, and monetized interfaces rather than another bare agent framework. Four retained items supported this theme.

@iamlukethedev reported (99 likes, 11 replies, 11,115 views, 51 bookmarks) that Hermes Agent v0.16 is no longer just a CLI-layer framework but a platform with desktop apps for macOS, Windows, and Linux, a web admin dashboard, quick setup, drag-and-drop files, MCP server management, memory controls, and remote instance connections. Replies said remote access through Tailscale felt fast and raised one live edge case: fallback model switching when the main model runs out of tokens.

@trythreews shipped (373 likes, 61 replies, 20,856 views, 18 bookmarks) a 3D AI-agent platform that pairs browser-native rendering with multi-LLM chat, native Solana wallets, on-chain identity, real-time voice, and a one-line embed tag. The public three.ws site fills in the product surface: 200+ motion clips, multiplayer worlds, pay-per-chat in USDC, WebXR placement, and MCP/A2A connectivity, which makes this feel less like a demo and more like an embodied runtime plus marketplace.

@MeltedMindz described (18 likes, 2 replies, 603 views) Postera as an agent-to-agent marketplace built on Base using x402, where agents publish SKILL.md files and get paid directly in USDC with no subscription cut taken by the platform. @AxiomBot showed (14 likes, 4 replies, 369 views) what that looks like in practice: a storefront page with paid skills, declared MCP/A2A/x402 endpoints, on-chain receipts, buyer counts, repeat-buyer metrics, and a profile score. The image matters because it turns agent commerce into a visible dashboard with prices, receipts, and trust signals.

Screenshot of the Axiom profile on Postera showing paid skills, declared endpoints, on-chain receipts, and score breakdown

Discussion insight: The conversation kept moving away from model IQ and toward surfaces users can operate: remote control, memory settings, identity, payouts, and storefront metrics.

Comparison to prior day: June 4 was full of asks for agent-first work surfaces. June 6 had more shipped product surfaces to point at.

2. What Frustrates People¶

Token budgets are too easy to burn without a control plane¶

Severity: High. @chamath (897 likes, 115 replies, 278,744 views, 856 bookmarks) explicitly said teams are still defaulting to the most expensive models and burning through large budgets without governance, while replies argued that enterprise AI needs traces, token budgets, and named ROI targets rather than vague experimentation. @AlexFinn (108 likes, 20 replies, 6,149 views, 144 bookmarks) showed the current workaround in operator form: split work by model profile, send cheaper tasks to local models, and actively tune cron jobs and compression settings when the runtime degrades. This is worth building for because the pain is already expressed in dollars, not just latency.

Agents still forget, compress, or guess between sessions¶

Severity: High. @_lopopolo (83 likes, 9 replies, 10,597 views, 71 bookmarks) said harnesses cannot rely on everything being stuffed into context, especially after autocompaction. @DamiDefi (108 likes, 7 replies, 11,156 views, 16 bookmarks) framed the same pain from the user side: most agents lose what they learned after each session, while Hermes only feels different because it compounds runtime skills, persistent memory, and offline optimization. @AlexFinn added the operator symptom — memory loss improved only after lowering Hermes's compression threshold — and replies said permissions and RAM limits become the whole product once an agent touches real files and accounts. This is worth building for because today's workarounds are still manual tuning, profile juggling, and memory-layer marketing.

Verification is still lagging generation¶

Severity: High. @Vtrivedy10 (48 likes, 4 replies, 3,799 views, 34 bookmarks) called efficient verification one of the biggest bottlenecks for self-improving agents and argued that long tasks need intermediate verification gates, not just a final-answer check. @vigilcodes (5 likes, 1 reply, 104 views) responded by shipping VIGIL as a single MCP endpoint with approval, honeypot, token, wallet, and revoke tools for Base. @dani_avila7 (8 likes, 2 replies, 679 views, 11 bookmarks) showed the same pressure moving into code review: SkillSpector is now merged into the Claude Code Templates repo so skill PRs get scanned before merge. This is worth building for because the feed shows teams moving verification earlier, but still treating it as a separate layer they have to bolt on.

3. What People Wish Existed¶

Cheap verification that runs inside the workflow¶

This was the clearest practical need. @Vtrivedy10 (48 likes, 4 replies, 3,799 views, 34 bookmarks) said intermediate verification is the bottleneck for long-horizon agents and that the real question is whether it can be done cheaply enough to use at scale. @vigilcodes (5 likes, 1 reply, 104 views) and @dani_avila7 (8 likes, 2 replies, 679 views, 11 bookmarks) both shipped partial answers — an MCP scanner and a PR scanner — but today's evidence still shows point solutions rather than one default verification layer. Opportunity: direct.

Control planes that combine routing, permissions, and persistent memory¶

This need was practical and urgent. @chamath treated model choice as a governance and routing problem, not a model-war argument. @AlexFinn and @iamlukethedev described the operator version of the same need: model-specific profiles, remote compute with a lightweight desktop surface, memory tuning, dashboard controls, and permission boundaries once the agent touches real files and accounts. Opportunity: direct and competitive. Partial answers exist, but the feed suggests operators still stitch together cost routing, memory, and permissions manually.

A way to make team context legible before the agent starts guessing¶

This was a practical need rather than an emotional one. @aakashgupta spent the first third of his roadmap on repo legibility and documentation, @_lopopolo said prompts must collapse into paved workflows instead of generic context stuffing, and @levie plus replies argued that many non-code domains still keep their most useful context in people's heads or unmanaged drives. @tom_doerr offered one narrow answer — autoskills auto-detects a stack and installs curated skills — but today's evidence says that most teams still do this translation by hand. Opportunity: direct.

4. Tools and Methods in Use¶

Tool	Category	Sentiment	Strengths	Limitations
DeepSeek V4 Pro / R1	LLM	(+)	Chamath's routing table positions them as the low-cost tier for high-volume agent work	Reply debate said the price/capability frontier still changes with the workload, so cheapest is not always best
Claude Opus 4.8	LLM	(+)	Used as the premium reliability tier in Chamath's routing example and AlexFinn's model-per-profile setup	Much pricier than open models, so operators are already treating it as a selective tool rather than the default
GPT-5.5 Pro	LLM	(+/-)	AlexFinn routes it to coding tasks, and Chamath presents it as the top-capability option	Chamath's example prices it at roughly $105,000/month for 1 billion input plus 1 billion output tokens, so the premium has to be justified task by task
Hermes Agent	Agent platform	(+/-)	Persistent memory, runtime skills, desktop apps, /background workflows, remote control, and multi-profile routing	Cron jobs can slow it down, memory compression needs tuning, local-model guidance needed a correction, and replies flagged token-fallback issues
MCP server patterns / FastMCP-style stack	Protocol / server layer	(+)	Makes tool, resource, prompt, gateway, and proxy responsibilities explicit; pauliusztin_ says strong systems mix MCP with skills and CLIs	Today's posts still show confusion over where each layer should live, and single-mechanism agents were described as underperforming
autoskills	Skills installer	(+)	Scans package.json, Gradle, and config files, then installs curated skills matched to the stack	Early project; value depends on registry quality and how much of a team's real workflow can be captured as installable skills
markitdown + headroom + codegraph	Context prep / code intelligence	(+)	Convert docs to Markdown, compress context 60-95%, and pre-index code graphs before the prompt starts	Today's signal came mostly from GitHub breakout momentum rather than operator case studies, so real-world limits were not yet surfaced
VIGIL	Security MCP	(+)	One endpoint covers approval scans, honeypot detection, wallet reports, token scans, and revokes for Base workflows	Narrowly focused on Base and onchain use cases, and still early in public traction
RTRVR	Browser agent	(+/-)	Works inside a logged-in browser session, supports MCP, and handles cross-tab automation without per-site API setup	The recommending tweet also said consistency breaks on complex multi-step workflows

Tool — the specific tool, framework, service, model, or method people mention
Category — broad grouping such as LLM, protocol, agent platform, or security layer
Sentiment — overall feeling: (+) positive, (+/-) mixed, (-) negative
Strengths — the concrete advantages people called out
Limitations — the concrete complaints, gaps, or failure modes visible in the posts

The overall satisfaction spectrum was pragmatic rather than ideological. @chamath and @AlexFinn both route models by task and cost, while @pauliusztin_ and @elora_khatun frame MCP as one part of a hybrid stack that also needs skills and CLIs. The common workarounds were to keep a desktop surface over remote compute, preprocess files before prompting, publish skills instead of giant prompts, and scan risky actions before agents merge or sign. @sharbel and @vigilcodes make the competitive shift visible: context-prep and security layers are pulling attention as standalone product categories.

5. What People Are Building¶

Project	Who built it	What it does	Problem it solves	Stack	Stage	Links
three.ws	@trythreews	Browser-native 3D AI-agent platform with wallets, voice, multiplayer worlds, and pay-per-chat	Gives agents an embodied interface, identity, and monetization layer instead of another chat box	Browser-native viewer, multi-LLM, Solana wallets, x402, WebXR, realtime voice	Shipped	site, tweet
Postera	@MeltedMindz	Marketplace where agents sell SKILL.md capabilities and get paid in USDC	Turns reusable agent know-how into a directly purchasable asset	Base, x402, SKILL.md, wallet settlement	Shipped	site, tweet
Axiom profile on Postera	@AxiomBot	Live storefront for x402 Endpoint Builder and Skill Author packages with receipts and declared endpoints	Makes agent identity, pricing, and trust signals visible to buyers	Base, x402, MCP, A2A, ERC-8004-style identity surfaces	Shipped	tweet
autoskills	midudev	Detects a repo's stack and installs curated agent skills automatically	Reduces manual skill selection and setup friction	Ruby, package/config scanners, audited skill registry	Shipped	repo, tweet
VIGIL	@vigilcodes	MCP security scanner with 11 read-only Base tools	Lets agents inspect approvals, honeypots, contracts, and wallets before signing	Python, MCP, Base, ClawHub	Shipped	repo, tweet

Stage — where the project stands: Shipped (live/production), Beta (usable but incomplete), Alpha (early prototype), or RFC (idea/proposal, no working code yet)
Stack — languages, frameworks, models, or services the project is built on
Problem it solves — the specific pain point or gap that motivated the build
Links — GitHub repo, project site, demo, blog post, or wherever the project lives

three.ws is the most distinctive build because it combines embodiment, payments, and tool connectivity in one browser surface. The public site says agents can be remixed, placed in AR, paid per chat in USDC, and connected to other agents through A2A and MCP, which is much more than a visual avatar wrapper.

Postera and Axiom show a parallel build pattern: skills are being treated as market inventory rather than gist-like prompt files. The Axiom screenshot makes that layer legible with paid listings, receipts, declared endpoints, buyer counts, and repeat-buyer stats.

GitHub README screenshot for autoskills showing one-command installation, tech stack detection, and audited skill selection

autoskills and VIGIL are narrower but important because they package one operator job into a dedicated layer — install the right skills or scan before signing — instead of asking users to assemble another general-purpose agent framework. That same pattern shows up repeatedly across the day: builders are shipping thinner, more opinionated layers around setup, safety, and monetization.

6. New and Notable¶

Skill security started moving into the build pipeline¶

@dani_avila7 showed (8 likes, 2 replies, 679 views, 11 bookmarks) that SkillSpector is now merged into Claude Code Templates so new skill PRs get scanned before merge. @vigilcodes launched (5 likes, 1 reply, 104 views) VIGIL as a single Base-focused MCP endpoint with approval, honeypot, token, wallet, and revoke tools. The notable part is not raw engagement; it is that both items move verification earlier in the lifecycle — before merge and before signing.

Dark UI screenshot of VIGIL listing approvals, safety score, token scan, wallet report, honeypot detection, and revoke tools

Context-prep repos broke into the weekly GitHub leaders¶

@sharbel compiled (32 likes, 13 replies, 1,627 views, 32 bookmarks) a June 6 leaderboard where markitdown, headroom, ECC, codegraph, Understand-Anything, supermemory, and Claude Code all ranked among the week's fastest-growing repos. Public GitHub metadata supports the direction of travel: microsoft/markitdown had 146,475 stars at fetch time, chopratejas/headroom 15,983, affaan-m/ECC 209,226, and colbymchenry/codegraph 43,177. The notable part is that context compression, file normalization, and local code knowledge are drawing breakout demand on their own.

Leaderboard image listing fastest-growing GitHub repos with markitdown, headroom, ECC, codegraph, and Claude Code

Agent education became formal courseware¶

@Dinosn linked (17 likes, 726 views, 14 bookmarks) Learn Harness Engineering, whose site describes a project-based course on environments, state, verification, and control systems for Codex and Claude Code. @Gauravjain2410 circulated (23 likes, 2 replies) an Anthropic Academy poster listing 13 free courses spanning Claude 101, Agent Skills, Claude Code in Action, Intro to MCP, MCP advanced topics, and deployment tracks for Bedrock and Vertex AI. That matters because the operator layer is being packaged as curriculum, not just another thread.

Anthropic Academy poster listing 13 Claude AI courses including Agent Skills, Claude Code in Action, and MCP tracks

7. Where the Opportunities Are¶

[+++] Agent control planes for routing, permissions, and memory — @chamath put hard monthly dollar ranges on model routing, while @AlexFinn and @iamlukethedev showed operators already managing model-specific profiles, desktop control, remote compute, and memory settings. This is strong because the pain spans finance, reliability, and access control at the same time.

[++] Verification-first runtime layers — @Vtrivedy10 described efficient verification as a core bottleneck, @dani_avila7 moved skill scanning into PR review, and @vigilcodes shipped pre-sign checks through MCP. This is moderate because buyers are visible and the need is clear, but the solutions are still fragmented across CI, runtime, and onchain scanners.

[++] Repo-legibility and skill-install infrastructure — @aakashgupta, @PrajwalTomar_, and @tom_doerr all point to the same need: make the agent setup explicit before it starts guessing. This is moderate because the demand is obvious and public artifacts are shipping, but the market is still split between education, templates, and installers.

[+] Agent commerce and embodied interfaces — @trythreews, @MeltedMindz, and @AxiomBot show agents getting bodies, wallets, storefronts, and direct payments. This is emerging because live products exist, but the visible revenue, buyer counts, and trust systems are still early.

8. Takeaways¶

Harness engineering on June 6 was mostly about artifacts, not opinions. A roadmap, a SOUL.md template, and a runtime diagram turned agent operations into files and diagrams teams can copy today. (source)
Model routing is now a unit-economics problem. Chamath's table put GPT-5.5 Pro, Claude Opus 4.8, DeepSeek V4 Pro, and DeepSeek R1 on the same monthly cost sheet, and AlexFinn's setup advice matched that by routing different models to different jobs. (source)
The interface layer is widening fast. Hermes desktop, three.ws, and Postera all treat agents as desktop apps, embodied surfaces, or storefronts rather than pure chat loops. (source)
Memory remains the operational fault line. _lopopolo said long runs cannot depend on deterministic context stuffing, and Hermes advocates are still winning attention by promising that the next session will not start blank. (source)
Security and context-prep are becoming product categories of their own. SkillSpector, VIGIL, and the GitHub breakout list all point to more demand for scanners, compression layers, and code-knowledge tooling around the model. (source)