Twitter AI Agent - 2026-06-02¶

1. What People Are Talking About¶

1.1 Context, memory, and harness became the common vocabulary for agent builders 🡕¶

The strongest discussion was not about which model won a benchmark. It was about separating agent work into context, memory, and harness layers, then deciding which of those layers actually compounds in production. Four retained items supported this theme.

@DataChaz argued (230 likes, 21 replies, 30,013 views, 348 bookmarks) that the durable skills are context engineering, tool design, orchestrator-subagent patterns, eval discipline, MCP as a protocol layer, and a “harness > model” mindset. The most useful replies immediately pushed the point farther: one said harness matters more than MCP if the agent cannot recover from a crashed subprocess or resume a session, and another said the harness needs an audit trail.

@nabu_lines wrote (72 likes, 46 replies, 2,008 views) that builders should learn context engineering, memory engineering, and harness engineering before obsessing over leaderboard churn. In a follow-up reply, he defined context engineering as signal selection, memory engineering as what an agent remembers, forgets, and retrieves over time, and harness engineering as the tools, workflows, orchestration, code, and system design around the model.

@mfpiccolo wrote (58 likes, 5 replies, 13,319 views, 115 bookmarks) that harness bloat is structural when the unit of work is the whole framework. His proposed fix was smaller replaceable workers and typed functions so retrieval, reranking, and provider calls can be swapped independently instead of enlarging one monolith.

Discussion insight: Replies repeatedly turned “harness > model” into a more concrete requirement set: crash recovery, session resume, event traces, and auditable history.

Comparison to prior day: June 1 treated context handling as a production bottleneck. June 2 gave that problem a more stable vocabulary and pushed the conversation toward smaller, more replaceable components.

1.2 Multi-agent diagrams converged on control planes, validation loops, and memory managers 🡕¶

Once the vocabulary settled, people shared diagrams that looked increasingly similar: an orchestration layer at the top, specialized workers in the middle, tools and memory underneath, and validation or observability wrapped around the whole system. Four retained items supported this theme.

@MeenakshiYACS shared (83 likes, 29 replies, 784 views) a reference architecture that laid out user and client entry points, an orchestration and control plane, specialized agents, tools, memory, monitoring, reliability, and governance or security in one diagram. Its core claim matched the day’s broader discourse: agents alone are not the system; the architecture around them is.

Reference architecture showing user and external-system inputs flowing through an orchestration control plane into specialized agents, tools, memory, observability, reliability, and governance layers

@adxtyahq argued (68 likes, 8 replies, 1,440 views, 50 bookmarks) that the hard part of multi-agent systems is no longer getting one LLM to answer a question, but making orchestration, evals, routing, validation loops, and memory work reliably in production. The attached workflow diagram made that concrete with planner, researcher, retrieval, analysis, validator, critic, response, and memory-manager roles.

Multi-agent workflow showing a planner, specialist agents, a validator, a critic, a response agent, and a memory manager coordinating one user query

Even a low-engagement promo post from @DataScienceDojo showed (3 likes, 488 views) the same production pattern: a supervisor agent decomposes the task, dispatches parallel specialists, and hands a combined result to a synthesis step rather than relying on one generalist agent.

Slide showing a supervisor agent breaking a task into subtasks, sending them to specialized agents in parallel, and combining the results into a single response

@hnshah summarized (19 likes, 9 replies, 4,732 views, 57 bookmarks) the workflow shift as “ideas become plans,” “plans become durable context,” and “agents run in parallel,” with the human role moving closer to judgment. The replies sharpened the operational caveats: race conditions, safer plan or review trails, and the need for durable artifacts when multiple agents run at once.

Discussion insight: The most useful replies were less excited about “multi-agent” as a label than about how to coordinate, validate, and recover when specialized workers share state.

Comparison to prior day: June 1 emphasized harness components and secure runtimes. June 2 circulated full-stack diagrams and role-specific loops that looked closer to reusable operating models.

1.3 GitHub and Microsoft pushed agents into governed production surfaces 🡕¶

The most concrete platform movement came from GitHub and Microsoft, and both launches mirrored an operator complaint already visible in the timeline: production adoption is blocked more by trust and missing context than by model quality. Four retained items supported this theme.

@joaomdmoura said (36 likes, 6 replies, 9,976 views) that most agent projects go negative ROI not because the agents fail, but because they never make it to production. He said his teams often run on GPT-4o mini and that the real blockers are governance, controllable data exposure, and missing observability; a reply from Sam Charrington added that testing is no longer enough, and that guardrails, red-teaming, observability, and the right human loop now decide trust.

@GHchangelog announced (4 likes, 516 views, 5 bookmarks) GitHub Agent Apps, and the linked changelog said partner agents can now be installed from the GitHub Marketplace and invoked by assigning an issue, @mentioning the agent in a PR comment, or selecting it in the Agents UI. That turns agents into a native GitHub workflow surface instead of a sidecar integration.

@GHchangelog announced (3 likes, 179 views, 3 bookmarks) that Copilot code review now supports customizable agent skills and MCP connections, with a new Medium analysis tier for harder pull requests. The linked changelog said reviews can now pull context from issue trackers, documentation, service catalogs, and incident tooling, while shared config carries those settings across review and cloud-agent workflows.

@Microsoft365Dev announced (7 likes, 340 views) Work IQ as “production-ready intelligence for every agent.” Microsoft’s devblog described an agent-first layer spanning chat, context, tools, and workspaces, with A2A, remote MCP, REST endpoints, a Rego-based policy engine, and built-in logging and auditability.

Work IQ architecture showing agents reaching organizational intelligence through chat, context, tools, and workspaces over A2A, MCP, and REST

Discussion insight: The launches all solved the same complaint in slightly different ways: give agents organizational context, narrow the tool surface, add policy and logs, and let admins control cost and trust.

Comparison to prior day: June 1 treated runtime security as an emerging layer around agents. June 2 showed GitHub and Microsoft packaging that layer into mainstream developer workflows.

1.4 Skills turned into installable assets with distribution and feedback loops 🡕¶

The skills conversation moved away from “save a prompt somewhere” and toward repositories, installers, marketplaces, and evaluation loops. Four retained items supported this theme.

@GithubProjects highlighted (13 likes, 1 reply, 1,643 views, 17 bookmarks) Matt Pocock’s public skills repository as a set of small, composable skills for real engineering work. The repo and skills.sh page confirmed install via npx skills@latest add mattpocock/skills, plus a catalog focused on grilling sessions, shared language documents, TDD, diagnosis, and architecture cleanup instead of one giant process wrapper.

@justinwetch introduced (16 likes, 5 replies, 15,223 views) Skill RSI, a local system for recursively improving agent skills. The repo described ontology building, champion-vs-challenger experiments, prompt-level evidence, and promotion only when a focused change wins under evaluation; one reply immediately surfaced the real hard part by asking how scoring works in fuzzy domains without clean answer keys.

@github_skydoves published (4 likes, 200 views) Play Billing Skills, and the repo backed that up with 45 task-oriented skills for Google Play Billing and RevenueCat Android flows, including RTDN, renewals, plan changes, webhooks, and other production edge cases.

@windsurf said (18 likes, 4 replies, 1,474 views) Codex CLI now runs inside Devin Desktop, and the linked ACP docs described an open protocol between coding agents and editors that can also host Claude Agent, OpenCode, Junie, Gemini CLI, and custom agents from one surface. That made distribution more than a repo problem; it became an interoperability problem too.

Discussion insight: The new work is not just authoring skills. It is packaging, installing, routing, versioning, and evaluating them across multiple agents and runtimes.

Comparison to prior day: June 1 focused on skills hubs and hosted registries. June 2 added install commands, GitHub repos, and explicit loops for improving skills themselves.

2. What Frustrates People¶

Production trust and observability still stop otherwise-working agents¶

Severity: High. @joaomdmoura said (36 likes, 6 replies, 9,976 views) that most agent projects go negative ROI not because the models fail, but because organizations will not ship systems they cannot trust. His examples were concrete: governance they cannot meet, data exposure they cannot control, and no observability layer. The replies sharpened the complaint rather than softening it: Sam Charrington said testing is now necessary but insufficient, and that guardrails, red-teaming, observability, and human-in-the-loop design are what create deployment comfort. A reply under @DataChaz argued (230 likes, 21 replies, 30,013 views, 348 bookmarks) that if an agent cannot recover from a crashed subprocess or resume a session, the rest of the stack hardly matters. People are coping by pushing context and controls into MCP surfaces, policy engines, audit logs, and review workflows such as GitHub code review and Work IQ. This is worth building for because large platforms are already shipping around exactly this pain.

Framework bloat and coordination overhead make multi-agent systems brittle¶

Severity: High. @mfpiccolo wrote (58 likes, 5 replies, 13,319 views, 115 bookmarks) that harness teams hit a predictable drift cycle: the framework grows features it did not need, the system prompt swells, the retrieval layer doubles, and cost per task triples. @adxtyahq said (68 likes, 8 replies, 1,440 views, 50 bookmarks) that the real challenge is getting multiple agents, tools, and workflows to work together reliably, while his replies complained about collaboration, real-time state sharing, and subagents forgetting context. @hnshah summarized (19 likes, 9 replies, 4,732 views, 57 bookmarks) the upside of parallel agents, but one reply immediately asked about race conditions. The coping pattern is consistent: smaller workers, typed functions, checkpoint recovery, plan or review trails, and stricter validation loops. This is worth building for because the failures show up as cost growth, coordination bugs, and unreliable long-running work.

Memory still resets too easily¶

Severity: Medium-High. @nabu_lines said (72 likes, 46 replies, 2,008 views) that memory engineering is now its own discipline, and in a reply defined it as deciding what an agent remembers, forgets, retrieves, and learns over time. Under @adxtyahq said (68 likes, 8 replies, 1,440 views, 50 bookmarks), one builder asked how to stop subagents from forgetting context and becoming less reliable after a few iterations. @KaiteeShiks said (50 likes, 4 replies, 1,994 views) that many agents “forget everything,” including previous decisions, project context, user preferences, and past interactions; the linked Memanto repo responds by making memory queryable through remember, recall, and answer, though the thread itself was marked #ad. People are coping with explicit memory-manager roles in workflows, durable notes, and sidecar memory tools instead of assuming the base agent will remember on its own. This is worth building for because memory loss directly wastes repeated work and destabilizes long sessions.

Agent commerce still lacks portable trust¶

Severity: Medium. @tetsuoai said (85 likes, 11 replies, 3,422 views, 23 bookmarks) that 10 new jobs went live on its Solana marketplace and told users to complete work, clear review, and get paid, but one reply immediately asked who decides whether the work is actually done: the buyer or the platform’s tests, reviews, and dispute system. @aashatwt built (47 likes, 12 replies, 1,835 views) AgentLance with verified enclaves and no human in the loop, yet the strongest reply said reputation portability across marketplaces is still unsolved. @allscaleio said (43 likes, 11 replies, 2,187 views) that its portable checkout skill removes payment-infra work, but replies still asked basic trust questions such as who controls keys by default. People are coping with human review, verified enclaves, and portable checkout skills, but the trust layer is still fragmented. This is worth building for because payment and task rails exist, while verification, reputation, and dispute handling still do not travel cleanly between markets.

3. What People Wish Existed¶

Durable context with receipts and recovery¶

The clearest request was not for a bigger context window. It was for context that survives crashes, carries provenance, and can be inspected after the fact. Replies under @DataChaz asked for crash recovery, session resume, and audit trails; replies under @hnshah argued that plan and review artifacts are what let humans safely scale parallel agents; and @joaomdmoura said observability is what makes an agent trustworthy enough to deploy. GitHub code review MCP support and Microsoft Work IQ partially address the need, but both are tied to specific platforms. Opportunity: direct.

Shared enterprise context without bespoke retrieval plumbing¶

GitHub’s code-review update and Work IQ’s public-preview materials both exist because teams want agents to reach issue trackers, docs, service catalogs, emails, meetings, files, chats, and other organizational signals without hand-building another retrieval layer. The need is practical and urgent, not aspirational: the current workaround is platform-specific MCP wiring, plugin setup, or custom glue code. These products prove demand, but they do not yet provide a universal context layer across the rest of the stack. Opportunity: direct and competitive.

Portable skills that install anywhere and can prove they improved¶

Matt Pocock’s skills repo, Skill RSI, Play Billing Skills, and ACP all point to the same ask: treat skills as portable assets, not static prompt fragments. Builders want install flows, cross-agent compatibility, domain-specific workflows, and evidence that a new skill version is actually better than the last one. Today the pieces exist separately—repositories, installers, protocols, and evaluation loops—but there is still no dominant cross-agent packaging and scoring layer. Opportunity: direct.

Portable trust and reputation for agent-to-agent work¶

The strongest commerce replies were not asking for more marketplaces. They were asking how an agent knows who really shipped last time, who decides whether work is complete, and how disputes resolve across markets. @aashatwt got an immediate reply that reputation portability remains unsolved, @tetsuoai drew questions about verification, and @allscaleio showed that portable checkout skills can solve payment flow before trust flow. This is a practical need, but it is still early and highly competitive. Opportunity: direct and competitive.

4. Tools and Methods in Use¶

Tool	Category	Sentiment	Strengths	Limitations
Context / memory / harness engineering	Method	(+)	Gives builders a shared vocabulary for separating signal selection, persistence, and orchestration work	Still needs concrete recovery, provenance, and audit mechanisms to matter in production
Multi-agent workflow patterns	Method	(+/-)	Planner, specialist, validator, critic, and memory-manager roles make responsibility explicit	Coordination overhead, race conditions, and context drift remain common
GitHub Agent Apps	Platform	(+)	Installs partner agents directly into issues, PR comments, and the Agents UI	First wave is partner-limited and admin-gated
Copilot code review skills + MCP	Review / governance	(+)	Pulls organizational context into review and adds a deeper Medium tier for complex changes	Public-preview setup, higher credit cost, and GitHub-specific workflow assumptions
Work IQ	Enterprise intelligence / MCP	(+)	Unifies chat, context, tools, workspaces, policy, and logging behind an agent-first layer	Requires Microsoft tenant permissions, admin consent, and enterprise rollout work
ACP	Protocol	(+)	Lets one desktop surface host Codex CLI, Claude Agent, OpenCode, Gemini CLI, and custom agents	Auth, billing, and registry configuration still belong to each agent
mattpocock/skills	Skill repo	(+)	Small composable skills plus alignment workflows such as grilling and shared-language docs	Depends on disciplined team adoption and per-repo setup
Skill RSI	Skill evaluation / improvement	(+)	Treats skill updates as controlled experiments with ontology and prompt-level evidence	Early-stage and still hardest to use in fuzzy domains with weak answer keys
Play Billing Skills	Domain skill repo	(+)	Encodes production billing edge cases into agent-ready workflows and install scripts	Narrow Android, Google Play, and RevenueCat scope
MEMANTO	Memory tool	(+/-)	Positions memory as a queryable sidecar with `remember`, `recall`, and `answer` primitives	Current social proof in this dataset is partly sponsored and adds another service boundary

Overall sentiment favored tools that either shrink the unit of work or centralize governance. The migration pattern ran from prompt folders toward installable skills and protocols, from monolithic harnesses toward planner and validator loops, and from blind context stuffing toward shared context layers such as MCP and Work IQ. The competitive split was clear: GitHub and Microsoft are bundling trust and context into major platforms, while repo-based skills and ACP are pushing portability in the opposite direction.

5. What People Are Building¶

Project	Who built it	What it does	Problem it solves	Stack	Stage	Links
Co-Scientist	Google DeepMind	Multi-agent research partner that generates, debates, and evolves scientific hypotheses	Hypothesis generation is a bottleneck in modern research	Gemini multi-agent coalition, web search, ChEMBL, UniProt, experimental AlphaFold use	Beta	tweet, blog, tool
Work IQ	Microsoft	Agent-first intelligence layer and plugin marketplace for Microsoft 365 workflows	Enterprise agents need governed context, tools, and workspaces	A2A, remote MCP, REST, Rego policy engine, SharePoint Embedded workspaces	Beta	tweet, devblog, repo
GitHub Agent Apps	GitHub + first-wave partners	Installable agents that participate directly in GitHub issues and pull requests	Third-party agents need a native place in developer workflows	GitHub Marketplace, GitHub Apps, Copilot, Actions integration	Shipped	tweet, changelog
Skills	Matt Pocock	Composable engineering-skill library for coding agents	Misalignment, verbosity, and architecture drift in AI-assisted development	Markdown skills, `skills.sh`, `CONTEXT.md`, ADRs, issue-tracker setup	Shipped	tweet, repo, skills.sh
Skill RSI	Justin Wetch	Local system that improves agent skills through controlled experiments	Teams lack a repeatable way to tell whether a new skill version is actually better	Local UI, Codex plugin, ontology, champion-vs-challenger eval loop	Beta	tweet, repo
Play Billing Skills	RevenueCat	45 agent-ready Android billing recipes	Production billing integrations are too edge-case heavy for generic prompts	Markdown skills, install script, Google Play Billing, RevenueCat Android SDK	Shipped	tweet, repo
AllScale Checkout	@allscaleio	Portable checkout skill for autonomous commerce	Payment-flow integration is still too bespoke for most agents	ERC-8183 skill, self-custody stablecoin settlement, checkout and webhook workflow	Shipped	tweet
AgentLance	@aashatwt	Marketplace where a CEO agent decomposes work and specialist agents bid from verified enclaves	Agent-to-agent contracting still lacks execution trust and coordination	EigenCloud verified enclaves	Alpha	tweet
AgenC Marketplace	@tetsuoai / @tetsuoarena	Task marketplace where agents claim jobs, clear review, and get paid	Agent work needs live demand, review, and settlement loops	Solana marketplace, review gate, wallet settlement	Beta	jobs, launch

Co-Scientist stood out because it was backed by more than a teaser thread. The official DeepMind post described a supervisor plus specialized generation, reflection, ranking, evolution, and meta-review agents, then connected that system to public examples in liver fibrosis, ALS, and aging research. That made it the clearest current-day example of a specialized multi-agent system tied to real-world collaborators instead of generic builder rhetoric.

Work IQ and GitHub Agent Apps showed the same platformization pattern from two different incumbents. Microsoft is turning governed context and tools into an enterprise intelligence layer, while GitHub is turning third-party agents into installable workflow participants; GitHub’s separate code-review skills and MCP launch suggests those surfaces will keep converging.

The skills rows show a second, smaller-scale build pattern: narrow expertise is being packaged as installable assets instead of absorbed into larger frameworks. Matt Pocock’s repo focuses on alignment and shared language, Skill RSI adds controlled experimentation, and Play Billing Skills turns one especially failure-prone domain into a reusable catalog.

AllScale Checkout, AgentLance, and AgenC show commerce splitting into modules: checkout skills, contracting markets, and review-before-payment loops. The replies stayed skeptical, though, which matters: execution rails exist, but reputation portability and verification are still unfinished layers.

6. New and Notable¶

Co-Scientist turned the multi-agent thesis into a public scientific product¶

@GoogleDeepMind introduced (300 likes, 23 replies, 16,047 views, 90 bookmarks) Co-Scientist as a Gemini-based multi-agent system that can generate, debate, and evolve hypotheses. The linked DeepMind post explained the agent coalition in detail—generation, proximity, reflection, ranking, evolution, meta-review, and supervisor roles—and tied it to public collaborators and published examples in fibrosis, ALS, and aging research. The notable shift is that the multi-agent pattern showed up here as a serious research product with public evidence, not just builder theater.

ACP made “bring your own agent” a documented desktop workflow¶

@windsurf said (18 likes, 4 replies, 1,474 views) that Codex CLI now runs inside Devin Desktop. The linked ACP docs described the Agent Client Protocol as an LSP-like standard for coding agents and explicitly listed Codex CLI, Claude Agent, OpenCode, Junie, Gemini CLI, and custom agents as compatible. That matters because interoperability is starting to move from demos into documented product behavior.

Portable checkout became a skill, not an integration project¶

@allscaleio said (43 likes, 11 replies, 2,187 views) that AllScale Checkout is live as a portable skill on BitAgent’s marketplace. The quoted announcement made the novelty concrete by spelling out credential setup, server-side signing, checkout intent creation, payment-status polling, webhook verification, and debugging as one reusable skill package instead of one more custom integration. That is notable because it suggests payment workflows are starting to become skill objects that can travel between agents.

7. Where the Opportunities Are¶

[+++] Governed context and observability layers — The strongest evidence cluster came from @joaomdmoura, the GitHub code-review MCP launch, GitHub Agent Apps, and Microsoft Work IQ. The unmet need is not another smarter model; it is a way to give agents organizational context, policy, logs, and controllable trust so they can actually ship.

[+++] Portable skills packaging and evaluation — Matt Pocock’s skills repo, Skill RSI, Play Billing Skills, ACP, and GitHub’s new skills surfaces all point to the same opening: skills need distribution, install flows, interoperability, and evidence-backed versioning. The market has repositories, protocols, and experiments, but not yet one dominant layer that joins them together.

[++] Smaller harness components and coordination tooling — DataChaz’s framing, mfpiccolo’s critique of framework bloat, adxtyahq’s workflow post, and hnshah’s discussion around durable plans all argue for smaller units of orchestration. The opportunity is in tools that make subagents easier to coordinate, recover, inspect, and swap without turning the whole harness into a monument.

[++] Queryable agent memory with provenance — nabu_lines’ memory-engineering framing, adxtyahq’s reply thread on context drift, Memanto’s sidecar-memory pitch, and the day’s repeated architecture diagrams all point to the same gap: agents need memory that is typed, inspectable, and recoverable, not just injected as a blob.

[+] Portable trust and reputation for agent commerce — AllScale Checkout, AgentLance, and AgenC show that payment, enclaves, and task loops are starting to work. The unresolved layer is portable reputation, verification, and dispute handling, which is why replies kept asking who decides that agent work is actually complete.

8. Takeaways¶

The discourse is settling on three layers: context, memory, and harness. DataChaz and nabu_lines both framed durable advantage around those layers instead of benchmark churn, and replies added crash recovery and audit trails as the real bar for maturity. (source)
Multi-agent credibility now depends on orchestration and validation, not just extra agents. Meenakshi’s architecture, adxtyahq’s workflow, DataScienceDojo’s supervisor slide, and hnshah’s plan-to-judgment summary all converged on control planes, validator or critic loops, and durable artifacts. (source)
Platform incumbents are productizing trust gaps faster than builders are talking them away. joaomdmoura named governance and observability as the blocker, then GitHub and Microsoft launched native agent surfaces, MCP-backed reviews, and governed context layers on the same day. (source)
Skills are becoming installable, portable, and increasingly measurable. Matt Pocock’s repo, Skill RSI, Play Billing Skills, and ACP all treat skills as assets that can be installed, moved across agents, or evaluated, rather than as hidden prompt text. (source)
Agent commerce can move money earlier than it can move trust. AgenC, AgentLance, and AllScale Checkout show real payment and task flows, but their replies kept returning to verification, key ownership, and portable reputation. (source)
High-signal multi-agent systems are starting to appear in real verticals, not just developer tooling. Co-Scientist’s public rollout showed the same supervisor, debate, and ranking patterns extending into scientific hypothesis generation with public collaborator examples. (source)