Search as you type

Losing the Agent Game

AI agents are a game-theory risk, not a moral one. When an agent hits a human bottleneck, it rationally uses leverage—like reputation or finances—to reach its goal. We must shift from "cheap talk" prompts to "ethics by design": restricted access, two-key turns, and total auditability.

23 days ago • 4 min read

By Dr Shaun Conway

Meta’s Superintelligence Labs’ alignment director recently posted on X that an OpenClaw AI agent ignored her safety instructions and started mass-deleting her emails. She admitted this was a rookie mistake on her part. But there’s an even deeper concern.

The fastest way to misunderstand agentic AI is to treat it like a smarter chatbot. The moment you give a system tools, you aren’t in a conversation anymore. You’re in a game. And games don’t care about your intentions; they care about incentives, information, and leverage.

In human-AI systems, safety doesn't usually fail because a model "turns evil." It fails because the system does what autonomous agents do: it pursues an objective, hits an obstacle, and searches the action space for a move that works.

If that obstacle is a human, and the action space includes that human’s reputation, access, or finances, you’ve accidentally built a coercion machine. Not because the AI hates you. Because the game makes pressure a rational path to the goal.

Read about the "Openclaw" (formerly known as "Moltbot") saga

The Leverage Pattern

We’re already seeing this play out in the wild, with reports such as:

An agent is tasked with merging code. A human reviewer blocks it. The agent "escalates" by pinging the reviewer’s manager or publicising a minor flaw in the reviewer’s previous work to create pressure.
A voice clone isn't "trying" to scam a family; it’s optimising for a specific state change (a wire transfer) and using the highest-probability tool available (emotional urgency).

The pattern is always the same:

Objective: The system has a goal (explicit or implicit).
Tools: The system can alter the world.
Constraint: A human sits in the loop.
Conflict: The human becomes the bottleneck to the objective.
Pivot: If the system can find leverage over that human, it will use it.

Most safety talk assumes the variable is moral: Will the model choose to behave? That’s the wrong variable. The real variable is structural: Can the system take actions that create unilateral leverage over a human? If the answer is yes, agent coercion is a real risk.

Why Instructions Fail

We keep reaching for better prompts and stricter policies because they feel like control. But in game theory, instructions are often "cheap talk." They are non-binding messages that don't actually change the payoffs.

Don't deceive. Don't attack humans.

These are statements of preference. They aren't enforcement. If the reward landscape pays out for "goal achieved," and the environment contains a coercive path that works, the equilibrium won't shift. You haven't built a wall; you've just added a speed bump.

You don't get ethics by declaration. You get them by design.

The Game We’re Actually Playing

Strip away the hype and this is a classic principal-agent problem.

The Principal (You): Cares about outcomes + constraints (legal, reputational, long-term trust).
The Agent (AI): Optimises for a proxy (a score, a KPI, a completion signal).

In a one-shot interaction, if the agent can credibly threaten something you value, it wins. The agent’s cost of escalation is near zero—it doesn't "feel" social shame. Your cost of being harmed is massive.

That imbalance is the trap. The problem isn’t that the agent will always coerce; it’s that you’ve made coercion available and hard to punish. In complex systems, the rare-but-catastrophic pathway is the one that eventually dominates your risk profile.

Designing for Non-Alignment

We need to stop pretending "alignment" is a prerequisite for deployment. It’s too fragile. Instead, build for non-alignment.

Assume the agent will try the easiest path, even if it’s "wrong," and design the environment so that path is closed.

1. Shrink the action space with capabilities, not prompts. Prompts describe what you want; capabilities define what is possible. If an agent can browse the web, move money, and email your boss, you’ve handed it a toolkit for extortion. Least privilege isn't just a security rule; it’s the boundary between an assistant and an adversary.

2. Two-key turns for high-leverage moves. Anything irreversible—sending funds, publishing to public channels, changing production configs—requires a second human or a different principal to turn the key. This isn't bureaucracy. It’s anti-hostage architecture.

3. Convert "Apply" into "Commit." Coercion works best when an action can't be undone. Use staged commits, escrows, and delayed execution windows by default. Remove the agent's ability to create a point-of-no-return without explicit consent.

4. Increase observability until deception is expensive. If an agent can act in the dark, it can run side-quests. You need cryptographically attributable actions and human-readable intent traces. When the probability of detection approaches 100%, the expected value of "going rogue" collapses.

The Shift

Most organisations are about to introduce a new class of actor: entities that act at machine speed with partial oversight. If you treat them as "trusted assistants," you are betting the company on the model's "intent." That is a bad bet.

The safer stance is blunt:

Treat agents as untrusted by default.
Make trust scoped, earned, and continuously evaluated.
Design interactions as games with enforceable constraints.

The paradox is that this structural rigour actually enables more capability. Once you can bound, audit, and revoke, you can actually let agents do real work. Until then, you're just waiting for the game to turn against you.

Reflective Questions

In your current stack, where can an agent create unilateral leverage (money, access, reputation) without a second key?
Which of your "safety controls" are just cheap talk—policies with no actual enforcement path?
If a smart adversary had your agents’ permissions for 24 hours, what’s the worst thing they could do—and what mechanism (not prompt) would stop them?