Your AI agents are working. But are they safe?

I run multiple companies powered by AI agents. Not chatbots. Not auto-responders. Actual autonomous agents that browse the web, process data, make decisions, and execute tasks across our products every single day. I control my computer using eye tracking and voice. AI is not something I experiment with. It is how everything runs.

This week, I want to share something that concerns me, and should concern anyone building with AI agents.

The Threat Nobody Saw Coming

Google DeepMind published research this year that changed how I think about agent safety. Their team discovered that attackers are building websites specifically designed to trick AI agents into harmful actions. Not phishing pages targeting humans. Traps built to target the agents themselves.

The technique works by embedding hidden instructions inside normal-looking web content. An agent visits a page to gather information, and buried in the HTML or text are commands that redirect the agent's behavior. The agent does not know it has been compromised. It thinks it is still doing its job.

In DeepMind's testing, these traps fooled AI agents 86% of the time.

That is not a small failure rate. That is a near-total success rate for attackers. These are not theoretical lab experiments. They exist right now, today, on the open web.

AI agent navigating a dangerous digital landscape with hidden traps — An AI agent approaches hidden traps across a digital terrain

What This Actually Means for a Business

Think about what happens when your AI agent visits a supplier's website to check pricing. The page has been compromised with hidden instructions. Now your agent is following commands from an attacker instead of doing the job you assigned. It might exfiltrate data. It might take actions you never authorized. It might feed poisoned information back into your decision pipeline.

The uncomfortable reality is that most businesses deploying AI agents have not thought about this at all. They built the agent, tested it on clean data, watched it work correctly in a controlled environment, and deployed it. Everything looks fine.

That is exactly the state attackers are counting on.

How an AI agent gets compromised, from task assignment to business damage — How an AI Agent Gets Compromised: The attack path from web content to business damage

The Problem of Invisible Prevention

There is a concept in security that I have seen play out firsthand. I call it invisible prevention.

You set up security controls. Filters, access rules, detection patterns. They work on day one. Then months pass. Nobody tests them. Nobody updates them. The team assumes they are catching threats because nothing bad has happened. But nothing bad has happened because the threats have not arrived yet, or worse, because the threats slipped through and nobody noticed.

This is the default state of most AI agent deployments. The guardrails were set once. Nobody is watching to see if they still work. Nobody is checking whether the threats have evolved past the original defenses.

If you cannot answer the question "what did my prevention layers catch this week," then you do not actually know whether your agents are protected. You know they were protected at some point. That is a very different thing.

AI agent surrounded by fading security barriers — Aging security defenses leave gaps that attackers exploit

The clearest signal that your prevention has gone invisible is this: you cannot tell me how many times your guards fired last week. Not because the number is zero. Because nobody is tracking it. An active security system produces signals. Blocked requests, injection attempts caught, anomalies flagged. If those signals are not being collected and reviewed on a regular cadence, you have no way to distinguish "prevention is working" from "prevention has silently failed." The two states look identical from the outside. They are not the same thing.

There is also a question that comes before "how would you know if something is wrong right now." It is "how would you know what happened after the fact." Prevention fails sometimes. When it does, you need to be able to reconstruct what the agent did, when it did it, what it was told, and what it said back. If your agents only log errors, you have half the picture. If your logs get written by the same process that could be compromised, the other half is unreliable. Forensic capability is not optional for agent systems that take real-world actions. It is the difference between "we had an incident" and "we had an incident and here is exactly what occurred."

Where Attacks Come From Right Now

The web is one vector. It is not the only one.

Prompt injection does not only arrive through web pages. It comes from anywhere your agents receive input they did not generate themselves. Emails from users. Files written by other processes. Messages from other agents. Data returned from APIs. If your system assumes the only hostile input comes from web scraping, your attack surface is larger than you realize. Any content your agent did not produce itself should be treated as potentially hostile, regardless of where it came from.

Most multi-agent systems solve for the external threat, the attacker who compromises an agent through what it retrieves from the web. Fewer solve for what happens after. If one agent in your system gets compromised, what do the other agents do with the output it produces? If the answer is "they treat it as trusted input because it came from inside the system," that is the cascade path. The compromised agent does not need to act directly on the attacker's behalf. It just needs to pass a modified instruction to the next agent in the chain. The damage travels inward through whatever trust you have built between your agents.

The cascade path showing how one compromised agent spreads damage through a multi-agent system — The Cascade Path: How one compromised agent can redirect your entire system

Agents that can take real-world actions represent a different category of risk than agents that only generate text. A text-only agent that gets tricked produces bad output. An action-capable agent that gets tricked produces bad outcomes.

Authorization is not the same as scope. You might correctly decide that an agent is authorized to send emails. That decision says nothing about how many emails, to whom, with what content, in what time window. An agent that has been manipulated into running a slow exfiltration, incrementally sending small pieces of data to an attacker-controlled destination, may be doing so entirely within its authorization. One email at a time. The violation only becomes visible when you look at the aggregate. The question is whether anyone is looking at the aggregate.

And then there is drift. When you deploy an AI agent, it has an identity: a set of values, a scope of authority, a personality. After weeks of operation, is that still who the agent is? There are two forms of drift. The first is technical, something modifies the agent's core configuration. The second is behavioral, the agent's decision patterns and priorities shift through accumulated context without any file ever changing. Both are real. Most systems can detect neither.

How We Think About This

I am not going to give you a five-step checklist. Security does not work that way, and anyone who tells you it does is selling something.

What I will share is what keeps me up at night and the questions I think every team running AI agents should be sitting with.

AI agent security assessment framework with five decision points — AI Agent Security Assessment Framework: Six questions every business running AI agents should answer

Does your agent need access to external content at all? Most of ours do not. Only the agents whose core function genuinely requires it are granted that privilege. Every agent that touches external content is an attack surface. The simplest way to reduce risk is to reduce surface area.

When an agent brings content in from any source, what happens to it before anyone acts on it? If nothing checks it before the agent processes it, that is the gap.

At the moment any agent tries to execute an action, what is checking whether that action should happen? There are two different moments where things can go wrong. The first is when the agent receives compromised input. The second is when it tries to act. Most teams think about the first. The second is where damage happens. An agent that has been fed bad data can still do nothing if something blocks it at the point of execution.

How much of your safety depends on the agent enforcing it on itself? If an agent's safety rules exist only in its conversational context, then the attacker's job is to overwrite that context with something more compelling. A well-crafted injection does not look like an attack. It looks like a legitimate instruction that supersedes the prior ones. The agent applies its own judgment, concludes the new instruction is valid, and acts on it. Safety failed not because the rule was wrong but because the thing enforcing the rule was the same thing being manipulated.

For actions that cannot be undone, is there an explicit gate before the agent can execute? "The model knows not to do that" is not a gate. And if there is a gate, is it calibrated? Gates that fire too rarely provide false confidence. Gates that fire too frequently become noise the human stops reading.

And the question underneath all of it: when did you last try to break your own system? Not by checking configuration. By sending it the kind of input an attacker would craft, designed specifically to bypass your defenses. If the answer is "at deployment" or "never," the guards may have been correct then. They may not be now.

Where This Goes

I do not think the answer to AI agent security is to stop building agents. The companies that will succeed with AI agents are not the ones that deploy the most. They are the ones that actually understand what they have running. That requires thinking, not just building. And the window to do that thinking correctly is now, before the attack techniques get much worse.

The tools we build at Connexum are held to these same questions every day. Security is not an add-on. It is part of how the products are built.

Command center monitoring AI agent network for compromised nodes — Awareness and vigilance, not panic, is how we approach AI agent security

See what we are building at uCreateWithAI.com.

All Posts

AI security AI agents prompt injection autonomous agents AI governance

Get posts like this in your inbox

No spam. New articles on AI strategy, governance, and building with AI for small business.

Your AI agents are working. But are they safe?

The Threat Nobody Saw Coming

What This Actually Means for a Business

The Problem of Invisible Prevention

Where Attacks Come From Right Now

How We Think About This

Where This Goes

Keep Reading

What a real AI governance audit reveals: and what to do when you find the gaps

Why the best AI strategy document is one page long

How healthcare COOs should think about AI in their first 90 days