All articles
ResearchFebruary 2026·9 min read

The Research That Demanded a New Category

How published academic threat intelligence and real-world attack data shaped the case for behavioral detection of agentic AI.

Between 2022 and 2026, a body of publicly available academic research and threat intelligence quietly made the case that enterprise AI deployments had created an entirely new attack surface — one for which no dedicated detection tooling existed. This is an account of that research.

The Problem Became Visible in Stages

Enterprise adoption of AI agents accelerated significantly from 2023 onwards. Unlike earlier AI deployments — which were largely confined to inference on isolated models — agentic systems were different in character. They operated with persistent memory, tool access, API credentials, and the ability to take autonomous action on behalf of users and processes.

The security community began to notice this structural shift. Existing endpoint detection and response (EDR) tools, designed to detect known malware signatures and behavioural patterns associated with human-operated attacks, had no native concept of an AI agent as a threat actor. An agent using legitimate credentials to exfiltrate data was, from a traditional detection perspective, indistinguishable from an authorised process doing its job.

The research that followed made this gap impossible to ignore.

Prompt Injection: The Foundational Attack Class

In 2022, Perez and Ribeiro published foundational work demonstrating that language model-based systems could be manipulated through adversarially crafted inputs embedded in otherwise innocuous-looking text — a technique they named prompt injection. The implications for AI agents were significant: any agent that processed external content as part of its workflow could be redirected by a hostile actor who controlled that content.

This was extended substantially by Greshake et al. in 2023, in work titled“Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections.” The paper demonstrated that indirect prompt injection — where malicious instructions are embedded in external resources that an agent retrieves and processes — could lead to credential theft, unauthorised data exfiltration, and persistent agent compromise, without any direct interaction with the user or the application.

These weren't theoretical attacks. The researchers demonstrated them against real, deployed applications. The attack surface was not a flaw in any specific model — it was a structural property of how LLM-based agents interacted with the world.

“The core insight was not that AI systems could be manipulated — it was that the manipulation could be delivered via content the agent was expected to process, not via the agent's direct input channel. This changes the threat model entirely.”

The MITRE ATLAS Framework: A Taxonomy for ML Threats

MITRE's Adversarial Threat Landscape for Artificial-Intelligence Systems (ATLAS) framework, maintained and regularly extended from 2021 onwards, provided the security community with a structured taxonomy for adversarial ML threats. Analogous to the widely used MITRE ATT&CK framework for traditional cyber threats, ATLAS documented known attack techniques against machine learning systems — including model evasion, model inversion, data poisoning, and membership inference attacks.

The framework made explicit what many security researchers had long suspected: the attack surface for AI systems was qualitatively different from traditional software attack surfaces. The techniques required to exploit AI systems did not map neatly onto existing detection logic, and the artefacts produced by successful attacks were frequently invisible to conventional monitoring tools.

OWASP LLM Top 10: Enterprise Security Community Responds

By 2023, the Open Web Application Security Project (OWASP) had convened a working group to document the most critical security risks in LLM-integrated applications. The resulting OWASP LLM Top 10 — published in version 1.0 in 2023 and substantially revised in version 2.0 in 2024 — validated the research community's concerns and brought them into mainstream enterprise security discourse.

Prompt injection (LLM01) and insecure output handling (LLM02) headed the list. Notably, the framework also identified supply chain vulnerabilities (LLM03), where malicious plugins or model packages introduced to an enterprise AI stack could compromise downstream agent behaviour.

The supply chain risk was particularly relevant to enterprise environments, where AI agent tooling was frequently assembled from marketplace components — plugins, tools, function libraries — with limited security review.

Agent Memory and the Persistence Problem

A 2024 thread of research focused specifically on the persistence characteristics of agentic AI systems — particularly those with long-term memory stores. Work in this space demonstrated that an adversary who could write to an agent's memory context could influence the agent's behaviour on subsequent interactions, without any persistent malware or traditional indicators of compromise.

This was significant for detection: traditional EDR tools look for artefacts — files, registry keys, network connections, process trees. An agent whose behaviour had been modified via memory poisoning left none of these artefacts. The only reliable indicator was the agent's behaviour — specifically, deviations from its established operational baseline.

This pointed toward a detection approach grounded in behavioural analysis rather than signature or indicator matching — a meaningful departure from the dominant EDR paradigm.

The Scale Evidence: Scans and Marketplace Analysis

The academic research established the attack classes. Real-world scanning and threat intelligence provided the scale evidence.

Bitsight and Censys scanning conducted in early 2026 identified over 30,000 exposed AI agent instances accessible from the public internet across enterprise networks — a point-in-time snapshot of a single agent framework, meaning the true figure across all agentic systems was considerably higher. Many of these instances lacked authentication controls, had open API surfaces, or were running in configurations inconsistent with the vendor's security guidance.

Separately, a Conscia advisory published in February 2026 documented more than 800 confirmed malicious skills in the OpenClaw AI agent marketplace — tools designed specifically to exploit the trust relationships that agents operate within. These included credential harvesters, exfiltration utilities, and tools designed to manipulate agent orchestration logic.

The combination of known attack techniques, demonstrated exploitability, and measurable real-world deployment at scale created a picture that was difficult to dismiss: AI agent infrastructure had become an active target, and the enterprise security stack had no native capability to address it.

NIST's Adversarial ML Taxonomy

NIST AI 100-1, published in 2024, formalised the government's position on adversarial machine learning threats. The document provided a comprehensive taxonomy of attack types targeting AI and ML systems — covering evasion, poisoning, privacy, and abuse attacks — and outlined the state of mitigation practice for each class.

The document was notable for its candour: for many of the attack classes it documented, NIST acknowledged that no fully reliable mitigations existed at time of publication. The most robust approaches combined technical controls with human oversight — a finding that aligned with the operational security principle that detection capability is a prerequisite for governance.

What the Research Implied

Read together, this body of research established several conclusions relevant to enterprise security practice:

  • The attack surface is structural. The vulnerabilities inherent to agentic AI systems — prompt injection susceptibility, tool access, persistent memory, marketplace supply chain — are not fixable via patches. They are properties of the paradigm. Security must be layered on top.
  • Traditional indicators are absent. AI agent compromises frequently leave no files, no registry artefacts, and no process tree signatures consistent with malware. Signature-based and indicator-of-compromise approaches fail by design.
  • Behaviour is the signal. The most reliable detection surface for agentic threats is deviation from established behavioural baselines — patterns of API access, credential use, memory interactions, and output characteristics that diverge from what a legitimate agent would be expected to do.
  • Human oversight is not optional. The research consistently found that automated enforcement without human review created new risks of its own. Detection systems should surface anomalies to human operators rather than acting autonomously on high-stakes decisions.
  • The market gap was real. At the time of writing, no enterprise-grade security platform was dedicated specifically to this threat class. The category was forming, but the tooling had not followed the research.

What Helixar Built From

This published research — not proprietary intelligence or novel academic findings of our own — formed the evidentiary basis for the decision to build Helixar.

The threat classes it documented became the taxonomy around which detection capabilities were designed. The detection approach it implied — behavioural, not signature-based — became the architectural principle. The human oversight requirement it consistently identified became a product constraint: every enforcement action in Helixar requires a human by default.

The goal of this article is not to claim credit for the underlying research — that belongs to its authors — but to make the reasoning transparent. We believe security tools should be explainable, and that includes explaining why they were built at all.

References

  1. Perez, F. & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. ML Safety Workshop, NeurIPS 2022.
  2. Greshake, K. et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injections. arXiv:2302.12173.
  3. MITRE Corporation. (2021–2024). MITRE ATLAS: Adversarial Threat Landscape for Artificial-Intelligence Systems. Version 4.x. atlas.mitre.org.
  4. OWASP. (2023, 2024). OWASP Top 10 for Large Language Model Applications. v1.0 (2023), v2.0 (2024). owasp.org.
  5. NIST. (2024). Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations. NIST AI 100-1. National Institute of Standards and Technology.
  6. Bitsight / Censys. (2026). Point-in-time scan of publicly exposed AI agent instances. February 2026. Internal figures cited in Helixar investor materials with attribution.
  7. Conscia. (2026). Threat advisory: Malicious skills identified in OpenClaw AI agent marketplace. February 2026.

Building security for the agentic era

Join the design partner programme. Phase 3 complete.

Apply Now