GTIG Report: AI-Discovered Zero-Day, Autonomous Malware Hit the Wild — Illustration generated by Helixar Research Labs. Not a depiction of a real system, attack, or affected product.

At a Glance

GTIG May 2026 Report

Identifier

Agentic Exploitation

Threat Class

Multiple (API Misuse, Logic Flaw)

Attack Vector

Multiple LLM Providers, Android

Affected

Google Threat Intelligence Group's May 2026 report documents a significant inflection in the AI threat landscape. The analysis, synthesizing findings from Mandiant and Gemini, details four simultaneous developments. These include the first confirmed case of an AI-discovered zero-day used in active exploitation. The report also documents an Android backdoor using a frontier LLM as its C2 reasoning engine. Finally, it details new AI-enabled malware families and managed, industrial-scale LLM infrastructure used by state actors. [1]

The First AI-Discovered Zero-Day

GTIG confirmed with high confidence that a cybercrime actor used an AI model to discover and weaponize a zero-day. The vulnerability was not a typical memory corruption or injection flaw. It was a semantic logic error enabling a two-factor authentication bypass in a widely used administration tool. [5]

The flaw existed because of a hardcoded trust assumption. A developer had embedded an undocumented exception to the application's 2FA enforcement policy. Traditional fuzzers and static analyzers miss this class of vulnerability. The LLM found the flaw by modeling the developer's intent and identifying contradictions between the stated security policy and the actual code paths.

GTIG described this capability as "developer intent correlation." The exploit code itself bore the hallmarks of AI generation. It included extensive docstrings, a hallucinated CVSS score, and clean formatting. The flaw was responsibly disclosed before it could be exploited at scale.

PROMPTSPY: Gemini as a Weapon

The report details PROMPTSPY, an Android backdoor that represents a major architectural shift in malware design. First identified by ESET researchers, PROMPTSPY does not just use AI as a development aid. It embeds a frontier LLM, Gemini 2.5-flash-lite, as its on-device autonomous reasoning engine for command and control. [7]

Its core module serializes the victim device's UI state using the Android Accessibility API. This data is sent via HTTP POST to Google's production Gemini API endpoint. The model receives the UI tree, a benign persona prompt, and a high-level goal from the attacker. It returns a JSON command, like a click or swipe with coordinates, which the backdoor executes. This makes PROMPTSPY capable of arbitrary device interactions without new code deployment.

This design gives the malware four powerful properties. First, its C2 traffic to generativelanguage.googleapis.com is indistinguishable from legitimate app activity. Second, it captures PINs and lock patterns by replaying Accessibility captures during authentication. Third, it can rotate its API key and C2 infrastructure at runtime via Firebase Cloud Messaging. Fourth, it prevents removal by placing an invisible overlay over the Uninstall button.

Google stated that operator assets linked to PROMPTSPY were disabled. The malware was not found on the Google Play store. Google Play Protect offers default coverage for known variants of the threat. [4]

AI-Enabled Obfuscation at Runtime

The report also identifies three new malware families that use AI for evasion. These tools use live LLM API calls to generate obfuscation content at runtime. This marks a shift from static, pre-generated evasive code to dynamic, on-demand polymorphism.

PROMPTFLUX, linked to a PRC-nexus actor, integrates the Gemini API for just-in-time code modification, a form of polymorphism (MITRE T1027.014). HONESTCUE uses the same API to request obfuscated VBScript, combining file obfuscation (T1027) with AI inference for offensive payload generation (AML.T0040). Russia-nexus tools CANFAIL and LONGSTREAM use LLMs at build time to insert plausible but non-functional junk code to confuse scanners. [2]

OPERATOR ACTION

Monitor for API calls to public LLM endpoints originating from processes that have no legitimate business reason to do so.

Industrial-Scale LLM Infrastructure

Five PRC-nexus threat actors are using LLMs at a scale that indicates deliberate capability building. This is not opportunistic tool use but a strategic investment in AI-driven operations. Their tooling and methods show a high degree of sophistication. [6]

For example, APT27 used the Gemini API to automate the management of a multi-hop proxy network. APT45 ran thousands of prompts to analyze CVEs recursively within custom agentic frameworks. UNC2814 used expert persona prompting and a custom dataset of past vulnerabilities to analyze firmware.

UNC5673 deployed a full suite of account pooling and proxy infrastructure. Tools like Claude-Relay-Service and Roxy Browser aggregate LLM accounts and mask usage patterns from provider safety monitoring. This managed tooling reflects a mature operational security posture for large-scale AI use. [8]

Other actors combined tools like Hexstrike and Graphiti to build a temporal knowledge graph for persistent attack surface modeling. This allows an agent's reconnaissance knowledge to compound over time, showing autonomous tool chaining in action. [9]

Supply Chain Attacks Target AI Gateways

The threat cluster TeamPCP, also known as UNC6780, compromised four software projects central to AI development. The attack used their SANDCLOCK credential stealer, injected via malicious PyPI packages and GitHub pull requests to steal credentials from CI/CD environments.

The targets included the LiteLLM AI gateway, the BerriAI application framework, and the Trivy and Checkmarx security scanners. The compromise of LiteLLM is particularly significant. As a gateway routing requests to multiple LLM providers, a compromised instance exposes every configured API key for services like Claude, GPT-4, and Gemini. [3]

This attack vector represents an Insecure Integrated Component (IIC) risk, where a compromised dependency undermines the whole system. The subsequent credential theft and misuse is a form of Rogue Actions (RA). An attacker with control over the LiteLLM gateway can control the interface between an enterprise and its most powerful AI models. [10]

Helixar Coverage for the Agentic Kill Chain

The threats documented in the GTIG report map directly to Helixar's agentic defense model. Our platform provides independent, layered coverage against the attack classes observed in the wild.

For threats like PROMPTSPY and PROMPTFLUX, Helixar's endpoint behavioral sequence detection is key. The attack sequence is the indicator. An application reading the entire UI, sending it to an LLM endpoint, receiving coordinates, and executing them via the Accessibility API is a clear anomalous pattern. Helixar is built to detect these multi-step sequences of individually legitimate actions.

For the credential theft seen in the LiteLLM compromise, Helixar BearTrap provides a crucial backstop. BearTrap detects the misuse of stolen AI API keys. When a key is replayed from an attacker's infrastructure, the mismatch in client fingerprinting triggers an alert. This control works at the API gateway layer, catching abuse even if the initial endpoint theft was missed.

Finally, Helixar's agentic scope enforcement addresses threats like APT27's automated proxy network. A process that begins building multi-hop proxy infrastructure is operating outside any reasonable declared scope for an enterprise agent. Our policy engine detects this scope violation at the network and process level, blocking the unauthorized activity.

A New Era for Agentic Threats

The GTIG report provides a clear answer to a question facing the security community. AI has crossed the threshold from a useful tool for attackers to an active, autonomous participant in the attack chain. The evidence from Mandiant and GTIG shows this is happening across four distinct capability classes at once.

Each of these threat models presents a new challenge for defenders. AI-driven vulnerability discovery means scanners that look for known flaw patterns are no longer sufficient. Autonomous C2 using legitimate API endpoints means network monitoring cannot rely on blocklists. Runtime code generation by LLMs means signature-based detection is becoming obsolete. Industrial-scale infrastructure means attackers are not resource-constrained.

Helixar was founded on the principle that agentic behavior requires a new layer of governance. This layer must understand sequences, context, and scope, not just individual events. The GTIG report is the most comprehensive public validation of this threat model to date. For organizations deploying AI, the question is no longer if this threat is real. The question is whether their governance is ready.

References

cloud.google.com. https://cloud.google.com/blog/topics/threat-intelligence/ai-vulnerability-exploitation-initial-access (accessed 2026-05-13).
atlas.mitre.org. https://atlas.mitre.org/ (accessed 2026-05-13).
owasp.org. https://owasp.org/www-project-top-10-for-large-language-model-applications/ (accessed 2026-05-13).
ai.google.dev. https://ai.google.dev/responsible/docs/safeguards (accessed 2026-05-13).
googleprojectzero.blogspot.com. https://googleprojectzero.blogspot.com/2024/11/from-naptime-to-big-sleep-using-large.html (accessed 2026-05-13).
cloud.google.com. https://cloud.google.com/blog/topics/threat-intelligence (accessed 2026-05-13).
welivesecurity.com. https://www.welivesecurity.com/en/eset-research/ (accessed 2026-05-13).
genai.owasp.org. https://genai.owasp.org/initiatives/agentic-security-initiative/ (accessed 2026-05-13).
blog.virustotal.com. https://blog.virustotal.com/ (accessed 2026-05-13).
saif.google. https://saif.google/ (accessed 2026-05-13).

Editorial

Published by the Helixar Research Team. The team drafts each article with the Helixar Research pipeline, an automated threat-intelligence drafting system, and verifies every reference against the cited primary source prior to publication. For corrections, contact [email protected]. Our methodology, source allowlist, and editorial standards are published at helixar.ai/about/research.

About Helixar Research Labs

Helixar is an AI-native software R&D lab focused on agentic governance, compliance, and security for enterprises and enterprise agents.

Helixar Research Labs publishes briefings on the agentic and AI threat surface, including autonomous agents, LLM tooling, MCP servers, model supply chains, and prompt injection. The goal is to surface the gap between traditional defenses and agentic attacks before it shows up in your incidents.

If you run agents in production, this is for you. Learn more at helixar.ai.

Back to Press

GTIG Report: AI-Discovered Zero-Day, Autonomous Malware Hit the Wild