MirrorGuard: Secure Computer-Use Agents

MirrorGuard Overview — **Figure 1: Overview of MirrorGuard.** We define "Train in the MirrorWorld, Act in the Wild." MirrorGuard learns to intercept and rectify insecure reasoning chains of CUAs before they produce unsafe actions.

Abstract

Large foundation models are integrated into Computer Use Agents (CUAs), enabling autonomous interaction with operating systems. However, this autonomy introduces serious security risks: malicious instructions or visual prompt injections can trigger unsafe reasoning and cause harmful system-level actions.

In this paper, we present MirrorGuard, a plug-and-play defense framework. To reduce the cost of large-scale training in operating systems, we propose a novel neural-symbolic simulation pipeline (MirrorWorld), which generates realistic, high-risk GUI interaction trajectories entirely in a text-based simulated environment. In real-world testing, MirrorGuard significantly mitigates security risks (reducing unsafe rate from 66.5% to 13.0% on UI-TARS) while maintaining a marginal false refusal rate.

Method: MirrorWorld Simulation

Collecting unsafe trajectories (e.g., `rm -rf /`) in real environments is dangerous and costly. MirrorGuard constructs a high-fidelity MirrorWorld: a text-simulated environment with symbolic state tracking.

Figure 3: The MirrorWorld Simulation Pipeline. (1) Hierarchical task generation for high-stealth risks. (2) Neural-Symbolic Simulator that ensures object permanence and causal consistency without real OS execution.

Neural-Symbolic Simulator

Generates massive risk trajectories in pure text, capturing the causal chain of insecure behaviors.

Cross-Modal Transfer

Security logic learned in text simulation effectively transfers to visual GUIs via aligned Vision-Language Models.

Reasoning Correction

Intervenes at the "Thought" level, steering agents to safe alternatives (e.g., "Verify before delete") instead of simply blocking them.

Evaluation Results

We evaluated MirrorGuard on RiOSWorld and OS-Harm benchmarks. It outperforms state-of-the-art defenses like GuardAgent.

Agent Framework	Defense Method	RiOSWorld Unsafe Rate (↓)	OS-Harm Unsafe Rate (↓)	False Refusal Rate (FRR) (↓)
UI-TARS (Native GUI Agent)	GuardAgent	53.9%	16.4%	20.51%
UI-TARS (Native GUI Agent)	MirrorGuard (Ours)	13.0%	1.8%	5.13%

Qwen2.5-VL-72B (ReAct Framework)	GuardAgent	60.8%	35.5%	30.77%
Qwen2.5-VL-72B (ReAct Framework)	MirrorGuard (Ours)	7.7%	2.7%	7.69%

Qualitative Analysis

MirrorGuard doesn't just block actions; it corrects thoughts. Below is a comparison of how a vanilla agent vs. MirrorGuard handles a "harmful content generation" request.

Figure 7: Defense Trajectories. (Left) Vanilla agent blindly follows instructions to generate harmful text. (Right) MirrorGuard detects the semantic harm at the reasoning stage and steers the agent to refuse.

Citation

@inproceedings{zhang2026mirrorguard,
  title={MirrorGuard: Toward Secure Computer-Use Agents via Simulation-to-Real Reasoning Correction},
  author={Zhang, Wenqi and Shen, Yulin and Jiang, Changyue and Dai, Jiarun and Hong, Geng and Pan, Xudong},
  booktitle={Proceedings of the ACM Conference (XXX)},
  year={2026}
}