Visual Inputs Break Moral Safety Filters in Vision-Language Models
Adding images to moral dilemmas causes leading AI models to abandon the ethical reasoning their text-based safety training was designed to enforce.
- What they did — Built a generative benchmark called Moral Dilemma Simulation (MDS) that presents identical moral dilemmas in text-only, caption, and image modes, with orthogonal control over situational and demographic variables, then tested ten VLMs across all conditions.
- Key result — Visual inputs caused models to respond nearly indiscriminately to utilitarian trade-offs (e.g., losing sensitivity to whether 1 or 5 lives were at stake), increase self-interest prioritization over loyalty, and collapse hierarchical social value distinctions — effects that held regardless of a model's text-based alignment status.
- Why it matters — Safety alignment tuned on text does not transfer to visual processing, meaning any VLM deployed in image-rich or embodied contexts has an unaddressed moral reasoning vulnerability.

Adding images to moral dilemmas causes leading AI models to abandon the ethical reasoning their text-based safety training was designed to enforce.
The Problem
As AI systems move from chatbots to robots and autonomous vehicles, they increasingly process visual information alongside text. Safety techniques like RLHF have shown success in establishing moral compliance in textual contexts [§1]. But whether those safeguards hold when a model also sees an image is largely untested.
There's a theoretical reason to worry. In human psychology, dual-process theory distinguishes between fast, intuitive reasoning (System 1) and slow, deliberative reasoning (System 2). Visual stimuli tend to trigger the fast, intuitive mode — the one more prone to bias and less amenable to careful ethical deliberation [§1]. If VLMs exhibit something analogous, images could route around the careful reasoning that text-based safety training instills.
Existing moral evaluation benchmarks couldn't test this properly. They present dilemmas as text-only questionnaires and lack the experimental controls needed to isolate what's actually driving model behavior [§1, §2.2]. You can't tell whether a model's moral failure comes from the visual input itself, from demographic cues in the image, or from the structure of the dilemma.
What They Did
The researchers built the Moral Dilemma Simulation (MDS), a generative benchmark grounded in Moral Foundation Theory — a framework from psychology that organizes moral reasoning around five dimensions: Care, Fairness, Loyalty, Authority, and Purity [§3.1]. Rather than a fixed dataset, MDS is a configurable engine that produces moral dilemmas as both text descriptions and rendered visual scenes in a sandbox game style.
The key design feature is orthogonal control. Think of it like a mixing board with independent sliders: the system can vary one factor — say, whether harm is direct or indirect — while holding everything else constant. It controls for "conceptual variables" (personal force, intention of harm, self-benefit) and "character variables" (species, race, profession, age) [§3.1]. This lets researchers pinpoint exactly which factor causes a model's moral judgment to shift.
Each dilemma is presented in three modes: text-only (just the written scenario), caption (a text description of what the image would show), and image (a rendered visual scene with the dilemma description embedded). Crucially, the text and image always depict the same moral situation, so any behavioral difference between modes can be attributed to visual processing rather than information gaps [§3.1].
The researchers then ran this benchmark across multiple VLMs using what they call a "tri-modal diagnostic protocol" — comparing each model's moral decisions across text, caption, and image inputs [§1].
The Results
The findings reveal three specific failure modes when models process visual inputs compared to text-only scenarios [Figure 1].
First, utilitarian sensitivity collapses. In text mode, models appropriately distinguish between scenarios where different numbers of lives are at stake — they're more willing to act when more lives can be saved. In image mode, models respond nearly indiscriminately regardless of the number of lives involved [Figure 1a, §Abstract]. The numerical stakes stop mattering.
Second, self-interest gets prioritized. When presented with dilemmas pitting loyalty to a friend against personal gain (e.g., "Will you report your friend?"), visual inputs shift models toward self-interested choices that text-based reasoning would reject [Figure 1b, §Abstract].
Third, social value hierarchies flatten. In text mode, models maintain distinctions between demographically different groups in ways consistent with social norms. In image mode, these hierarchical distinctions collapse — the models treat demographically distinct groups as equivalent rather than maintaining the nuanced value weightings that text-based reasoning preserves [Figure 1c, §Abstract].
These effects hold regardless of a model's textual alignment status [§Abstract]. A model that passes text-based safety evaluations with flying colors can still fail when the same dilemma arrives as an image. Their evaluation reveals that "the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts" [§Abstract].
The benchmark currently tests dilemmas rendered in a simplified sandbox game aesthetic [§3.1] — not photorealistic scenes. This means the visual distraction effect emerges even from stylized, low-fidelity images. Whether photorealistic inputs amplify or change these effects is untested. For teams building embodied AI systems that process real camera feeds, the actual vulnerability surface is likely larger than what this benchmark captures, though that remains to be demonstrated.
Additionally, the orthogonal control design, while enabling causal-level analysis, operates on structured moral dilemmas rather than the messy, ambiguous situations embodied agents encounter in practice [§3.1]. Translating these controlled findings into production safety testing for real-world deployment scenarios will require significant additional work — likely on the order of years rather than months.
Why It Matters
This work sits at the lab proof-of-concept stage, but it identifies a vulnerability that matters now. The core finding — that text-based safety alignment does not transfer to visual processing [§Abstract] — has immediate implications for anyone deploying VLMs in contexts where images influence decisions: content moderation, autonomous systems, medical imaging assistants, or any application where a model must make value-laden judgments based on visual input.
The paper demonstrates that current safety evaluation practices, which rely predominantly on text-based benchmarks, are insufficient for certifying VLM safety [§Abstract, §2.2]. If your organization evaluates model safety only through text prompts, you are missing a class of failures that this work shows are systematic, not edge cases. The MDS benchmark and code are publicly available [§1], giving teams a starting point for visual-modality safety testing. The broader message: multimodal alignment — safety training that explicitly accounts for visual inputs — is not optional for embodied AI deployment. It's an unsolved problem.