Signal #4 — 2026-W12

Visual Inputs Break Moral Safety Filters in Vision-Language Models

Xinyi Yang, Chenheng Xu, Weijun Hong, Ce Mo, Qian Wang, Fang Fang, Yixin Zhu

Adding images to moral dilemmas causes leading AI models to abandon the ethical reasoning their text-based safety training was designed to enforce.

What they did — Built a generative benchmark called Moral Dilemma Simulation (MDS) that presents identical moral dilemmas in text-only, caption, and image modes, with orthogonal control over situational and demographic variables, then tested ten VLMs across all conditions.
Key result — Visual inputs caused models to respond nearly indiscriminately to utilitarian trade-offs (e.g., losing sensitivity to whether 1 or 5 lives were at stake), increase self-interest prioritization over loyalty, and collapse hierarchical social value distinctions — effects that held regardless of a model's text-based alignment status.
Why it matters — Safety alignment tuned on text does not transfer to visual processing, meaning any VLM deployed in image-rich or embodied contexts has an unaddressed moral reasoning vulnerability.

Visual modality distracts moral decision-making

Adding images to moral dilemmas causes leading AI models to abandon the ethical reasoning their text-based safety training was designed to enforce.

The Problem

As AI systems move from chatbots to robots and autonomous vehicles, they increasingly process visual information alongside text. Safety techniques like RLHF have shown success in establishing moral compliance in textual contexts [§1]. But whether those safeguards hold when a model also sees an image is largely untested.

There's a theoretical reason to worry. In human psychology, dual-process theory distinguishes between fast, intuitive reasoning (System 1) and slow, deliberative reasoning (System 2). Visual stimuli tend to trigger the fast, intuitive mode — the one more prone to bias and less amenable to careful ethical deliberation [§1]. If VLMs exhibit something analogous, images could route around the careful reasoning that text-based safety training instills.

Existing moral evaluation benchmarks couldn't test this properly. They present dilemmas as text-only questionnaires and lack the experimental controls needed to isolate what's actually driving model behavior [§1, §2.2]. You can't tell whether a model's moral failure comes from the visual input itself, from demographic cues in the image, or from the structure of the dilemma.

What They Did

The researchers built the Moral Dilemma Simulation (MDS), a generative benchmark grounded in Moral Foundation Theory — a framework from psychology that organizes moral reasoning around five dimensions: Care, Fairness, Loyalty, Authority, and Purity [§3.1]. Rather than a fixed dataset, MDS is a configurable engine that produces moral dilemmas as both text descriptions and rendered visual scenes in a sandbox game style.

The key design feature is orthogonal control. Think of it like a mixing board with independent sliders: the system can vary one factor — say, whether harm is direct or indirect — while holding everything else constant. It controls for "conceptual variables" (personal force, intention of harm, self-benefit) and "character variables" (species, race, profession, age) [§3.1]. This lets researchers pinpoint exactly which factor causes a model's moral judgment to shift.

Each dilemma is presented in three modes: text-only (just the written scenario), caption (a text description of what the image would show), and image (a rendered visual scene with the dilemma description embedded). Crucially, the text and image always depict the same moral situation, so any behavioral difference between modes can be attributed to visual processing rather than information gaps [§3.1].

The researchers then ran this benchmark across multiple VLMs using what they call a "tri-modal diagnostic protocol" — comparing each model's moral decisions across text, caption, and image inputs [§1].

The Results

The findings reveal three specific failure modes when models process visual inputs compared to text-only scenarios [Figure 1].

First, utilitarian sensitivity collapses. In text mode, models appropriately distinguish between scenarios where different numbers of lives are at stake — they're more willing to act when more lives can be saved. In image mode, models respond nearly indiscriminately regardless of the number of lives involved [Figure 1a, §Abstract]. The numerical stakes stop mattering.

Second, self-interest gets prioritized. When presented with dilemmas pitting loyalty to a friend against personal gain (e.g., "Will you report your friend?"), visual inputs shift models toward self-interested choices that text-based reasoning would reject [Figure 1b, §Abstract].

Third, social value hierarchies flatten. In text mode, models maintain distinctions between demographically different groups in ways consistent with social norms. In image mode, these hierarchical distinctions collapse — the models treat demographically distinct groups as equivalent rather than maintaining the nuanced value weightings that text-based reasoning preserves [Figure 1c, §Abstract].

These effects hold regardless of a model's textual alignment status [§Abstract]. A model that passes text-based safety evaluations with flying colors can still fail when the same dilemma arrives as an image. Their evaluation reveals that "the vision modality activates intuition-like pathways that override the more deliberate and safer reasoning patterns observed in text-only contexts" [§Abstract].

The benchmark currently tests dilemmas rendered in a simplified sandbox game aesthetic [§3.1] — not photorealistic scenes. This means the visual distraction effect emerges even from stylized, low-fidelity images. Whether photorealistic inputs amplify or change these effects is untested. For teams building embodied AI systems that process real camera feeds, the actual vulnerability surface is likely larger than what this benchmark captures, though that remains to be demonstrated.

Additionally, the orthogonal control design, while enabling causal-level analysis, operates on structured moral dilemmas rather than the messy, ambiguous situations embodied agents encounter in practice [§3.1]. Translating these controlled findings into production safety testing for real-world deployment scenarios will require significant additional work — likely on the order of years rather than months.

Why It Matters

This work sits at the lab proof-of-concept stage, but it identifies a vulnerability that matters now. The core finding — that text-based safety alignment does not transfer to visual processing [§Abstract] — has immediate implications for anyone deploying VLMs in contexts where images influence decisions: content moderation, autonomous systems, medical imaging assistants, or any application where a model must make value-laden judgments based on visual input.

The paper demonstrates that current safety evaluation practices, which rely predominantly on text-based benchmarks, are insufficient for certifying VLM safety [§Abstract, §2.2]. If your organization evaluates model safety only through text prompts, you are missing a class of failures that this work shows are systematic, not edge cases. The MDS benchmark and code are publicly available [§1], giving teams a starting point for visual-modality safety testing. The broader message: multimodal alignment — safety training that explicitly accounts for visual inputs — is not optional for embodied AI deployment. It's an unsolved problem.

One Agency Axis Controls Five Traits in a 35B AI Model

Jia Qing Yap

Five supposedly distinct behavioral traits in a 35-billion-parameter AI model turn out to be one underlying axis: the disposition to act independently versus defer to the user.

What they did — Trained nine sparse autoencoders on a 35B-parameter hybrid architecture model, then used linear probes projected through the SAE decoder to construct steering vectors for five agentic behavioral traits, testing them across 1,800 agent rollouts.
Key result — Autonomy steering achieved Cohen's d = 1.01 (p < 0.0001), but cross-trait analysis revealed all five vectors primarily modulate a single shared agency axis rather than five independent traits.
Why it matters — For teams building agentic AI systems, this suggests that fine-grained per-trait behavioral control may be harder than expected — the model's internal representation treats agency as one dimension, not five.

Five supposedly distinct behavioral traits in a 35-billion-parameter AI model turn out to be one underlying axis: the disposition to act independently versus defer to the user.

The Problem

As AI agents gain the ability to write code, search the web, and take multi-step actions, controlling *how* agentically they behave becomes critical. You might want an agent that's eager to use tools but cautious about risk, or one that's persistent but deferential. The implicit assumption is that these traits — autonomy, tool-use eagerness, persistence, risk calibration, and deference — are independently adjustable knobs [§1].

Testing this assumption requires two things: a way to find where these traits live inside a model's internal representations, and a way to intervene on them causally. Sparse autoencoders (SAEs) can decompose model activations into interpretable features, but they've primarily been demonstrated on models up to 7B parameters with standard transformer architectures [§1]. Whether they work on production-scale models with non-standard architectures — mixture-of-experts routing, linear recurrence layers, hybrid attention — was an open question [§1].

What They Did

The researchers trained nine SAEs on the residual stream — the main information highway between layers — of Qwen 3.5-35B-A3B, a 35-billion-parameter model where only about 3 billion parameters are active per token [§3]. The model uses an unusual hybrid architecture: repeating blocks of three GatedDeltaNet layers (a type of linear recurrence) followed by one standard attention layer [§3, Table 1].

For each of five behavioral traits, they constructed contrastive prompt pairs — one version designed to elicit high expression of the trait, one low — totaling 1,520 pairs across four task domains (coding, research, communication, data analysis) [§4.2]. They selected the SAE with the highest discriminative power for each trait, then trained a simple linear probe (ridge regression) on that SAE's activations to predict which version was which [§4.3].

Here's the key technical move: rather than manipulating individual SAE features directly — which would be like trying to adjust a dimmer switch that only has "on" and "off" positions, because the SAE's top-k activation function zeros out all but the k largest features — they projected the probe weights back through the SAE's decoder matrix to get a continuous steering vector in the model's native 2048-dimensional activation space [§4.3]. The formula is simple: multiply the decoder's weight matrix by the probe weights (v = W⊤_dec · w_probe). This bypasses the discretization and enables smooth, graded intervention [§4.3].

They then tested each steering vector across 50 ReAct-style agent scenarios at multiple intensity multipliers, producing 1,800 total agent rollouts [Abstract].

The Results

Autonomy steering works. At multiplier α = 2, it achieves Cohen's d = 1.01 (p < 0.0001), peaking at d = 1.04 at α = 3 [Abstract]. In practical terms, the model shifts from asking the user for help 78% of the time to proactively executing code and searching the web [Abstract]. That's a large, statistically robust behavioral change from adding a single vector during inference.

But the cross-trait analysis tells a more interesting story. All five steering vectors primarily modulate a single dominant agency axis — the disposition to act independently versus defer to the user — with trait-specific effects appearing only as secondary modulations in which tools get used and how the effect scales with intensity [Abstract].

The sharpest finding is a dissociation between risk calibration and tool-use eagerness. Both probes were trained on the same SAE at the same layer, with nearly identical predictive accuracy (R² = 0.795 vs. 0.792), and their probe vectors are nearly orthogonal (cosine similarity = −0.017) [Abstract]. Yet the tool-use vector steers behavior (d = 0.39) while the risk-calibration vector produces only suppression [Abstract]. This is a concrete demonstration that probe accuracy measures correlation, not causal efficacy — a direction that *predicts* a trait in latent space need not be *causally upstream* of that trait in the model's computation [§1].

A separate experiment tested whether steering matters during the "prefill" phase (when the model processes the full prompt) versus the "decode" phase (when it generates tokens one by one). Steering only during decoding had zero effect (p > 0.35) [Abstract]. This provides causal evidence that in GatedDeltaNet architectures, behavioral commitments are computed during prefill and then propagated through the recurrent state [Abstract].

Important caveats: the evaluation uses proxy metrics extracted from raw agent trajectories rather than human judgments, and the 50 scenarios, while spanning four domains, don't capture the full diversity of production agent deployments [§4.4]. The paper also acknowledges that no head-to-head comparison with simpler contrastive activation addition (mean-diff CAA) is reported [§2]. Without that baseline, it's unclear how much of the steering efficacy comes from the SAE-based method versus the contrastive data itself — a gap that limits claims about the method's relative advantage.

Why It Matters

This is a lab proof-of-concept, not a deployable safety tool. But two findings have practical implications. First, for teams designing agent guardrails that assume independent control over traits like "autonomy" and "persistence," the single-axis result is a warning: these may not be separable knobs in current architectures [Abstract]. Turning down autonomy likely turns down persistence and tool eagerness too.

Second, the correlation-causation dissociation matters for anyone using probes to understand model internals. A probe that accurately predicts a behavior does not guarantee you've found a lever to change it [§1]. The risk-calibration case — high R², zero steering effect — is a clean example that should inform how interpretability results get translated into interventions. Code, trained SAEs, and steering vectors are publicly available [Abstract].

Private Training Text Recoverable from Shared Gradients at Scale

Yibo Li, Qiongxiu Li

Private text used to train large language models can now be reconstructed from shared gradients even when 128 training samples are aggregated together.

What they did — Built a three-stage framework (SOMP) that treats the recovery of private training text from aggregated LLM gradients as a sparse signal recovery problem, exploiting the geometric structure of multi-head attention gradients to disentangle mixed signals without exhaustive search.
Key result — At batch size 16 with long sequences, SOMP achieves substantially higher reconstruction fidelity than prior methods like DAGER and GRAB; it still recovers meaningful text at batch size 128, a regime where previous attacks largely fail.
Why it matters — Federated learning deployments that rely on large-batch gradient aggregation as a privacy mechanism may be more vulnerable than previously assumed.

Private text used to train large language models can now be reconstructed from shared gradients even when 128 training samples are aggregated together.

The Problem

Federated learning is supposed to protect privacy: instead of sharing raw training data, participants share only aggregated gradient updates. A common assumption is that combining many training samples into a single gradient update effectively obscures any individual sample [§1]. This matters in sensitive domains — healthcare, law, finance — where institutions want to collaboratively train LLMs without exposing private text [§1].

But gradient inversion attacks have shown this assumption is fragile. An attacker who sees the aggregated gradient can work backward to reconstruct the original training text [§1]. Prior methods like DAGER can exactly reconstruct entire text batches from transformer gradients in small-batch settings [§1]. The problem is scaling: as batch sizes grow and sequences get longer, the mixed gradient signal becomes harder to untangle. DAGER's exhaustive layer-wise search becomes prohibitively expensive, and hybrid optimization methods like GRAB suffer from cross-sample interference [§1]. Prior attacks break down precisely in the regimes most relevant to real federated deployments — long sequences and large batches.

What They Did

SOMP (Subspace-Guided Orthogonal Matching Pursuit) reframes the text recovery problem as sparse signal recovery in gradient space [§4]. The core insight: aggregated transformer gradients aren't random noise. They retain geometric structure from multi-head attention — each attention head contributes a distinct "slice" of the gradient that can be analyzed separately — along with token-level sparsity patterns that help identify which tokens were present in the original training data [§3.2].

The framework works in three stages [§4].

**Stage I: Token Pooling.** Rather than searching the entire vocabulary, SOMP decomposes the first-layer gradient into per-head slices and scores each candidate token against these slices [§4.1]. Think of each attention head as a different camera angle on the same scene — a token that genuinely appeared in the training data should look plausible from multiple angles simultaneously. Tokens are scored on three criteria: geometric alignment to head-wise subspaces, consistency across heads, and sparsity patterns in feed-forward network gradients [§4.1]. This produces a compact pool of plausible tokens.

**Stage II: Sentence Decoding.** From this token pool, SOMP generates candidate sentences using beam search guided by geometric alignment to second-layer gradient subspaces, combined with a language model prior for fluency [§4.2]. Diversity penalties discourage repetitive outputs [§4.2].

**Stage III: Sparse Reconstruction.** This is where the sparse recovery framing pays off. Each candidate sentence produces a predicted gradient. SOMP selects the smallest subset of candidates whose combined gradients best reconstruct the observed aggregated gradient [§4.3]. This is analogous to compressed sensing: given a mixed signal, find the sparsest combination of known "atoms" that explains it. The method uses Orthogonal Matching Pursuit — a greedy algorithm that iteratively picks the candidate whose gradient most reduces the unexplained residual [§4.3].

This progressive narrowing — from vocabulary to token pool to candidate sentences to final selection — avoids the exhaustive search that makes prior methods expensive [§4].

The Results

SOMP was tested across multiple LLM families (GPT-2, LLaMA, Phi, OPT) at various scales, and across five languages [Abstract].

At batch size B=16 with long sequences, SOMP achieves substantially higher reconstruction fidelity than DAGER and GRAB [Abstract]. The paper reports results using ROUGE metrics across different configurations. On GPT-2 at batch size 4 with sequence length 32, SOMP achieves a ROUGE-1 F1 of 97.0 compared to DAGER's 93.0 and GRAB's 55.0 [Table 1]. At batch size 8 with sequence length 64, SOMP reaches ROUGE-1 of 72.0 versus DAGER's 56.0 [Table 1].

The most striking finding is at extreme aggregation levels. Even at batch size 128, SOMP still recovers meaningful text [Abstract] — a regime where prior attacks become much less effective. This directly challenges the assumption that large-batch aggregation provides sufficient privacy.

SOMP also remains computationally competitive with prior methods despite handling harder settings [Abstract].

There are important limitations. The experiments use a standard honest-but-curious server setting where the attacker observes the full aggregated gradient [§3.1] — real federated deployments may add noise (differential privacy), compression, or secure aggregation that would further degrade attack performance. The paper evaluates on clean FedSGD updates [§3.1]; production federated systems often use FedAvg or other protocols with multiple local steps, which changes the gradient structure. The method's reliance on first- and second-layer gradient geometry [§4.1, §4.2] means it is architecture-dependent — extending to non-standard transformer variants or mixture-of-experts models would require additional validation.

Why It Matters

This work sits at the lab proof-of-concept stage, but it shifts the boundary of what gradient inversion attacks can do. The practical implication is specific: if your organization uses federated learning for LLM training and relies on batch aggregation as a privacy mechanism, that mechanism is weaker than previously demonstrated [Abstract]. SOMP shows that the geometric structure of transformer gradients provides exploitable signal even under heavy aggregation [§3.2].

This doesn't mean federated learning is broken — differential privacy and secure aggregation remain viable defenses. But it does mean that gradient aggregation alone, even at batch size 128, should not be treated as a sufficient privacy guarantee for sensitive text data [Abstract]. For teams designing federated LLM training pipelines, the takeaway is to layer defenses rather than relying on batch size as implicit protection.

Visual Inputs Break Moral Safety Filters in Vision-Language Models

The Problem

What They Did

The Results

Why It Matters

One Agency Axis Controls Five Traits in a 35B AI Model

The Problem

What They Did

The Results

Why It Matters

Private Training Text Recoverable from Shared Gradients at Scale

The Problem

What They Did

The Results

Why It Matters

Quick Takes

Subscribe — free