Web Coding Benchmarks Finally Test What Matters: Visuals, Interaction, and Repair
Current benchmarks for AI web coding miss most of what makes a website work — visual fidelity, interactive behavior, and the ability to fix bugs — and a new 1,526-task evaluation exposes how badly models struggle with aesthetics.
- What they did — Built a 1,526-task multimodal benchmark spanning seven web coding task types (generation, editing, repair across text/image/video inputs) with a browser-based agent evaluation protocol that autonomously tests generated websites.
- Key result — Closed-source models (GPT-5.2, Gemini-3-Pro, Claude-4.5-Opus) score 55–87 across task types, while the best open-source model (Qwen3-VL-235B) drops to 23–69, with aesthetics as the most persistent gap.
- Why it matters — Teams selecting models for web development should expect closed-source models to substantially outperform open-source alternatives, particularly on visual quality and editing tasks.

Current benchmarks for AI web coding miss most of what makes a website work — visual fidelity, interactive behavior, and the ability to fix bugs — and a new 1,526-task evaluation exposes how badly models struggle with aesthetics.
The Problem
Evaluating whether an AI model can build a good website is fundamentally different from evaluating whether it can solve an algorithm problem. Success depends on visual fidelity, interaction behavior, responsiveness, and user experience — none of which are captured by standard metrics like pass@k on HumanEval or unit-test pass rates on SWE-Bench [§1].
Existing web coding benchmarks each cover only a narrow slice of the problem. Some test generation from text prompts, others from screenshots, but none jointly evaluate generation, editing, and repair across text, image, and video inputs [Table 1]. More critically, most rely on static correctness checks rather than actually running the generated website and testing its interactive behavior [§1].
This matters because real web development is an iterative cycle: you generate code, edit it based on feedback, and repair bugs. A benchmark that only tests generation from text descriptions tells you very little about whether a model can handle the full workflow.
What They Did
WebCompass is a 1,526-task benchmark organized into seven task categories that combine three input modalities (text, image, video) with three task types (generation, editing, repair) [§2.1]. The tasks cover 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy, Medium, and Hard difficulty levels [Abstract].
The benchmark construction used a multi-stage, human-in-the-loop pipeline [§2.2]. For text-guided generation, the team collected queries from multiple sources, deduplicated them using embedding-based clustering, then had an LLM expand underspecified requests into structured design documents covering page content, interaction behaviors, and visual appearance [§2.2.1]. This addresses a real problem: vague prompts like "make me a dashboard" produce wildly different outputs across models, making automated comparison nearly impossible.
For editing and repair tasks, the team built on existing codebases. Repair tasks use a reverse-engineering approach: they start with working code, inject specific bugs, and provide exact search/replace annotations mapping buggy code to clean targets [Table 1]. This ensures deterministic ground truth — you know exactly what the correct fix should be.
The evaluation design splits by task type. For editing and repair, WebCompass uses a checklist-guided LLM-as-a-Judge protocol — essentially giving a language model a rubric and having it score before/after screenshots and code diffs [§1]. For generation, the team built something more ambitious: an Agent-as-a-Judge protocol where an autonomous agent launches the generated website in a real browser, explores it through the Model Context Protocol (MCP), synthesizes targeted test cases, and scores the result based on actual execution [Abstract]. Think of it as an automated QA tester that clicks through your website, checks if buttons work, and verifies that visual elements render correctly.
The Results
The performance gap between closed-source and open-source models is stark. Looking at the radar chart [Figure 1], the top closed-source models score between roughly 55 and 87 across the seven task types. The best open-source model, Qwen2.5-VL-72B-Instruct, drops to a range of approximately 23 to 69, with other open-source models performing even worse [Figure 1].
The task-type breakdown reveals interesting patterns. Repair tasks that involve restoring interactivity (Repair-RCT) show the highest scores overall — top models reach scores in the high 70s to mid-80s [Figure 1]. But generation tasks are harder: Vision-Guided Generation (Gen-SPI) shows significantly lower performance across all models, with even the best performers dropping into the 50s [Figure 1].
Aesthetics emerges as the most persistent bottleneck, especially for open-source models [Abstract]. Models can often get the structure and functionality roughly right but produce visually unappealing results. Framework choice also matters materially: Vue consistently proves more challenging for models, while React and Vanilla/HTML perform more strongly depending on task type [Abstract].
Editing tasks show a distinct difficulty profile from repair. Text-Guided Editing (Edit-ITG) generally shows lower scores than the structurally similar Diagnostic Repair (Repair-RCT) tasks [Figure 1]. Repair preserves interactivity better but remains execution-challenging [Abstract].
The benchmark's scope of 1,526 tasks [Abstract] represents a significant expansion over existing web coding evaluations, though as with any benchmark, it represents a sampling of the full space of web development challenges. The Agent-as-a-Judge evaluation, while more realistic than static checks, depends on the capability of the judge model itself, introducing a moving target as models improve. For teams building internal evaluation pipelines, this means the framework is useful for comparative model selection today but will need recalibration as both tested and judge models evolve.
Why It Matters
For practitioners choosing models for web development workflows, the findings are concrete and actionable. If you're building with Vue, expect worse model performance than with React or vanilla HTML [Abstract]. If visual polish matters — and for customer-facing work it always does — closed-source models remain substantially ahead [Abstract, Figure 1].
This is a lab-stage evaluation framework. The Agent-as-a-Judge protocol is a meaningful step toward automated acceptance testing for AI-generated websites, but it approximates rather than replaces human judgment [Abstract]. The benchmark's public availability (data, code, and project page) [Abstract] means teams can run these evaluations on their own model candidates today, which is its most immediate practical value. The larger contribution is establishing that web coding evaluation requires testing across the full lifecycle — generation, editing, and repair — not just one-shot code generation from a text prompt.