IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
[2025-10]: IWR-Bench paper is released! Check out our work on arXiv. 🚀
[2025-10]: We've open-sourced the benchmark data and evaluation code. Find them on our GitHub!
Current webpage-to-code benchmarks primarily focus on static screenshot-to-code tasks, overlooking the dynamic interactions fundamental to real-world web applications. To address this gap, we introduce IWR-Bench, a novel benchmark for evaluating the ability of Large Vision-Language Models (LVLMs) to perform interactive webpage reconstruction from video.
IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, featuring diverse interaction complexities (e.g., web games), visual styles, and domains. To align with standard web development practices, each task provides the model with not only a user interaction video but also the complete set of crawled static assets. This setup challenges models on two fronts: comprehensive multi-modal reasoning to infer interaction logic, and advanced code generation to translate this logic into functional code.
Our experiments on 28 leading LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%. We found that functional correctness (24.39% IFS) lags far behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging new frontier for vision-language research.
IWR-BENCH
We are excited to introduce IWR-Bench, a powerful benchmark for a new task we've formalized: Interactive Webpage Reconstruction (IWR). The goal of IWR is to recreate a fully functional webpage from a video of a user interacting with it.
Our benchmark is meticulously designed to cover a wide spectrum of challenges, with tasks categorized by application domain, visual complexity, and interaction logic. To mirror a real-world scenario, each task provides the model with an interaction video and the webpage's original static assets. We then evaluate the reconstructed page's functionality using an automated "agent-as-a-judge" that programmatically executes a sequence of actions. Performance is quantified by two holistic metrics: the Interactive Functionality Score (IFS), assessing operational and logical correctness, and the Visual Fidelity Score (VFS), which measures visual accuracy from pixel details to the overall layout.
IWR-Bench addresses a critical gap between existing webpage reconstruction and video understanding benchmarks. Previous webpage benchmarks are either limited to static image-to-code tasks (like Pix2Code and WebSight) or model interactions as stateless, single-step events without providing the necessary static assets for a realistic build (like Interaction2Code). Conversely, general video understanding benchmarks (e.g., MVBench) focus on comprehension tasks like Video QA, not code generation.
In contrast, IWR-Bench is the first to use videos of stateful, full-trajectory workflows from live websites, provide all required assets, and employ a robust agent-based protocol to evaluate true interactive correctness.
We evaluated a diverse set of 28 leading Large Vision-Language Models (LVLMs), including both proprietary and open-source models. For each task, models are provided with the user interaction video and a composite image of all static assets. Their goal is to generate a single, self-contained HTML file that replicates the observed webpage's appearance and functionality. Our evaluation is conducted under a zero-shot setting to assess the models' core capabilities without any task-specific fine-tuning.
Click on the column headers to sort the table.
Last updated: September 24, 2025
| Model Name | Date | Low-level Visual | High-level Visual | VFS | IFS | Final Score |
|---|
Main evaluation results on IWR-Bench. The thinking model is underlined. IFS: Interactive Functionality Score, VFS: Visual Fidelity Score.
@misc{chen2025iwrbenchlvlmsreconstructinteractive,
title={IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?},
author={Yang Chen and Minghao Liu and Yufan Shen and Yunwen Li and Tianyuan Huang and Xinyu Fang and Tianyu Zheng and Wenxuan Huang and Cheng Yang and Daocheng Fu and Jianbiao Mei and Rong Wu and Yunfei Zhao and Licheng Wen and Xuemeng Yang and Song Mao and Qunshu Lin and Zhi Yu and Yongliang Shen and Yu Qiao and Botian Shi},
year={2025},
eprint={2509.24709},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.24709},
}