Logo IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

IWR-Bench Team

*Yang Chen, Minghao Liu, *Yufan Shen†, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Yunfei Zhao, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, *Botian Shi†

🔔News

[2025-10]: IWR-Bench paper is released! Check out our work on arXiv. 🚀

[2025-10]: We've open-sourced the benchmark data and evaluation code. Find them on our GitHub!


Introduction

Current webpage-to-code benchmarks primarily focus on static screenshot-to-code tasks, overlooking the dynamic interactions fundamental to real-world web applications. To address this gap, we introduce IWR-Bench, a novel benchmark for evaluating the ability of Large Vision-Language Models (LVLMs) to perform interactive webpage reconstruction from video.

IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, featuring diverse interaction complexities (e.g., web games), visual styles, and domains. To align with standard web development practices, each task provides the model with not only a user interaction video but also the complete set of crawled static assets. This setup challenges models on two fronts: comprehensive multi-modal reasoning to infer interaction logic, and advanced code generation to translate this logic into functional code.

Our experiments on 28 leading LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%. We found that functional correctness (24.39% IFS) lags far behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging new frontier for vision-language research.


IWR-Bench Model Performance
Performance of representative models on IWR-Bench.

Logo IWR-BENCH

Overview

We are excited to introduce IWR-Bench, a powerful benchmark for a new task we've formalized: Interactive Webpage Reconstruction (IWR). The goal of IWR is to recreate a fully functional webpage from a video of a user interacting with it.

Our benchmark is meticulously designed to cover a wide spectrum of challenges, with tasks categorized by application domain, visual complexity, and interaction logic. To mirror a real-world scenario, each task provides the model with an interaction video and the webpage's original static assets. We then evaluate the reconstructed page's functionality using an automated "agent-as-a-judge" that programmatically executes a sequence of actions. Performance is quantified by two holistic metrics: the Interactive Functionality Score (IFS), assessing operational and logical correctness, and the Visual Fidelity Score (VFS), which measures visual accuracy from pixel details to the overall layout.


IWR-Bench Overview
Overview of the IWR-Bench task and evaluation. The inputs to the model are (a) a user interaction video and (b) composite images of all static assets sniffed from the webpage. The evaluation employs an agent-as-judge framework, where an automated agent assesses the rendered page’s interactivity by executing (c) a ground-truth action sequence and its visual fidelity through screenshot comparison.

Tasks

taxonomy
An overview of the IWR-Bench taxonomy, which organizes tasks along three orthogonal axes: Domain, Visual Complexity, and Interaction Logic.

Comparisons with Existing Benchmarks

IWR-Bench addresses a critical gap between existing webpage reconstruction and video understanding benchmarks. Previous webpage benchmarks are either limited to static image-to-code tasks (like Pix2Code and WebSight) or model interactions as stateless, single-step events without providing the necessary static assets for a realistic build (like Interaction2Code). Conversely, general video understanding benchmarks (e.g., MVBench) focus on comprehension tasks like Video QA, not code generation.

In contrast, IWR-Bench is the first to use videos of stateful, full-trajectory workflows from live websites, provide all required assets, and employ a robust agent-based protocol to evaluate true interactive correctness.

Comparison with other benchmarks
Comparison of IWR-Bench with existing benchmarks. IWR-Bench is unique in its sourcing from live websites, video-based tasks, comprehensive interactive evaluation, and provision of static assets to create a realistic reconstruction task.

Leaderboard

We evaluated a diverse set of 28 leading Large Vision-Language Models (LVLMs), including both proprietary and open-source models. For each task, models are provided with the user interaction video and a composite image of all static assets. Their goal is to generate a single, self-contained HTML file that replicates the observed webpage's appearance and functionality. Our evaluation is conducted under a zero-shot setting to assess the models' core capabilities without any task-specific fine-tuning.

Open-Source Proprietary

Click on the column headers to sort the table.

Last updated: September 24, 2025

Model Name Date Low-level Visual High-level Visual VFS IFS Final Score

Main evaluation results on IWR-Bench. The thinking model is underlined. IFS: Interactive Functionality Score, VFS: Visual Fidelity Score.

BibTeX


      @misc{chen2025iwrbenchlvlmsreconstructinteractive,
            title={IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?}, 
            author={Yang Chen and Minghao Liu and Yufan Shen and Yunwen Li and Tianyuan Huang and Xinyu Fang and Tianyu Zheng and Wenxuan Huang and Cheng Yang and Daocheng Fu and Jianbiao Mei and Rong Wu and Yunfei Zhao and Licheng Wen and Xuemeng Yang and Song Mao and Qunshu Lin and Zhi Yu and Yongliang Shen and Yu Qiao and Botian Shi},
            year={2025},
            eprint={2509.24709},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2509.24709}, 
      }