An MLLM-Assisted Web Crawler Approach for Web Application Fuzzing

Wantong Yang; Enze Wang; Zhiwen Gui; Yuan Zhou; Baosheng Wang; Wei Xie

doi:10.3390/app15020962

Applied Sciences (Jan 2025)

An MLLM-Assisted Web Crawler Approach for Web Application Fuzzing

Wantong Yang,
Enze Wang,
Zhiwen Gui,
Yuan Zhou,
Baosheng Wang,
Wei Xie

Affiliations

Wantong Yang: College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Enze Wang: College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Zhiwen Gui: College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Yuan Zhou: College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Baosheng Wang: College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Wei Xie: College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

DOI: https://doi.org/10.3390/app15020962
Journal volume & issue: Vol. 15, no. 2
p. 962

Abstract

Read online

Web application fuzzing faces significant challenges in achieving comprehensive test interface (attack surface) coverage, primarily due to the complexity of user interactions and dynamic website architectures. While web crawlers can automatically access and extract critical website information—including form fields and request parameters—which are essential for generating effective fuzzing test cases, current crawler technologies exhibit three primary limitations: (i) insufficient capabilities in analyzing page relationships and determining page states; (ii) lack of functionality-aware exploration capabilities, resulting in generated inputs with poor contextual relevance; (iii) generation of unstructured operation sequences that fail to execute effectively due to their incompatibility with state-based testing logic. To address these challenges, we propose CrawlMLLM, a framework using multi-modal large language models to simulate human web browsing. It includes three core components: page state mining, functionality analysis, and automatic operation generation. Evaluations show 163% code coverage improvements over SOTA work. When integrated with vulnerability audit tools, CrawlMLLM found 44 vulnerabilities in three vulnerable web applications versus 34 by the baseline. In six real-world applications, CrawlMLLM detected 20 vulnerabilities while the next best method found six.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords