OHiFormer: Object-Wise Hierarchical Dependency-Based Transformer for Screen Summarization

Ye Ji Han; Soyeon Lee; Jin Sob Kim; Byung Hoon Lee; Sung Won Han

doi:10.1109/ACCESS.2024.3431711

IEEE Access (Jan 2024)

OHiFormer: Object-Wise Hierarchical Dependency-Based Transformer for Screen Summarization

Ye Ji Han,
Soyeon Lee,
Jin Sob Kim,
Byung Hoon Lee,
Sung Won Han

Affiliations

Ye Ji Han: ORCiD; School of Industrial and Management Engineering, Korea University, Seongbuk, Seoul, Republic of Korea
Soyeon Lee: ORCiD; School of Industrial and Management Engineering, Korea University, Seongbuk, Seoul, Republic of Korea
Jin Sob Kim: ORCiD; School of Industrial and Management Engineering, Korea University, Seongbuk, Seoul, Republic of Korea
Byung Hoon Lee: School of Industrial and Management Engineering, Korea University, Seongbuk, Seoul, Republic of Korea
Sung Won Han: ORCiD; School of Industrial and Management Engineering, Korea University, Seongbuk, Seoul, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3431711
Journal volume & issue: Vol. 12
pp. 101313 – 101324

Abstract

Read online

Screen summarization aims to generate concise textual descriptions that communicate the crucial contents and functionalities of a mobile user interface (UI) screen. A UI screen consists of objects with a hierarchical structure that are tightly interconnected, and each object contains multimodal data such as images, texts, and bounding boxes. Considering these characteristics, previous works encoded the absolute position of objects at the view hierarchy to extract the semantic representation of the UI screen. However, the importance of the hierarchical dependency between objects in the UI structure was overlooked. In this study, we propose an object-wise hierarchical dependency-based Transformer named OHiFormer. OHiFormer considers the objects on the UI screen as tokens in natural language processing and leverages the Transformer to capture the mutual relationships between objects. Moreover, OHiFormer includes a modified self-attention mechanism using structural relative position encoding to represent the hierarchically connected UI. Experimental results demonstrate that OHiFormer outperforms benchmark models in the BLEU 1, BLEU 2, BLEU 3, BLEU 4, ROUGE-L, and CIDEr metrics by 3.63%, 2.1%, 0.12%, 1.8%, 2.38%, and 17.58%, respectively, on the Screen Summarization dataset. Furthermore, our proposed UI structural representation method achieves remarkable performance on complex UIs with numerous objects compared to other structural position encoding methods. Finally, a visualization of the self-attention heatmaps demonstrates how OHiFormer reflects the hierarchical dependencies between objects. By reflecting hierarchical dependencies hidden in the visual layout of the UI, OHiFormer not only improves the quality of summaries but also offers the potential for applications in mobile apps and systems containing numerous interactive objects.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords