Alexandria Engineering Journal (Jul 2024)
Vision transformer-based visual language understanding of the construction process
Abstract
The widespread implementation of surveillance systems on construction sites has led to the accumulation of vast amounts of visual data, highlighting the need for an effective semantic analysis methodology. Natural language, as the most intuitive mode of expression, can significantly enhance the interpretability of such data. The adoption of multi-modality models promotes the interaction between surveillance video and textual data, thereby enabling managers to swiftly comprehend on-site dynamics. This study introduces a Visual Question Answering (VQA) approach for the construction industry and presents a specialized dataset to address the unique requirements of on-site management. Utilizing a Vision Transformer (ViT) architecture, the proposed model conducts feature extraction, fusion and interaction between visual and textual features. An additional projection layer is added to establish a transfer learning strategy that is optimized for construction site data. This novel approach facilitates rapid alignment of visual and language features in the model and is validated through ablation studies. The proposed approach achieves a testing accuracy of 83.8%, effectively converting image data from construction sites into natural language descriptions that enhance the analysis of construction processes. Compared to existing methods, this approach does not rely on object detection and allows for the direct extraction of deep-level semantic information from the on-site images. This study further discusses the feasibility of applying VQA within the architecture, engineering and construction (AEC) industry, examines its limitations, and offers suggestions for viable future directions of development.