Large-scale Vietnamese point-of-interest classification using weak labeling

Van Trung Tran; Van Trung Tran; Quang Dao Le; Quang Dao Le; Bao Son Pham; Viet Hung Luu; Viet Hung Luu; Quang Hung Bui

doi:10.3389/frai.2022.1020532

Frontiers in Artificial Intelligence (Dec 2022)

Large-scale Vietnamese point-of-interest classification using weak labeling

Van Trung Tran,
Van Trung Tran,
Quang Dao Le,
Quang Dao Le,
Bao Son Pham,
Viet Hung Luu,
Viet Hung Luu,
Quang Hung Bui

Affiliations

Van Trung Tran: Center of Multidisciplinary Integrated Technologies for Field Monitoring, Vietnam National University of Engineering and Technology, Hanoi, Vietnam
Van Trung Tran: NTT Hi-Tech Institute, Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam
Quang Dao Le: Center of Multidisciplinary Integrated Technologies for Field Monitoring, Vietnam National University of Engineering and Technology, Hanoi, Vietnam
Quang Dao Le: NTT Hi-Tech Institute, Nguyen Tat Thanh University, Ho Chi Minh City, Vietnam
Bao Son Pham: Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam
Viet Hung Luu: Center of Multidisciplinary Integrated Technologies for Field Monitoring, Vietnam National University of Engineering and Technology, Hanoi, Vietnam
Viet Hung Luu: FIMO, Hanoi, Vietnam
Quang Hung Bui: Center of Multidisciplinary Integrated Technologies for Field Monitoring, Vietnam National University of Engineering and Technology, Hanoi, Vietnam

DOI: https://doi.org/10.3389/frai.2022.1020532
Journal volume & issue: Vol. 5

Abstract

Read online

Point-of-Interests (POIs) represent geographic location by different categories (e.g., touristic places, amenities, or shops) and play a prominent role in several location-based applications. However, the majority of POIs category labels are crowd-sourced by the community, thus often of low quality. In this paper, we introduce the first annotated dataset for the POIs categorical classification task in Vietnamese. A total of 750,000 POIs are collected from WeMap, a Vietnamese digital map. Large-scale hand-labeling is inherently time-consuming and labor-intensive, thus we have proposed a new approach using weak labeling. As a result, our dataset covers 15 categories with 275,000 weak-labeled POIs for training, and 30,000 gold-standard POIs for testing, making it the largest compared to the existing Vietnamese POIs dataset. We empirically conduct POI categorical classification experiments using a strong baseline (BERT-based fine-tuning) on our dataset and find that our approach shows high efficiency and is applicable on a large scale. The proposed baseline gives an F1 score of 90% on the test dataset, and significantly improves the accuracy of WeMap POI data by a margin of 37% (from 56 to 93%).

Published in Frontiers in Artificial Intelligence

ISSN: 2624-8212 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.frontiersin.org/journals/artificial-intelligence#

About the journal

Abstract

Keywords