IEEE Access (Jan 2021)

Taxi Passenger Hot Spot Mining Based on a Refined K-Means++ Algorithm

  • Yuanni Wang,
  • Jiansi Ren

DOI
https://doi.org/10.1109/ACCESS.2021.3075682
Journal volume & issue
Vol. 9
pp. 66587 – 66598

Abstract

Read online

With the development of information technology, it is possible to explore the spatial-temporal distribution characteristics of taxi travel demand by examining taxi GPS location data in order to master the actual supply and demand levels of different hot spots at different time periods. At present, in hot spot mining, the existing research on the clustering of passenger hot spots has some performance problems, such as insufficient clustering accuracy and high algorithm time complexity. The purpose of this paper is to propose a two-level subdivision concept and improve the K-means++ algorithm to finish the fine clustering of hot spots of taxi passengers. The first-level subdivision establishes a dynamic adjustable region with time and geographical range. In the second layer, a Gaussian mixture model is used for the data distribution statistics, and the optimal subdivision area number is determined according to the minimum principle of the Akaike information criterion and Bayesian information criterion. The SSE (sum of the square distance errors) is used to determine the optimal cluster number $k$ for each local area. Finally, the K-means++ algorithm is used to complete the clustering of each local area. A week of green taxi data from New York City was used to validate the method and compare it to the traditional K-means and DBSCAN approaches. The proposed method achieved better accuracy with comparable time consumption. This demonstrated the value of the approach for hot spot data mining although clustering still has some important advantages. In addition, the hot spots in the morning peak and weekend are displayed visually, which is helpful to provide the guidance for urban transportation and planning.

Keywords