大数据 (Jan 2019)
Analysis of HIV high-risk population characteristics with Baidu Tieba data
Abstract
The textual content and temporal pattern of online activities for users gathered in the “Fear of HIV Bar” of Baidu Tieba were analyzed. LDA topic model was used to analyze the main differences between topics discussed among HIV-infected people and non-HIV-infected people. A machine learning method based on key words was used to distinguish the sexual orientation of users who start a discussion in “Fear of HIV Bar”, and calculate the epidemic rate of HIV among groups with different sexual orientations. The techniques used in this paper can be supplemented as an important tool for high-risk populations research. In addition, this paper can be applied to assess the epidemic of HIV in populations with different sexual orientations by using machine learning technique to intelligently classify the sexual orientation of a user, which is of great significance for the public health agencies.