Heliyon (Oct 2024)
Understanding privacy concerns in ChatGPT: A data-driven approach with LDA topic modeling
Abstract
This study investigates privacy concerns associated with ChatGPT, a prominent generative AI model, through a data-driven approach combining Twitter data analysis and a user survey. Leveraging Latent Dirichlet Allocation (LDA) topic modeling and data categorization techniques, the research identifies key areas of concern: 1) Privacy Leakage Due to Public Data Exploitation, 2) Privacy Leakage Due to Personal Input Exploitation, and 3) Privacy Leakage Due to Unauthorized Access. Twitter data analysis of over 500k tweets, supplemented by a survey of 67 ChatGPT users, reveals nuanced user perceptions and experiences regarding privacy risks. A Python program was used to improve a dataset of 500k tweets referencing “ChatGPT” during the data preparation stage. To get a refined collection of terms, steps included converting text to lowercase, eliminating mentions and hyperlinks, tokenizing, eliminating stopwords, and keyword matching to extract tweets about ChatGPT's privacy features. Once preprocessing was completed, there were 11k refined tweets. Results highlight significant apprehensions, particularly regarding unauthorized access, underscoring the importance of robust privacy measures in AI systems. The study contributes to understanding user concerns, informing policy decisions, and guiding future research on privacy in generative AI. These studies might improve ChatGPT and other AI systems' security and privacy. The public, corporations, researchers, lawmakers, and AI developers may all benefit from the useful information it provides in better understanding and managing privacy threats.