Труды Института системного программирования РАН (Oct 2018)
Applying Time Series to The Task of Background User Identification Based on Their Text Data Analysis
Abstract
The paper presents the novel approach of user identification based on behavior analytics of user operations with a text information. It is offered to describe user behavior by content of his text documents. The structured representation of the considered behavioral information is carried out based on representation of documents text content in the user topic space, which is created by non-negative matrix factorization. The topic weights in the document characterize the user’s topic trend during an operating time with this document. The time variation of the topic weight values creates multidimensional time series that describe the history of user behavior when working with text data. Forecasting of such time series will allow for user identification based on estimated deviation of observed topic trend from the predicted topic weight values. This paper also presents the new time series forecasting method based on orthogonal nonnegative matrix factorization (ONMF) which is used within proposed user identification approach. It is worth noting that nonnegative matrix factorization methods were not used before for the time series forecasting task. The proposed user identification approach has been experimentally verified on the example of real corporate email correspondence created from the Enron dataset. In addition, experiments with other today popular forecasting methods have shown the superiority of proposed forecasting method in quality of user’s topic weights classification. Also we investigated two different approaches to estimates of the deviation of a time series point from the predicted value: absolute deviation and p-value estimation. Experiments have shown that both discussed approaches of deviation estimates are applicable in the proposed user identification approach.
Keywords