Analysis of the Use of Background Distribution for Naive Bayes Classifiers

Andrade Daniel; Tamura Akihiro; Tsuchida Masaaki

doi:10.1515/jisys-2017-0016

Journal of Intelligent Systems (Apr 2019)

Analysis of the Use of Background Distribution for Naive Bayes Classifiers

Andrade Daniel,
Tamura Akihiro,
Tsuchida Masaaki

Affiliations

Andrade Daniel: Security Research Laboratories, NEC Corporation, Tokyo, Japan
Tamura Akihiro: Graduate School of Science and Engineering, Ehime University, Matsuyama, Japan
Tsuchida Masaaki: AI System Department, DeNA Co., Ltd., Tokyo, Japan

DOI: https://doi.org/10.1515/jisys-2017-0016
Journal volume & issue: Vol. 28, no. 2
pp. 259 – 273

Abstract

Read online

The naive Bayes classifier is a popular classifier, as it is easy to train, requires no cross-validation for parameter tuning, and can be easily extended due to its generative model. Moreover, recently it was shown that the word probabilities (background distribution) estimated from large unlabeled corpora could be used to improve the parameter estimation of naive Bayes. However, previous methods do not explicitly allow to control how much the background distribution can influence the estimation of naive Bayes parameters. In contrast, we investigate an extension of the graphical model of naive Bayes such that a word is either generated from a background distribution or from a class-specific word distribution. We theoretically analyze this model and show the connection to Jelinek-Mercer smoothing. Experiments using four standard text classification data sets show that the proposed method can statistically significantly outperform previous methods that use the same background distribution.

Published in Journal of Intelligent Systems

ISSN: 0334-1860 (Print); 2191-026X (Online)
Publisher: De Gruyter
Country of publisher: Poland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.degruyter.com/view/journals/jisys/jisys-overview.xml

About the journal

Abstract

Keywords