Труды Института системного программирования РАН (Oct 2018)

Detecting Content Spam on the Web through Text Diversity Analysis

  • Anton S. Pavlov,
  • Boris V. Dobrov

Journal volume & issue
Vol. 21, no. 0

Abstract

Read online

Web spam is considered to be one of the greatest threats to modern search engines. Spammers use a wide range of content generation techniques known as content spam to fill search results with low quality pages. We argue that content spam must be tackled using a wide range of content quality features. In this paper we propose a set of content diversity features based on frequency rank distributions for terms and topics. We combine them with a wide range of other content features to produce a content spam classifier that outperforms existing results.

Keywords