Advances in Electrical and Computer Engineering (Feb 2019)
Generic Feature Selection Methodology to Named Entity Detection from Indian and European Languages
Abstract
This paper describes the development of language and domain independent Named Entity Recognition (NER) system which can identify named entities from any given dataset irrespective of the language and domain. The main novelty of the present work is the generic feature selection methodology which has been applied to 7 Indian languages and 5 European languages. The generic feature selection methodology was done in two ways; first using frequency based approach; secondly k-means++ clustering algorithm was used to validate the patterns obtained in the frequency based approach. The dataset used for the experiments belongs to different genre. To the best of our knowledge we are the first to work on the development of cross-lingual Named Entity (NE) system with 12 languages belongs to different language families. We have done the 10-fold cross validation and the system output has been analyzed for all the languages and causes of error cases was discussed in the error analysis section. The performance of our system is also compared with the existing systems.
Keywords