Proceedings of the XXth Conference of Open Innovations Association FRUCT (Nov 2024)
A Comprehensive Study of Bengali Embedding Models: Insights and Evaluations
Abstract
Abstract—Word embeddings in Natural Language Processing (NLP) represent words as vectors, encapsulating both semantic and syntactic meanings. Prominent models such as Word2Vec, FastText, and GloVe play a crucial role in various NLP tasks. This study evaluates these models, trained on a 240 million words corpus derived from nearly 900,000 Bengali newspaper articles, which we scraped using Scrapy. Our evaluation process involved fine-tuning parameters such as vector dimensions, epochs, window sizes, and minimum counts. We assessed the Continuous Bag of Words (CBOW) and SkipGram architectures across Word2Vec and FastText models, measuring their performance. To benchmark these models, we created 133 unique semantic and 103 syntactic Bengali question sets for the first time, assessing accuracy, cosine similarity, training time, memory usage, and a combined evaluation metric. Additionally, we utilized the confusion matrix for concept categorization. Comparing models trained from scratch and using Gensim, we found that models with 25 epochs, a minimum count of 35, and 300 dimensions delivered optimal performance. Specifically, Gensim Word2Vec with SkipGram achieved the highest semantic task accuracy, while FastText from Scratch excelled in syntactic tasks. All the models showed best performance for concept categorization in both semantic and syntactic analogy tasks. Additionally, models trained with 100 dimensions consistently showed higher cosine values, indicating better prediction purity, while models trained with 300 dimensions predicted the maximum number of correct answers. Our experiments revealed that models trained from scratch outperformed those trained with Gensim in FastText, with each model exhibiting strengths in different aspects of NLP tasks. Future research will expand on this work by introducing more diverse question sets.
Keywords