IEEE Access (Jan 2024)
ACRF: Aggregated Conditional Random Field for Out of Vocab (OOV) Token Representation for Hindi NER
Abstract
Named entities are random, like emerging entities and complex entities. Most of the large language model’s tokenizers have fixed vocab; hence, they tokenize out-of-vocab (OOV) words into multiple sub-words during tokenization. During fine-tuning for any downstream task, these sub-words (tokens) make the named entity classification more complex since, for each sub-word, an extra entity type is assigned for utilizing the word embedding of the sub-word. This work attempts to reduce this complexity by aggregating token embeddings of each word. In this work, we have applied Aggregated-CRF (ACRF), where a conditional random field (CRF) is applied at the top of aggregated token embeddings for named entity prediction. Aggregation is done at embeddings of all tokens generated by a tokenizer corresponding to a word. The experiment was done with two Hindi datasets (HiNER and Hindi Multiconer2). This work showed that the ACRF is better than vanilla CRF (where token embeddings are not aggregated). Also, our result outperformed the existing best result at HiNER data, which was done by applying a cross-entropy classification layer. Further, An analysis of the impact of tokenization has been conducted, both generally and according to entity types for each word present in test data, and the results show that ACRF performed better for the words which tokenized in more than one sub-words (OOV) compared to vanilla CRF. In addition, this work conducts a comparative analysis between two transformer-based models, MuRIL-large and XLM-roberta-large and investigates how these models adopt aggregation strategy based on OOV.
Keywords