Zhejiang dianli (Jun 2024)
Research on feature extraction of unstructured large power texts
Abstract
Large power texts contain numerous abbreviations of technical terms, alternative names, and irregular expressions. Existing word segmentation tools often fail to identify specialized vocabulary in the electrical engineering field, significantly hindering the analysis and utilization of unstructured texts. To address this challenge, this paper proposes a set of indexing rules tailored to the characteristics of unstructured texts in electrical engineering. Segmentation based on these rules can significantly enhance segmentation accuracy, laying a solid foundation for feature extraction of power texts. Furthermore, by employing effective long-text segmentation algorithms to preserve the semantic information of the original text, the paper integrates and embeds text feature information extracted by the BERT model with vocabulary feature information extracted by Word2Vec. This combined approach enables the extraction of precise features from large unstructured power texts. Finally, experimental results have demonstrated the effectiveness of the proposed method for extracting features from large unstructured power texts.
Keywords