IEEE Access (Jan 2022)

Automatic Information Extraction in the Third-Generation Semiconductor Materials Domain Based on DKNet and MANet

  • Xiaobo Jiang,
  • Kun He,
  • Borui Yang

DOI
https://doi.org/10.1109/ACCESS.2022.3159338
Journal volume & issue
Vol. 10
pp. 29367 – 29376

Abstract

Read online

The third-generation semiconductor materials (TGSMs) is a frontier scientific domain, where researchers need to consult extensive literature for the entity information on materials, devices, preparation methods, and experimental performances, and sort out the complex relations between them. However, the explosion of relevant papers has far exceeded researchers’ reading ability. In this article, the TGSM-field automatic information extraction is conducted based on entity recognition (ER) and relation extraction (RE) techniques. First, the corpora used for ER and RE in this field are created. Second, aiming at the complexity of the entities, a neural network using domain knowledge (DKNet) is proposed to improve ER performance. It uses the keyword sequence of each entity type as prior knowledge, adds a dedicated embedding to encode entity categories, then combines prior knowledge and encoded vectors with the context through a gated information fusion module to assist recognition. As for the indicative word dependence problem of entity relations, a multi-aspect attention-based network model (MANet) is proposed to enhance the attention to relation-indicative words, thereby improving the RE performance. Finally, F1 scores of 74.5 and 85.9 were achieved on the created ER and RE test sets, outperforming other advanced models by $3.4~\sim ~10.1$ , which is the best performance of the TGSM-field automatic information extraction.

Keywords