Jiàoyù zīliào yǔ túshūguǎn xué (Jun 2005)

自動化研究主題探勘方法及其在計算語言學之應用 An Automatic Method for Topic Exploration in a Subject Domain and Its Application on Computational Linguistics

  • Sung-Chien Lin

Journal volume & issue
Vol. 42, no. 4
pp. 523 – 544

Abstract

Read online

由於科學研究的規模日益龐大而且研究的工作也愈來愈複雜,研究人員與科技管理人員需要一套能夠有效地探勘研究主題的方法。過去我們針對這個問題提出一系列 文本處理與文字資訊探勘的技術,其中主要為關鍵語詞抽取技術以及資訊視覺化技術。關鍵語詞抽取技術以研究領域中的論文文字資料做為輸入,自動化抽取關鍵語 詞來代表領域中的重要主題。資訊視覺化技術則將這些語詞和它們之間的關係呈現在二維的圖形,提供使用者可以透過產生的圖形了解該領域的重要主題和它們的發 展情形。其餘還包括了語詞共現估計、主題相關程度計算以及論文映射等技術。本論文將這些技術整合起來並應用到國內的計算語言學領域,確認這個領域研究與發 展的重點。���果發現計算語言學早期著重於各種語言知識的計算理論化,以因應機器翻譯的需求;中期和近期則有語音處理和資訊檢索等更多的應用出現,而應用的 技術則傾向採用具有強健與容易實作等特性的統計導向方法。Because the size of modern scientific research is larger than before and the task of research becomes even more complex, researchers and managers urgently need an effective method to explore important topics in research domains. In the past, we had proposed a series of technologies based on text processing and text mining to deal with such a problem. Using text information in papers of the examined domain as input, a technology for term extraction was proposed to select key terms in the text information to represent important topics in the domain. Another proposed technology for information visualization was used to present the terms and their relationships in two-dimensional graphs with a technology of information visualization. Users can easily browse the topics of the domain as well as their development through the generated graphs for decision making of research and management. In addition, the technologies include several techniques of estimating term co-occurrences, calculating degrees of relevance between topics, and mapping paper information to the topic graph. In this paper, an automatic method for topic exploration was proposed with the integration of the developed technologies and it was applied to the studies of computational linguistics in Taiwan to depict foci of research and development in the domain. The result shows that for the development of technologies of machine translation, the earlier studies in the domain emphasized the computational theorization of several linguistic knowledge, but in its mid and later periods, there were more applications emerging, such as speech processing and information retrieval, and a lot of statistical approaches were adopted as the technologies for their robustness and easy implementation.

Keywords