The popularity of information technology has given rise to a growing interest in smart education and has provided the possibility of combining online and offline education. Knowledge graphs, an effective technology for knowledge representation and management, have been successfully utilized to manage massive educational resources. However, the existing research on constructing educational knowledge graphs ignores multiple modalities and their relationships, such as teacher speeches and their relationship with knowledge. To tackle this problem, we propose an automatic approach to construct multi-modal educational knowledge graphs that integrate speech as a modal resource to facilitate the reuse of educational resources. Specifically, we first propose a fine-tuned Bidirectional Encoder Representation from Transformers (BERT) model based on education lexicon, called EduBERT, which can adaptively capture effective information in the education field. We also add a Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) to effectively identify educational entities. Then, the locational information of the entity is incorporated into BERT to extract the educational relationship. In addition, to cover the shortage of traditional text-based knowledge graphs, we focus on collecting teacher speech to construct a multi-modal knowledge graph. We propose a speech-fusion method that links these data into the graph as a class of entities. The numeric results show that our proposed approach can manage and present various modes of educational resources and that it can provide better education services.