IEEE Access (Jan 2024)
BioMDSum: An Effective Hybrid Biomedical Multi-Document Summarization Method Based on PageRank and Longformer Encoder-Decoder
Abstract
Biomedical multi-document summarization (BioMDSum) involves automatically generating concise and informative summaries from collections of related biomedical documents. While extractive summarization methods have shown promise, they often produce incoherent summaries. Onethe other hand, fully abstractive methods yield coherent summaries but demand extensive training datasets and computational resources due to the typically lengthy nature of biomedical documents. Toeaddress these challenges, weepropose a hybrid summarization method that combines the strengths of both approaches. The proposed method consists of two main phases: (i) an extractive summarization phase that uses k-means clustering to group similar sentences based on their cosine similarity between embeddings generated by the sentence-BERT model, followed by the PageRank algorithm for sentence scoring and selection; and (ii) an abstractive summarization phase that fine-tunes a Longform Encoder-Decoder (LED) transformer model to generate a concise and coherent summary from the sentences selected during the extractive phase. Weeconducted several experiments on the standard biomedical multi-document summarization datasets Cochrane and MS^2. The results demonstrate that the proposed method is competitive and outperforms recent state-of-the-art systems based on ROUGE evaluation measures. Specifically, our model achieved ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR scores of 29.41%, 6.57%, 18.31%, 85.95%, and 22.15% on the Cochrane dataset, and 28.79%, 8.22%, 17.93%, 85.51%, and 25.17% on the MS^2 dataset, respectively. Furthermore, aneablation analysis shows that integrating extractive and abstractive phases in our hybrid summarization method enhances the overall performance of the proposed approach.
Keywords