Scientific Reports (Jul 2024)

Optimizing word embeddings for small dataset: a case study on patient portal messages from breast cancer patients

  • Qingyuan Song,
  • Congning Ni,
  • Jeremy L. Warner,
  • Qingxia Chen,
  • Lijun Song,
  • S. Trent Rosenbloom,
  • Bradley A. Malin,
  • Zhijun Yin

DOI
https://doi.org/10.1038/s41598-024-66319-z
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 9

Abstract

Read online

Abstract Patient portal messages often relate to specific clinical phenomena (e.g., patients undergoing treatment for breast cancer) and, as a result, have received increasing attention in biomedical research. These messages require natural language processing and, while word embedding models, such as word2vec, have the potential to extract meaningful signals from text, they are not readily applicable to patient portal messages. This is because embedding models typically require millions of training samples to sufficiently represent semantics, while the volume of patient portal messages associated with a particular clinical phenomenon is often relatively small. We introduce a novel adaptation of the word2vec model, PK-word2vec (where PK stands for prior knowledge), for small-scale messages. PK-word2vec incorporates the most similar terms for medical words (including problems, treatments, and tests) and non-medical words from two pre-trained embedding models as prior knowledge to improve the training process. We applied PK-word2vec in a case study of patient portal messages in the Vanderbilt University Medical Center electric health record system sent by patients diagnosed with breast cancer from December 2004 to November 2017. We evaluated the model through a set of 1000 tasks, each of which compared the relevance of a given word to a group of the five most similar words generated by PK-word2vec and a group of the five most similar words generated by the standard word2vec model. We recruited 200 Amazon Mechanical Turk (AMT) workers and 7 medical students to perform the tasks. The dataset was composed of 1389 patient records and included 137,554 messages with 10,683 unique words. Prior knowledge was available for 7981 non-medical and 1116 medical words. In over 90% of the tasks, both reviewers indicated PK-word2vec generated more similar words than standard word2vec (p = 0.01).The difference in the evaluation by AMT workers versus medical students was negligible for all comparisons of tasks’ choices between the two groups of reviewers ( $${\text{p}} = 0.774$$ p = 0.774 under a paired t-test). PK-word2vec can effectively learn word representations from a small message corpus, marking a significant advancement in processing patient portal messages.

Keywords