Title2Vec: a contextual job title embedding for occupational named entity recognition and other applications

Junhua Liu; Yung Chuen Ng; Zitong Gui; Trisha Singhal; Lucienne T. M. Blessing; Kristin L. Wood; Kwan Hui Lim

doi:10.1186/s40537-022-00649-5

Journal of Big Data (Sep 2022)

Title2Vec: a contextual job title embedding for occupational named entity recognition and other applications

Junhua Liu,
Yung Chuen Ng,
Zitong Gui,
Trisha Singhal,
Lucienne T. M. Blessing,
Kristin L. Wood,
Kwan Hui Lim

Affiliations

Junhua Liu: Singapore University of Technology and Design
Yung Chuen Ng: Singapore University of Technology and Design
Zitong Gui: Singapore University of Technology and Design
Trisha Singhal: Singapore University of Technology and Design
Lucienne T. M. Blessing: Singapore University of Technology and Design
Kristin L. Wood: Singapore University of Technology and Design
Kwan Hui Lim: Singapore University of Technology and Design

DOI: https://doi.org/10.1186/s40537-022-00649-5
Journal volume & issue: Vol. 9, no. 1
pp. 1 – 16

Abstract

Read online

Abstract Occupational data mining and analysis is an important task in understanding today’s industry and job market. Various machine learning techniques are proposed and gradually deployed to improve companies’ operations for upstream tasks, such as employee churn prediction, career trajectory modelling and automated interview. Job titles analysis and embedding, as the fundamental building blocks, are crucial upstream tasks to address these occupational data mining and analysis problems. A relevant occupational job title dataset is required to accomplish these tasks and towards that effort, we present the Industrial and Professional Occupations Dataset (IPOD). The IPOD dataset contains over 475,073 job titles based on 192,295 user profiles from a major professional networking site. To further facilitate these applications of occupational data mining and analysis, we propose Title2vec, a contextual job title vector representation using a bidirectional Language Model approach. To demonstrate the effectiveness of Title2vec, we also define an occupational Named Entity Recognition (NER) task and proposed two methods based on Conditional Random Fields (CRF) and bidirectional Long Short-Term Memory with CRF (LSTM-CRF). Using a large occupational job title dataset, experimental results show that both CRF and LSTM-CRF outperform human and baselines in both exact-match accuracy and F1 scores. The dataset and pre-trained embeddings have been made publicly available at https://www.github.com/junhua/ipod .

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords