Genomics & Informatics (Sep 2020)
Organizing an in-class hackathon to correct PDF-to-text conversion errors of 1.0
- Sunho Kim,
- Royoung Kim,
- Hee-Jo Nam,
- Ryeo-Gyeong Kim,
- Enjin Ko,
- Han-Su Kim,
- Jihye Shin,
- Daeun Cho,
- Yurhee Jin,
- Soyeon Bae,
- Ye Won Jo,
- San Ah Jeong,
- Yena Kim,
- Seoyeon Ahn,
- Bomi Jang,
- Jiheyon Seong,
- Yujin Lee,
- Si Eun Seo,
- Yujin Kim,
- Ha-Jeong Kim,
- Hyeji Kim,
- Hye-Lynn Sung,
- Hyoyoung Lho,
- Jaywon Koo,
- Jion Chu,
- Juwon Lim,
- Youngju Kim,
- Kyungyeon Lee,
- Yuri Lim,
- Meongeun Kim,
- Seonjeong Hwang,
- Shinhye Han,
- Sohyeun Bae,
- Sua Kim,
- Suhyeon Yoo,
- Yeonjeong Seo,
- Yerim Shin,
- Yonsoo Kim,
- You-Jung Ko,
- Jihee Baek,
- Hyejin Hyun,
- Hyemin Choi,
- Ji-Hye Oh,
- Da-Young Kim,
- Hyun-Seok Park
Affiliations
- Sunho Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Royoung Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Hee-Jo Nam
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Ryeo-Gyeong Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Enjin Ko
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Han-Su Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Jihye Shin
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Daeun Cho
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Yurhee Jin
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Soyeon Bae
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Ye Won Jo
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- San Ah Jeong
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Yena Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Seoyeon Ahn
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Bomi Jang
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Jiheyon Seong
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Yujin Lee
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Si Eun Seo
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Yujin Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Ha-Jeong Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Hyeji Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Hye-Lynn Sung
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Hyoyoung Lho
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Jaywon Koo
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Jion Chu
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Juwon Lim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Youngju Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Kyungyeon Lee
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Yuri Lim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Meongeun Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Seonjeong Hwang
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Shinhye Han
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Sohyeun Bae
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Sua Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Suhyeon Yoo
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Yeonjeong Seo
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Yerim Shin
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Yonsoo Kim
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- You-Jung Ko
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Jihee Baek
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Hyejin Hyun
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Hyemin Choi
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Ji-Hye Oh
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- Da-Young Kim
- Hyun-Seok Park
- Bioinformatics & Natural Language Processing Laboratory, ELTEC College of Engineering, Ewha Womans University, Seoul 03760, Korea
- DOI
- https://doi.org/10.5808/GI.2020.18.3.e33
- Journal volume & issue
-
Vol. 18,
no. 3
p. e33
Abstract
This paper describes a community effort to improve earlier versions of the full-text corpus of Genomics & Informatics by semi-automatically detecting and correcting PDF-to-text conversion errors and optical character recognition errors during the first hackathon of Genomics & Informatics Annotation Hackathon (GIAH) event. Extracting text from multi-column biomedical documents such as Genomics & Informatics is known to be notoriously difficult. The hackathon was piloted as part of a coding competition of the ELTEC College of Engineering at Ewha Womans University in order to enable researchers and students to create or annotate their own versions of the Genomics & Informatics corpus, to gain and create knowledge about corpus linguistics, and simultaneously to acquire tangible and transferable skills. The proposed projects during the hackathon harness an internal database containing different versions of the corpus and annotations.
Keywords