Sec-Lib: Protecting Scholarly Digital Libraries From Infected Papers Using Active Machine Learning Framework

Nir Nissim; Aviad Cohen; Jian Wu; Andrea Lanzi; Lior Rokach; Yuval Elovici; Lee Giles

doi:10.1109/ACCESS.2019.2933197

IEEE Access (Jan 2019)

Sec-Lib: Protecting Scholarly Digital Libraries From Infected Papers Using Active Machine Learning Framework

Nir Nissim,
Aviad Cohen,
Jian Wu,
Andrea Lanzi,
Lior Rokach,
Yuval Elovici,
Lee Giles

Affiliations

Nir Nissim: ORCiD; Malware Lab, Cyber Security Research Center (CSRC), Ben-Gurion University, Beersheba, Israel
Aviad Cohen: Malware Lab, Cyber Security Research Center (CSRC), Ben-Gurion University, Beersheba, Israel
Jian Wu: Computer Science Department, Old Dominion University, Norfolk, VA, USA
Andrea Lanzi: Computer Science Department, University of Milan, Milan, Italy
Lior Rokach: Malware Lab, Cyber Security Research Center (CSRC), Ben-Gurion University, Beersheba, Israel
Yuval Elovici: Malware Lab, Cyber Security Research Center (CSRC), Ben-Gurion University, Beersheba, Israel
Lee Giles: Computer Science and Engineering Department, Pennsylvania State University, State College, PA, USA

DOI: https://doi.org/10.1109/ACCESS.2019.2933197
Journal volume & issue: Vol. 7
pp. 110050 – 110073

Abstract

Read online

Researchers from academia and the corporate-sector rely on scholarly digital libraries to access articles. Attackers take advantage of innocent users who consider the articles' files safe and thus open PDF-files with little concern. In addition, researchers consider scholarly libraries a reliable, trusted, and untainted corpus of papers. For these reasons, scholarly digital libraries are an attractive-target and inadvertently support the proliferation of cyber-attacks launched via malicious PDF-files. In this study, we present related vulnerabilities and malware distribution approaches that exploit the vulnerabilities of scholarly digital libraries. We evaluated over two-million scholarly papers in the CiteSeerX library and found the library to be contaminated with a surprisingly large number (0.3-2%) of malicious PDF documents (over 55% were crawled from the IPs of US-universities). We developed a two layered detection framework aimed at enhancing the detection of malicious PDF documents, Sec-Lib, which offers a security solution for large digital libraries. Sec-Lib includes a deterministic layer for detecting known malware, and a machine learning based layer for detecting unknown malware. Our evaluation showed that scholarly digital libraries can detect 96.9% of malware with Sec-Lib, while minimizing the number of PDF-files requiring labeling, and thus reducing the manual inspection efforts of security-experts by 98%.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords