Developing a Web Scraping Application with Bypass Blocking

Alexey A. Moskalenko; Olga R. Laponina; Vladimir A. Sukhomlin

doi:10.25559/SITITO.15.201902.413-420

Современные информационные технологии и IT-образование (Jul 2019)

Developing a Web Scraping Application with Bypass Blocking

Alexey A. Moskalenko,
Olga R. Laponina,
Vladimir A. Sukhomlin

Affiliations

Alexey A. Moskalenko: ORCiD; Lomonosov Moscow State University (Russia)
Olga R. Laponina: ORCiD; Lomonosov Moscow State University (Russia)
Vladimir A. Sukhomlin: ORCiD; Lomonosov Moscow State University (Russia)

DOI: https://doi.org/10.25559/SITITO.15.201902.413-420
Journal volume & issue: Vol. 15, no. 2
pp. 413 – 420

Abstract

Read online

Web-scraping is a process of extracting data from web-pages on the Internet by automating web-sites requests. Importance of web-scraping is increased with developing of the Internet. And more than half of Internet traffic (except for streaming, i.e. audio and video) is created by automated means, so-called bots. TThe article is devoted to the study of the process of web-scraping and the problem of blocking web scrapers on the Internet. We consider the basic principles and concepts of web scraping process and classification of web scrapers. A review of existing web-scraping solutions is carried out, highlighting the main advantages and disadvantages of web-scraping bypassing locks. The reasons for blocking web scrapers by websites are considered, highlighting the signs by which websites determine and block web scrapers. We investigate techniques for bypassing web-scraper locks and their impact on the web-scraping process. A program developed in the Python programming language that uses techniques to bypass web-scrapper locks is proposed. The program has a graphical interface developed using the Tkinter framework to create a web-scraping policy. Web scrapers bypassing blocking techniques use an open source framework to automate user actions in the Selenium WebDriver browser. A comparative analysis of the work of web scrapers showed that the use of the modules created in the work allows you to bypass the blocking of web scraping.

Published in Современные информационные технологии и IT-образование

ISSN: 2411-1473 (Print)
Publisher: The Fund for Promotion of Internet media, IT education, human development «League Internet Media»
Country of publisher: Russian Federation
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://sitito.cs.msu.ru

About the journal

Abstract

Keywords