Современные информационные технологии и IT-образование (Jul 2019)
Developing a Web Scraping Application with Bypass Blocking
Abstract
Web-scraping is a process of extracting data from web-pages on the Internet by automating web-sites requests. Importance of web-scraping is increased with developing of the Internet. And more than half of Internet traffic (except for streaming, i.e. audio and video) is created by automated means, so-called bots. TThe article is devoted to the study of the process of web-scraping and the problem of blocking web scrapers on the Internet. We consider the basic principles and concepts of web scraping process and classification of web scrapers. A review of existing web-scraping solutions is carried out, highlighting the main advantages and disadvantages of web-scraping bypassing locks. The reasons for blocking web scrapers by websites are considered, highlighting the signs by which websites determine and block web scrapers. We investigate techniques for bypassing web-scraper locks and their impact on the web-scraping process. A program developed in the Python programming language that uses techniques to bypass web-scrapper locks is proposed. The program has a graphical interface developed using the Tkinter framework to create a web-scraping policy. Web scrapers bypassing blocking techniques use an open source framework to automate user actions in the Selenium WebDriver browser. A comparative analysis of the work of web scrapers showed that the use of the modules created in the work allows you to bypass the blocking of web scraping.
Keywords