IEEE Access (Jan 2024)

Automatic Regular Expression Generation for Extracting Relevant Image Data From Web Pages Using Genetic Algorithms

  • Canan Aslanyurek,
  • Tarik Yerlikaya

DOI
https://doi.org/10.1109/ACCESS.2024.3420734
Journal volume & issue
Vol. 12
pp. 90660 – 90669

Abstract

Read online

In this study, a method that automatically generates regular expressions using genetic algorithms is designed to extract relevant images on web pages. Data extraction, which is usually done with web scrapers, can also be done with regular expressions. The complexity of regular expressions and the fact that they require expert knowledge make their writing difficult. With this study, a regular expression is automatically created to obtain relevant images of news content on websites. With the principle of genetic algorithms, the survival of the good and the elimination of the bad, a regular expression that can reach the most relevant image is produced. Thus, instead of a time-consuming and error-prone method such as creating the appropriate pattern for each site with web scraper tools, automatic regular expression generation using genetic algorithm methods can be used as a better method. A data set containing text-based related and irrelevant images from 200 websites collected from 58 countries was used in the study. There are 22,682 relevant images among 635,015 image data in the dataset. With the method developed using the genetic algorithm, the rate of accessing the relevant images by regular expressions produced by only looking at the relevant image data is approximately 98.49%.

Keywords