IEEE Access (Jan 2024)
Humkinar: Construction of a Large Scale Web Repository and Information System for Low Resource Urdu Language
Abstract
Online content availability, commercial viability, and technological advancements for English and European languages direct mainstream search engines to prioritize the search results of these high-resource languages. This makes it challenging for low-resource language users to access the search results in regional languages which is essential to promote literacy, inclusion, and digital accessibility. In this article, we create Humkinar– a Urdu language search engine using open-source tools. Our search engine is designed with five key components: computing infrastructure, data collector, search manager, web analytics engine, and user interface. First, our in-house computing infrastructure offers 160 GB RAM, 80 cores, and 30 TB memory to support the operations of the search engine. Next, we customize an open-source web crawler with a specialized Urdu language-focused URL selection algorithm, webpage parser, and content selection mechanism to collect Urdu webpages with optimized computing and Internet resources. We also employ specialized content scrapers to collect targeted and high-priority Urdu content like news articles, Wikipedia, poetry, and books. Overall, our data collector module has successfully curated a repository containing 14 million crawled webpages and 2.2 million scraped Urdu documents. Also, we design post-processing tools for tasks such as topic classification, de-duplication, profanity assessment, text summarization, and the scoring of website quality specific to the Urdu language. In addition, acknowledging the limitations of applying conventional ranking signals to Urdu language, search manager utilizes our seven derived ranking signals for search results. These signals are tuned to emphasize the richness and quality of Urdu language websites and content in search results. Moreover, we incorporate a web analytics engine into our search engine to collect and analyze user actions and metadata to enhance the overall functionality and effectiveness of the search engine. Our web analytics engine has recorded 400K user interactions from 83 countries conducted through the interactive user interface. Finally, we conduct usability testing of search engine with native Urdu language speakers to assess the strengths and weaknesses of our search engine.
Keywords