Computer Science (Jan 2013)

Staged Event-Driven Architecture As A Micro-Architecture Of Distributed And Pluginable Crawling Platform

  • Leszek Siwik,
  • Kamil Wlodarczyk,
  • Mateusz Kluczny

DOI
https://doi.org/10.7494/csci.2013.14.4.645
Journal volume & issue
Vol. 14, no. 4
p. 645

Abstract

Read online

There are many crawling systems available on the market but they are rather close systems dedicated for performing particular kind and class of tasks with predefined set of scope, strategy etc. In real life however there are meaningful groups of users (e.g. marketing, criminal or governmental analysts) requiring not just a yet another crawling system dedicated for performing predefined tasks. They need rather easy-to-use, user friendly all-in-one studio for not only executing and running internet robots and crawlers, but also for (graphical) (re)defining and (re)composing crawlers according to dynamically changing requirements and use-cases. To realize the above-mentioned idea, Cassiopeia framework has been designed and developed. One has to remember, however, that enormous size and unimaginable structural complexity of WWW network are the reasons that, from a technical and architectural point of view, developing effective internet robots – and the more so developing a framework supporting graphical robots’ composition – becomes a really challenging task. The crucial aspect in the context of crawling efficiency and scalability is concurrency model applied. There are two the most typical concurrency management models i.e. classical concurrency based on the pool of threads and processes and event-driven concurrency. None of them are ideal approaches. That is why, research on alternative models is still conducted to propose efficient and convenient architecture for concurrent and distributed applications. One of promising models is staged event-driven architecture mixing to some extent both of above mentioned classical approaches and providing some additional benefits such as splitting application into separate stages connected by events queues – what is interesting taking requirements about crawler (re)composition into account. The goal of this paper is to present the idea and the PoC implementation of Cassiopeia framework, with the special attention paid to its crucial architectural element i.e. design, implementation and applying of staged event-driven architecture being a micro-architecture of Cassiopeia’s agents i.e. its key computational and processing units