storm-focused-crawler

on .

The storm-focused-crawler deals with the management of the URLs that were extracted from the Items and MediaItems collected by the stream-manager. This process is described in D4.3 in more detail. The main operations in this process are the following:

  • Multimedia Fetcher: For URLs pointing to media content (e.g. links to YouTube, twitpic, etc.), the actual media content is downloaded (in case of videos, the video thumbnails and not the video itself).
  • Article Extractor: In case of URLs pointing to general web pages, a simple article extraction technique is applied in order to extract the main article text and title. In case a photo is featured in the article, its URL is also extracted and forwarded to the Multimedia Fetcher.
  • VLAD Feature Extractor: In this step, a single feature vector (VLAD) is extracted from the image content. The local features (SURF descriptors) used for its computation are not stored. Further details on the implementation and evaluation of this process are described in D4.2 and D4.3.
  • Feature Indexer: The next step, after feature extraction, is the indexing of the feature vector using Product Quantization (PQ) and Asymmetric Distance Computation (ADC) for fast similarity-based search. Further details are available in D4.2 and D4.3.
  • Location Estimation: In this step, a geographical location is inferred for an input Item or MediaItem based on its textual metadata and its extracted features. This step is not implemented yet, but some research that is necessary for its development has been conducted (described in D4.3).

The aforementioned steps are implemented on top of a Stormtopology. The project source code is available in the storm-focused-crawler GitHub project.

storm-focused-crawler