news-orchestrator
The news-orchestrator acts as the monitoring and controlling entity of the analysis and indexing phase of the workflow. Its role is to trigger in a sequential way the various modules that participate in the DySCO formulation. It starts by synchronizing the stream manager’s output to mongoDB with the analysis workflow , acting as an intermediate buffer that pushes content in batches to the various analysis modules. Once the analysis modules interact to fill in the different metadata fields of Items and DySCO objects, the orchestrator encodes the objects into Solr-compatible documents and feeds them into the Solr server. After this point, all DySCOs and Items are available to be retrieved and visualized at the presentation layer through the available User Interfaces.
More specifically, the following components are operated by the news-orchestrator (cf. figure below):
- Entity Extractor: For each incoming Item, the Entity Extractor detects references to named entities. This is based on the Stanford CoreNLP library. Note that this entity extractor is different from the one used by the infotainment-orchestrator.
- Sentiment Analyzer: The Sentiment Analyzer is responsible for the detection of sentiment labels (positive/neutral/negative) for each incoming Item. Details on the adopted approach are provided in D2.2.
- DyscoCreator: The DySCO Creator clusters incoming Items based on the BN-gram method described in D2.2 and in (Aiello et al., 2013). In V2, we plan to explore improvements of the method, as well as additional methods (e.g. the SFPM approach described in D2.2). Several of these topic detection implementations have been made available as open-source project in GitHub.
- DyscoMatcher: This matches the newly created DySCOs with DySCOs created in previous timeslots (provided their similarity exceeds a certain threshold). In V2, this component might be considerably revised due to the foreseen changes in the DySCO management lifecycle.
- Aggregator: This aggregates the different elements that were extracted per Item (n-grams, keywords, named entities) on a per DySCO basis.
- Title Extractor: This uses a set of business rules and heuristics to extract a human readable title for each new DySCO. The set of these rules has been revised during the evaluation based on feedback from end users, and is expected to be further updated in V2.
- Ranker: This component (to be created in V2) will associate importance weights to the discovered DySCOs. It will take into account external sources (e.g. RSS feeds, Reddit topics).
- Influencer Extractor: This is executed in an asynchronous way (on top of Hadoop) and periodically extracts influencers per keyword (for a set of trending keywords defined on the basis of the created DySCOs).
- Query Creator: This will be responsible for (a) forming appropriate SolrQueries that are used for the retrieval (from the SocialSensor store) of Items, MediaItems and WebPages related to a DySCO of interest, and (b) forming appropriate queries that are used by the stream-manager to fetch (from the wrapped online social networks) additional Items and MediaItems that are related to the newly created DySCO. The source code of the query creator is available in the GitHub socialsensor-query-builder project.