DySCO

on 23 August 2013.

Concept

Currently, online content is indexed and searched at an atomic level, i.e. each content item is processed and indexed independently of the rest of the collection. SocialSensor attempts to extend this paradigm by performing indexing and search over composite objects relating to a common topic/entity of interest, e.g. an evolving news story (in the case of professional journalists) or a social event in the context of a big festival (in the case of event organizers). In SocialSensor, we call such composite objects DySCOs. The motivation behind using DySCOs over single items is that DySCOs bring together different pieces of content, that would be otherwise dispersed online, in a single informative view. Furthermore, DySCOs make possible to extract aggregate knowledge and views based on the individual pieces of content associated with them. In addition, performing the indexing at a collection-level enables richer representation of contextual information w.r.t. content, i.e. instead of considering the intrinsic content features of a single item, the indexing mechanism will be able to access contextual information about content items.

To sum up, DySCOs are composite objects centered around a particular topic or entity of interest that encode contextual and inferred information about collections of content items that are detected to be related to the given topic of interest.

Structure

Before proceeding with the description of the DySCO, we first provide a definition of the elementary data classes, in particular the classes Item, StreamUser, WebPage and MediaItem, which correspond to the following entities:

Item: A post made in a social platform. The most common Item type is a tweet. Retweets are also stored as distinct Items. Other Items may correspond to a Facebook post, a Flickr image post, etc.
StreamUser: The social network account that created an Item.
WebPage: A URL embedded in the Item.
MediaItem: A content item (image/video) embedded or linked by an Item.

The following diagram illustrates the relations between these classes and other auxiliary ones, which will not be further detailed here for the sake of brevity.

basic-domain-elements

DySCOs can be considered as sets of Items related to some topic/entitiy of interest. Note that a DySCO is explicitly associated with both the whole set of Items that are related to the topic of interest and the set of the unique ones (e.g. retweets are filtered out). To make DySCOs more informative, several pieces of inferred information are attached to them. More specifically, DySCOs are associated with a set of Ngrams, i.e. sets of keywords that are representative of the underlying DySCO content. Furthermore, DySCOs are associated with Entities (persons, organizations, locations), extracted from the text of individual Items, and aggregated over the corresponding DySCO. Finally, DySCOs are associated, via the corresponding Items, to Sentiment and Alethiometer scores. A more detailed view of the internal DySCO structure is available in the form of JavaDoc. Note that the structure of a DySCO is still evolving to address new requirements or issues emerging during the project research.

dysco-in-context An additional noteworthy attribute of DySCOs is that they are organized in groups. The necessity of organizing DySCOs in groups stems from the way, in which they are created (cf. trending DySCOs in the lifecycle section). Trending DySCOs are automatically generated on a timeslot basis. The length of a timeslot is configurable, and meaningful values for it range from 2 to 15 minutes. Note that there is a tradeoff between a short and a long timeslot: when the timeslot is short, the system can better approach real-time topic detection, but at the cost of accuracy, since the set of Items collected in a short time window is less (which renders the topic detection algorithms more likely to produce erroneous results); the reverse situation occurs for long timeslots. Given the fact that the same topic may be observed across multiple timeslots, it is necessary to group together those DySCO objects from different timeslots that refer to the same topic. This is achieved by attaching a groupId field to each DySCO, and by employing a DySCO matching process that periodically scans the DySCOs of recent timeslots to identify sets of DySCOs that refer to the same topic.

Lifecycle

DySCOs are typically mined from streams of social media content. In SocialSensor, we consider two broad categories of DySCOs:

Trending: These are topics that become very popular and are therefore automatically discovered by the topic discovery algorithms of SocialSensor. End users of the system are interested in being notified about or in exploring such topics, be they breaking news stories, trending social media discussions, or social events of interest. To help them discover the most interesting topics, SocialSensor associates these DySCOs with a "trending" score, so that end users get a glimpse only on the most trending ones.
User-defined: There are cases, where the end users are interested in the content and social discussions around a pre-specified topic of interest. In such cases, they need to specify the topic (with help from the UI), so that the system is able to match incoming content with it or to quickly fetch relevant content from different social media sources.

Therefore, the first step carried out by SocialSensor is the grouping of Items into DySCOs around specific topics and entities of interest. This step is implemented differently depending on the category of the DySCO (trending, user-defined). At a second step, the system extracts some additional information from the DySCO Items, namely keywords, Entities, Sentiment and Alethiometer scores. Afterwards, new DySCOs are matched with DySCOs from previous timeslots to form groups that refer to the same topic of interest. Once the target information is extracted, the system performs the indexing of DySCOs, Items, MediaItems and auxiliary data classes (in mongoDB, Solr and the visual indexing framework of SocialSensor). Then, it is possible to retrieve the indexed entries in a variety of ways, e.g. by id, by keyword queries, by timeframe, etc. in order to serve the particular requirements imposed by the respective application (professional news, casual news, infotainment).

Research Questions

Q1. What is the optimal level of granularity that DySCOs should be defined?

DySCOs correspond to real-world news stories, entities and social events. However, it is not straightforward to establish the desired level of granularity for them. For instance, a news editor preparing a summary retrospective article on the US Elections may be interested in DySCOs that correspond to the highlights of the presidential campaign of Obama, while a journalist working on a more focused article would be interested in DySCOs around specific events (e.g. debate, a particular speech, etc.). Although it is not straightforward to specify granularity in terms of discrete values, for the sake of simplicity, we identify three levels of granularity for DySCOs:

level 1 (story): short-lived (e.g. few hours, one day), very focused, trending now - gone later;
level 2 (mini-topic): evolving for a few days, focused, may relate to many stories;
level 3 (super-topic): long-lived, broad, associated with many mini-topics or stories.

Coping with the need for multiple levels of granularity for DySCOs is a challenging problem in many ways, since it affects the topic detection algorithms employed by the system, the DySCO representation and indexing, the efficiency and scalability of the system, and the interaction with the end users. At the moment, the system supports a two-level granularity scheme, mostly positioned around level 1 and 2 DySCOs.

Q2. How is it possible to automatically create DySCOs of high quality?

This is a complicated task that is mainly dealt with in the context of the topic detection research of the project. Although several topic detection algorithms have been tested in a variety of datasets, there is always room for improvement. In particular, the following problems have been identified:

Impurity: The automatic grouping of Items into DySCOs inadvertently introduces irrelevant content in the latter. This is due to the fact that the grouping is based on superficial text features. Therefore, a DySCO might consist of Items that are similar in terms of text, but completely unrelated to each other. Imposing stricter constraints on this process (e.g. grouping Items in the same DySCO only if they are very similar to each other) exacerbates the Fragmentation problem (discussed below).
Fragmentation: Since people discuss about the same topic in different ways, it is extremely challenging to group together Items that are completely different in terms of text. Some cues that are helpful in that respect are shared URLs and Media Items (i.e. if two Items link to the same URL or image, they are highly likely to pertain to the same topic irrespective of their text).
Noise: A large percentage of online discussions and content are largely unimportant or misleading/malicious. For instance, many online accounts generate posts around very personal activities (e.g. what am I doing right now) and some accounts spread malicious links by using unrelated text to describe them (e.g. hijacking popular hashtags). Obviously, there are numerous other cases of unimportant and misleading content. Identifying and filtering out such content presents yet another challenge to the DySCO generation process.
Sampling: A further complication in the DySCO generation process stems from the fact that only a small subset of Items (out of the full set generated at any given moment) are available as input to the system. To mitigate this limitation, the system monitors the Items generated by a selected number of online accounts that are considered more reliable and newsworthy. However, this still means that the DySCO generation algorithms do not have access to the "complete picture" of social network activity at any given moment.

Q3. How is it possible to automatically find a meaningful name for a DySCO?

When presenting DySCOs to users (typically in the form of lists), it is important to represent each of them with a succinct yet informative title. We explored different techniques to achieve this, e.g. compute the most frequent sequence of terms across the set of Items associated with a DySCO, use the title of a linked article, use the most (statistically) prominent set of keywords from the associated Items. All aforementioned techniques work well for a number of cases, yet fail in others, thus complicating the DySCO naming process.

Related Resources

A first description of the DySCO concept is available as part of D1.1 (p. 47-51). DySCOs are further discussed in the context of Social Search in D4.2 (p. 11-15). Some contrived use case-specific examples of DySCOs are provided in the respective use case deliverables D7.1 (p. 29-30) and D8.1 (p. 73-80).

Technical documentation about DySCOs is also available in the form of JavaDoc. More specifically, the socialsensor-framework-common project documents the data structure of DySCO and associated classes (Item, MediaItem, StreamUser). Furthermore, the socialsensor-framework-client project offers documentation on methods of storing and retrieving DySCOs and associated data.

Soon, we will publish both of the aforementioned projects on GitHub.