SocialSensor at MediaEval 2014

on .

SocialSensor had a very successful and important contribution to MediaEval 2014 (Barcelona, Spain), participating in three tasks (Social Event Detection, Diverse Social Image Retrieval, and Multimodal Location Estimation) and organizing the Social Event Detection task. Here, we present some of the highlights.

Social Event Detection

The 2014 version of the Social Event Detection (SED) task was organized by CERTH and was partly supported by SocialSensor. This year, the SED task had two subtasks. The first, the detection subtask, asked participants to cluster a large collection of images, so that each cluster corresponds to a distinct social event. The second subtask, the retrieval task, asked participants to determine those social events that match specific criteria (type of event, location, time, involved entities). Participants could submit to one or two of the subtasks. Six teams submitted to the first subtask and two teams submitted to the second subtask. Five of the teams eventually attended the workshop. Discussions between attending teams at the workshop were fruitful and led to useful conclusions about the performance of different approaches and the possible directions of future research in the field. In particular, it has been recognized that there were two types of approaches applied to the first subtask. The first type applies a sequence of clustering steps according to specific modalities, whereas the second type is based on a learned multimodal similarity measure. Two approaches that belong to the first type ranked in the top two positions, whereas an approach that belongs in the second type followed closely on the third place. This indicates that a good insight about the nature of the task may result from ad hoc clustering procedures that perform very well; but also that more general multimodal clustering procedures may perform equally well. Regarding future challenges, the participants expressed the opinion that the next step in the clustering subtask is to also include multimedia items that do not belong in some social event and that the retrieval subtask, although very challenging, is also a very interesting direction of future research. The task is described in more detail in the SED organization working notes paper.

SocialSensor, represented by CERTH, submitted runs to both subtasks of the SED task. The approach pursued in the first subtask, the full clustering subtask, is based on what is termed the same event model, a model that takes as input the set of per-modality similarities between a pair of images and predicts whether the images belong to the same event or not, which is then used to create a same-event graph. Eventually, we apply a community detection procedure on the graph to obtain the clusters of images that represent the events. This is the approach that we also applied last year, however this year we have introduced a key tweak that has significantly improved the performance. In particular, we have increased the classification threshold of the same event model so that the true positives rate is significantly improved (0.9999) at the cost of a somewhat lower true negatives rate (0.95). This has resulted in a much cleaner graph and ultimately in much better clustering performance (F1 without the tweak was 0.4514, whereas with the tweak it was at least 0.8312, our best run achieved an F1 score of 0.9161). Our participation was ranked 3rd in the first subtask. For the second subtask we learned language models for each of the retrieval criteria and used them to classify each event (or image in some of our runs) according to each of them. Eventually, we returned those events that match the retrieval criteria. The average (over the 10 test queries) F1 achieved by our best run was 0.4604, which was the best performing in the second subtask. More details on the adopted approach and the obtained results are available through our SED participation working notes paper.

Diverse Social Image Retrieval

spyromitros placing taskThe Diverse Social Image Retrieval task dealt with the problem of result diversification in social photo retrieval. In particular, it considered a tourist use case where a person tries to find a complete visual description of the place he/she is visiting. Currently, social media platforms focus on relevance of search results. However, diversity is equally important in order to get a complete visual description of a place. Therefore, the task aims to foster new technologies that could be implemented as a top layer in the retrieval chain of current social media platforms (e.g., Panoramio, Google Images, Flickr, Webshots, Picasa). In order to simulate this problem, the task asked participants to refine a ranked list of location photos retrieved from Flickr, by providing a subset of the images that are at the same time relevant and provide a diversified summary (e.g. different views of the location, different times of day/year, different weather conditions, creative views, etc.). The refinement and diversification process could rely on the social metadata associated with the images and/or on the visual characteristics of the images.

The method developed by SocialSensor casts the task into an optimization problem where the objective function jointly accounts for relevance and diversity. For the relevance part, the method uses a machine learning algorithm that automatically determines the relevance of an image based on textual and/or visual cues while diversity is achieved by excluding from the refined set, images with a high similarity to the ones already selected. By using a different instantiation of this method, SocialSensor created three types of runs each one relying on a different type of features (visual-only, text-only, visual+text). The results achieved by the method (called ReDiv) were impressive. Using the state-of-the-art VLAD+CSURF visual features, the run submitted by SocialSensor in the visual-only category was ranked 1st among 13 contestants. The text-only run achieved the 3rd position, while the visual+text run was also 1st by a large margin. In terms of overall performance, SocialSensor submitted the 2nd best run, beaten slightly by a run relying on external sources.  The approach and obtained results are detailed in our Diverse Social Image Retrieval working notes paper.

Placing: Multimodal Location Estimation

The Multimodal Location Estimation (in short Placing) Task captures the challenge of estimating the geographical location of multimedia items, such as images and videos. The estimation of the geographical location is achieved based on a massive amount of geo-tagged training data. The multimedia items’ metadata for the entire dataset are provided by the organizers. The participants are able to submit up to five runs, each of which has to contain the estimated geographical coordinates for every query item of the test set. The evaluation method includes the estimated error between the results of the implementations and the ground truth that is provided by Flickr/Yahoo. The evaluation measure is considered to be the accuracy of the predicted locations in a series of widening circles. The ranges of the evaluating circles are: 10m, 100m, 1km, 10km, 100km, 1000km and 5000km.

In our participation, we submitted a total of five runs, three of which were tag-based and the rest two were visual-based runs. For the tag-based runs, we built upon the scheme of Popescu (used in last year's participation of CEA-LIST) using the language model as basis; extending it with the use of Similarity Search, introduced by Van Laere et al. (2013), an Internal Grid technique and a Spatial Entropy model that we developed. As far it concerns the visual-based runs, first the SURF+VLAD and the CS-LBP+VLAD features were extracted, concatenated them in a single vector, and then a linear SVM was trained in a predefined number of spatial clusters and subclusters. Our best performance in terms of both median error and accuracy in all ranges was attained by the run that combined the baseline approach and all our proposed extensions. This was the second best run in the contest for high-precision estimations (100m and 1km). More details on our approach and the obtained results are available on our Placing Task working notes paper.