Twitter topic detection datasets
We have collected tweets for three different events:
Apart from the tweets, we manually selected a list of important topics, appearing in mainstream news sources, and have created corresponding ground truth datasets that, for a given time interval, specify what are the important topics (represented by sets of keywords). We also developed an evaluation script that, given the output of a topic detection algorithm, computes a list of performance measures.
In accordance with the TREC practice of sharing Twitter datasets, the above distributions contain only the tweet IDs, since we are not allowed to publicly distribute the original tweets. For convenience, the tweet IDs and the ground truth topics are organized in timeslots, as described in [1].
These datasets were used to conduct the experimental study in [1]. In case you use them in your research, please cite [1]. Most of the topic detection methods presented in [1] have been also made publicly available for use.
For any questions or requests, you may have with respect to the dataset, please contact any of the following co-authors of the paper: Luca, Giorgos, Carlos, David or Akis.
[1] L. M. Aiello, G. Petkos, C. Martin, D. Corney, S. Papadopoulos, R. Skraba, A. Goker, I. Kompatsiaris, A. Jaimes. Sensing trending topics in Twitter. IEEE Transactions on Multimedia (pre-print), 2013