TWEETS ON A TREE: INDEX-BASED CLUSTERING OF TWEETS
Mert Kemal Erpam
Computer Science and Engineering, MSc. Thesis, 2017
Prof. Yücel Saygın (Thesis Advisor), Asst. Prof. Kamer KAYA
, Assoc. Prof. Şule Gündüz Öğüdücü
Date & Time: January 5th, 2017 – 2:40 PM
Place: FENS G025
Keywords: clustering, twitter, summarization, suffix tree, semantic relatedness,
Computer-mediated communication, CMC, is a type of communication that occurs through use of two or more electronic devices. With the advancement of technology, CMC has started to become a more preferred type of communication between humans. Through computer-mediated technologies, news portals, search engines and social media platforms such as Facebook, Twitter, Reddit and many other platforms are created. In social media platforms, a user can post and discuss his/her own opinion and also read and share other users’ opinions. This generates a significant amount of data which, if filtered and analyzed, can give researchers important insights about public opinion and culture.
Twitter is a social networking service founded in 2006 and became widespread throughout the world in a very short time frame. The service has more than 310 million monthly active users and throughout these users more than 500 million tweets are generated daily as of 2016. Due the volume, velocity and variety of Twitter data, it cannot be analyzed by using conventional methods. A clustering or sampling method is necessary to reduce the amount of data for analysis.
To cluster documents, in a very broad sense two similarity measures can be used: Lexical similarity and semantic similarity. Lexical similarity looks for syntactic similarity between documents. It is usually computationally light to compute lexical similarity, however for clustering purposes it may not be very accurate as it disregards the semantic value of words. On the other hand, semantic similarity looks for semantic value and relations between words to calculate the similarity and while it is generally more accurate than lexical similarity, it is computationally difficult to calculate semantic similarity.
In our work we aim to create computationally light and accurate clustering of short documents which has the characteristics of big data. We propose a hybrid approach of clustering where lexical and semantic similarity is combined together. In our approach, we use string similarity to create clusters and semantic vector representations of words to interactively merge clusters.