PROMOTER ACCOUNT DETECTION IN TWITTER

Twitter is an online social network and micro-blog that becomes an alternative media for sharing and getting information. In the political area, Twitter provides various features as a media to promote campaign and get a good imaging for political party or contestant. In order to get a good opinion from other users, the contestant can manipulate their success with a massive promotion. This promotion activity could lead to public opinion that is not consistent with the facts. So that, we need to determine whether this is promoter account or not. In this paper, we propose a new framework for promoter account detection. This framework based on twitter content to detect promoter account according to their existence in topic of promotion. This framework employs k-means approach in order to cluster topic of promotion based on twitter’s content. From each cluster, we evaluate the existence of promoter account. With very simple approach, the results obtained on experiment show that this framework is effective for promoter account detection.


INTRODUCTION
Twitter is currently the most popular and fastest-growing micro-blogging service. Although it is a relatively new communication medium compared to traditional media, micro-blogging has gained increased attention among users, organizations and research scholars in different disciplines. The popularity of Twitter as a communication services such as portability, immediacy and ease of use allow users to instantly respond and spread information with limited or with no restrictions on content. This can be done by posting tweets, RT (retweet) from other users, communicate/ write responses (replies), mentions, favorites to follow other users [1].
By these features of twitter, marketing activities/ promotions and massive campaign become very easy to do. The Twitter social network has become a target platform promoter to disseminate their target messages. There are a large number of campaigns containing the message to increase the popularity and ratings. Canover, et.al [2], in 2010 when the U.S. elections occur, conduct research to predict political leanings of twitter users based on the content and structure of political communication twit. He said that the success of a candidate can be manipulated using twitter. To obtain a positive opinion, not only a candidate must undertake an intensive campaign but also leave a comment with positive comments on the candidates from the promoter. This promotion is conducted by a team of success or promoter to lead a positive image of the candidate. Twitter campaign using this model has been proven effective to deliver a candidate in getting significant support from voters [3]. Zhang, etc [4] also said that promotion or campaign could impact negatively so they build a framework to detect spam and promotional campaigns based on URL-driven estimation to calculate the similarity of the goals of an account in the post URL. Zhang used URL information to get some features such as number of domain and number of URL posting by accounts. Beside URL information, Zhang also used timestamps feature to determine posting average. That framework need much effort on extracting features.
In this paper, we propose new framework promoter account detection automatically. Our approach is only based on twitter content to estimate goal of promoting. This framework employs k-means approach in order to cluster topic of promotion based on twitter's content. From each cluster, we evaluate the existence of promoter account.
By detecting promoter account, user can differ whether it is promoter account or not. So they can get more objective view. Beside, the candidate can know the potential accounts that become supporters.

TWITTER
In this chapter we describe twitter and its users at a glance.

Service overview
Twitter is currently the most popular and fastest-growing micro-blogging service, with more than 140 million users producing over 400 million tweets per day-mostly mobile-as of June 2012. Twitter enables users to post status updates or tweets, no longer than 140 characters to a network of followers using various communication services (e.g. cell phones, e-mails, web interfaces, or other third-party applications). While some users consider the 140-character constraint as a severe limitation, many argue that it is the feature that sets Twitter apart-short information is easier to consume and faster to spread. Twitter contains concise information that is easy to digest. Users can access the website via twitter, SMS or app from mobile devices [9]. Twitter allows sharing any kind of information, ideas, opinions, motivations, etc. This popularity is also reflected in the increasingly large number of research papers published about Twitter in various fields.
Twitter has easy access to the user so that enthusiasts twitter very much has 115 million number of active Twitter users every month and 58 million average number of tweets per day [11]. So twitter can be used as a source of real-time information that is important [7]. Each user is free to publish anything, anytime, and anywhere, especially to users who are connected with it (followers/ following). This provides opportunity for users to perform various activities such as promotion (marketing) campaigns, as well as the spread of spam.
In twitter, user account placing the "@" symbol before a user name, in addition to reply. It is a special mention from one user in response to another user's message starting with the replied to @username. Mentions are displayed in the referenced user accounts to keep track of messages mentioning their names. Twitter also allows users to forward or retweet someone else's tweet to their followers. It is commonly carried out by using the RT prefix before the user name that originated the message, "RT@username". Retweeting is a common practice on Twitter to share useful or interesting information while giving credit to the original user [1].

Promoter account
There are so many types of account. Daniel Gayo [7] classify twitter accounts into four different type: verified accounts, spammers, aggressive marketers and average user. In this session, we focus describing promoting activities which from Daniel Gayo it is grouped into spammer and aggresive marketers.
Promoter account used to promote the object intensively. It can be indicated by it's many of tweet activities, such as message, retweet, reply, etc [4]. By our observation, we find that there are two kind of promoting account. First is robot, which is automatically posting large number of tweets. Robot account has retweet as its main activity. Second is official, which is manually promote the candidate. Official account has various activity, such as reply, retweet and initialize a topic promotion.

FRAMEWORK DESIGN
We design this framework with four main phases: data collection, text processing, tweet clustering and evaluate promoter existance. The framework design is describe in Fig. 1.

Data Collection
We used the Twitter search API (Application Programming Interface) i.e. dev.twitter.com/pages/streaming.api to create the dataset used in this study. We used a query composed of the name of one of the candidates. In this study, we focused on one candidate's name, Dahlan Iskan, with various query, i.e. DahlanIskan, iskan_dahlan, Dahlan Iskan. Twitter's attributes that used from captured data are the account name and the content of tweet. Fig. 2 gives an example of captured data, then we extract attributes as shown in Table 1. We do not differ the types of tweets activity such as message, RT (retweet), reply and favorite. The tweet corpus that used in this framework consists of positive or neutral opinion tweet. The reason is promoter will not give negative opinion to its object. So that, these phases consists of data collection, data selection, and extracting attributes (Fig. 1).

Text Processing
This promoter account detection framework is based on twitter content to estimate goal of promoting so that we need text processing. Text processing consists of data cleaning, preprocess and term weighting (Fig. 1).
Data is cleaned by removing the account name (@username from reply, mention and retweet), hashtag, URL and stop-word. This is a tweet as an illustration, RT @detikcom: Dahlan Iskan Minta Mahasiswa Jangan Korupsi Kalau Sudah Bekerja http://t.co/2HMx4C6Pc7 via @detikfinance. After data cleaning, the tweet above become Minta Mahasiswa Jangan Korupsi Sudah Bekerja. ( 1) where TF(d,t) is the term frequency of the tweet . IDF(t) is the inverse of the occurrence of the term on a tweet. ( 2) where N is the amount of the whole corpus tweets, df(t) is the number of corpus tweet that are containing term t.

Tweet Clustering
This framework employs k-means approach in order to cluster topic of promotion based on twitter's content. We also employs cosine similarity to determine distance between tweet content to the centroids as shown in formula (3). (3) where is the i-th tweet corpus and is the j-th tweet corpus.

Evaluating Promoter Existance
Promoter account is detected by calculating the frequency of occurrence of the account in the cluster. is the cluster sets { } where is number of clusters and is account sets where is the final number of accounts. The outcome of cluster is represented by a matrix. Activity = [ ]. Each entry represents the total activity account in cluster .
Based on the activity matrix, we evaluate the existance of each account. Formula (4) shows the existence of the account in cluster .

(4)
The equation (5) is used to calculate promoter weight of account , where

RESULT
To show the applicability and the potential of promoter account detection framework, the framework has been put into action by carrying out a real-world data consisting twitter corpus.
In the following sections, the working process of this framework is described in a step-by-step way and the results obtained after the execution of each step are shown.

Dataset
A preliminary experimental session was carried out on collecting twitter from website using Twitter API Stream. Twitter corpus collected during a period 18 hours within 5 days. Since that time, we only choose positive or neutral tweet as corpus. From this process, we collect 1890 twitter with 460 unique account and we determine fifty promoter account. Finally, we process the corpus with text preprocessing then give index for each term from tweet message.

Experiment
K-means algorithm was applied in order to cluster tweet corpus. In this experiment, we use cosine similarity to determine the distance between tweet content to each centroid. Several runs of the algorithm were performed by setting, in each trial, a different initial number of clusters (50,40,30,25,20,15,10).
After clustering the tweet corpus, we evaluate the existence of promoter account by promoter's weight as shown in equation (5). Then we retrieve thirty highest number of promotor's weight. Fig. 3. shows us precision value for each trial where this method give best result at number of cluster 10.

DISCUSSION
This framework is very simple. The results obtained in the study indicate that the framework is effective for a promoter account detection. Our simple classification system is able to accurately detect promoter based on twitter content. With this system, we have been able to identify a number of clusters with appropriate promoters account. Canover research also built a promoter detection based on URL-driven [2]. This framework can detect the promoter but many aspects should be analyzed, e.g. number of URLs posted by accounts, frequency of tweets with URLs posted by the account, links in the tweets and timestamps.
We investigate the promoter accounts that detected by our framework. We then find there are two types of promoter account. First is robot, which automatically posting large number of tweets. Robot account has retweet as its main activity. They are spreading in almost clusters, because their task is spreading topic promotions to the others. Second type is official account, which is manually promote the candidate. Official account has various activities, such as reply, retweet, and initialize a topic promotion. Usually they concern with special topic, so the official account's existence is less than robot.
We suggest improving method of clustering in order to get improvement in experiment results. Beside, we should improve evaluating promoter existence by considering two different types of account.

CONCLUSION
In this paper we studied promoting campaigns in Twitter and proposed promoter account detection framework. Firstly, collect positive or neutral opinion of object promotion. Next, we proposed a k-means algorithm to cluster tweet corpus. Finally, we determine sum of account existence in each cluster.
Our framework is very simple, it does not distinguish the type of activity twitter. It also does not require time information. Beside, from this experiment, we can indicate that our proposed approach provide a valid tool to automatically detecting promoter.
As future work, we plan to improve our similarity estimation method with the text in tweets. Furthermore, we intend to add more features for the detection step to obtain more effective framework detection.