ABSTRACT
Social networking sites engage millions of users around the world. The users’ interactions with these social sites, such as Twitter and Facebook have a tremendous impact and occasionally undesirable repercussions for daily life. The prominent social networking sites have turned into a target platform for the spammers to disperse a huge amount of irrelevant and deleterious information. Twitter, for example, has become one of the most extravagantly used platforms of all times and therefore allows an unreasonable amount of spam. Fake users send undesired tweets to users to promote services or websites that not only affect legitimate users but also disrupt resource consumption. Moreover, the possibility of expanding invalid information to users through fake identities has increased that results in the unrolling of harmful content. Recently, the detection of spammers and identification of fake users on Twitter has become a common area of research in contemporary online social Networks (OSNs). In this paper, we perform a review of techniques used for detecting spammers on Twitter. Moreover, a taxonomy of the Twitter spam detection approaches is presented that classifies the techniques based on their ability to detect: (i) fake content, (ii) spam based on URL, (iii) spam in trending topics, and (iv) fake users. The presented techniques are also compared based on various features, such as user features, content features, graph features, structure features, and time features. We are hopeful that the presented study will be a useful resource for researchers to find the highlights of recent developments in Twitter spam detection on a single platform.
SYSTEM ANALYSIS
EXISTING SYSTEM
- Sahami et al. Proposed textual and non textual and domain-specific features and learned naive Bayes classifier to segregate spam emails from legitimate ones. Schafer proposed metadata-based approaches to detect botnets based on compromised email accounts to diffuse mail spam’s. Spam campaigns on Facebook were analysed by Gao et al. using a similarity graph based on semantic similarity between posts and URLs that point to the same destination.
- Furthermore, they extracted clusters from a similarity graph, wherein each cluster represents a specific spam campaign. Upon analysis, they determined that most spam sources were hijacked accounts, which exploited the trust of users to redirect legitimate users to phishing sites.
- Yang et al. and Ahmed and Abulaish used content- and interaction based attributes for learning classifiers to segregate spammers from benign users on different OSNs.
- Yang et al. and Ahmed and Abulaish analysed the contribution of each feature to spammer detection, whereas Yang et al. Conducted an in-depth empirical analysis of the evasive tactics practiced by spammers to bypass detection systems. They also tested the robustness of newly devised features.
- Zhu et al. used a matrix factorization technique to find the latent features from the sparse activity matrix and adopted social regularization to learn the spam discriminating power of the classifier on the Renner network, one of the most popular OSNs in China. Another spammer detection approach in social media was proposed by Tan et a.
Disadvantages
- There are no Hybrid techniques to classify different spam’s behaviours.
- There is no spambot detection techniques.
PROPOSED SYSTEM
- In the proposed system, the system proposes a Fake User Identification approach for detecting social spam bots in Twitter, which utilizes an amalgamation of metadata-, content-, interaction-, and community-based features. In the analysis of characterizing features of existing approaches, most network-based features are not defined using user followers and underlying community structures, thereby disregarding the fact that the reputation of user in a network is inherited from the followers (rather than from the ones user is following) and community members. Therefore, the system emphasizes the use of followers and community structures to define the network-based features of a user.
- The system classifies set of features into several broad categories, namely, (i) fake content, (ii) spam based on URL, (iii) spam in trending topics, and (iv) fake users, wherein the network category is further classified into interaction- and community based features. Metadata features are extracted from available additional information regarding the tweets of a user, whereas content-based features aim to observe the message posting behaviour of a user and the quality of the text that the user uses in posts. Network-based features are extracted from user interaction network.
Advantages
- A novel study that uses community-based features with other feature categories, including metadata, content, and interaction, for detecting automated spammers.
- Used Hybrid technique to classify spammers such as random forest, decision tree, and Bayesian network.
SYSTEM REQUIREMENTS
SOFTWARE REQUIREMENTS:
• Programming Language : Python
• Font End Technologies : TKInter/Web(HTML,CSS,JS)
• IDE : Jupyter/Spyder/VS Code
• Operating System : Windows 08/10
HARDWARE REQUIREMENTS:
Processor : Core I3
RAM Capacity : 2 GB
Hard Disk : 250 GB
Monitor : 15″ Color
Mouse : 2 or 3 Button Mouse
Key Board : Windows 08/10