Thursday, January 16, 2014

yelp01 Yelp’s filtering is reasonable and its filtering algorithm seems to be correlated with abnormal spamming behaviors.

http://www.cs.uic.edu/~amukherj/papers/ICWSM-Spam_final_camera-submit.pdf



What Yelp Fake Review Filter Might Be Doing?


Arjun MukherjeeVivek VenkataramanBing LiuNatalie Glance
University of Illinois at Chicago Google Inc. arjun4787@gmail.com; {vvenka6, liub}@uic.edu; nglance@google.com



Abstract
 
Online reviews have become a valuable resource for decision making. However, its usefulness brings forth a curse ‒ deceptive opinion spam. In recent years, fake review detection has attracted significant attention. However, most review sites still do not publicly filter fake reviews. Yelp is an exception which has been filtering reviews over the past few years. However, Yelp’s algorithm is trade secret. In this work, we attempt to find out what Yelp might be doing by analyzing its filtered reviews. The results will be useful to other review hosting sites in their filtering effort. There are two main approaches to filtering: supervised and unsupervised learning. In terms of features used, there are also roughly two types: linguistic features and behavioral features. In this work, we will take a supervised approach as we can make use of Yelp’s filtered reviews for training. Existing approaches based on supervised learning are all based on pseudo fake reviews rather than fake reviews filtered by a commercial Web site. Recently, supervised learning using linguistic n-gram features has been shown to perform extremely well (attaining around 90% accuracy) in detecting crowdsourced fake reviews generated using Amazon Mechanical Turk (AMT). We put these existing research methods to the test and evaluate performance on the real-life Yelp data. To our surprise, the behavioral features perform very well, but the linguistic features are not as effective. To investigate, a novel information theoretic analysis is proposed to uncover the precise psycholinguistic difference between AMT reviews and Yelp reviews (crowdsourced vs. commercial fake reviews). We find something quite interesting. This analysis and experimental results allow us to postulate that Yelp’s filtering is reasonable and its filtering algorithm seems to be correlated with abnormal spamming behaviors.



Introduction
 
Copyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
 
Online reviews are increasingly being used by individuals and organizations in making purchase and other decisions. Positive reviews can render significant financial gains and fame for businesses. Unfortunately, this gives strong incentives for imposters to game the system by posting
deceptive fake reviews to promote or to discredit some target products and services. Such individuals are called opinion spammers and their activities are called opinion spamming (Jindal and Liu 2008).



The problem of opinion spam or fake reviews has become widespread. Several high-profile cases have been reported in the news (Streitfeld, 2012a). Consumer sites have even put together many clues for people to manually spot fake reviews (Popken, 2010). There have also been media investigations where fake reviewers admit to have been paid to write fake reviews (Kost, 2012). In fact the menace has soared to such serious levels that Yelp.com has launched a "sting" operation to publicly shame businesses who buy fake reviews (Streitfeld, 2012b).
 
Deceptive opinion spam was first studied in (Jindal and Liu, 2008). Since then, several dimensions have been explored: detecting individual (Lim et al., 2010) and group (Mukherjee et al., 2012) spammers, and time-series (Xie et al., 2012) and distributional (Feng et al., 2012a) analysis. The main detection technique has been supervised learning using linguistic and/or behavioral features. Existing works have made important progresses. However, they mostly rely on ad-hoc fake and non-fake labels for model building. For example, in (Jindal and Liu, 2008), duplicate and near duplicate reviews were assumed to be fake reviews in model building. An AUC (Area Under the ROC Curve) of 0.78 was reported using logistic regression. The assumption, however, is too restricted for detecting generic fake reviews. Li et al. (2011) applied a co-training method on a manually labeled dataset of fake and non-fake reviews attaining an F1-score of 0.63. This result too may not be reliable as human labeling of fake reviews has been shown to be quite poor (Ott et al., 2011).



In this work, we aim to study how well do the existing research methods work in detecting real-life fake reviews in a commercial website. We choose Yelp.com as it is a well-known large-scale online review site that filters fake or suspicions reviews. However, its filtering algorithm is trade secret. In this study, we experiment with Yelp’s filtered and unfiltered reviews to find out what Yelp’s filter might be doing. Note that by no means do we claim that Yelp’s fake review filtering is perfect. However, Yelp is a

commercial review hosting site that has been performing industrial scale filtering since 2005 to remove suspicious or fake reviews (Stoppelman, 2009). Our focus is to study Yelp using its filtered reviews and to conjecture its review filtering quality and what its review filter might be doing.
 
Our starting point is the work of Ott et al. (2011) which is a state-of-the-art as it reported an accuracy of 90%. Ott et al. (2011) used Amazon Mechanical Turk (AMT) to crowdsource anonymous online workers (called Turkers) to write fake hotel reviews (by paying $1 per review) to portray some hotels in positive light. 400 fake positive reviews were crafted using AMT on 20 popular Chicago hotels. 400 positive reviews from Tripadvisor.com on the same 20 Chicago hotels were used as non-fake reviews. Ott et al. (2011) reported an accuracy of 89.6% using only word bigram features. Feng et al., (2012b) boosted the accuracy to 91.2% using deep syntax features. These results are quite encouraging as they achieved very high accuracy using only linguistic features.



We thus first tried the linguistic n-gram based approach to classify filtered and unfiltered reviews of Yelp. Applying the same n-gram features and the same supervised learning method as in (Ott et al., 2011) on the Yelp data yielded an accuracy of 67.8%, which is significantly lower than 89.6% as reported on the AMT data in (Ott et al., 2011). The significantly lower accuracy on the Yelp data can be due to two reasons: (1) the fake and non-fake labels according to Yelp’s filter are very noisy, (2) there are some fundamental differences between the Yelp data and the AMT data which are responsible for the big difference in accuracy.

To investigate the actual cause, we propose a principled information theoretic analysis. Our analysis shows that for the AMT data in (Ott et al. 2011), the word distributions in fake reviews written by Turkers are quite different from the word distributions in non-fake reviews from Tripadvisor. This explains why detecting crowdsourced fake reviews in the AMT data of (Ott et al., 2011) is quite easy, yielding a 89.6% detection accuracy.

However, in the Yelp data, we found that the suspicious reviewers (spammers) according to Yelp’s filter used very similar language in their (fake) reviews as other non-fake (unfiltered) reviews. This resulted in fake (filtered) and non-fake (unfiltered) reviews of Yelp to be linguistically similar, which explains why fake review detection using n-grams on Yelp’s data is much harder. A plausible reason could be that the spammers according to Yelp’s filter made an effort to make their (fake) reviews sound convincing as other non-fake reviews. However, the spammers in the Yelp data left behind some specific psycholinguistic footprints which reveal deception. These were precisely discovered by our information theoretic analysis.

The inefficacy of linguistics in detecting fake reviews filtered by Yelp encouraged us to study reviewer behaviors in Yelp. Our behavioral analysis shows marked distributional divergence between reviewing behaviors of spammers (authors of filtered reviews) and non-spammers (others). This motivated us to examine the effectiveness of behaviors in detecting Yelp’s fake (filtered) reviews. To our surprise, we found that the behaviors are highly effective for detecting fake reviews filtered by Yelp. More importantly, the behavioral features significantly outperform linguistic n-grams on detection performance.

Finally, using the results of our experimental study, we conjecture some assertions on the quality of Yelp’s filtering and postulate what Yelp’s fake review filter might be doing. We summarize our main results below:

1. We found that in the AMT data (Ott et al., 2011), the word distributions of fake and non-fake reviews are very different, which explains the high (90%) detection accuracy using n-grams. However, for the Yelp data, word distributions in fake and non-fake reviews are quite similar, which explains why the method in (Ott et al., 2011) is less effective on Yelp’s real-life data.

2. The above point indicates that the linguistic n-gram feature based classification approach in (Ott et al., 2011) does not seem to be the (main) approach used by Yelp.

3. Using abnormal behaviors renders a respectable 86% accuracy in detecting fake (filtered) reviews of Yelp showing that abnormal behaviors based detection results are highly correlated with Yelp’s filtering.

4. These results allow us to postulate that Yelp might be using behavioral features/clues in its filtering.

We will describe the detailed investigations in subsequent sections. We believe this study will be useful to both academia and industry and also to other review sites in their fake review filtering efforts. Before proceeding further, we first review the relevant literature below.
 
Related Work
 
Web spam (Castillo et al., 2007; Spirin and Han, 2012) and email spam (Chirita et al., 2005) are most widely studied spam activities. Blog (Kolari et al., 2006), network (Jin et al., 2011), and tagging (Koutrika et al., 2007) spam are also studied. However, opinion spam dynamics differ.

Apart from the works mentioned in the introduction, Jindal et al., (2010) studied unexpected reviewing patterns, Wang et al., (2011) investigated graph-based methods for finding fake store reviewers, and Fei et al., (2013) exploited review burstiness for spammer detection.

Studies on review quality (Liu et al., 2007), distortion (Wu et al., 2010), and helpfulness (Kim et al., 2006) have also been conducted. A study of bias, controversy and summarization of research paper reviews is reported in (Lauw et al., 2006; 2007). However, research paper reviews do not (at least not obviously) involve faking.

Also related is the task of psycholinguistic deception detection which investigates lying words (Hancock et al., 2008; Newman et al. 2003), untrue views (Mihalcea and Strapparava (2009), computer-mediated deception in role-playing games (Zhou et al., 2008), etc.
 
Detecting Fake Reviews in Yelp
 
This section reports a set of classification experiments using the real-life data from Yelp and the AMT data from (Ott et al., 2011).
 
The Yelp Review Dataset
 
To ensure credibility of user opinions posted on Yelp, it uses a filtering algorithm to filter fake/suspicious reviews and puts them in a filtered list. According to its CEO, Jeremy Stoppelman, Yelp’s filtering algorithm has evolved over the years (since their launch in 2005) to filter shill and fake reviews (Stoppelman, 2009). Yelp is also confident enough to make its filtered reviews public. Yelp’s filter has also been claimed to be highly accurate by a study in BusinessWeek (Weise, 2011)1.

1 Yelp accepts that its filter may catch some false positives (Holloway, 2011), and also accepts the cost of filtering such reviews than the infinitely high cost of not having any filter at all which would render it become a lassez-faire review site that people stop using (Luther, 2010).

2 We also tried naïve Bayes, but it resulted in slightly poorer models.



In this work, we study Yelp’s filtering using its filtered (fake) and unfiltered (non-fake) reviews across 85 hotels and 130 restaurants in the Chicago area. To avoid any bias we consider a mixture of popular and unpopular hotels and restaurants (based on the number of reviews) in our dataset in Table 1. We note the class distribution is imbalanced.
 
Classification Experiments on Yelp
 
We now report the classification results on the Yelp data.
 
Classification Settings: All our classification experiments are based on SVM2 (SVMLight (Joachims, 1999)) using 5-fold Cross Validation (CV), which was also done in (Ott et al., 2011). We report linear kernel SVM results as it



outperformed rbf, sigmoid, and polynomial kernels.

From Table 1, we see that the class distribution of the real-life Yelp data is skewed. It is well known that highly imbalanced data often produces poor models (Chawla et al., 2004). Classification results using the natural class distribution in Table 1 yielded an F1-score of 31.9 and 32.2 for the hotel and restaurant domains. For a detailed analysis of detection in the skewed (natural) class distribution, refer to (Mukherjee et al., 2013).

To build a good model for imbalanced data, a well-known technique is to employ under-sampling (Drummond and Holte, 2003) to randomly select a subset of instances from the majority class and combine it with the minority class to form a balanced class distribution data for model building. Since Ott et al., (2011) reported classification on balanced data (50% class distribution): 400 fake reviews from AMT and 400 non-fake reviews from Tripadvisor, for a fair comparison, we also report results on balanced data.
 

Domain fake non-fake % fake total # reviews # reviewers
Hotel 802 4876 14.1% 5678 5124
Restaurant 8368 50149 14.3% 58517 35593

No comments:

Post a Comment