Identifying Google Analytics Referrer Spam Using R
Before beginning any analysis of Google Analytics data, it is important to clean up the referrals lists to make sure that you are only doing analysis on actual visits to your web site. Referral spam is problem that began in about 2014 when webmasters began to notice referrals in Google Analytics that did not appear in web server access logs–no one actually visited the site. Referral spam operators randomly guess Google Analytics tracking ID codes and impersonate accesses to a site in hopes that a webmaster reviewing the referral list will visit the spammers site to download malicious code or to purchase a product or service of interest to web masters. A look at Google Trends shows that this became a major problem in 2015, as shown in Figure 1.
Figure 1. Google Trends shows a dramatic uptick in searches for “referral spam” beginning in 2015.
Within Google Analytics, you can set a flag in the admin area for a view to filter out well-known referral spammers using filters described by in an article by Ben Travis at Viget. For analysis in R, you will be able to use the view filters in Google Analytics and will have to filter out the referral spammers before you do any analysis. The RGoogleAnalytics and rdomains R packages offer a programmatic way to conveniently analyze Google Analytics referral spam attacks and remove them from other Google Analytics analysis. The article is divided into the following sections:
- Retrieve Referrer Data from Google Analytics
- Use the urltools Package to Isolate the Domain Name
- Use the rdomains Package to Look up Domains on Dmoz and Shallalist, Two Domain Classification Sites
- Look up Referrer Domain Names on Virustotal, a Domain Classification Site
- Filter the Referrer List to Identify Referral Spammers
Retrieve Referrer Data from Google Analytics
The first step is to use the RGoogleAnalytics package to retrieve the referral data from Google Analytics. In the example shown, the OAuth token has been generated and saved once for use in all scripts. The query parameters are fairly general and include the ga:fullreferrer
parameter as shown in Figure 2. Make sure to set up logic to save the query data and check for existence before running the query, as the query can take a while; you may also run up against daily retrieval limits.
# # Retrieve the previously saved OAuth token # require(RGoogleAnalytics) load("~/Consulting_Business/R/working/token_file") ValidateToken(token)
## Access Token successfully updated
profiles <- GetProfiles(token)
## Access Token is valid
# # Build a list of all the Query Parameters # if (!file.exists("./ga_data_referrer")) { query.list <- Init(start.date = "2015-01-12", end.date = "2016-09-30", dimensions = "ga:date,ga:hour,ga:pagePath,ga:sourceMedium,ga:fullReferrer,ga:metro,ga:networkDomain", metrics = "ga:sessions,ga:pageviews,ga:sessionDuration,ga:bounceRate", max.results = 10000, sort = "ga:date,ga:hour", filters = "ga:medium==referral", table.id = paste("ga:",gaProfileID,sep="")) ga.query <- QueryBuilder(query.list) # Extract the data and store it in a data-framefa ga.data <- GetReportData(ga.query, token) save(ga.data,file="./ga_data_referrer") } else { load("./ga_data_referrer") }
Use the urltools Package to Isolate the Domain Name
The next step in the process is to use the tldextract
function in the urltools package to isolate the domain name in the referral string, and then filter for referrals from the root directory of a domain or try.php
, two common paths found in referral spam. In addition, any occurrence of darodar
is included in the filter, as this was one of the original referral spam domains. The code to isolate and filter domains is shown in Figure 3.
tldextract
from the urltools package.# # dply does not currently like $domain notation...thus [["domain"]] # gaDf <- ga.data gaDf <- gaDf %>% mutate(referrerDomain=paste(tldextract(urltools::domain(gaDf$fullReferrer))[["domain"]], ".", tldextract(urltools::domain(gaDf$fullReferrer))[["tld"]] ,sep="")) #gaRefSpamDf <- gaDf %>% filter(pagePath == "/" | # grepl("try.php",fullReferrer) | # grepl("darodar",fullReferrer)) %>% # group_by(referrerDomain) %>% # summarize(attacks=n()) gaRefSpamDf <- gaDf %>% group_by(referrerDomain) %>% summarize(attacks=n()) gaRefSpamDetailDf <- gaDf %>% group_by(referrerDomain,pagePath,fullReferrer) %>% summarize(attacks=n()) gaRefSpamDf
## # A tibble: 123 × 2 ## referrerDomain attacks ## <chr> <int> ## 1 100dollars-seo.com 5 ## 2 1und1.de 1 ## 3 aafes.com 2 ## 4 alhea.com 3 ## 5 alot.com 1 ## 6 aol.com 1 ## 7 asana.com 1 ## 8 ask.com 15 ## 9 atlassian.net 2 ## 10 b1.org 1 ## # ... with 113 more rows
Use rdomains Package to Look up Domains on Dmoz and Shallalist, Two Domain Classification Sites
The next step in the process is to use the rdomains package to look up the domains on various domain classification sites, beginning with dmoz and shallalist. The get_dmoz_data
and get_shalla_data
download a data set that can then be used for dmoz_cat
and shalla_cat
calls that both take a vector of domain names and return data frames with classification information, as shown in the example given in Figure 4. For purposes of identifying referrer spam domains, we will look only at domains that neither dmoz nor shallalist classify.
# # Retrieve moz and shallalist catalogs # require(rdomains)
if (!file.exists("./dmoz_domain_category.csv")) { get_dmoz_data(outdir = "./", overwrite = FALSE) } if (!file.exists("./shalla_domain_cateory.csv") && !file.exists("./shalla_domain_category.csv")) { get_shalla_data(outdir = "./", overwrite = FALSE) } # # Query the dmoz catalog # if (!file.exists("./gaRefSpam1Df")) { gaRefSpam1Df <- gaRefSpamDf %>% mutate(dmozCat = dmoz_cat(gaRefSpamDf$referrerDomain, use_file = "dmoz_domain_category.csv")$dmoz_category) %>% filter(is.na(dmozCat)) save(gaRefSpam1Df, file="./gaRefSpam1Df") } else { load("./gaRefSpam1Df") } # # Query the shallalist catalog # if (!file.exists("./gaRefSpam2Df")) { gaRefSpam2Df <- gaRefSpam1Df %>% mutate(shallaCat = shalla_cat(domains = referrerDomain)$shalla_category) %>% filter(is.na(shallaCat)) save(gaRefSpam2Df, file="./gaRefSpam2Df") } else { load("./gaRefSpam2Df") } summary(gaRefSpam2Df)
## referrerDomain attacks dmozCat shallaCat ## Length:69 Min. : 1.00 Length:69 Length:69 ## Class :character 1st Qu.: 1.00 Class :character Class :character ## Mode :character Median : 1.00 Mode :character Mode :character ## Mean : 29.61 ## 3rd Qu.: 5.00 ## Max. :1720.00
gaRefSpam2Df
## # A tibble: 69 × 4 ## referrerDomain attacks dmozCat shallaCat ## <chr> <int> <chr> <chr> ## 1 100dollars-seo.com 5 <NA> <NA> ## 2 alot.com 1 <NA> <NA> ## 3 asana.com 1 <NA> <NA> ## 4 atlassian.net 2 <NA> <NA> ## 5 basecamphq.com 1 <NA> <NA> ## 6 best-seo-offer.com 10 <NA> <NA> ## 7 best-seo-solution.com 7 <NA> <NA> ## 8 binarystream.com 1 <NA> <NA> ## 9 buttons-for-website.com 1 <NA> <NA> ## 10 buttons-for-your-website.com 6 <NA> <NA> ## # ... with 59 more rows
This list of domains still includes several that are clearly legitimate by inspection.
Look up Referrer Domain Names on Virustotal, a Domain Classification Site
As shown in the output of the code in Figure 4, by inspection, we still have a few domains that are known to be legitimate domains. To filter these out, we will go to the Virustotal service for further classification. Virustotal works somewhat differently than the other services: you must have an account and an API key, both of which are free. Calls to Virustotal are limited to four per minute, so the rdomains interface works a little differently; you cannot send a vector of domain names to the Virustotal_cat
call. To process a group of domains, you will need to write a function similar to the one shown in Figure 5.
#
# Write a function to query Virustotal for a vector of domain
# names and limit the query rate to four per minute
#
getVirustotal <- function(domainDf,VirustotalApiKey) {
require(rdomains)
require(dplyr)
if (exists("virusDomain")) {
rm(virusDomain)
}
#domainDf <- gaRefSpamDf$referrerDomain
#print(NROW(domainDf))
virusDomain <- data.frame(domain=as.character(),
bitdefender=as.character(),
dr_web=as.character(),
alexa=as.character(),
google=as.character(),
websense=as.character(),
trendmicro=as.character());
for (i in 1:NROW(domainDf)) {
#print(paste("i = ",i));
#print(paste("Domain = ",domainDf[i]));
Sys.sleep(15)
thisDomain <- Virustotal_cat(domainDf[i],apikey = VirustotalApiKey);
if (exists("thisDomain")) {
#print(paste("Domain results = ",thisDomain))
virusDomain <- merge(virusDomain,thisDomain,all=TRUE)
}
}
return(virusDomain)
}
#
# Call the function to get the Virustotal info for all domains
#
if (!file.exists("./gaRefSpam3Df")) {
gaRefSpam3Df <- getVirustotal(gaRefSpamDf$referrerDomain,VirustotalApiKey)
save(gaRefSpam3Df,file="./gaRefSpam3Df")
} else {
load("./gaRefSpam3Df")
}
gaRefSpam3Df
## domain bitdefender dr_web alexa google websense trendmicro
## 1 100dollars-seo.com <NA> <NA> <NA> uncategorized uncategorized <NA>
## 2 1und1.de hosting <NA> anbieter hosting information technology computers internet
## 3 aafes.com onlineshop <NA> military onlineshop shopping <NA>
## 4 alhea.com searchengines <NA> <NA> searchengines search engines and portals <NA>
## 5 alot.com business <NA> toolbars business society and lifestyles <NA>
## 6 aol.com computersandsoftware <NA> web_portals computersandsoftware search engines and portals search engines portals,news media
## 7 asana.com business <NA> <NA> business hosted business applications <NA>
## 8 ask.com searchengines <NA> ask searchengines search engines and portals search engines portals
## 9 atlassian.net marketing <NA> <NA> marketing educational materials <NA>
## 10 b1.org computersandsoftware not recommended site <NA> computersandsoftware information technology <NA>
## 11 basecamp.com business <NA> hosted business web collaboration computers internet
## 12 basecamphq.com business <NA> <NA> business web collaboration <NA>
## 13 best-seo-offer.com <NA> <NA> <NA> elevated exposure elevated exposure <NA>
## 14 best-seo-solution.com parked <NA> <NA> parked uncategorized <NA>
## 15 binarystream.com <NA> not recommended site <NA> information technology information technology <NA>
## 16 bing.com searchengines <NA> bing searchengines search engines and portals search engines portals
## 17 bt.com business <NA> carriers business business and economy business economy
## 18 buttons-for-website.com <NA> <NA> <NA> suspicious embedded link suspicious embedded link <NA>
## 19 buttons-for-your-website.com <NA> <NA> <NA> uncategorized uncategorized <NA>
## 20 centurylink.com business <NA> united_states business business and economy computers internet
## 21 centurylink.net portals <NA> <NA> portals news and media search engines portals
## 22 charter.net business <NA> business_and_economy business news and media news media
## 23 cincinnatibell.net business <NA> public_utilities business search engines and portals <NA>
## 24 clearch.org business <NA> <NA> business business and economy <NA>
## 25 cognizant.com business <NA> c business information technology <NA>
## 26 comcast.net news <NA> <NA> news business and economy news media
## 27 cox.com onlineshop <NA> operators onlineshop business and economy business economy
## 28 crazyguyonabike.com sports <NA> travelogues sports society and lifestyles <NA>
## 29 darodar.com parked <NA> <NA> parked suspicious content <NA>
## 30 delta-search.com searchengines not recommended site <NA> searchengines business and economy <NA>
## 31 desk.com business <NA> saas business hosted business applications blogs web communications
## 32 diigo.com computersandsoftware social networks <NA> computersandsoftware personal network storage and backup <NA>
## 33 disconnect.me education <NA> <NA> education proxy avoidance unknown
## 34 disqus.com computersandsoftware <NA> <NA> computersandsoftware information technology blogs web communications,newsgroups
## 35 dnsrsearch.com <NA> <NA> <NA> search engines and portals search engines and portals <NA>
## 36 dogpile.com searchengines not recommended site/adult content metasearch searchengines search engines and portals <NA>
## 37 duckduckgo.com searchengines <NA> search_engines searchengines search engines and portals search engines portals
## 38 earthlink.net bank e-mail united_states bank information technology <NA>
## 39 ecosia.org searchengines <NA> <NA> searchengines search engines and portals search engines portals
## 40 emailsrvr.com webmail <NA> <NA> webmail web hosting email
## 41 evernote.com computersandsoftware <NA> software computersandsoftware personal network storage and backup computers internet,personal network storage
## 42 facebook.com socialnetworks social networks we_are_the_99_percent socialnetworks social web - facebook social networking
## 43 financemarketing.com computersandsoftware <NA> <NA> computersandsoftware financial data and services <NA>
## 44 findeer.com searchengines <NA> <NA> searchengines search engines and portals <NA>
## 45 godaddy.com marketing <NA> g marketing web hosting web hosting
## 46 google.by searchengines <NA> <NA> searchengines search engines and portals search engines portals
## 47 google.ca searchengines <NA> <NA> searchengines search engines and portals search engines portals
## 48 google.co.id searchengines <NA> <NA> searchengines search engines and portals search engines portals
## 49 google.co.jp searchengines <NA> ガイドとディレクトリ searchengines search engines and portals search engines portals,reference
## 50 google.com searchengines chats google searchengines search engines and portals search engines portals
## 51 google.com.au searchengines <NA> search_engines searchengines search engines and portals search engines portals
## 52 google.com.kw searchengines <NA> <NA> searchengines search engines and portals <NA>
## 53 google.cz searchengines <NA> google searchengines search engines and portals search engines portals
## 54 google.de searchengines <NA> google searchengines search engines and portals <NA>
## 55 google.fr computersandsoftware <NA> google computersandsoftware search engines and portals search engines portals
## 56 google.it searchengines <NA> motori searchengines search engines and portals search engines portals
## 57 google.nl searchengines <NA> google searchengines search engines and portals search engines portals
## 58 hootsuite.com socialnetworks <NA> twitter socialnetworks social networking social networking
## 59 hud.gov business <NA> home business government government legal
## 60 info.com searchengines <NA> metasearch searchengines search engines and portals search engines portals
## 61 informationvine.com <NA> <NA> <NA> search engines and portals search engines and portals <NA>
## 62 isket.jp blogs <NA> <NA> blogs information technology <NA>
## 63 ixenia.com <NA> <NA> <NA> uncategorized uncategorized <NA>
## 64 ixquick.com searchengines <NA> metasearch searchengines search engines and portals search engines portals
## 65 ixquick.de business <NA> <NA> business search engines and portals search engines portals
## 66 justprofit.xyz business <NA> <NA> business elevated exposure <NA>
## 67 k9safesearch.com education <NA> <NA> education business and economy <NA>
## 68 larger.io <NA> <NA> <NA> business and economy business and economy <NA>
## 69 linkedin.com socialnetworks social networks social_networking socialnetworks social web - linkedin social networking,business economy
## 70 live.com webmail <NA> internet webmail search engines and portals search engines portals,email
## 71 locatimefree.com business <NA> <NA> business business and economy <NA>
## 72 meetup.com socialnetworks adult content/social networks social_networking socialnetworks social networking social networking,business economy
## 73 microsofttranslator.com education <NA> traductors_automàtics education reference materials translators cached pages
## 74 moz.com computersandsoftware <NA> <NA> computersandsoftware information technology computers internet
## 75 NA.NA <NA> <NA> <NA> <NA> <NA> <NA>
## 76 nextdoor.com socialnetworks social networks <NA> socialnetworks social networking <NA>
## 77 obrazky.cz marketing <NA> služby marketing search engines and portals <NA>
## 78 office365.com computersandsoftware <NA> <NA> computersandsoftware collaboration - office computers internet
## 79 office.com computersandsoftware <NA> groupware computersandsoftware collaboration - office business economy
## 80 pch.com gambling <NA> contests_and_sweepstakes gambling games <NA>
## 81 peoplepc.com computersandsoftware <NA> united_states computersandsoftware information technology <NA>
## 82 pushbullet.com marketing <NA> <NA> marketing information technology disease vector,spam
## 83 qwant.com computersandsoftware <NA> moteurs_de_recherche computersandsoftware search engines and portals <NA>
## 84 rankings-analytics.com <NA> <NA> <NA> suspicious content suspicious content <NA>
## 85 rankscanner.com blogs <NA> <NA> blogs information technology <NA>
## 86 richpasco.org <NA> <NA> <NA> uncategorized uncategorized <NA>
## 87 rof.net business <NA> <NA> business information technology <NA>
## 88 salesforce.com computersandsoftware <NA> contact_management computersandsoftware hosted business applications business economy
## 89 saltpalace.com business <NA> <NA> business business and economy <NA>
## 90 searchlock.com business <NA> <NA> business proxy avoidance <NA>
## 91 securesearch.co business <NA> <NA> business entertainment <NA>
## 92 semaltmedia.com business <NA> <NA> business uncategorized <NA>
## 93 servicepunt71.nl <NA> <NA> <NA> business and economy business and economy <NA>
## 94 seznam.cz searchengines <NA> portály searchengines search engines and portals search engines portals
## 95 shawcable.net business <NA> <NA> business business and economy <NA>
## 96 smarter.com onlineshop <NA> <NA> onlineshop shopping <NA>
## 97 social-buttons.com <NA> not recommended site <NA> uncategorized uncategorized <NA>
## 98 sosodesktop.com <NA> <NA> <NA> information technology information technology <NA>
## 99 stackoverflow.com computersandsoftware <NA> chats_and_forums computersandsoftware information technology computers internet
## 100 startjuno.com business <NA> <NA> business news and media <NA>
## 101 startnetzero.net business <NA> <NA> business news and media <NA>
## 102 startpage.com searchengines <NA> <NA> searchengines search engines and portals search engines portals
## 103 startssl.com computersandsoftware <NA> <NA> computersandsoftware business and economy internet infrastructure
## 104 success-seo.com business <NA> <NA> business uncategorized <NA>
## 105 suddenlink.net portals <NA> <NA> portals search engines and portals <NA>
## 106 t.co computersandsoftware not recommended site <NA> computersandsoftware information technology social networking
## 107 tds.net education <NA> <NA> education information technology news media
## 108 telstra.com.au business <NA> carriers business business and economy business economy
## 109 thegeekspeaks.net <NA> <NA> <NA> uncategorized uncategorized <NA>
## 110 toshiba.com business <NA> <NA> business business and economy <NA>
## 111 twcc.com <NA> <NA> <NA> entertainment entertainment <NA>
## 112 video--production.com <NA> <NA> <NA> uncategorized uncategorized <NA>
## 113 videos-for-your-business.com <NA> <NA> <NA> uncategorized uncategorized <NA>
## 114 webcrawler.com searchengines <NA> metasearch searchengines search engines and portals <NA>
## 115 web.de portals <NA> startseiten_und_portale portals search engines and portals search engines portals
## 116 webmastercentre.co.uk business <NA> <NA> business information technology <NA>
## 117 windstream.net business <NA> <NA> business search engines and portals search engines portals
## 118 wow.com games <NA> <NA> games search engines and portals <NA>
## 119 wowway.net business <NA> <NA> business information technology computers internet
## 120 xfinity.com business <NA> <NA> business business and economy <NA>
## 121 yahoo.com news <NA> web_portals news search engines and portals search engines portals
## 122 ygask.com <NA> <NA> <NA> uncategorized uncategorized <NA>
## 123 zendesk.com computersandsoftware <NA> saas computersandsoftware hosted business applications business economy
We now have a list where we can clearly identify the referral spammers using only the domain classification services.
Filter the Referrer List to Identify Referral Spammers
To filter down to the final list of referral spammers, we will use dplyr
to only include domains that are "not recommended site," "known infection site," on Dr. Web.
# # Filter on characteristics of known referral spam domains # if (!file.exists("./gaRefSpam4Df")) { gaRefSpam4Df <- gaRefSpam3Df %>% inner_join(gaRefSpam2Df,by=c("domain" = "referrerDomain")) %>% filter((is.na(dr_web) | dr_web == "not recommended site" | dr_web == "known infection source")) save(gaRefSpam4Df,file="./gaRefSpam4Df") } else { load("./gaRefSpam4Df") } gaRefSpam4Df
## domain bitdefender dr_web alexa google websense trendmicro attacks dmozCat shallaCat ## 1 100dollars-seo.com <NA> <NA> <NA> uncategorized uncategorized <NA> 5 <NA> <NA> ## 2 alot.com business <NA> toolbars business society and lifestyles <NA> 1 <NA> <NA> ## 3 asana.com business <NA> <NA> business hosted business applications <NA> 1 <NA> <NA> ## 4 atlassian.net marketing <NA> <NA> marketing educational materials <NA> 2 <NA> <NA> ## 5 basecamphq.com business <NA> <NA> business web collaboration <NA> 1 <NA> <NA> ## 6 best-seo-offer.com <NA> <NA> <NA> elevated exposure elevated exposure <NA> 10 <NA> <NA> ## 7 best-seo-solution.com parked <NA> <NA> parked uncategorized <NA> 7 <NA> <NA> ## 8 binarystream.com <NA> not recommended site <NA> information technology information technology <NA> 1 <NA> <NA> ## 9 buttons-for-website.com <NA> <NA> <NA> suspicious embedded link suspicious embedded link <NA> 1 <NA> <NA> ## 10 buttons-for-your-website.com <NA> <NA> <NA> uncategorized uncategorized <NA> 6 <NA> <NA> ## 11 centurylink.net portals <NA> <NA> portals news and media search engines portals 6 <NA> <NA> ## 12 cincinnatibell.net business <NA> public_utilities business search engines and portals <NA> 1 <NA> <NA> ## 13 clearch.org business <NA> <NA> business business and economy <NA> 1 <NA> <NA> ## 14 cognizant.com business <NA> c business information technology <NA> 1 <NA> <NA> ## 15 darodar.com parked <NA> <NA> parked suspicious content <NA> 2 <NA> <NA> ## 16 delta-search.com searchengines not recommended site <NA> searchengines business and economy <NA> 11 <NA> <NA> ## 17 desk.com business <NA> saas business hosted business applications blogs web communications 1 <NA> <NA> ## 18 disconnect.me education <NA> <NA> education proxy avoidance unknown 2 <NA> <NA> ## 19 dnsrsearch.com <NA> <NA> <NA> search engines and portals search engines and portals <NA> 1 <NA> <NA> ## 20 ecosia.org searchengines <NA> <NA> searchengines search engines and portals search engines portals 5 <NA> <NA> ## 21 evernote.com computersandsoftware <NA> software computersandsoftware personal network storage and backup computers internet,personal network storage 1 <NA> <NA> ## 22 financemarketing.com computersandsoftware <NA> <NA> computersandsoftware financial data and services <NA> 1 <NA> <NA> ## 23 findeer.com searchengines <NA> <NA> searchengines search engines and portals <NA> 1 <NA> <NA> ## 24 hud.gov business <NA> home business government government legal 2 <NA> <NA> ## 25 informationvine.com <NA> <NA> <NA> search engines and portals search engines and portals <NA> 1 <NA> <NA> ## 26 isket.jp blogs <NA> <NA> blogs information technology <NA> 7 <NA> <NA> ## 27 ixenia.com <NA> <NA> <NA> uncategorized uncategorized <NA> 1 <NA> <NA> ## 28 justprofit.xyz business <NA> <NA> business elevated exposure <NA> 2 <NA> <NA> ## 29 k9safesearch.com education <NA> <NA> education business and economy <NA> 3 <NA> <NA> ## 30 larger.io <NA> <NA> <NA> business and economy business and economy <NA> 3 <NA> <NA> ## 31 live.com webmail <NA> internet webmail search engines and portals search engines portals,email 1 <NA> <NA> ## 32 locatimefree.com business <NA> <NA> business business and economy <NA> 30 <NA> <NA> ## 33 microsofttranslator.com education <NA> traductors_automàtics education reference materials translators cached pages 1 <NA> <NA> ## 34 moz.com computersandsoftware <NA> <NA> computersandsoftware information technology computers internet 1720 <NA> <NA> ## 35 NA.NA <NA> <NA> <NA> <NA> <NA> <NA> 1 <NA> <NA> ## 36 obrazky.cz marketing <NA> služby marketing search engines and portals <NA> 1 <NA> <NA> ## 37 office365.com computersandsoftware <NA> <NA> computersandsoftware collaboration - office computers internet 1 <NA> <NA> ## 38 pushbullet.com marketing <NA> <NA> marketing information technology disease vector,spam 1 <NA> <NA> ## 39 qwant.com computersandsoftware <NA> moteurs_de_recherche computersandsoftware search engines and portals <NA> 1 <NA> <NA> ## 40 rankings-analytics.com <NA> <NA> <NA> suspicious content suspicious content <NA> 3 <NA> <NA> ## 41 rankscanner.com blogs <NA> <NA> blogs information technology <NA> 24 <NA> <NA> ## 42 richpasco.org <NA> <NA> <NA> uncategorized uncategorized <NA> 1 <NA> <NA> ## 43 salesforce.com computersandsoftware <NA> contact_management computersandsoftware hosted business applications business economy 2 <NA> <NA> ## 44 saltpalace.com business <NA> <NA> business business and economy <NA> 25 <NA> <NA> ## 45 searchlock.com business <NA> <NA> business proxy avoidance <NA> 6 <NA> <NA> ## 46 securesearch.co business <NA> <NA> business entertainment <NA> 1 <NA> <NA> ## 47 semaltmedia.com business <NA> <NA> business uncategorized <NA> 4 <NA> <NA> ## 48 servicepunt71.nl <NA> <NA> <NA> business and economy business and economy <NA> 1 <NA> <NA> ## 49 seznam.cz searchengines <NA> portály searchengines search engines and portals search engines portals 1 <NA> <NA> ## 50 shawcable.net business <NA> <NA> business business and economy <NA> 1 <NA> <NA> ## 51 social-buttons.com <NA> not recommended site <NA> uncategorized uncategorized <NA> 18 <NA> <NA> ## 52 sosodesktop.com <NA> <NA> <NA> information technology information technology <NA> 1 <NA> <NA> ## 53 startjuno.com business <NA> <NA> business news and media <NA> 1 <NA> <NA> ## 54 startnetzero.net business <NA> <NA> business news and media <NA> 1 <NA> <NA> ## 55 startssl.com computersandsoftware <NA> <NA> computersandsoftware business and economy internet infrastructure 1 <NA> <NA> ## 56 success-seo.com business <NA> <NA> business uncategorized <NA> 48 <NA> <NA> ## 57 suddenlink.net portals <NA> <NA> portals search engines and portals <NA> 3 <NA> <NA> ## 58 tds.net education <NA> <NA> education information technology news media 1 <NA> <NA> ## 59 telstra.com.au business <NA> carriers business business and economy business economy 3 <NA> <NA> ## 60 thegeekspeaks.net <NA> <NA> <NA> uncategorized uncategorized <NA> 2 <NA> <NA> ## 61 toshiba.com business <NA> <NA> business business and economy <NA> 1 <NA> <NA> ## 62 twcc.com <NA> <NA> <NA> entertainment entertainment <NA> 1 <NA> <NA> ## 63 video--production.com <NA> <NA> <NA> uncategorized uncategorized <NA> 3 <NA> <NA> ## 64 videos-for-your-business.com <NA> <NA> <NA> uncategorized uncategorized <NA> 6 <NA> <NA> ## 65 windstream.net business <NA> <NA> business search engines and portals search engines portals 2 <NA> <NA> ## 66 xfinity.com business <NA> <NA> business business and economy <NA> 29 <NA> <NA> ## 67 ygask.com <NA> <NA> <NA> uncategorized uncategorized <NA> 1 <NA> <NA>
Just looking at Dr. Web classification still gets two false positives; one for startssl.com
and one for moz.com
. It is surprising that these two are not classified by Dr. Web, but since they are classified by Trend Micro or Alexa, we can add an additional filter:
# # Filter on characteristics of known referral spam domains # if (!file.exists("./gaRefSpam5Df")) { gaRefSpam5Df <- gaRefSpam4Df %>% filter(is.na(trendmicro) & is.na(alexa)) save(gaRefSpam5Df,file="./gaRefSpam5Df") } else { load("./gaRefSpam5Df") } gaRefSpam5Df
## domain bitdefender dr_web alexa google websense trendmicro attacks dmozCat shallaCat ## 1 100dollars-seo.com <NA> <NA> <NA> uncategorized uncategorized <NA> 5 <NA> <NA> ## 2 asana.com business <NA> <NA> business hosted business applications <NA> 1 <NA> <NA> ## 3 atlassian.net marketing <NA> <NA> marketing educational materials <NA> 2 <NA> <NA> ## 4 basecamphq.com business <NA> <NA> business web collaboration <NA> 1 <NA> <NA> ## 5 best-seo-offer.com <NA> <NA> <NA> elevated exposure elevated exposure <NA> 10 <NA> <NA> ## 6 best-seo-solution.com parked <NA> <NA> parked uncategorized <NA> 7 <NA> <NA> ## 7 binarystream.com <NA> not recommended site <NA> information technology information technology <NA> 1 <NA> <NA> ## 8 buttons-for-website.com <NA> <NA> <NA> suspicious embedded link suspicious embedded link <NA> 1 <NA> <NA> ## 9 buttons-for-your-website.com <NA> <NA> <NA> uncategorized uncategorized <NA> 6 <NA> <NA> ## 10 clearch.org business <NA> <NA> business business and economy <NA> 1 <NA> <NA> ## 11 darodar.com parked <NA> <NA> parked suspicious content <NA> 2 <NA> <NA> ## 12 delta-search.com searchengines not recommended site <NA> searchengines business and economy <NA> 11 <NA> <NA> ## 13 dnsrsearch.com <NA> <NA> <NA> search engines and portals search engines and portals <NA> 1 <NA> <NA> ## 14 financemarketing.com computersandsoftware <NA> <NA> computersandsoftware financial data and services <NA> 1 <NA> <NA> ## 15 findeer.com searchengines <NA> <NA> searchengines search engines and portals <NA> 1 <NA> <NA> ## 16 informationvine.com <NA> <NA> <NA> search engines and portals search engines and portals <NA> 1 <NA> <NA> ## 17 isket.jp blogs <NA> <NA> blogs information technology <NA> 7 <NA> <NA> ## 18 ixenia.com <NA> <NA> <NA> uncategorized uncategorized <NA> 1 <NA> <NA> ## 19 justprofit.xyz business <NA> <NA> business elevated exposure <NA> 2 <NA> <NA> ## 20 k9safesearch.com education <NA> <NA> education business and economy <NA> 3 <NA> <NA> ## 21 larger.io <NA> <NA> <NA> business and economy business and economy <NA> 3 <NA> <NA> ## 22 locatimefree.com business <NA> <NA> business business and economy <NA> 30 <NA> <NA> ## 23 NA.NA <NA> <NA> <NA> <NA> <NA> <NA> 1 <NA> <NA> ## 24 rankings-analytics.com <NA> <NA> <NA> suspicious content suspicious content <NA> 3 <NA> <NA> ## 25 rankscanner.com blogs <NA> <NA> blogs information technology <NA> 24 <NA> <NA> ## 26 richpasco.org <NA> <NA> <NA> uncategorized uncategorized <NA> 1 <NA> <NA> ## 27 saltpalace.com business <NA> <NA> business business and economy <NA> 25 <NA> <NA> ## 28 searchlock.com business <NA> <NA> business proxy avoidance <NA> 6 <NA> <NA> ## 29 securesearch.co business <NA> <NA> business entertainment <NA> 1 <NA> <NA> ## 30 semaltmedia.com business <NA> <NA> business uncategorized <NA> 4 <NA> <NA> ## 31 servicepunt71.nl <NA> <NA> <NA> business and economy business and economy <NA> 1 <NA> <NA> ## 32 shawcable.net business <NA> <NA> business business and economy <NA> 1 <NA> <NA> ## 33 social-buttons.com <NA> not recommended site <NA> uncategorized uncategorized <NA> 18 <NA> <NA> ## 34 sosodesktop.com <NA> <NA> <NA> information technology information technology <NA> 1 <NA> <NA> ## 35 startjuno.com business <NA> <NA> business news and media <NA> 1 <NA> <NA> ## 36 startnetzero.net business <NA> <NA> business news and media <NA> 1 <NA> <NA> ## 37 success-seo.com business <NA> <NA> business uncategorized <NA> 48 <NA> <NA> ## 38 suddenlink.net portals <NA> <NA> portals search engines and portals <NA> 3 <NA> <NA> ## 39 thegeekspeaks.net <NA> <NA> <NA> uncategorized uncategorized <NA> 2 <NA> <NA> ## 40 toshiba.com business <NA> <NA> business business and economy <NA> 1 <NA> <NA> ## 41 twcc.com <NA> <NA> <NA> entertainment entertainment <NA> 1 <NA> <NA> ## 42 video--production.com <NA> <NA> <NA> uncategorized uncategorized <NA> 3 <NA> <NA> ## 43 videos-for-your-business.com <NA> <NA> <NA> uncategorized uncategorized <NA> 6 <NA> <NA> ## 44 xfinity.com business <NA> <NA> business business and economy <NA> 29 <NA> <NA> ## 45 ygask.com <NA> <NA> <NA> uncategorized uncategorized <NA> 1 <NA> <NA>
After investigating the domains in the list, this filter only has one clear false positive: saltpalace.com
, a convention center in Salt Lake city where I attended a conference and from which I visited my web site to make sure that it was up and running. There is really no good way to filter out this false positive using domain classification information at this point.
# # Combine with detail to see if there are other characteristics to use for filtering # if (!file.exists("./gaRefSpam6Df")) { gaRefSpam6Df <- gaRefSpam5Df %>% inner_join(gaRefSpamDetailDf,by=c("domain" = "referrerDomain")) save(gaRefSpam6Df,file="./gaRefSpam6Df") } else { load("./gaRefSpam6Df") } summary(gaRefSpam6Df)
## domain bitdefender dr_web alexa google websense trendmicro attacks.x dmozCat shallaCat pagePath fullReferrer attacks.y ## Length:76 Length:76 Length:76 Length:76 business :32 business and economy :37 Length:76 Min. : 1.00 Length:76 Length:76 Length:76 Length:76 Min. : 1.000 ## Class :character Class :character Class :character Class :character searchengines:12 uncategorized :13 Class :character 1st Qu.: 1.00 Class :character Class :character Class :character Class :character 1st Qu.: 1.000 ## Mode :character Mode :character Mode :character Mode :character uncategorized:10 information technology : 4 Mode :character Median : 6.00 Mode :character Mode :character Mode :character Mode :character Median : 1.000 ## education : 3 search engines and portals: 4 Mean :10.62 Mean : 3.684 ## marketing : 2 proxy avoidance : 3 3rd Qu.:24.25 3rd Qu.: 3.000 ## (Other) :16 (Other) :14 Max. :48.00 Max. :48.000 ## NA's : 1 NA's : 1
gaRefSpam6Df[,c("domain","pagePath","fullReferrer")]
## domain pagePath fullReferrer ## 1 100dollars-seo.com / 100dollars-seo.com/try.php ## 2 asana.com /Web-Commerce/100dollars-seo-com-referral-spam app.asana.com/0/11812954602745/44396681995798 ## 3 atlassian.net /Web-Commerce/social-buttons-com-referrer-spam lmovim.atlassian.net/browse/VDG-1 ## 4 atlassian.net /Web-Commerce/social-buttons-com-referrer-spam playhousedigital.atlassian.net/browse/TGCF-76 ## 5 basecamphq.com /Web-Commerce/social-buttons-com-referrer-spam viminteractive.basecamphq.com/projects/10027358-ee-maintenance/posts/92384734/comments ## 6 best-seo-offer.com / best-seo-offer.com/try.php ## 7 best-seo-solution.com / best-seo-solution.com/try.php ## 8 binarystream.com /Loan-Pricing/effective-yield-loan-fee-amortization crm2015.binarystream.com/_controls/emailbody/msgBody.aspx ## 9 buttons-for-website.com / buttons-for-website.com/ ## 10 buttons-for-your-website.com / buttons-for-your-website.com/ ## 11 clearch.org /Personal-and-Small-Business-Technology/using-multiple-virtual-desktops-on-windows-os-x-and-ubuntu search.clearch.org/ ## 12 darodar.com / forum.topic44008047.darodar.com/ ## 13 delta-search.com / www2.delta-search.com/ ## 14 delta-search.com /about www2.delta-search.com/ ## 15 delta-search.com /All-Articles www2.delta-search.com/ ## 16 delta-search.com /contact www2.delta-search.com/ ## 17 delta-search.com /people www2.delta-search.com/ ## 18 delta-search.com /Table/Deposit-Pricing/ www2.delta-search.com/ ## 19 delta-search.com /Table/Loan-Pricing/ www2.delta-search.com/ ## 20 delta-search.com /Table/Loan-Pricing/Charge-offs/ www2.delta-search.com/ ## 21 delta-search.com /Table/Open-Source-Software/ www2.delta-search.com/ ## 22 delta-search.com /Table/Operations-and-Information-Technology/Security/ www2.delta-search.com/ ## 23 delta-search.com /Web-Commerce/social-buttons-com-referrer-spam www2.delta-search.com/ ## 24 dnsrsearch.com /Web-Commerce/100dollars-seo-com-referral-spam dnsrsearch.com/index_results.php ## 25 financemarketing.com /Web-Commerce/social-buttons-com-referrer-spam projects.financemarketing.com/tasks/3362401 ## 26 findeer.com /Web-Commerce/traffic2cash-xyz-google-analytics-referral-spam search.findeer.com/it/results ## 27 informationvine.com /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services informationvine.com/index ## 28 isket.jp /Web-Commerce/social-buttons-com-referrer-spam isket.jp/seo/social-buttons-comからのリファラスパムが大量に・・・/ ## 29 ixenia.com /Web-Commerce/social-buttons-com-referrer-spam redmine.ixenia.com/issues/1143 ## 30 justprofit.xyz / justprofit.xyz/ ## 31 k9safesearch.com /Open-Source-Software/r-open-source-statistical-software k9safesearch.com/search.jsp ## 32 k9safesearch.com /Personal-and-Small-Business-Technology/using-multiple-virtual-desktops-on-windows-os-x-and-ubuntu k9safesearch.com/search.jsp ## 33 k9safesearch.com /Web-Commerce/social-buttons-com-referrer-spam k9safesearch.com/search.jsp ## 34 larger.io / larger.io/ ## 35 locatimefree.com /Web-Commerce/social-buttons-com-referrer-spam locatimefree.com/google-analytics-realtime-rapid-increase-access-referrer-spam/ ## 36 NA.NA /Web-Commerce/traffic2cash-xyz-google-analytics-referral-spam 131.253.14.125/bvsandbox.aspx ## 37 rankings-analytics.com / rankings-analytics.com/try.php ## 38 rankscanner.com / rankscanner.com/Domain/mooresoftwareservices.com ## 39 richpasco.org /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services richpasco.org/virus/callerid_spoofing.html ## 40 saltpalace.com / spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 41 saltpalace.com / spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.4/success ## 42 saltpalace.com /All-Articles spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 43 saltpalace.com /Deposit-Pricing/using-time-variable-fees-to-solve-peak-workload-problems spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 44 saltpalace.com /Table/Deposit-Pricing/ spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 45 saltpalace.com /Table/Deposit-Pricing/ spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.4/success ## 46 saltpalace.com /Table/Loan-Pricing/ spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 47 saltpalace.com /Table/Loan-Pricing/ spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.4/success ## 48 saltpalace.com /Table/Operations-and-Information-Technology/ spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 49 saltpalace.com /Table/Operations-and-Information-Technology/Web-Commerce/ spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 50 saltpalace.com /Table/Personal-and-Small-Business-Technology/ spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 51 saltpalace.com /Web-Commerce/social-buttons-com-referrer-spam spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success ## 52 searchlock.com /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services searchlock.com/ ## 53 searchlock.com /Personal-and-Small-Business-Technology/windows-10-upgrade-experience-for-lenovo-w541 searchlock.com/ ## 54 searchlock.com /Web-Commerce/social-buttons-com-referrer-spam searchlock.com/ ## 55 securesearch.co /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services securesearch.co/search/ ## 56 semaltmedia.com / semaltmedia.com/ ## 57 servicepunt71.nl /Web-Commerce/social-buttons-com-referrer-spam webmail.servicepunt71.nl/owa/redir.aspx ## 58 shawcable.net /Personal-and-Small-Business-Technology/sales-and-lead-management-with-suitecrm wm-s.glb.shawcable.net/zimbra/mail ## 59 social-buttons.com / site34.social-buttons.com/ ## 60 sosodesktop.com /Web-Commerce/traffic2cash-xyz-google-analytics-referral-spam search.sosodesktop.com/search/web ## 61 startjuno.com /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services startjuno.com/search/index.php ## 62 startnetzero.net /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services startnetzero.net/search/index.php ## 63 success-seo.com / success-seo.com/try.php ## 64 suddenlink.net /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services home.suddenlink.net/search/index.php ## 65 thegeekspeaks.net /Web-Commerce/social-buttons-com-referrer-spam thegeekspeaks.net/social-buttons-com-spams-google-analytics/ ## 66 toshiba.com /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services home.toshiba.com/search/index.php ## 67 twcc.com /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services search.twcc.com/ ## 68 video--production.com / video--production.com/ ## 69 videos-for-your-business.com / 53275950.videos-for-your-business.com/ ## 70 videos-for-your-business.com / videos-for-your-business.com/ ## 71 xfinity.com /Loan-Pricing/effective-yield-loan-fee-amortization search.xfinity.com/ ## 72 xfinity.com /Personal-and-Small-Business-Technology/downloading-and-preparing-ftc-robocall-complaint-list-for-ncid search.xfinity.com/ ## 73 xfinity.com /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services search.xfinity.com/ ## 74 xfinity.com /Personal-and-Small-Business-Technology/using-ncid-on-two-phone-lines search.xfinity.com/ ## 75 xfinity.com /Web-Commerce/why-and-how-to-set-up-ssl-https-on-your-web-site search.xfinity.com/ ## 76 ygask.com /Loan-Pricing/effective-yield-loan-fee-amortization ygask.com/effective-interest-method-of-amortization-k.html
It looks like many of the referral spam domains reference the root page: this is the only page that will exist on all sites. Next we eliminate domains that referred to a page path other than /
.
# # Combine with detail to see if there are other characteristics to use for filtering # if (!file.exists("./gaRefSpam7Df")) { gaRefSpam7Df <- gaRefSpam6Df %>% filter(pagePath == "/" ) %>% #group_by(domain,bitdefender,dr_web,alexa,google,websense,trendmicro,dmozCat,shallaCat,attacks.x) %>% group_by(domain,dr_web,alexa,trendmicro) %>% summarize(numRefPages=n_distinct(pagePath)) #gaRefSpam7Df <- gaRefSpam7Df %>% filter(numRefPages <= 1) %>% # group_by(domain,bitdefender,dr_web,alexa,google,websense,trendmicro,dmozCat,shallaCat,attacks.x) save(gaRefSpam7Df,file="./gaRefSpam7Df") } else { load("./gaRefSpam7Df") } gaRefSpam7Df
## Source: local data frame [17 x 5] ## Groups: domain, dr_web, alexa [?] ## ## domain dr_web alexa trendmicro numRefPages ## <chr> <chr> <chr> <chr> <int> ## 1 100dollars-seo.com <NA> <NA> <NA> 1 ## 2 best-seo-offer.com <NA> <NA> <NA> 1 ## 3 best-seo-solution.com <NA> <NA> <NA> 1 ## 4 buttons-for-website.com <NA> <NA> <NA> 1 ## 5 buttons-for-your-website.com <NA> <NA> <NA> 1 ## 6 darodar.com <NA> <NA> <NA> 1 ## 7 delta-search.com not recommended site <NA> <NA> 1 ## 8 justprofit.xyz <NA> <NA> <NA> 1 ## 9 larger.io <NA> <NA> <NA> 1 ## 10 rankings-analytics.com <NA> <NA> <NA> 1 ## 11 rankscanner.com <NA> <NA> <NA> 1 ## 12 saltpalace.com <NA> <NA> <NA> 1 ## 13 semaltmedia.com <NA> <NA> <NA> 1 ## 14 social-buttons.com not recommended site <NA> <NA> 1 ## 15 success-seo.com <NA> <NA> <NA> 1 ## 16 video--production.com <NA> <NA> <NA> 1 ## 17 videos-for-your-business.com <NA> <NA> <NA> 1
Conclusions
Before doing any work in R using Google Analytics data, you must remove all of the referral spam web sites from your data; this can be done easily using the rdomains package. Because the classification data changes, it will be necessary to revisit this script on a regular basis.