Google searches for 'referral spam' increased dramatically in 2015
Google searches for 'referral spam' increased dramatically in 2015

Identifying Google Analytics Referrer Spam Using R

Before beginning any analysis of Google Analytics data, it is important to clean up the referrals lists to make sure that you are only doing analysis on actual visits to your web site. Referral spam is problem that began in about 2014 when webmasters began to notice referrals in Google Analytics that did not appear in web server access logs–no one actually visited the site. Referral spam operators randomly guess Google Analytics tracking ID codes and impersonate accesses to a site in hopes that a webmaster reviewing the referral list will visit the spammers site to download malicious code or to purchase a product or service of interest to web masters. A look at Google Trends shows that this became a major problem in 2015, as shown in Figure 1.

Figure 1. Google Trends shows a dramatic uptick in searches for “referral spam” beginning in 2015.

Within Google Analytics, you can set a flag in the admin area for a view to filter out well-known referral spammers using filters described by in an article by Ben Travis at Viget. For analysis in R, you will be able to use the view filters in Google Analytics and will have to filter out the referral spammers before you do any analysis. The RGoogleAnalytics and rdomains R packages offer a programmatic way to conveniently analyze Google Analytics referral spam attacks and remove them from other Google Analytics analysis. The article is divided into the following sections:

Retrieve Referrer Data from Google Analytics

The first step is to use the RGoogleAnalytics package to retrieve the referral data from Google Analytics. In the example shown, the OAuth token has been generated and saved once for use in all scripts. The query parameters are fairly general and include the ga:fullreferrer parameter as shown in Figure 2. Make sure to set up logic to save the query data and check for existence before running the query, as the query can take a while; you may also run up against daily retrieval limits.

Figure 2. Retrieve Google Analytics referrer data.
#
# Retrieve the previously saved OAuth token
#
require(RGoogleAnalytics)
load("~/Consulting_Business/R/working/token_file")
ValidateToken(token)
## Access Token successfully updated
profiles <- GetProfiles(token)
## Access Token is valid
#
# Build a list of all the Query Parameters
#
if (!file.exists("./ga_data_referrer")) {
  query.list <- Init(start.date = "2015-01-12",
                     end.date = "2016-09-30",
                     dimensions = "ga:date,ga:hour,ga:pagePath,ga:sourceMedium,ga:fullReferrer,ga:metro,ga:networkDomain",
                     metrics = "ga:sessions,ga:pageviews,ga:sessionDuration,ga:bounceRate",
                     max.results = 10000,
                     sort = "ga:date,ga:hour",
                     filters = "ga:medium==referral",
                     table.id = paste("ga:",gaProfileID,sep=""))
  ga.query <- QueryBuilder(query.list)

  # Extract the data and store it in a data-framefa
  ga.data <- GetReportData(ga.query, token)
  save(ga.data,file="./ga_data_referrer")
} else {
  load("./ga_data_referrer")
}

Use the urltools Package to Isolate the Domain Name

The next step in the process is to use the tldextract function in the urltools package to isolate the domain name in the referral string, and then filter for referrals from the root directory of a domain or try.php, two common paths found in referral spam. In addition, any occurrence of darodar is included in the filter, as this was one of the original referral spam domains. The code to isolate and filter domains is shown in Figure 3.

Figure 3. Isolate the domain name from the full referrer using tldextract from the urltools package.
#
# dply does not currently like  $domain notation...thus [["domain"]]
#
gaDf <- ga.data

gaDf <- gaDf %>% mutate(referrerDomain=paste(tldextract(urltools::domain(gaDf$fullReferrer))[["domain"]],
                                             ".",
                                             tldextract(urltools::domain(gaDf$fullReferrer))[["tld"]]
                                             ,sep=""))
#gaRefSpamDf <- gaDf %>% filter(pagePath == "/" | 
#                                 grepl("try.php",fullReferrer) | 
#                                 grepl("darodar",fullReferrer)) %>% 
#                        group_by(referrerDomain) %>% 
#                        summarize(attacks=n())
gaRefSpamDf <- gaDf %>% group_by(referrerDomain) %>%
                        summarize(attacks=n())
gaRefSpamDetailDf <- gaDf %>% group_by(referrerDomain,pagePath,fullReferrer) %>%
                        summarize(attacks=n())
gaRefSpamDf
## # A tibble: 123 × 2
##        referrerDomain attacks
##                 <chr>   <int>
## 1  100dollars-seo.com       5
## 2            1und1.de       1
## 3           aafes.com       2
## 4           alhea.com       3
## 5            alot.com       1
## 6             aol.com       1
## 7           asana.com       1
## 8             ask.com      15
## 9       atlassian.net       2
## 10             b1.org       1
## # ... with 113 more rows

Use rdomains Package to Look up Domains on Dmoz and Shallalist, Two Domain Classification Sites

The next step in the process is to use the rdomains package to look up the domains on various domain classification sites, beginning with dmoz and shallalist. The get_dmoz_data and get_shalla_data download a data set that can then be used for dmoz_cat and shalla_cat calls that both take a vector of domain names and return data frames with classification information, as shown in the example given in Figure 4. For purposes of identifying referrer spam domains, we will look only at domains that neither dmoz nor shallalist classify.

Figure 4. Retrieving domain classification data from dmoz.org and shallalist.de.
#
# Retrieve moz and shallalist catalogs
#
require(rdomains)
## Loading required package: rdomains
if (!file.exists("./dmoz_domain_category.csv")) {
  get_dmoz_data(outdir = "./", overwrite = FALSE)
}
if (!file.exists("./shalla_domain_cateory.csv") && !file.exists("./shalla_domain_category.csv")) {
  get_shalla_data(outdir = "./", overwrite = FALSE)
}
#
# Query the dmoz catalog
#
if (!file.exists("./gaRefSpam1Df")) {
  gaRefSpam1Df <-
    gaRefSpamDf %>% mutate(dmozCat = dmoz_cat(gaRefSpamDf$referrerDomain, use_file = "dmoz_domain_category.csv")$dmoz_category) %>% filter(is.na(dmozCat))
  save(gaRefSpam1Df, file="./gaRefSpam1Df")
} else {
  load("./gaRefSpam1Df")
}
#
# Query the shallalist catalog
#
if (!file.exists("./gaRefSpam2Df")) {
  gaRefSpam2Df <-
    gaRefSpam1Df %>% mutate(shallaCat = shalla_cat(domains = referrerDomain)$shalla_category) %>% filter(is.na(shallaCat))
  save(gaRefSpam2Df, file="./gaRefSpam2Df")
} else {
  load("./gaRefSpam2Df")
}
summary(gaRefSpam2Df)
##  referrerDomain        attacks          dmozCat           shallaCat        
##  Length:69          Min.   :   1.00   Length:69          Length:69         
##  Class :character   1st Qu.:   1.00   Class :character   Class :character  
##  Mode  :character   Median :   1.00   Mode  :character   Mode  :character  
##                     Mean   :  29.61                                        
##                     3rd Qu.:   5.00                                        
##                     Max.   :1720.00
gaRefSpam2Df
## # A tibble: 69 × 4
##                  referrerDomain attacks dmozCat shallaCat
##                           <chr>   <int>   <chr>     <chr>
## 1            100dollars-seo.com       5    <NA>      <NA>
## 2                      alot.com       1    <NA>      <NA>
## 3                     asana.com       1    <NA>      <NA>
## 4                 atlassian.net       2    <NA>      <NA>
## 5                basecamphq.com       1    <NA>      <NA>
## 6            best-seo-offer.com      10    <NA>      <NA>
## 7         best-seo-solution.com       7    <NA>      <NA>
## 8              binarystream.com       1    <NA>      <NA>
## 9       buttons-for-website.com       1    <NA>      <NA>
## 10 buttons-for-your-website.com       6    <NA>      <NA>
## # ... with 59 more rows

This list of domains still includes several that are clearly legitimate by inspection.

Look up Referrer Domain Names on Virustotal, a Domain Classification Site

As shown in the output of the code in Figure 4, by inspection, we still have a few domains that are known to be legitimate domains. To filter these out, we will go to the Virustotal service for further classification. Virustotal works somewhat differently than the other services: you must have an account and an API key, both of which are free. Calls to Virustotal are limited to four per minute, so the rdomains interface works a little differently; you cannot send a vector of domain names to the Virustotal_cat call. To process a group of domains, you will need to write a function similar to the one shown in Figure 5.

Figure 5. Calls to Virustotal must be rate limited to no more than four per minute.
#
# Write a function to query Virustotal for a vector of domain
# names and limit the query rate to four per minute
#
getVirustotal <- function(domainDf,VirustotalApiKey) {
  require(rdomains)
  require(dplyr)
  if (exists("virusDomain")) {
    rm(virusDomain)
  }
  #domainDf <- gaRefSpamDf$referrerDomain
  #print(NROW(domainDf))
  virusDomain <- data.frame(domain=as.character(),
                            bitdefender=as.character(),
                            dr_web=as.character(),
                            alexa=as.character(),
                            google=as.character(),
                            websense=as.character(),
                            trendmicro=as.character());
  for (i in 1:NROW(domainDf)) {
    #print(paste("i = ",i));
    #print(paste("Domain = ",domainDf[i]));
    Sys.sleep(15)
    thisDomain <- Virustotal_cat(domainDf[i],apikey = VirustotalApiKey);
    if (exists("thisDomain")) {
      #print(paste("Domain results = ",thisDomain))
      virusDomain <- merge(virusDomain,thisDomain,all=TRUE)
    }
  }
  return(virusDomain)
}
#
# Call the function to get the Virustotal info for all domains
#
if (!file.exists("./gaRefSpam3Df")) {
  gaRefSpam3Df <- getVirustotal(gaRefSpamDf$referrerDomain,VirustotalApiKey)
  save(gaRefSpam3Df,file="./gaRefSpam3Df")
} else {
  load("./gaRefSpam3Df")
}
gaRefSpam3Df
##                           domain          bitdefender                             dr_web                    alexa                     google                            websense                                  trendmicro
## 1             100dollars-seo.com                 <NA>                               <NA>                     <NA>              uncategorized                       uncategorized                                        <NA>
## 2                       1und1.de              hosting                               <NA>                 anbieter                    hosting              information technology                          computers internet
## 3                      aafes.com           onlineshop                               <NA>                 military                 onlineshop                            shopping                                        <NA>
## 4                      alhea.com        searchengines                               <NA>                     <NA>              searchengines          search engines and portals                                        <NA>
## 5                       alot.com             business                               <NA>                 toolbars                   business              society and lifestyles                                        <NA>
## 6                        aol.com computersandsoftware                               <NA>              web_portals       computersandsoftware          search engines and portals           search engines portals,news media
## 7                      asana.com             business                               <NA>                     <NA>                   business        hosted business applications                                        <NA>
## 8                        ask.com        searchengines                               <NA>                      ask              searchengines          search engines and portals                      search engines portals
## 9                  atlassian.net            marketing                               <NA>                     <NA>                  marketing               educational materials                                        <NA>
## 10                        b1.org computersandsoftware               not recommended site                     <NA>       computersandsoftware              information technology                                        <NA>
## 11                  basecamp.com             business                               <NA>                   hosted                   business                   web collaboration                          computers internet
## 12                basecamphq.com             business                               <NA>                     <NA>                   business                   web collaboration                                        <NA>
## 13            best-seo-offer.com                 <NA>                               <NA>                     <NA>          elevated exposure                   elevated exposure                                        <NA>
## 14         best-seo-solution.com               parked                               <NA>                     <NA>                     parked                       uncategorized                                        <NA>
## 15              binarystream.com                 <NA>               not recommended site                     <NA>     information technology              information technology                                        <NA>
## 16                      bing.com        searchengines                               <NA>                     bing              searchengines          search engines and portals                      search engines portals
## 17                        bt.com             business                               <NA>                 carriers                   business                business and economy                            business economy
## 18       buttons-for-website.com                 <NA>                               <NA>                     <NA>   suspicious embedded link            suspicious embedded link                                        <NA>
## 19  buttons-for-your-website.com                 <NA>                               <NA>                     <NA>              uncategorized                       uncategorized                                        <NA>
## 20               centurylink.com             business                               <NA>            united_states                   business                business and economy                          computers internet
## 21               centurylink.net              portals                               <NA>                     <NA>                    portals                      news and media                      search engines portals
## 22                   charter.net             business                               <NA>     business_and_economy                   business                      news and media                                  news media
## 23            cincinnatibell.net             business                               <NA>         public_utilities                   business          search engines and portals                                        <NA>
## 24                   clearch.org             business                               <NA>                     <NA>                   business                business and economy                                        <NA>
## 25                 cognizant.com             business                               <NA>                        c                   business              information technology                                        <NA>
## 26                   comcast.net                 news                               <NA>                     <NA>                       news                business and economy                                  news media
## 27                       cox.com           onlineshop                               <NA>                operators                 onlineshop                business and economy                            business economy
## 28           crazyguyonabike.com               sports                               <NA>              travelogues                     sports              society and lifestyles                                        <NA>
## 29                   darodar.com               parked                               <NA>                     <NA>                     parked                  suspicious content                                        <NA>
## 30              delta-search.com        searchengines               not recommended site                     <NA>              searchengines                business and economy                                        <NA>
## 31                      desk.com             business                               <NA>                     saas                   business        hosted business applications                    blogs web communications
## 32                     diigo.com computersandsoftware                    social networks                     <NA>       computersandsoftware personal network storage and backup                                        <NA>
## 33                 disconnect.me            education                               <NA>                     <NA>                  education                     proxy avoidance                                     unknown
## 34                    disqus.com computersandsoftware                               <NA>                     <NA>       computersandsoftware              information technology         blogs web communications,newsgroups
## 35                dnsrsearch.com                 <NA>                               <NA>                     <NA> search engines and portals          search engines and portals                                        <NA>
## 36                   dogpile.com        searchengines not recommended site/adult content               metasearch              searchengines          search engines and portals                                        <NA>
## 37                duckduckgo.com        searchengines                               <NA>           search_engines              searchengines          search engines and portals                      search engines portals
## 38                 earthlink.net                 bank                             e-mail            united_states                       bank              information technology                                        <NA>
## 39                    ecosia.org        searchengines                               <NA>                     <NA>              searchengines          search engines and portals                      search engines portals
## 40                 emailsrvr.com              webmail                               <NA>                     <NA>                    webmail                         web hosting                                       email
## 41                  evernote.com computersandsoftware                               <NA>                 software       computersandsoftware personal network storage and backup computers internet,personal network storage
## 42                  facebook.com       socialnetworks                    social networks    we_are_the_99_percent             socialnetworks               social web - facebook                           social networking
## 43          financemarketing.com computersandsoftware                               <NA>                     <NA>       computersandsoftware         financial data and services                                        <NA>
## 44                   findeer.com        searchengines                               <NA>                     <NA>              searchengines          search engines and portals                                        <NA>
## 45                   godaddy.com            marketing                               <NA>                        g                  marketing                         web hosting                                 web hosting
## 46                     google.by        searchengines                               <NA>                     <NA>              searchengines          search engines and portals                      search engines portals
## 47                     google.ca        searchengines                               <NA>                     <NA>              searchengines          search engines and portals                      search engines portals
## 48                  google.co.id        searchengines                               <NA>                     <NA>              searchengines          search engines and portals                      search engines portals
## 49                  google.co.jp        searchengines                               <NA>     ガイドとディレクトリ              searchengines          search engines and portals            search engines portals,reference
## 50                    google.com        searchengines                              chats                   google              searchengines          search engines and portals                      search engines portals
## 51                 google.com.au        searchengines                               <NA>           search_engines              searchengines          search engines and portals                      search engines portals
## 52                 google.com.kw        searchengines                               <NA>                     <NA>              searchengines          search engines and portals                                        <NA>
## 53                     google.cz        searchengines                               <NA>                   google              searchengines          search engines and portals                      search engines portals
## 54                     google.de        searchengines                               <NA>                   google              searchengines          search engines and portals                                        <NA>
## 55                     google.fr computersandsoftware                               <NA>                   google       computersandsoftware          search engines and portals                      search engines portals
## 56                     google.it        searchengines                               <NA>                   motori              searchengines          search engines and portals                      search engines portals
## 57                     google.nl        searchengines                               <NA>                   google              searchengines          search engines and portals                      search engines portals
## 58                 hootsuite.com       socialnetworks                               <NA>                  twitter             socialnetworks                   social networking                           social networking
## 59                       hud.gov             business                               <NA>                     home                   business                          government                            government legal
## 60                      info.com        searchengines                               <NA>               metasearch              searchengines          search engines and portals                      search engines portals
## 61           informationvine.com                 <NA>                               <NA>                     <NA> search engines and portals          search engines and portals                                        <NA>
## 62                      isket.jp                blogs                               <NA>                     <NA>                      blogs              information technology                                        <NA>
## 63                    ixenia.com                 <NA>                               <NA>                     <NA>              uncategorized                       uncategorized                                        <NA>
## 64                   ixquick.com        searchengines                               <NA>               metasearch              searchengines          search engines and portals                      search engines portals
## 65                    ixquick.de             business                               <NA>                     <NA>                   business          search engines and portals                      search engines portals
## 66                justprofit.xyz             business                               <NA>                     <NA>                   business                   elevated exposure                                        <NA>
## 67              k9safesearch.com            education                               <NA>                     <NA>                  education                business and economy                                        <NA>
## 68                     larger.io                 <NA>                               <NA>                     <NA>       business and economy                business and economy                                        <NA>
## 69                  linkedin.com       socialnetworks                    social networks        social_networking             socialnetworks               social web - linkedin          social networking,business economy
## 70                      live.com              webmail                               <NA>                 internet                    webmail          search engines and portals                search engines portals,email
## 71              locatimefree.com             business                               <NA>                     <NA>                   business                business and economy                                        <NA>
## 72                    meetup.com       socialnetworks      adult content/social networks        social_networking             socialnetworks                   social networking          social networking,business economy
## 73       microsofttranslator.com            education                               <NA>    traductors_automàtics                  education                 reference materials                    translators cached pages
## 74                       moz.com computersandsoftware                               <NA>                     <NA>       computersandsoftware              information technology                          computers internet
## 75                         NA.NA                 <NA>                               <NA>                     <NA>                       <NA>                                <NA>                                        <NA>
## 76                  nextdoor.com       socialnetworks                    social networks                     <NA>             socialnetworks                   social networking                                        <NA>
## 77                    obrazky.cz            marketing                               <NA>                   služby                  marketing          search engines and portals                                        <NA>
## 78                 office365.com computersandsoftware                               <NA>                     <NA>       computersandsoftware              collaboration - office                          computers internet
## 79                    office.com computersandsoftware                               <NA>                groupware       computersandsoftware              collaboration - office                            business economy
## 80                       pch.com             gambling                               <NA> contests_and_sweepstakes                   gambling                               games                                        <NA>
## 81                  peoplepc.com computersandsoftware                               <NA>            united_states       computersandsoftware              information technology                                        <NA>
## 82                pushbullet.com            marketing                               <NA>                     <NA>                  marketing              information technology                         disease vector,spam
## 83                     qwant.com computersandsoftware                               <NA>     moteurs_de_recherche       computersandsoftware          search engines and portals                                        <NA>
## 84        rankings-analytics.com                 <NA>                               <NA>                     <NA>         suspicious content                  suspicious content                                        <NA>
## 85               rankscanner.com                blogs                               <NA>                     <NA>                      blogs              information technology                                        <NA>
## 86                 richpasco.org                 <NA>                               <NA>                     <NA>              uncategorized                       uncategorized                                        <NA>
## 87                       rof.net             business                               <NA>                     <NA>                   business              information technology                                        <NA>
## 88                salesforce.com computersandsoftware                               <NA>       contact_management       computersandsoftware        hosted business applications                            business economy
## 89                saltpalace.com             business                               <NA>                     <NA>                   business                business and economy                                        <NA>
## 90                searchlock.com             business                               <NA>                     <NA>                   business                     proxy avoidance                                        <NA>
## 91               securesearch.co             business                               <NA>                     <NA>                   business                       entertainment                                        <NA>
## 92               semaltmedia.com             business                               <NA>                     <NA>                   business                       uncategorized                                        <NA>
## 93              servicepunt71.nl                 <NA>                               <NA>                     <NA>       business and economy                business and economy                                        <NA>
## 94                     seznam.cz        searchengines                               <NA>                  portály              searchengines          search engines and portals                      search engines portals
## 95                 shawcable.net             business                               <NA>                     <NA>                   business                business and economy                                        <NA>
## 96                   smarter.com           onlineshop                               <NA>                     <NA>                 onlineshop                            shopping                                        <NA>
## 97            social-buttons.com                 <NA>               not recommended site                     <NA>              uncategorized                       uncategorized                                        <NA>
## 98               sosodesktop.com                 <NA>                               <NA>                     <NA>     information technology              information technology                                        <NA>
## 99             stackoverflow.com computersandsoftware                               <NA>         chats_and_forums       computersandsoftware              information technology                          computers internet
## 100                startjuno.com             business                               <NA>                     <NA>                   business                      news and media                                        <NA>
## 101             startnetzero.net             business                               <NA>                     <NA>                   business                      news and media                                        <NA>
## 102                startpage.com        searchengines                               <NA>                     <NA>              searchengines          search engines and portals                      search engines portals
## 103                 startssl.com computersandsoftware                               <NA>                     <NA>       computersandsoftware                business and economy                     internet infrastructure
## 104              success-seo.com             business                               <NA>                     <NA>                   business                       uncategorized                                        <NA>
## 105               suddenlink.net              portals                               <NA>                     <NA>                    portals          search engines and portals                                        <NA>
## 106                         t.co computersandsoftware               not recommended site                     <NA>       computersandsoftware              information technology                           social networking
## 107                      tds.net            education                               <NA>                     <NA>                  education              information technology                                  news media
## 108               telstra.com.au             business                               <NA>                 carriers                   business                business and economy                            business economy
## 109            thegeekspeaks.net                 <NA>                               <NA>                     <NA>              uncategorized                       uncategorized                                        <NA>
## 110                  toshiba.com             business                               <NA>                     <NA>                   business                business and economy                                        <NA>
## 111                     twcc.com                 <NA>                               <NA>                     <NA>              entertainment                       entertainment                                        <NA>
## 112        video--production.com                 <NA>                               <NA>                     <NA>              uncategorized                       uncategorized                                        <NA>
## 113 videos-for-your-business.com                 <NA>                               <NA>                     <NA>              uncategorized                       uncategorized                                        <NA>
## 114               webcrawler.com        searchengines                               <NA>               metasearch              searchengines          search engines and portals                                        <NA>
## 115                       web.de              portals                               <NA>  startseiten_und_portale                    portals          search engines and portals                      search engines portals
## 116        webmastercentre.co.uk             business                               <NA>                     <NA>                   business              information technology                                        <NA>
## 117               windstream.net             business                               <NA>                     <NA>                   business          search engines and portals                      search engines portals
## 118                      wow.com                games                               <NA>                     <NA>                      games          search engines and portals                                        <NA>
## 119                   wowway.net             business                               <NA>                     <NA>                   business              information technology                          computers internet
## 120                  xfinity.com             business                               <NA>                     <NA>                   business                business and economy                                        <NA>
## 121                    yahoo.com                 news                               <NA>              web_portals                       news          search engines and portals                      search engines portals
## 122                    ygask.com                 <NA>                               <NA>                     <NA>              uncategorized                       uncategorized                                        <NA>
## 123                  zendesk.com computersandsoftware                               <NA>                     saas       computersandsoftware        hosted business applications                            business economy

We now have a list where we can clearly identify the referral spammers using only the domain classification services.

Filter the Referrer List to Identify Referral Spammers

To filter down to the final list of referral spammers, we will use dplyr to only include domains that are "not recommended site," "known infection site," on Dr. Web.

#
# Filter on characteristics of known referral spam domains
#
if (!file.exists("./gaRefSpam4Df")) {
gaRefSpam4Df <- gaRefSpam3Df %>% inner_join(gaRefSpam2Df,by=c("domain" = "referrerDomain")) %>%
                                   filter((is.na(dr_web) |
                                           dr_web == "not recommended site" |
                                           dr_web == "known infection source"))
  save(gaRefSpam4Df,file="./gaRefSpam4Df")
} else {
  load("./gaRefSpam4Df")
}
gaRefSpam4Df
##                          domain          bitdefender               dr_web                 alexa                     google                            websense                                  trendmicro attacks dmozCat shallaCat
## 1            100dollars-seo.com                 <NA>                 <NA>                  <NA>              uncategorized                       uncategorized                                        <NA>       5    <NA>      <NA>
## 2                      alot.com             business                 <NA>              toolbars                   business              society and lifestyles                                        <NA>       1    <NA>      <NA>
## 3                     asana.com             business                 <NA>                  <NA>                   business        hosted business applications                                        <NA>       1    <NA>      <NA>
## 4                 atlassian.net            marketing                 <NA>                  <NA>                  marketing               educational materials                                        <NA>       2    <NA>      <NA>
## 5                basecamphq.com             business                 <NA>                  <NA>                   business                   web collaboration                                        <NA>       1    <NA>      <NA>
## 6            best-seo-offer.com                 <NA>                 <NA>                  <NA>          elevated exposure                   elevated exposure                                        <NA>      10    <NA>      <NA>
## 7         best-seo-solution.com               parked                 <NA>                  <NA>                     parked                       uncategorized                                        <NA>       7    <NA>      <NA>
## 8              binarystream.com                 <NA> not recommended site                  <NA>     information technology              information technology                                        <NA>       1    <NA>      <NA>
## 9       buttons-for-website.com                 <NA>                 <NA>                  <NA>   suspicious embedded link            suspicious embedded link                                        <NA>       1    <NA>      <NA>
## 10 buttons-for-your-website.com                 <NA>                 <NA>                  <NA>              uncategorized                       uncategorized                                        <NA>       6    <NA>      <NA>
## 11              centurylink.net              portals                 <NA>                  <NA>                    portals                      news and media                      search engines portals       6    <NA>      <NA>
## 12           cincinnatibell.net             business                 <NA>      public_utilities                   business          search engines and portals                                        <NA>       1    <NA>      <NA>
## 13                  clearch.org             business                 <NA>                  <NA>                   business                business and economy                                        <NA>       1    <NA>      <NA>
## 14                cognizant.com             business                 <NA>                     c                   business              information technology                                        <NA>       1    <NA>      <NA>
## 15                  darodar.com               parked                 <NA>                  <NA>                     parked                  suspicious content                                        <NA>       2    <NA>      <NA>
## 16             delta-search.com        searchengines not recommended site                  <NA>              searchengines                business and economy                                        <NA>      11    <NA>      <NA>
## 17                     desk.com             business                 <NA>                  saas                   business        hosted business applications                    blogs web communications       1    <NA>      <NA>
## 18                disconnect.me            education                 <NA>                  <NA>                  education                     proxy avoidance                                     unknown       2    <NA>      <NA>
## 19               dnsrsearch.com                 <NA>                 <NA>                  <NA> search engines and portals          search engines and portals                                        <NA>       1    <NA>      <NA>
## 20                   ecosia.org        searchengines                 <NA>                  <NA>              searchengines          search engines and portals                      search engines portals       5    <NA>      <NA>
## 21                 evernote.com computersandsoftware                 <NA>              software       computersandsoftware personal network storage and backup computers internet,personal network storage       1    <NA>      <NA>
## 22         financemarketing.com computersandsoftware                 <NA>                  <NA>       computersandsoftware         financial data and services                                        <NA>       1    <NA>      <NA>
## 23                  findeer.com        searchengines                 <NA>                  <NA>              searchengines          search engines and portals                                        <NA>       1    <NA>      <NA>
## 24                      hud.gov             business                 <NA>                  home                   business                          government                            government legal       2    <NA>      <NA>
## 25          informationvine.com                 <NA>                 <NA>                  <NA> search engines and portals          search engines and portals                                        <NA>       1    <NA>      <NA>
## 26                     isket.jp                blogs                 <NA>                  <NA>                      blogs              information technology                                        <NA>       7    <NA>      <NA>
## 27                   ixenia.com                 <NA>                 <NA>                  <NA>              uncategorized                       uncategorized                                        <NA>       1    <NA>      <NA>
## 28               justprofit.xyz             business                 <NA>                  <NA>                   business                   elevated exposure                                        <NA>       2    <NA>      <NA>
## 29             k9safesearch.com            education                 <NA>                  <NA>                  education                business and economy                                        <NA>       3    <NA>      <NA>
## 30                    larger.io                 <NA>                 <NA>                  <NA>       business and economy                business and economy                                        <NA>       3    <NA>      <NA>
## 31                     live.com              webmail                 <NA>              internet                    webmail          search engines and portals                search engines portals,email       1    <NA>      <NA>
## 32             locatimefree.com             business                 <NA>                  <NA>                   business                business and economy                                        <NA>      30    <NA>      <NA>
## 33      microsofttranslator.com            education                 <NA> traductors_automàtics                  education                 reference materials                    translators cached pages       1    <NA>      <NA>
## 34                      moz.com computersandsoftware                 <NA>                  <NA>       computersandsoftware              information technology                          computers internet    1720    <NA>      <NA>
## 35                        NA.NA                 <NA>                 <NA>                  <NA>                       <NA>                                <NA>                                        <NA>       1    <NA>      <NA>
## 36                   obrazky.cz            marketing                 <NA>                služby                  marketing          search engines and portals                                        <NA>       1    <NA>      <NA>
## 37                office365.com computersandsoftware                 <NA>                  <NA>       computersandsoftware              collaboration - office                          computers internet       1    <NA>      <NA>
## 38               pushbullet.com            marketing                 <NA>                  <NA>                  marketing              information technology                         disease vector,spam       1    <NA>      <NA>
## 39                    qwant.com computersandsoftware                 <NA>  moteurs_de_recherche       computersandsoftware          search engines and portals                                        <NA>       1    <NA>      <NA>
## 40       rankings-analytics.com                 <NA>                 <NA>                  <NA>         suspicious content                  suspicious content                                        <NA>       3    <NA>      <NA>
## 41              rankscanner.com                blogs                 <NA>                  <NA>                      blogs              information technology                                        <NA>      24    <NA>      <NA>
## 42                richpasco.org                 <NA>                 <NA>                  <NA>              uncategorized                       uncategorized                                        <NA>       1    <NA>      <NA>
## 43               salesforce.com computersandsoftware                 <NA>    contact_management       computersandsoftware        hosted business applications                            business economy       2    <NA>      <NA>
## 44               saltpalace.com             business                 <NA>                  <NA>                   business                business and economy                                        <NA>      25    <NA>      <NA>
## 45               searchlock.com             business                 <NA>                  <NA>                   business                     proxy avoidance                                        <NA>       6    <NA>      <NA>
## 46              securesearch.co             business                 <NA>                  <NA>                   business                       entertainment                                        <NA>       1    <NA>      <NA>
## 47              semaltmedia.com             business                 <NA>                  <NA>                   business                       uncategorized                                        <NA>       4    <NA>      <NA>
## 48             servicepunt71.nl                 <NA>                 <NA>                  <NA>       business and economy                business and economy                                        <NA>       1    <NA>      <NA>
## 49                    seznam.cz        searchengines                 <NA>               portály              searchengines          search engines and portals                      search engines portals       1    <NA>      <NA>
## 50                shawcable.net             business                 <NA>                  <NA>                   business                business and economy                                        <NA>       1    <NA>      <NA>
## 51           social-buttons.com                 <NA> not recommended site                  <NA>              uncategorized                       uncategorized                                        <NA>      18    <NA>      <NA>
## 52              sosodesktop.com                 <NA>                 <NA>                  <NA>     information technology              information technology                                        <NA>       1    <NA>      <NA>
## 53                startjuno.com             business                 <NA>                  <NA>                   business                      news and media                                        <NA>       1    <NA>      <NA>
## 54             startnetzero.net             business                 <NA>                  <NA>                   business                      news and media                                        <NA>       1    <NA>      <NA>
## 55                 startssl.com computersandsoftware                 <NA>                  <NA>       computersandsoftware                business and economy                     internet infrastructure       1    <NA>      <NA>
## 56              success-seo.com             business                 <NA>                  <NA>                   business                       uncategorized                                        <NA>      48    <NA>      <NA>
## 57               suddenlink.net              portals                 <NA>                  <NA>                    portals          search engines and portals                                        <NA>       3    <NA>      <NA>
## 58                      tds.net            education                 <NA>                  <NA>                  education              information technology                                  news media       1    <NA>      <NA>
## 59               telstra.com.au             business                 <NA>              carriers                   business                business and economy                            business economy       3    <NA>      <NA>
## 60            thegeekspeaks.net                 <NA>                 <NA>                  <NA>              uncategorized                       uncategorized                                        <NA>       2    <NA>      <NA>
## 61                  toshiba.com             business                 <NA>                  <NA>                   business                business and economy                                        <NA>       1    <NA>      <NA>
## 62                     twcc.com                 <NA>                 <NA>                  <NA>              entertainment                       entertainment                                        <NA>       1    <NA>      <NA>
## 63        video--production.com                 <NA>                 <NA>                  <NA>              uncategorized                       uncategorized                                        <NA>       3    <NA>      <NA>
## 64 videos-for-your-business.com                 <NA>                 <NA>                  <NA>              uncategorized                       uncategorized                                        <NA>       6    <NA>      <NA>
## 65               windstream.net             business                 <NA>                  <NA>                   business          search engines and portals                      search engines portals       2    <NA>      <NA>
## 66                  xfinity.com             business                 <NA>                  <NA>                   business                business and economy                                        <NA>      29    <NA>      <NA>
## 67                    ygask.com                 <NA>                 <NA>                  <NA>              uncategorized                       uncategorized                                        <NA>       1    <NA>      <NA>

Just looking at Dr. Web classification still gets two false positives; one for startssl.com and one for moz.com. It is surprising that these two are not classified by Dr. Web, but since they are classified by Trend Micro or Alexa, we can add an additional filter:

#
# Filter on characteristics of known referral spam domains
#
if (!file.exists("./gaRefSpam5Df")) {
  gaRefSpam5Df <- gaRefSpam4Df %>% filter(is.na(trendmicro) & is.na(alexa))
  save(gaRefSpam5Df,file="./gaRefSpam5Df")
} else {
  load("./gaRefSpam5Df")
}
gaRefSpam5Df
##                          domain          bitdefender               dr_web alexa                     google                     websense trendmicro attacks dmozCat shallaCat
## 1            100dollars-seo.com                 <NA>                 <NA>  <NA>              uncategorized                uncategorized       <NA>       5    <NA>      <NA>
## 2                     asana.com             business                 <NA>  <NA>                   business hosted business applications       <NA>       1    <NA>      <NA>
## 3                 atlassian.net            marketing                 <NA>  <NA>                  marketing        educational materials       <NA>       2    <NA>      <NA>
## 4                basecamphq.com             business                 <NA>  <NA>                   business            web collaboration       <NA>       1    <NA>      <NA>
## 5            best-seo-offer.com                 <NA>                 <NA>  <NA>          elevated exposure            elevated exposure       <NA>      10    <NA>      <NA>
## 6         best-seo-solution.com               parked                 <NA>  <NA>                     parked                uncategorized       <NA>       7    <NA>      <NA>
## 7              binarystream.com                 <NA> not recommended site  <NA>     information technology       information technology       <NA>       1    <NA>      <NA>
## 8       buttons-for-website.com                 <NA>                 <NA>  <NA>   suspicious embedded link     suspicious embedded link       <NA>       1    <NA>      <NA>
## 9  buttons-for-your-website.com                 <NA>                 <NA>  <NA>              uncategorized                uncategorized       <NA>       6    <NA>      <NA>
## 10                  clearch.org             business                 <NA>  <NA>                   business         business and economy       <NA>       1    <NA>      <NA>
## 11                  darodar.com               parked                 <NA>  <NA>                     parked           suspicious content       <NA>       2    <NA>      <NA>
## 12             delta-search.com        searchengines not recommended site  <NA>              searchengines         business and economy       <NA>      11    <NA>      <NA>
## 13               dnsrsearch.com                 <NA>                 <NA>  <NA> search engines and portals   search engines and portals       <NA>       1    <NA>      <NA>
## 14         financemarketing.com computersandsoftware                 <NA>  <NA>       computersandsoftware  financial data and services       <NA>       1    <NA>      <NA>
## 15                  findeer.com        searchengines                 <NA>  <NA>              searchengines   search engines and portals       <NA>       1    <NA>      <NA>
## 16          informationvine.com                 <NA>                 <NA>  <NA> search engines and portals   search engines and portals       <NA>       1    <NA>      <NA>
## 17                     isket.jp                blogs                 <NA>  <NA>                      blogs       information technology       <NA>       7    <NA>      <NA>
## 18                   ixenia.com                 <NA>                 <NA>  <NA>              uncategorized                uncategorized       <NA>       1    <NA>      <NA>
## 19               justprofit.xyz             business                 <NA>  <NA>                   business            elevated exposure       <NA>       2    <NA>      <NA>
## 20             k9safesearch.com            education                 <NA>  <NA>                  education         business and economy       <NA>       3    <NA>      <NA>
## 21                    larger.io                 <NA>                 <NA>  <NA>       business and economy         business and economy       <NA>       3    <NA>      <NA>
## 22             locatimefree.com             business                 <NA>  <NA>                   business         business and economy       <NA>      30    <NA>      <NA>
## 23                        NA.NA                 <NA>                 <NA>  <NA>                       <NA>                         <NA>       <NA>       1    <NA>      <NA>
## 24       rankings-analytics.com                 <NA>                 <NA>  <NA>         suspicious content           suspicious content       <NA>       3    <NA>      <NA>
## 25              rankscanner.com                blogs                 <NA>  <NA>                      blogs       information technology       <NA>      24    <NA>      <NA>
## 26                richpasco.org                 <NA>                 <NA>  <NA>              uncategorized                uncategorized       <NA>       1    <NA>      <NA>
## 27               saltpalace.com             business                 <NA>  <NA>                   business         business and economy       <NA>      25    <NA>      <NA>
## 28               searchlock.com             business                 <NA>  <NA>                   business              proxy avoidance       <NA>       6    <NA>      <NA>
## 29              securesearch.co             business                 <NA>  <NA>                   business                entertainment       <NA>       1    <NA>      <NA>
## 30              semaltmedia.com             business                 <NA>  <NA>                   business                uncategorized       <NA>       4    <NA>      <NA>
## 31             servicepunt71.nl                 <NA>                 <NA>  <NA>       business and economy         business and economy       <NA>       1    <NA>      <NA>
## 32                shawcable.net             business                 <NA>  <NA>                   business         business and economy       <NA>       1    <NA>      <NA>
## 33           social-buttons.com                 <NA> not recommended site  <NA>              uncategorized                uncategorized       <NA>      18    <NA>      <NA>
## 34              sosodesktop.com                 <NA>                 <NA>  <NA>     information technology       information technology       <NA>       1    <NA>      <NA>
## 35                startjuno.com             business                 <NA>  <NA>                   business               news and media       <NA>       1    <NA>      <NA>
## 36             startnetzero.net             business                 <NA>  <NA>                   business               news and media       <NA>       1    <NA>      <NA>
## 37              success-seo.com             business                 <NA>  <NA>                   business                uncategorized       <NA>      48    <NA>      <NA>
## 38               suddenlink.net              portals                 <NA>  <NA>                    portals   search engines and portals       <NA>       3    <NA>      <NA>
## 39            thegeekspeaks.net                 <NA>                 <NA>  <NA>              uncategorized                uncategorized       <NA>       2    <NA>      <NA>
## 40                  toshiba.com             business                 <NA>  <NA>                   business         business and economy       <NA>       1    <NA>      <NA>
## 41                     twcc.com                 <NA>                 <NA>  <NA>              entertainment                entertainment       <NA>       1    <NA>      <NA>
## 42        video--production.com                 <NA>                 <NA>  <NA>              uncategorized                uncategorized       <NA>       3    <NA>      <NA>
## 43 videos-for-your-business.com                 <NA>                 <NA>  <NA>              uncategorized                uncategorized       <NA>       6    <NA>      <NA>
## 44                  xfinity.com             business                 <NA>  <NA>                   business         business and economy       <NA>      29    <NA>      <NA>
## 45                    ygask.com                 <NA>                 <NA>  <NA>              uncategorized                uncategorized       <NA>       1    <NA>      <NA>

After investigating the domains in the list, this filter only has one clear false positive: saltpalace.com, a convention center in Salt Lake city where I attended a conference and from which I visited my web site to make sure that it was up and running. There is really no good way to filter out this false positive using domain classification information at this point.

#
# Combine with detail to see if there are other characteristics to use for filtering
#
if (!file.exists("./gaRefSpam6Df")) {
  gaRefSpam6Df <- gaRefSpam5Df %>% inner_join(gaRefSpamDetailDf,by=c("domain" = "referrerDomain"))
  save(gaRefSpam6Df,file="./gaRefSpam6Df")
} else {
  load("./gaRefSpam6Df")
}
summary(gaRefSpam6Df)
##     domain          bitdefender           dr_web             alexa                     google                         websense   trendmicro          attacks.x       dmozCat           shallaCat           pagePath         fullReferrer         attacks.y     
##  Length:76          Length:76          Length:76          Length:76          business     :32   business and economy      :37   Length:76          Min.   : 1.00   Length:76          Length:76          Length:76          Length:76          Min.   : 1.000  
##  Class :character   Class :character   Class :character   Class :character   searchengines:12   uncategorized             :13   Class :character   1st Qu.: 1.00   Class :character   Class :character   Class :character   Class :character   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   uncategorized:10   information technology    : 4   Mode  :character   Median : 6.00   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median : 1.000  
##                                                                              education    : 3   search engines and portals: 4                      Mean   :10.62                                                                               Mean   : 3.684  
##                                                                              marketing    : 2   proxy avoidance           : 3                      3rd Qu.:24.25                                                                               3rd Qu.: 3.000  
##                                                                              (Other)      :16   (Other)                   :14                      Max.   :48.00                                                                               Max.   :48.000  
##                                                                              NA's         : 1   NA's                      : 1
gaRefSpam6Df[,c("domain","pagePath","fullReferrer")]
##                          domain                                                                                               pagePath                                                                           fullReferrer
## 1            100dollars-seo.com                                                                                                      /                                                             100dollars-seo.com/try.php
## 2                     asana.com                                                         /Web-Commerce/100dollars-seo-com-referral-spam                                          app.asana.com/0/11812954602745/44396681995798
## 3                 atlassian.net                                                         /Web-Commerce/social-buttons-com-referrer-spam                                                      lmovim.atlassian.net/browse/VDG-1
## 4                 atlassian.net                                                         /Web-Commerce/social-buttons-com-referrer-spam                                          playhousedigital.atlassian.net/browse/TGCF-76
## 5                basecamphq.com                                                         /Web-Commerce/social-buttons-com-referrer-spam viminteractive.basecamphq.com/projects/10027358-ee-maintenance/posts/92384734/comments
## 6            best-seo-offer.com                                                                                                      /                                                             best-seo-offer.com/try.php
## 7         best-seo-solution.com                                                                                                      /                                                          best-seo-solution.com/try.php
## 8              binarystream.com                                                    /Loan-Pricing/effective-yield-loan-fee-amortization                              crm2015.binarystream.com/_controls/emailbody/msgBody.aspx
## 9       buttons-for-website.com                                                                                                      /                                                               buttons-for-website.com/
## 10 buttons-for-your-website.com                                                                                                      /                                                          buttons-for-your-website.com/
## 11                  clearch.org     /Personal-and-Small-Business-Technology/using-multiple-virtual-desktops-on-windows-os-x-and-ubuntu                                                                    search.clearch.org/
## 12                  darodar.com                                                                                                      /                                                       forum.topic44008047.darodar.com/
## 13             delta-search.com                                                                                                      /                                                                 www2.delta-search.com/
## 14             delta-search.com                                                                                                 /about                                                                 www2.delta-search.com/
## 15             delta-search.com                                                                                          /All-Articles                                                                 www2.delta-search.com/
## 16             delta-search.com                                                                                               /contact                                                                 www2.delta-search.com/
## 17             delta-search.com                                                                                                /people                                                                 www2.delta-search.com/
## 18             delta-search.com                                                                                /Table/Deposit-Pricing/                                                                 www2.delta-search.com/
## 19             delta-search.com                                                                                   /Table/Loan-Pricing/                                                                 www2.delta-search.com/
## 20             delta-search.com                                                                       /Table/Loan-Pricing/Charge-offs/                                                                 www2.delta-search.com/
## 21             delta-search.com                                                                           /Table/Open-Source-Software/                                                                 www2.delta-search.com/
## 22             delta-search.com                                                 /Table/Operations-and-Information-Technology/Security/                                                                 www2.delta-search.com/
## 23             delta-search.com                                                         /Web-Commerce/social-buttons-com-referrer-spam                                                                 www2.delta-search.com/
## 24               dnsrsearch.com                                                         /Web-Commerce/100dollars-seo-com-referral-spam                                                       dnsrsearch.com/index_results.php
## 25         financemarketing.com                                                         /Web-Commerce/social-buttons-com-referrer-spam                                            projects.financemarketing.com/tasks/3362401
## 26                  findeer.com                                          /Web-Commerce/traffic2cash-xyz-google-analytics-referral-spam                                                          search.findeer.com/it/results
## 27          informationvine.com                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                              informationvine.com/index
## 28                     isket.jp                                                         /Web-Commerce/social-buttons-com-referrer-spam                     isket.jp/seo/social-buttons-comからのリファラスパムが大量に・・・/
## 29                   ixenia.com                                                         /Web-Commerce/social-buttons-com-referrer-spam                                                         redmine.ixenia.com/issues/1143
## 30               justprofit.xyz                                                                                                      /                                                                        justprofit.xyz/
## 31             k9safesearch.com                                               /Open-Source-Software/r-open-source-statistical-software                                                            k9safesearch.com/search.jsp
## 32             k9safesearch.com     /Personal-and-Small-Business-Technology/using-multiple-virtual-desktops-on-windows-os-x-and-ubuntu                                                            k9safesearch.com/search.jsp
## 33             k9safesearch.com                                                         /Web-Commerce/social-buttons-com-referrer-spam                                                            k9safesearch.com/search.jsp
## 34                    larger.io                                                                                                      /                                                                             larger.io/
## 35             locatimefree.com                                                         /Web-Commerce/social-buttons-com-referrer-spam        locatimefree.com/google-analytics-realtime-rapid-increase-access-referrer-spam/
## 36                        NA.NA                                          /Web-Commerce/traffic2cash-xyz-google-analytics-referral-spam                                                          131.253.14.125/bvsandbox.aspx
## 37       rankings-analytics.com                                                                                                      /                                                         rankings-analytics.com/try.php
## 38              rankscanner.com                                                                                                      /                                       rankscanner.com/Domain/mooresoftwareservices.com
## 39                richpasco.org                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                             richpasco.org/virus/callerid_spoofing.html
## 40               saltpalace.com                                                                                                      /               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 41               saltpalace.com                                                                                                      /               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.4/success
## 42               saltpalace.com                                                                                          /All-Articles               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 43               saltpalace.com                              /Deposit-Pricing/using-time-variable-fees-to-solve-peak-workload-problems               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 44               saltpalace.com                                                                                /Table/Deposit-Pricing/               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 45               saltpalace.com                                                                                /Table/Deposit-Pricing/               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.4/success
## 46               saltpalace.com                                                                                   /Table/Loan-Pricing/               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 47               saltpalace.com                                                                                   /Table/Loan-Pricing/               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.4/success
## 48               saltpalace.com                                                          /Table/Operations-and-Information-Technology/               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 49               saltpalace.com                                             /Table/Operations-and-Information-Technology/Web-Commerce/               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 50               saltpalace.com                                                         /Table/Personal-and-Small-Business-Technology/               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 51               saltpalace.com                                                         /Web-Commerce/social-buttons-com-referrer-spam               spcc-meru-guestsvc.saltpalace.com/portal/RootsTech2015/10.52.0.3/success
## 52               searchlock.com                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                                        searchlock.com/
## 53               searchlock.com                  /Personal-and-Small-Business-Technology/windows-10-upgrade-experience-for-lenovo-w541                                                                        searchlock.com/
## 54               searchlock.com                                                         /Web-Commerce/social-buttons-com-referrer-spam                                                                        searchlock.com/
## 55              securesearch.co                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                                securesearch.co/search/
## 56              semaltmedia.com                                                                                                      /                                                                       semaltmedia.com/
## 57             servicepunt71.nl                                                         /Web-Commerce/social-buttons-com-referrer-spam                                                webmail.servicepunt71.nl/owa/redir.aspx
## 58                shawcable.net                        /Personal-and-Small-Business-Technology/sales-and-lead-management-with-suitecrm                                                     wm-s.glb.shawcable.net/zimbra/mail
## 59           social-buttons.com                                                                                                      /                                                             site34.social-buttons.com/
## 60              sosodesktop.com                                          /Web-Commerce/traffic2cash-xyz-google-analytics-referral-spam                                                      search.sosodesktop.com/search/web
## 61                startjuno.com                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                         startjuno.com/search/index.php
## 62             startnetzero.net                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                      startnetzero.net/search/index.php
## 63              success-seo.com                                                                                                      /                                                                success-seo.com/try.php
## 64               suddenlink.net                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                   home.suddenlink.net/search/index.php
## 65            thegeekspeaks.net                                                         /Web-Commerce/social-buttons-com-referrer-spam                           thegeekspeaks.net/social-buttons-com-spams-google-analytics/
## 66                  toshiba.com                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                      home.toshiba.com/search/index.php
## 67                     twcc.com                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                                       search.twcc.com/
## 68        video--production.com                                                                                                      /                                                                 video--production.com/
## 69 videos-for-your-business.com                                                                                                      /                                                 53275950.videos-for-your-business.com/
## 70 videos-for-your-business.com                                                                                                      /                                                          videos-for-your-business.com/
## 71                  xfinity.com                                                    /Loan-Pricing/effective-yield-loan-fee-amortization                                                                    search.xfinity.com/
## 72                  xfinity.com /Personal-and-Small-Business-Technology/downloading-and-preparing-ftc-robocall-complaint-list-for-ncid                                                                    search.xfinity.com/
## 73                  xfinity.com                       /Personal-and-Small-Business-Technology/stopping-rachel-from-cardholder-services                                                                    search.xfinity.com/
## 74                  xfinity.com                                  /Personal-and-Small-Business-Technology/using-ncid-on-two-phone-lines                                                                    search.xfinity.com/
## 75                  xfinity.com                                         /Web-Commerce/why-and-how-to-set-up-ssl-https-on-your-web-site                                                                    search.xfinity.com/
## 76                    ygask.com                                                    /Loan-Pricing/effective-yield-loan-fee-amortization                             ygask.com/effective-interest-method-of-amortization-k.html

It looks like many of the referral spam domains reference the root page: this is the only page that will exist on all sites. Next we eliminate domains that referred to a page path other than /.

#
# Combine with detail to see if there are other characteristics to use for filtering
#
if (!file.exists("./gaRefSpam7Df")) {
  gaRefSpam7Df <- gaRefSpam6Df %>% filter(pagePath == "/" ) %>%
    #group_by(domain,bitdefender,dr_web,alexa,google,websense,trendmicro,dmozCat,shallaCat,attacks.x) %>% 
    group_by(domain,dr_web,alexa,trendmicro) %>%
    summarize(numRefPages=n_distinct(pagePath))
  #gaRefSpam7Df <- gaRefSpam7Df %>% filter(numRefPages <= 1) %>%
  #  group_by(domain,bitdefender,dr_web,alexa,google,websense,trendmicro,dmozCat,shallaCat,attacks.x)
  save(gaRefSpam7Df,file="./gaRefSpam7Df")
} else {
  load("./gaRefSpam7Df")
}
gaRefSpam7Df
## Source: local data frame [17 x 5]
## Groups: domain, dr_web, alexa [?]
## 
##                          domain               dr_web alexa trendmicro numRefPages
##                           <chr>                <chr> <chr>      <chr>       <int>
## 1            100dollars-seo.com                 <NA>  <NA>       <NA>           1
## 2            best-seo-offer.com                 <NA>  <NA>       <NA>           1
## 3         best-seo-solution.com                 <NA>  <NA>       <NA>           1
## 4       buttons-for-website.com                 <NA>  <NA>       <NA>           1
## 5  buttons-for-your-website.com                 <NA>  <NA>       <NA>           1
## 6                   darodar.com                 <NA>  <NA>       <NA>           1
## 7              delta-search.com not recommended site  <NA>       <NA>           1
## 8                justprofit.xyz                 <NA>  <NA>       <NA>           1
## 9                     larger.io                 <NA>  <NA>       <NA>           1
## 10       rankings-analytics.com                 <NA>  <NA>       <NA>           1
## 11              rankscanner.com                 <NA>  <NA>       <NA>           1
## 12               saltpalace.com                 <NA>  <NA>       <NA>           1
## 13              semaltmedia.com                 <NA>  <NA>       <NA>           1
## 14           social-buttons.com not recommended site  <NA>       <NA>           1
## 15              success-seo.com                 <NA>  <NA>       <NA>           1
## 16        video--production.com                 <NA>  <NA>       <NA>           1
## 17 videos-for-your-business.com                 <NA>  <NA>       <NA>           1

Conclusions

Before doing any work in R using Google Analytics data, you must remove all of the referral spam web sites from your data; this can be done easily using the rdomains package. Because the classification data changes, it will be necessary to revisit this script on a regular basis.