
Download and Format the FTC Robocall Complaint List for NCID
While researching how to set up multiple modems on an single NCID server, I stumbled across a forum entry describing the new robo call complaint list that the FTC began posting in October 2015. I immediately investigated and downloaded the complaint list to load it onto my Raspberry Pi NCID server. In no time, I wrote an R script to automate downloading the file and updating my ncidd.alias
file with the new definitions–the FTC is updating the file on a monthly basis.
Since I will now know what numbers have been reported to the FTC, it will be easy to identify and report any new robo call phone numbers to the FTC.
R Code to Download FTC Robo Call Complaint List
To automate the process of downloading the complaint list and creating a new file to append to the complaintList, I wrote an R script to simplify the process. I took an approach of adding the complaint phone numbers to the ncidd.alias
file to replace the NAME with “FTC Complaint List” as suggested by a discussion on the NCID forum. The first segment sets up some filenames and downloads the FTC complaint file:
# # Set up file name and URL arguments # outFn <- "../../data/ftc_weekly_complaintList.csv" ftccomplaintListURL <- "https://consumercomplaints.fcc.gov/hc/theme_assets/513073/200051444/Telemarketing_RoboCall_Weekly_Data.csv" aliasFn <- "../../data/ncidd.alias.ftc" complaintListFn <- "../../data/ncidd.blacklist.ftc" # # The FTC website uses a self-signed certificate for some reason, thus --no-check-certificate # wgetCmdStr <- paste("wget --no-check-certificate -O ",outFn,ftccomplaintListURL) system(wgetCmdStr) ftccomplaintListDf <- read.csv(outFn,header=TRUE) summary(ftccomplaintListDf)
## Phone.Issues ## Robocalls :18909 ## Telemarketing (including do not call and spoofing):38930 ## ## ## ## ## ## Time.of.Issue Caller.ID.Number Advertiser.Business.Phone.Number ## - : 2573 - :13782 - :44466 ## 1:00 PM : 841 248-215-0437: 187 844-279-7423: 72 ## 10:00 AM: 768 267-233-6724: 131 844-226-8637: 66 ## 11:00 AM: 688 213-217-9999: 110 281-702-8593: 53 ## 9:00 AM : 583 863-777-2311: 107 202-864-1122: 37 ## 2:00 PM : 537 609-270-0113: 103 844-279-7431: 32 ## (Other) :51849 (Other) :43419 (Other) :13113 ## Type.of.Call.or.Message..Robocalls. ## :38893 ## Abandoned Call : 3746 ## Autodialed Live Voice Call: 997 ## Prerecorded Voice :14034 ## Text Message : 169 ## ## ## Type.of.Call.or.Message..Telemarketing. State ## :19213 California : 7692 ## Abandoned Calls : 7749 Texas : 4133 ## Email : 64 Florida : 4118 ## Live Voice :16771 New York : 3369 ## Prerecorded Voice:13060 Pennsylvania: 2413 ## Text Messaage : 880 Virginia : 2111 ## Text Message : 102 (Other) :34003 ## Date..Ticket.Date.of.Issue. Date..Ticket.Created. ## 10/14/2015: 783 10/14/2015: 788 ## 10/1/2015 : 738 10/15/2015: 778 ## 10/7/2015 : 738 10/7/2015 : 770 ## 10/6/2015 : 733 10/6/2015 : 745 ## 10/27/2015: 722 10/27/2015: 741 ## 10/15/2015: 710 10/28/2015: 726 ## (Other) :53415 (Other) :53291
The next segment uses the dplyr and magrittr packages to select the two columns with phone numbers, merge them into one column and then de-dup the numbers:
# # Load required packages # require(dplyr) require(magrittr) # # Combine the phone numbers in Caller.ID.Number column and Advertiser.Business.Phone.Number column into a single column to handle nulls in either # of the columns # callerIDDf <- ftccomplaintListDf %>% select(Caller.ID.Number) phoneNbrDf <- ftccomplaintListDf %>% select(Advertiser.Business.Phone.Number) %>% rename(Caller.ID.Number=Advertiser.Business.Phone.Number) complaintListDf <- rbind(callerIDDf,phoneNbrDf) # # De-dup the list and generate the text that will go in the ncidd.alias file # complaintListDf <- complaintListDf %>% group_by(Caller.ID.Number) %>% summarise(count=n()) %>% mutate(ncidFormatPhone=gsub("-","",Caller.ID.Number)) %>% mutate(aliasLine=paste("alias NAME * = \"FTC Complaint List\" if ",ncidFormatPhone)) summary(complaintListDf)
## Caller.ID.Number count ncidFormatPhone ## - : 1 Min. : 1.00 Length:25747 ## 000-000-0000: 1 1st Qu.: 1.00 Class :character ## 000-000-0001: 1 Median : 1.00 Mode :character ## 000-000-0106: 1 Mean : 4.49 ## 000-000-0911: 1 3rd Qu.: 2.00 ## 000-000-2019: 1 Max. :58248.00 ## (Other) :25741 ## aliasLine ## Length:25747 ## Class :character ## Mode :character ## ## ## ##
The final segment writes the two files in the format that NCID can read:
# # Write the file to be appended to # write(complaintListDf$aliasLine,file=aliasFn) write("\"FTC Complaint List\"",file=complaintListFn)
To append the files to the existing lists of alias and blacklist definitions, a slightly different form of sudo
is required:
cd /etc/ncid
sudo cp ncidd.alias ncidd.alias.personal
sudo cp ncidd.blacklist ncidd.blacklist.personal
sudo sh -c "cat ncidd.alias.ftc >> ncidd.alias"
sudo sh -c "cat ncidd.blacklist.ftc >> ncidd.blacklist"
In future months, I will concatenate the .personal
and .ftc
files.
Analysis on the FTC Complaint List
It is probably useful to look at the complaint data to find out more about how spammers go about causing problems. Figure 1 shows the proportion of robo calls vs. spoofing, do not call list violations and other hostile telemarketing practices. Robo calls are only about 1/3 of the complaints.

Print the T-shirt
You can download and print the FTC’s Zap Rachel contest and T-shirt design from the DEF CON contest in 2014.
Related Articles
For additional information, you may be interested in other articles on NCID and stopping phone spam:
- Stopping Rachel from Cardholder Services covers multiple ways to address phone spam, including setting up an NCID server.
- Download and Format the FTC Robocall Complaint List for NCID shows how to download and format the FTC complaint list to give you a list of spammers before they call you.
- Using NCID on Two Phone Lines shows how to add a second modem to your NCID configuration.
- Details
- Written by Bruce Moore
- Hits: 3882
Visualizing Web Analytics in R Part 6: Interactive Networks
This article is the sixth in a series about visualizing Google Analytics and other web analytics data using R. This article focuses on using interactive network visualizations to show where search-related website traffic originates. The series hopes to show how R and interactive visualizations can help to answer the following business questions:
- Which articles need work to improve search engine ranking
- Which articles are well ranked but do not get clicked, and need work on titles or meta data
- Where to focus efforts for new content
- How to use passive web search data to focus new product development
The other articles in the series are:
- Visualizing Web Analytics Data in R Part 1: the Problem
- Visualizing Web Analytics Data in R Part 2: Interactive Outliers
- Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)
- Visualizing Web Analytics Data in R Part 4: Interactive Globe
- Visualizing Web Analytics Data in R Part 5: Interactive Heatmap
- Visualizing Web Analytics Data in R Part 7: Interactive Complex
Network Diagram Using networkD3 and htmlwidgets
Network diagrams are useful for understanding complex datasets. The networkD3 and htmlwidgets package provides a way to generate interactive network diagrams that can be manipulated in a web browser. Figure 1 shows the by page and country Google Analytics session data displayed using the simpleNetwork
call, while Figure 2 shows the same data displayed using the forceNetwork
call. In Figure 2, the large nodes represent countries, while the smaller nodes in a ring around each country represent the pages that are referenced from that country. The size of each node is log(sessions + 1)
to differentiate nodes based upon the number of sessions where the sessions vary from 1 to about 20,000.
For this particular data set, neither of these visualizations are as useful as the heatmaps discussed previously, but they demonstrate what can be done with the network visualizations. For the behavioral flow in Google Analytics, these networks would be the best possible visualization; unfortunately, I have not gotten that data out of Google Analytics yet.
In preparing the forceNetwork
visualization, it is important to remember that the indexing begins at 0 as in C
rather than 1 as used in R. As you develop a forceNetwork
visualization, start with an empty link dataframe and then add links. The JavaScript code used for the on-click box is shown in Figure 3 while the full forceNetwork
call is shown in Figure 4.
The htmlwidgets
package is used to provide the functions to save the visualizations with the
saveWidget(chartWid,saveName,selfcontained = TRUE)
call. The selfcontained = TRUE
parameter puts everything into a single HTML file that is easier to manage on some web sites. The files are minified, but are not compressed.
forceNetwork
visualization.MyClickScript <-'alert("You clicked node " + d.name + " which is in row " + (d.index + 1) + " of the nodeDf R data frame. It has " + (Math.floor(Math.exp(d.nodesize) - 1)) + " sessions.");'
forceNetwork
visualization.sapply(nodeDf,class)
## name group size Sessions destNode nodeID ## "factor" "factor" "numeric" "numeric" "character" "numeric" ## destNodeID ## "numeric"
sapply(arcFinDf,class)
## source target value ## "integer" "integer" "integer"
chartWid <- forceNetwork(Links = arcFinDf, Nodes = nodeDf, Source = "source", Target = "target", Group = "group", NodeID = "name", Value = "value", Nodesize = "size", width=800, height=600, charge=-(1e1), linkDistance = JS("function(d){return d.value * 10}"), radiusCalculation = JS("d.nodesize+6"), zoom=TRUE, legend=TRUE, opacity = 0.8, clickAction = MyClickScript, bounded=FALSE )
Notes
This article was written in RStudio and uses the networkd3
package for rendering and htmlwidgets
package for saving HTML files.
- Details
- Written by Bruce Moore
- Hits: 4470
Visualizing Web Analytics in R Part 5: Interactive Heatmap
This article is the fifth in a series about visualizing Google Analytics and other web analytics data using R. This article focuses on using heatmaps to determine which articles are important in which geographical areas. The series hopes to show how R and interactive visualizations can help to answer the following business questions:
- Which articles need work to improve search engine ranking
- Which articles are well ranked but do not get clicked, and need work on titles or meta data
- Where to focus efforts for new content
- How to use passive web search data to focus new product development
The other articles in the series are:
- Visualizing Web Analytics Data in R Part 1: the Problem
- Visualizing Web Analytics Data in R Part 2: Interactive Outliers
- Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)
- Visualizing Web Analytics Data in R Part 4: Interactive Globe
- Visualizing Web Analytics Data in R Part 6: Interactive Networks
- Visualizing Web Analytics Data in R Part 7: Interactive Complex
Heatmap of Region and Page
In looking at large amounts of data with only three dimensions, heatmaps can frequently be faster to understand the general trends than scatter plots or geographic maps, though they are not as good at identifying outliers as the scatter plots shown in articles 1 through 3, nor are they good at displaying geographic relations as are the maps shown in article 4. Heatmaps are best used to get the overall sense of a dataset. Figures 1, 2 and 3 show heatmaps of ISO region vs. URL with scaling by column (ISO Region), row (URL) and none. It is important to know how scaling is done when reading a heatmap as these three interactive graphs show.
Figure 1 is scaled by column, and shows that the social-buttons.com
article is the most important article in all geographies, followed effective-yield-loan-amortization
in Saudi Arabia, Nigeria the Philippines and Egypt and and sales-and-lead-management-with-suite-crm
in Lithuania, Slovakia and other smaller countries.
Figure 2 is scaled by row and is visually very different; it shows that the overwhelming sources of traffic for all pages are the US, the UK and other English-speaking countries. This isn’t a surprise since the site is English-only.
Figure 3 is scaled across all cells, and shows that from the perspective of all traffic, only three articles and three countries are important: social-buttons.com
, effective-yield-loan-amortization
and stopping-rachel-from-cardholder-services
are the main articles and are largely accessed in the US, the UK and by users whose country is not set.
Heatmaps are easier to read than scatterplots, but don’t necessarily yield as much information and can be misleading if the user does not know how data was scaled for the plot.
Notes
This article was written in RStudio and uses the d3heatmap
package for all graphics.
- Details
- Written by Bruce Moore
- Hits: 4029
Visualizing Web Analytics in R Part 4: Interactive Globe
This article is the fourth in a series about visualizing Google Analytics and other web analytics data using R. This article focuses on using interactive globe visualizations to show where search-related website traffic originates. The series hopes to show how R and interactive visualizations can help to answer the following business questions:
- Which articles need work to improve search engine ranking
- Which articles are well ranked but do not get clicked, and need work on titles or meta data
- Where to focus efforts for new content
- How to use passive web search data to focus new product development
The other articles in the series are:
- Visualizing Web Analytics Data in R Part 1: the Problem
- Visualizing Web Analytics Data in R Part 2: Interactive Outliers
- Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)
- Visualizing Web Analytics Data in R Part 5: Interactive Heatmap
- Visualizing Web Analytics Data in R Part 6: Interactive Networks
- Visualizing Web Analytics Data in R Part 7: Interactive Complex
Globe Showing Relative Page Use
Geographic visualizations are increasingly important in understanding many types of data. The section that follows shows a way to visualize Google Search Console and Google Analytics data on an interactive globe using the gblobejs call in the threejs package. This type of visualization is especially suited to origin-destination pair data like airline or telecommunications data, but is still very useful for point data like web analytics.
The first step in this process was to find geocoded values for the ISO Region Codes provided by Google Analytics. Country data is readily available, but state or province level data is more difficult to obtain. The analysis in this article aggregates data at the country level.
Figure 1 shows the country of origin for sessions for each of four categories of article:
- General articles are blue
- Web commerce articles are red
- Consumer articles are yellow
- Banking articles are green
Because the globejs
command does not allow different arc heights, the latitude/longitude of the origin is jittered so that the different arcs do not overlap and are all visible. The session volume is scaled and applied to the line width, but this is not particularly easy to read in this visualization. This visualization makes it clear that the vast majority of traffic comes from the United States and Europe, and that “general” articles are only used in the US and Europe.
Figure 2 shows the same session data, but in a geographic bar chart format. For this particular dataset, this visualization is easier to understand–and easier to generate, as it does not require artificially generating the origin-destination pairs. In Figure 2, it is clear that the US and developed world generate the vast majority of the traffic.
These figures are useful understanding the regional patterns in the data and make it very easy for users to combine the visualization data with their understanding of underlying geographic and demographic information. Unfortunately, the globe visualizations can’t really show article-level detail. For article-level detail, a heatmap is really a better visualization; interactive heatmaps using the d3heatmap
package are demonstrated in the next article in the series, Visualizing Web Analytics in R Part 5: Interactive Heatmap.
Notes
This article was written in RStudio and uses the threejs
package for all graphics.
- Details
- Written by Bruce Moore
- Hits: 4398
Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)
This article is the second in a series about visualizing Google Analytics and other web analytics data using R. This article focuses on determining which articles have click-through rates higher than would be expected for their average search position and which articles need work. The series hopes to show how R and interactive visualizations can help to answer the following business questions:
- Which articles need work to improve search engine ranking
- Which articles are well ranked but do not get clicked, and need work on titles or meta data
- Where to focus efforts for new content
- How to use passive web search data to focus new product development
The other articles in the series are:
- Visualizing Web Analytics Data in R Part 1: the Problem
- Visualizing Web Analytics Data in R Part 2: Interactive Outliers
- Visualizing Web Analytics Data in R Part 4: Interactive Globe
- Visualizing Web Analytics Data in R Part 5: Interactive Heatmap
- Visualizing Web Analytics Data in R Part 6: Interactive Networks
- Visualizing Web Analytics Data in R Part 7: Interactive Complex
Interactive 3D Scatterplot showing Five Dimensional Data
Showing more than three dimensional data gets to be a problem an any visualization; parallel axes work very well if you are presenting the data to scientists and engineers that are familiar with this type of plot, but it does not go over well with typical business professionals. For users that have not been trained to read parallel axes and other types of advanced charts, a 3D scatter plot is a good way to represent up to five dimensions; three on the x, y and z axes, a fourth dimension with dot size and a fifth dimension with color. If you can use additional dot types, you can represent a sixth data dimension.
The R threejs
makes it easy to generate an 3D scatterplot with five variables, though it has some limitations that will hopefully be removed as the package is enhanced.
Figure 1 shows a 3D scatter plot of Google Analytics Click-through rate (CTR), search position, and article size generated with the scatterplot3d
call in the threejs
package. The call allows two different rending tools–“canvas” and “webgl”. The “webgl” option is faster in most browsers as it uses OpenGL, but it does not allow changing the size of the dots. It also may not allow users on some mobile devices to manipulate the image, as some devices do not support OpenGL. The “canvas” rendering option is slower, but it allows changes to dot size and works better on mobile devices.
Figure 2 shows the same plot but uses “canvas” so that the dot size represents the number of times the page was presented in a search (impressions) and the color represents the classification of the page as general (blue), web (red), banking (green) and consumer (yellow).
Rotating it so that the Y axis (postion) is closest to the foreground, Figure 2 shows that the CTR (X axis) increases strongly with article size (Z axis) for general, banking and consumer articles, but not as strongly for web articles. With the X axis (CTR) in the foreground, it is clear that search position improves (low number to left is good) with article size.
The most prominent article, stopping-rachel-from-cardholder-services is pretty clearly in the top half of article size (Z axis). The single largest article, statistical-example-of-fair-lending-disparate-treatment-problem has one of the highest click-through rates and is generally positioned well in searches. The small articles that do well on both CTR and position tend to be very, very specific articles that people are trying to find specifically: bruce-moore
, the author’s contact page, and various articles with links to presentation slides.
Linear Regression of Article Size
The linear regression shown in Figure 3 below indicates that increasing page size and decreasing position improve the CTR, but the adjusted R-squared value of only 0.1181 indicates that this is not a good regression model, and that some other variable is probably much more important to the CTR. Including the query area (general, banking, web and consumer) does not shed much additional light on the situation as shown in Figure 4.
## ## Call: ## lm(formula = CTR ~ Position + pageSize, data = scatterDf) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.031509 -0.016552 -0.004023 0.012603 0.101812 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.400e-02 9.695e-03 1.444 0.1529 ## Position -2.296e-04 1.407e-04 -1.632 0.1069 ## pageSize 4.225e-07 1.743e-07 2.423 0.0178 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0252 on 75 degrees of freedom ## Multiple R-squared: 0.1117, Adjusted R-squared: 0.08799 ## F-statistic: 4.714 on 2 and 75 DF, p-value: 0.01179
## ## Call: ## lm(formula = CTR ~ Position + pageSize + queryArea, data = scatterDf) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.031381 -0.018114 -0.003975 0.013424 0.105836 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.690e-02 1.272e-02 2.115 0.0379 * ## Position -1.837e-04 1.435e-04 -1.280 0.2045 ## pageSize 3.469e-07 1.852e-07 1.873 0.0652 . ## queryAreaConsumer -1.510e-02 1.077e-02 -1.403 0.1650 ## queryAreaGeneral -1.436e-02 8.816e-03 -1.629 0.1078 ## queryAreaWeb -5.934e-03 1.014e-02 -0.585 0.5604 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.02513 on 72 degrees of freedom ## Multiple R-squared: 0.1516, Adjusted R-squared: 0.09272 ## F-statistic: 2.574 on 5 and 72 DF, p-value: 0.03369
With no additional data elements in the Google Search Console data set, there isn’t much more insight that can be gained from this dataset without doing text mining on both the queries and the article content. Since this series is primarily about visualization, the next article in the series will look at using the globejs
call to visualize geographic data on an interactive globe.
Notes
This article was written in RStudio and uses the threejs
package for all graphics except for the linear regression residual plot. The formula display uses MathJax.
- Details
- Written by Bruce Moore
- Hits: 4698