Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)

This article is the second in a series about visualizing Google Analytics and other web analytics data using R. This article focuses on determining which articles have click-through rates higher than would be expected for their average search position and which articles need work. The series hopes to show how R and interactive visualizations can help to answer the following business questions:

  • Which articles need work to improve search engine ranking
  • Which articles are well ranked but do not get clicked, and need work on titles or meta data
  • Where to focus efforts for new content
  • How to use passive web search data to focus new product development

The other articles in the series are:

Interactive 3D Scatterplot showing Five Dimensional Data

Showing more than three dimensional data gets to be a problem an any visualization; parallel axes work very well if you are presenting the data to scientists and engineers that are familiar with this type of plot, but it does not go over well with typical business professionals. For users that have not been trained to read parallel axes and other types of advanced charts, a 3D scatter plot is a good way to represent up to five dimensions; three on the x, y and z axes, a fourth dimension with dot size and a fifth dimension with color. If you can use additional dot types, you can represent a sixth data dimension.

The R threejs makes it easy to generate an 3D scatterplot with five variables, though it has some limitations that will hopefully be removed as the package is enhanced.

Figure 1 shows a 3D scatter plot of Google Analytics Click-through rate (CTR), search position, and article size generated with the scatterplot3d call in the threejs package. The call allows two different rending tools–“canvas” and “webgl”. The “webgl” option is faster in most browsers as it uses OpenGL, but it does not allow changing the size of the dots. It also may not allow users on some mobile devices to manipulate the image, as some devices do not support OpenGL. The “canvas” rendering option is slower, but it allows changes to dot size and works better on mobile devices.

Figure 2 shows the same plot but uses “canvas” so that the dot size represents the number of times the page was presented in a search (impressions) and the color represents the classification of the page as general (blue), web (red), banking (green) and consumer (yellow).

Rotating it so that the Y axis (postion) is closest to the foreground, Figure 2 shows that the CTR (X axis) increases strongly with article size (Z axis) for general, banking and consumer articles, but not as strongly for web articles. With the X axis (CTR) in the foreground, it is clear that search position improves (low number to left is good) with article size.

The most prominent article, stopping-rachel-from-cardholder-services is pretty clearly in the top half of article size (Z axis). The single largest article, statistical-example-of-fair-lending-disparate-treatment-problem has one of the highest click-through rates and is generally positioned well in searches. The small articles that do well on both CTR and position tend to be very, very specific articles that people are trying to find specifically: bruce-moore, the author’s contact page, and various articles with links to presentation slides.

Figure 1. 3D scatterplot showing click-through-rate, position, and article size. Larger articles generally rank more highly and have better click-through rates.
Figure 2. 3D scatterplot showing click-through-rate, position, article size, and subject area (as a color). Larger articles generally rank more highly and have better click-through rates regardless of subject area. Blue represents general articles, red–web, green–banking, yellow–consumer.

Linear Regression of Article Size

The linear regression shown in Figure 3 below indicates that increasing page size and decreasing position improve the CTR, but the adjusted R-squared value of only 0.1181 indicates that this is not a good regression model, and that some other variable is probably much more important to the CTR. Including the query area (general, banking, web and consumer) does not shed much additional light on the situation as shown in Figure 4.

Figure 3. Regression model of CTR vs Position and Page Size.
## 
## Call:
## lm(formula = CTR ~ Position + pageSize, data = scatterDf)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.031509 -0.016552 -0.004023  0.012603  0.101812 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.400e-02  9.695e-03   1.444   0.1529  
## Position    -2.296e-04  1.407e-04  -1.632   0.1069  
## pageSize     4.225e-07  1.743e-07   2.423   0.0178 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0252 on 75 degrees of freedom
## Multiple R-squared:  0.1117,	Adjusted R-squared:  0.08799 
## F-statistic: 4.714 on 2 and 75 DF,  p-value: 0.01179
Figure 4. Regression model of CTR vs Position, Page Size, and Query Area.
## 
## Call:
## lm(formula = CTR ~ Position + pageSize + queryArea, data = scatterDf)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.031381 -0.018114 -0.003975  0.013424  0.105836 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        2.690e-02  1.272e-02   2.115   0.0379 *
## Position          -1.837e-04  1.435e-04  -1.280   0.2045  
## pageSize           3.469e-07  1.852e-07   1.873   0.0652 .
## queryAreaConsumer -1.510e-02  1.077e-02  -1.403   0.1650  
## queryAreaGeneral  -1.436e-02  8.816e-03  -1.629   0.1078  
## queryAreaWeb      -5.934e-03  1.014e-02  -0.585   0.5604  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02513 on 72 degrees of freedom
## Multiple R-squared:  0.1516,	Adjusted R-squared:  0.09272 
## F-statistic: 2.574 on 5 and 72 DF,  p-value: 0.03369

With no additional data elements in the Google Search Console data set, there isn’t much more insight that can be gained from this dataset without doing text mining on both the queries and the article content. Since this series is primarily about visualization, the next article in the series will look at using the globejs call to visualize geographic data on an interactive globe.

Notes

This article was written in RStudio and uses the threejs package for all graphics except for the linear regression residual plot. The formula display uses MathJax.