R, Statistics and Visualization

Visualizing Web Analytics Data in R Part 1: The Problem

The analysis of Google Analytics and other web analytics data is a big part of many marketing efforts today, and can provide useful information for market analysis and web site design as well as the more obvious topic of search engine optimization (SEO). This is the first of a series of articles that demonstrate the use of R to analize and visualize web analytics data from Google Webmaster Tools (now Google Search Console) and Google Analytics. The series is based upon a presentation given to the Dallas Infographics and Data Visualization Meetup on September 24, 2015.

The fundamental business problem for this case study is how to decide where to focus web site redesign and search engine optimization efforts on a web site–this web site. The articles that follow use traditional 2-dimensional plots and new interactive visualization tools to help answer the following business questions:

  • Which articles need work to improve search engine ranking
  • Which articles are well ranked but do not get clicked, and need work on titles or meta data
  • Where to focus efforts for new content
  • How to use passive web search data to focus new product development

The other articles in the series are:

Search Console Data Give Seasonality

The first thing to do in analyzing web analytics data is the to plot the basic data from Google Search Console:

  • Impressions–how many times did Google present your site in a search result?
  • Position–where in the search results did Google rank your web site?
  • Click-through Rate (CTR)–when presented in a search result, what portion of users clicked on your web site?
  • Clicks–how many users ultimately clicked on your site? This is the product of impressions and CTR.

Figure 1 is a simple plot of the four quantities on one plot, but it isn’t very useful, because the four Impressions is in at least the thousands, and Position is generally less than 50; we need to scale the four variables by dividing each by its maximum value, as shown in Figure 2. Finally, we need to reverse the Position variable so that improvements in position (smaller number) move in the same direction as improvements in clicks (bigger number) as shown in Figure 3. The calculation for this is

\[ \begin{aligned} \text{Reversed Position}&=1 - \frac{\text{position}}{\text{max(position)}} \end{aligned} \]

Figure 3 shows that the site gets a lot of traffic during the business week, but that isn–t surprising since the products and services are aimed at businesses rather than consumers. Generally, the search ranking is slowly improving over time, and something interesting happened during the first week of July: although position did not change significantly, the number of impressions, CTR and clicks increased dramatically for a short period of time. Understanding this increase in search traffic and the important search terms for content development is the topic for the second and third articles in this series, Visualizing Web Analytics Data in R Part 2: Interactive Outliers and Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D).

These plots give us a clear understanding of seasonality, but they don’t give a complete picture of the importance of position in a search: if we buy ads to show up as paid top spot ads in search results, will that improve the click-through, and how much will it improve click-through? The section that follows after Figure 3 looks at these problems.

Figure 1. Plot of data without scaling or normalizing values
plot of chunk google_analytics_part_1_webmaster_data_unscaled
Figure 2. Plot of data after scaling values
plot of chunk google_analytics_part_1_webmaster_data_scaled
Figure 3. Plot of data after scaling values and changing Postion to 1-scaled_value
plot of chunk google_analytics_part_1_webmaster_data_normalized

How are Clicks and Click-through-rate Related to Search Position?

To figure out what to advertize and how to allocate article revision and SEO efforts, it is important to understand how clicks and click-through rate are related to search position. Figure 4 shows a plot of clicks vs. position along with a fitted curve; this is very helpful, but since clicks are related to impressions and CTR, we know that the high values are related to an event in July that does not apply to the web site in general. To get a more general understanding, we need to plot CTR vs. position, as shown in Figure 5. In this plot, we can clearly see that some days get higher click-through than would be expected from the search rank while other days under perform. The R-squared value in the regression model shown in Figure 6 indicates that about 50% of the CTR is related to search position–and 50% is related to whether or not the user thought the article was relevant. Clearly, the clarity of the title and snippet matter a lot on whether a user will click through to read an article; getting a good search ranking by advertising is only 50% of the battle. The residual plot shows some non-linearity as the fitted value increases; this probably indicates the big difference between showing up on the first page of search results and the second page of search results.

The next article in the series, Visualizing Web Analytics Data in R Part 2: Interactive Outliers looks at how to plots with interactive graphics and fly-overs to help in understanding what pages are over-performing or under-performing relative to the position in search results.

Figure 4. Plot of clicks vs. search position. The search position clearly influences the number of clicks, but the so does the number of impressions which is not shown in this plot.
plot of chunk google_analytics_part_1_webmaster_data_CTR_vs_position
Figure 5. Plot of click-through rate (CTR) vs. search position. It is clear that search position is important, but that other factors cause some pages to over- or under-perform relative to the search position.
plot of chunk google_analytics_part_1_webmaster_data_regression_residual
Figure 6. Linear regression of click-through rate (CTR) vs. position and regression residual plot. Adusted R-squared value indicates that almost 50% of CTR is unrelated to search position.
## 
## Call:
## lm(formula = "CTR~Position", data = searchDf)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.015291 -0.005032 -0.001089  0.004313  0.022848 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.075490   0.004091  18.451  < 2e-16 ***
## Position    -0.001883   0.000194  -9.704 1.46e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007425 on 88 degrees of freedom
## Multiple R-squared:  0.5169,	Adjusted R-squared:  0.5114 
## F-statistic: 94.17 on 1 and 88 DF,  p-value: 1.459e-15
plot of chunk google_analytics_part_1_webmaster_data_regression

Notes

This article was written in RStudio and uses the ggplot2 package for all graphics except for the linear regression resitual plot. The formula display uses MathJax.