Visualizing Web Analytics Data in R Part 1: The Problem
The analysis of Google Analytics and other web analytics data is a big part of many marketing efforts today, and can provide useful information for market analysis and web site design as well as the more obvious topic of search engine optimization (SEO). This is the first of a series of articles that demonstrate the use of R to analize and visualize web analytics data from Google Webmaster Tools (now Google Search Console) and Google Analytics. The series is based upon a presentation given to the Dallas Infographics and Data Visualization Meetup on September 24, 2015.
The fundamental business problem for this case study is how to decide where to focus web site redesign and search engine optimization efforts on a web site–this web site. The articles that follow use traditional 2-dimensional plots and new interactive visualization tools to help answer the following business questions:
- Which articles need work to improve search engine ranking
- Which articles are well ranked but do not get clicked, and need work on titles or meta data
- Where to focus efforts for new content
- How to use passive web search data to focus new product development
The other articles in the series are:
- Visualizing Web Analytics Data in R Part 2: Interactive Outliers
- Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D)
- Visualizing Web Analytics Data in R Part 4: Interactive Globe
- Visualizing Web Analytics Data in R Part 5: Interactive Heatmap
- Visualizing Web Analytics Data in R Part 6: Interactive Networks
- Visualizing Web Analytics Data in R Part 7: Interactive Complex
Search Console Data Give Seasonality
The first thing to do in analyzing web analytics data is the to plot the basic data from Google Search Console:
- Impressions–how many times did Google present your site in a search result?
- Position–where in the search results did Google rank your web site?
- Click-through Rate (CTR)–when presented in a search result, what portion of users clicked on your web site?
- Clicks–how many users ultimately clicked on your site? This is the product of impressions and CTR.
Figure 1 is a simple plot of the four quantities on one plot, but it isn’t very useful, because the four Impressions is in at least the thousands, and Position is generally less than 50; we need to scale the four variables by dividing each by its maximum value, as shown in Figure 2. Finally, we need to reverse the Position variable so that improvements in position (smaller number) move in the same direction as improvements in clicks (bigger number) as shown in Figure 3. The calculation for this is
\[ \begin{aligned} \text{Reversed Position}&=1 - \frac{\text{position}}{\text{max(position)}} \end{aligned} \]Figure 3 shows that the site gets a lot of traffic during the business week, but that isn–t surprising since the products and services are aimed at businesses rather than consumers. Generally, the search ranking is slowly improving over time, and something interesting happened during the first week of July: although position did not change significantly, the number of impressions, CTR and clicks increased dramatically for a short period of time. Understanding this increase in search traffic and the important search terms for content development is the topic for the second and third articles in this series, Visualizing Web Analytics Data in R Part 2: Interactive Outliers and Visualizing Web Analytics Data in R Part 3: Interactive 5D (3D).
These plots give us a clear understanding of seasonality, but they don’t give a complete picture of the importance of position in a search: if we buy ads to show up as paid top spot ads in search results, will that improve the click-through, and how much will it improve click-through? The section that follows after Figure 3 looks at these problems.



How are Clicks and Click-through-rate Related to Search Position?
To figure out what to advertize and how to allocate article revision and SEO efforts, it is important to understand how clicks and click-through rate are related to search position. Figure 4 shows a plot of clicks vs. position along with a fitted curve; this is very helpful, but since clicks are related to impressions and CTR, we know that the high values are related to an event in July that does not apply to the web site in general. To get a more general understanding, we need to plot CTR vs. position, as shown in Figure 5. In this plot, we can clearly see that some days get higher click-through than would be expected from the search rank while other days under perform. The R-squared value in the regression model shown in Figure 6 indicates that about 50% of the CTR is related to search position–and 50% is related to whether or not the user thought the article was relevant. Clearly, the clarity of the title and snippet matter a lot on whether a user will click through to read an article; getting a good search ranking by advertising is only 50% of the battle. The residual plot shows some non-linearity as the fitted value increases; this probably indicates the big difference between showing up on the first page of search results and the second page of search results.
The next article in the series, Visualizing Web Analytics Data in R Part 2: Interactive Outliers looks at how to plots with interactive graphics and fly-overs to help in understanding what pages are over-performing or under-performing relative to the position in search results.


## ## Call: ## lm(formula = "CTR~Position", data = searchDf) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.015291 -0.005032 -0.001089 0.004313 0.022848 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.075490 0.004091 18.451 < 2e-16 *** ## Position -0.001883 0.000194 -9.704 1.46e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.007425 on 88 degrees of freedom ## Multiple R-squared: 0.5169, Adjusted R-squared: 0.5114 ## F-statistic: 94.17 on 1 and 88 DF, p-value: 1.459e-15

Notes
This article was written in RStudio and uses the ggplot2 package for all graphics except for the linear regression resitual plot. The formula display uses MathJax.