Web Hosting and SEO

Duplicate Canonical URLs in Joomla--Configuring sh404SEF

Many content management systems have multiple URLs that map to a single article–a serious problem from the perspective of search engine optimization. In my case, resolving a duplicate URL problem approximately doubled my site traffic in a little over a month. The markup for canonical URLs will resolve this problem if it is correctly implemented. Unfortunately, in Joomla 3.x, the canonical markup is used on each URL as described in The Problem With Joomla's Canonical URL Links, so that several URLs are identified as canonical; this wreaks havoc with Search Engine Optimization, as Google and other search engines look at these as duplicated content.

The basic steps for diagnosing and fixing this problem are cover in the following sections:

Determining Whether or Not You Have a Duplicate URL Problem

The first step in this process is to determine whether or not you have a problem in the first place–your Content Management System (CMS) may not have this problem, or your web site may already have been fixed. To tell if this is a problem on your web site, go to Google Webmaster Tools, and go to Search Appearance->HTML Improvements. If you use a unique description and title for each article, you should see few if any duplicates. When I first worked on this aspect of my web site, there about 11 articles with duplicate URLs as shown in Figure 1.

Figure 1. Google Webmaster Tools List of Duplicate URLs at Start.

Evaluating Alternative Search Engine Friendly (SEF) URL Extensions

I started the process of fixing the duplicate URLs by doing research on add-ins that worked on the problem, and settled on a paid add-in, sh404sef from Weeblr. It was distributed by Anything Digital until very recently; the cause for this change are not clear. At this writing, sh404SEF shows options for the following extensions on the configuration page:

  • Contacts
  • Weblinks
  • Virtuemart
  • Community Builder
  • Jomsocial
  • Kunena
  • MyBlog
  • Mosets tree

It shows support for Social Networking, Twitter Cards, Google Authorship and Google Publisher tags. It has numerous other features that I plan to implement, but which were not part if this project. In evaluating extensions on Joomla, one of the key differentiating factors was whether or not an extension handles tags; if you do not use tags, you will have many more choices, and many more free choices.

Planning Will Avoid Misadventures

I installed the add-in; ultimately it eliminated my problem with duplicate URLs, and almost doubled my website traffic in six weeks. Unfortunately, I didn’t really plan my implementation and that led to a number of problems that I had to fix; Google indexed my site twice while I was figuring things out and ended up indexing even more URLs for each article. It would have been a much shorter period if I had spent more time reading the documentation at the beginning. The article that follows describes how to set this up and some of my misadventures in the process. Figure 2 shows the increase in duplicate URLS due to not planning my conversion. The planning section that follows describes the decisions that I should have made in advance of the cut-over and before robots had a chance to crawl my web site.

Setting up sh404SEF on a new system would be a straightforward simple process. Doing so on an existing system adds a number of complications. You should do this during a low-traffic time and take your site off-line or turn off robot access in robots.txt while you are making all of the changes. The basic planning steps are as follows:

Figure 2. Google Webmaster Tools List of Duplicate URLs after Mis-configuring sh404SEF.

Decide on Changes to Menu Structure for Web Site

sh404SEF set up the canonical URL based upon the menu structure used to get to an article. If you plan any changes to your menu structure, now would be the time to do it so that you only have to set up your redirects once. During the the month after setting up sh404SEF, Google Analytics and other analytics tools will show multiple URLs for each article and will thus be difficult to interpret. If possible finish any web site redesigns before making the switch to sh404SEF.

Decide on Whether or Not to Implement .htaccess

The second major decision that you must make is whether or not you want your URLs to include ?index.php immediately after the domain. If you don’t want this, you will need to set up .htaccess as a redirect method. The help for sh404SEF does not have much information on this, but the .htaccess that is generated by Akeeba Admin Tools apparently has everything that is necessary. Changing my decision on this is one of the causes for my duplicate URL problem to balloon before I got everything configured correctly.

Decide on Whether or Not to Include .html as an Extension on URLs

The third decision in the set-up process is whether or not to use a .html extension default. If you try to make this change after Google has re-indexed the site–which will probably be on the first day–you will have to set up redirects for the pre-sh404SEF URLs and for the post-sh404SEF .html URLs. This is one of the mistakes that I made that caused the number of duplicate URLs to balloon.

Collect Historical URLs

The most important step in the planning process is to collect your historical URLs by making a copy of your sitemaps and a copy of all of the landing pages in Google Analytics or whatever analytics tool you use. You will use these lists to set up redirects from your old URLs to the new ones so that search engine users will be able to find your content while the search engines are in the process of indexing your new URLs.

Configure sh404SEF

Once you have made these decisions, it is time to install and configure sh404SEF. Make sure to do this during a low-traffic time and temporarily turn off robot access in robots.txt. The steps are generally:

Turn off Search Engine Friendly (SEF) URLs in Base Joomla

The first step in the configuration process is to turn off the SEF URLs and rewriting in base Joomla as shown in Figure 3.

Figure 3. Turn off SEF URLs and rewriting in Base Joomla.
Figure 3.  Turn off SEF URLs and rewriting in Base Joomla.

Enable sh404SEF and Turn on .htaccess Rewriting

The next step after installing sh404SEF is to enable it and turn on rewriting (if you have chosen not to have ?index.php as part of each URL)in the component administration page as shown in Figure 4. To turn off ?index.html, on the first panel of Components->sh404SEF->Control Panel and set the rewriting mode to .htaccess.

Figure 4. Enable sh404SEF and turn on .htaccess rewriting.
Figure 4.  Enable sh404SEF and turn on .htaccess rewriting.

Remove .html from URL File Suffix

If you have decided not to use a file suffix, null out the File Suffix section in the configuration pages. To remove the .html at the end of URLs, go to Components->sh404SEF->Configuration->General->Main and change the value in the File Extension box from .html so that it is empty as shown in Figure 5.

Figure 5. Remove .html from URL File Suffix.
Figure 5.  Remove .html from URL File Suffix.

Set up 404 Redirects for All URLs

On the sitemap that you saved, select each of the URLs in the sitemap. This will cause a 404 not found error to be logged on your web site. If you forgot to save a sitemap, you can log in to Google Analytics and go to the Acquisition->Landing Page section and select each of the URLs listed. Next, log in to Google Webmaster Tools and go to the query section and do the same thing. This will generate a relatively complete list of all of the old URLs in the redirect module of sh404SEF.

After generating 404 errors for all of the old URLs, it is time to go back to sh404SEF and create redirects to the new URLs. To do this in the Joomla Admin section, go to Components->sh404SEF->404 Requests Manager as shown in Figure 6. For each of the URLs listed go select the “Redirect to an SEF URL” option and select the correct URL for the article. If the URL does not appear in the prompt list, you may need to use the “Enter a redirect URL” option. This will get most of the important URLs on your site redirected, but you will need to repeat this step every day for a couple of weeks to get all of the URLs redirected.

If you have a large number of old URLs to redirect, the URL manager in sh404SEF has an import capability that may help in this process.

Figure 6. Set up Redirects from Old URLs.
Figure 6.  Set up Redirects from Old URLs.

Set up Name for Lists of Articles

The final major step in setting up sh404SEF is setting the name to be used in the URLs that show a list of the articles in each category. “Table” is used in the example shown in Figure 7.

Figure 7. Set Name (Table in this case) for List of Articles in a Category.
Figure 7.  Set Name (Table in this case) for List of Articles in a Category.

Final Results

After installing sh404SEF and mis-configuring it, the number of duplicate URLS actually grew from 11 to 23 as shown in Figure 2, and then went higher. After correctly configuring sh404SEF and waiting several weeks for Google to reindex the site, the number shrank to 3 as shown in Figure 8, and will soon likely be one or two.

Figure 8. Google Webmaster Tools List of Duplicate URLs after Correct sh404SEF Configuration.
Figure 8.  Google Webmaster Tools List of Duplicate URLs after Correct sh404SEF Configuration.

Conclusion

Fixing the duplicate URL problems on your web site will dramatically increase your site traffic due to better indexing within Google and other search engines. Thinking through some of the problems in advance will make this a less labor intensive process.