In my experience, SEOs are split into two camps. The first camp is excited about finding new tools and staying on the cutting edge by learning about each one. The second camp sticks with a core set of time-honored tools, only looking for new ones when their current set can no longer accomplish what they need, or a new tool clearly makes theirs obsolete.
As a staunch member of that second camp, I’ve become very adept at using Google Search as a go-to tool for evaluating websites. In my eight years in the industry, I’ve never come across any tools that could replace the data it provides (including crawling tools like Screaming Frog).
Are you making the most of Google search while evaluating your sites? Here’s a few tricks that may not have occurred to you:
Examining How Google Understands Your Site & Architecture
I’ve heard and read that Google does not place the “strongest” pages first when using a site:website.com search, but in practice, it seems an awful lot like they do. I find that the better I communicate the site architecture to Google through sitemaps, link structure, menus, schema, bread crumbs, URL optimization and other signals, the more likely Google is to order your core content first, and put the junk on the later pages. I found a great example of this a few weeks ago; a site: search for Lowe’s returns a barren-looking “recipes” page came up as their #10 result:
Another common issue I’ve seen involves 301-redirects. While building these redirects is often part of standard best practices, I was interested to learn recently that when a redirect is built from a page that’s hosted on one server, and pointed to one hosted on a second server, Google tends to not deindex the page with the redirect (Note: while most redirected pages seem to be deindexed quickly, Google states a 301 redirected page can remain indexed – see the yellow box in the link for details).
Most commonly, this indexing issue turns up in the case of site migrations across servers. However, this also stands true when a redirect is built from one website to an externally owned site. Shaw’s has this issue with their indexation; the result is pretty interesting, with a long string of redirects that assigns /PlayWithOreo the title text for “Albertsons – Boise, Idaho” and places the result as #2 with a site: search. However, the link itself redirects to a Facebook page! (It’s also worth noting that a page redirecting to one of their Pinterest boards is their #1 result.)
As you help Google understand your website, be prepared for fluctuation in traffic. In some cases, Google organic traffic may even drop after these improvements, while the conversion rate and from the traffic rises by 30% or more, and user behavior metrics rise across the board.
One Extra Tip: When doing site: searches, I also like to check out what Google shows at the very end of the search – there’s almost always “broken” content appearing somewhere in the last few pages. To speed things up, I toss &900&filter=0 at the end of Google’s search string URL. That brings you straight to the end of the search results, with nothing omitted.
Finding “Broken” Things
Using Google Site Search is a great way to troubleshoot for indexed error and “not found” pages. These are pages that add little or no value to your users and waste Google spider time that could otherwise be used to crawl and discover your strongest content. Additionally, they can potentially lead to Google penalties due to low-quality content and duplication. If these pages are behind a form or site search functionality, or were machine-created by adding ?tags on URL’s, you won’t find them with site crawlers.
At the very least, these pages should be removed from the index and/or assigned 301-redirects. However, in many cases, you can also find potential issues that your IT/Development team can fix. I tend to do a site: search for all of the following:
- site:website.com “not found” Example: The 2,450 pages found on the Costco website here.
- site:website.com “error” Example: The 285 error pages found on the Coach website here.
- site:website.com “sorry” Example: Hundreds of thousands of product dead-end pages mixed in the results here.
One good find from these searches was from the Sears website. I found 2,020 pages with the content “an error has occurred” here, by starting with the “error” search and clicking through the results. In many cases, a simpler search can be refined to something that better targets the pages you want removed, providing IT with better data for troubleshooting. I also like to explore the website itself and try to break things, so I can use the error and “not found” messages generated to scour Google search for pages with the same content.
Finding Orphaned/Abandoned Pages
With large, enterprise-level companies, the number of pages on a website can be tremendous, and in many cases, a large number are orphaned with each site update. Similarly, medium-sized companies can have a high turnover on their IT team, with each generation leaving a trail of “stuff” behind them. The result of this can be an accumulation of a lot of leftovers, which you’ll want to identify to either remove the content or migrate it to the new template.
To find these dregs, a favorite trick of mine is to tap into the power of the Wayback machine. Here’s how I made it work for cnn.com:
- Go back in time to a very early archived version of the website. In this case, I picked a template they used in 2001.
- Older templates often had boilerplate content tucked away somewhere. I picked out “2001 Cable News Network” from the footer.
- Add the content, in quotes, in a site:search. In this case, my search turned up 19,100 results, the vast majority of which seem to be content in the old template.
- You haven’t found all the results yet – Google will only show you a sampling of their index at one time, and research will be required if you want to discover as much as possible in the first round. As I checked out CNN.com, I used the first round of search results as a starting point. For example, I noticed that a large portion of the content is in the /2001/ folder. I followed up by typing [site:www.cnn.com/2001/] into Google, and found 20,900 more pages, which seem to be largely in the old template. Were this my account, I’d bring this data to the dev team to get everything in that folder migrated into the new template, rather than trying to piece together a URL list that’s 100k entries long.
In the case of migrated URLs, it’s also worth your time to do a site: search for all domains redirecting to yours. For example, cadbury.com has recently been redirected to mondelezinternational.com. But a site search for cadbury.com reveals dozens of examples of abandoned subdomains that still need proper 301 redirects put in place. There may be abandoned link equity there that could be passed along via 301-redirect to the new corporate website domain.
Identifying Duplicate Content and Indexation Issues
In my experience, duplicate content and indexation issues happen with a website for any number of the following reasons:
Poor 301 Redirects
Improperly applied 301-redirection of query-based results are adding a query tag to the home page and core content, resulting in duplication. If you have content that routinely expires — for example a job board, auction site, or real estate listing page — be sure that any redirects in place are complete, and do not include any residual strings from the back end.
Blog posts are displaying their full text in the main blog archive, while tag, author and category pages are being indexed. They should be under a cut, or unique teaser text should be used to direct users into the full blog text. I personally add noindex, follow tags to /tag/ and /category/ pages, and leave/author/ pages and the main blog archive pages indexed with pagination markup. I like to refrain from deleting these pages if they’re already created as they can help search spiders to circulate throughout the blog, and also pass along any established page authority they have.
If these pages are all being indexed, it’s a good practice to use site: search to see how many tag and category pages are being indexed, and what percentage of my total indexation “real estate” is being taken up with this low-value content. You can’t find this out by simply looking at the number of tags in the back end, because many tags will be more than one page deep (when 11 or more blog posts have that tag).
Blog posts and articles have been intentionally syndicated across the Internet. This is a hairy issue that’s not easily fixed and could be paired with a linkspam issue. If the problem is extreme, a link disavow may be the answer. The best way to find out about this is to spot check a handful of 5-6 year-old articles, as well as (heavens forbid) a few of the current ones, by copying a several-word segment of each article, in quotes, in Google search.
High-quality content has been plagiarized and stolen by low-quality competitors. This is done by professional companies (or the lazy third-party companies that built their sites) far more often than I would have guessed. I’ve also noticed that for some industries (like recruiting companies), plagiarism is much more common than with others.
Indexed Search Query Pages
Query results are being indexed via site search features. I tackle this problem by using canonical links on search query pages that redirect the results to the original search page. Other SEO’s may have alternate solutions they prefer.
One of the scariest things I’ve ever done in this situation was to add a canonical link for all search result pages, where 44,000 pages were indexed, pointing back to the original search page. Google spent five months consolidating the index down to a single page, and we were worried it was too severe a shock to the system. However, traffic was on the rise the whole time, and we never saw any negative repercussions from the strategy.
One mistake I see too often is when SEOs or dev teams address the problem by simply blocking the search query result pages search engines via robots.txt. In my experience, rather than removing items from the index, it prevents search engines like Google from knowing when the content has been deleted. So in the end, it can remain indexed indefinitely, even long after the original page has 404’d.
One good example of indexation issues caused by robots.txt can be found with Macy’s search feature, which blocks all search query links:
Identifying Linkspam And Hacks
It’s easy and super-valuable to find a pharma hack using Google, and every website evaluation should have this phase included early on in its strategy. There’s a good chance that your client does not know that the hack is in place, and you’ll look like a hero when you quickly identify why their site’s traffic has been inexplicably plummeting.
In some cases in the past few years, I’ve also come across scenarios where it seems like a developer may have identified the hack and quietly dealt with it, without “the boss” knowing. In situations like this, Google (and especially Bing’s) cache may be the only way to find evidence of a recent hack. In the example below, American Cinema Editors has cleaned the Viagra hack off their site, but previous issue is still evident on Google search. This can be a valuable screen shot to collect and provide to your client, if they’re getting conflicting messages from IT/Dev and you find yourself in the middle:
As far as Google is concerned, the indexed version still has the Viagra hack in place (see image below), and may be raking the site accordingly. If your client has Google Webmaster Tools installed, this is a good time to use its Fetch and Remove URL features to signal to Google that significant changes have taken place on the previously-hacked pages, before moving on to the linkspam cleanup.
Bonus Tip: Locating Back-End/Admin SharePoint Content
Google is also great when finding back-end content generated by your CMS. The typical flaws of each CMS, and the level of competency applied by IT and dev withthat CMS, vary greatly. If you’re new to SEO, my recommendation would be to create a Google Doc that keeps track of the different CMS’s you’ve worked in and the issues that characterize each. For my part, I’ve built up a lot of experience, in particular, in working with SharePoint.
Since the SharePoint environment was not intended for front-end web development, it doesn’t tend to align with with SEO best practices. In particular, I find that unless IT is particularly skilled in streamlining the site to remove all unnecessary code, several paths to back-end content are embedded on the pages. These enable search engines to find and index back-end content that’s extremely difficult to find without Google Search (and would likely show up as Direct Traffic in GA). One great example can be found with the SharePoint-run website for Dimension Data. Below, a search for a single type of back-end content reveals 8,330 pages in the index:
Along with this, you may find a boatload of content with the following searches:
- site:website.com inurl:default.aspx
- site:website.com inurl:viewlsts.aspx
- site:website.com inurl:DispForm.aspx
- site:website.com inurl:_vti_bin
- site:website.com inurl:forms.asmx
- site:website.com inurl:permissions.asmx
- site:website.com inurl:imaging.asmx
- site:website.com inurl:UserProfileService.asmx
- site:website.com inurl:ExcelService.asmx
- site:website.com inurl:meetings.asmx
- site:website.com inurl:lists.asmx
- site:website.com inurl:webs.asmx
- site:website.com inurl:WebPartPages.asmx
- site:website.com inurl:permissions.asmx
- site:website.com inurl:spscrawl.asmx
- site:website.com Reports
- site:website.com SharePoint
- site:website.com “style library”
One especially frustrating issue is when back-end SharePoint content can be found directly off the domain URL, such as http://www.atidan.com/default.aspx, when the core URL is http://www.atidan.com/SitePages/default.aspx. I’ve seen those URLs rank very highly in site search, and may possibly interfere with Google’s understanding of which is the true home page.
Note: If you have any tricks to finding back-end content that are specific to a different CMS back end, I’d be ecstatic if you shared them with me in the comments.