So, you’ve either done an oopsie or are jumping into a new site/client, and there are a few too many URLs.
What do you do next?
How do you fix it?
Do you even need to fix anything?
- 1 The URL over-indexation issue
- 2 Determining whether you have an over-indexation issue
- 2.1 Discovered – currently not indexed
- 2.2 Crawled – currently not indexed
- 2.3 Soft 404
- 2.4 Duplicate without user-selected canonical, Duplicate, Google chose different canonical than user & Duplicate, submitted URL not selected as canonical
- 2.5 Alternate page with proper canonical tag
- 2.6 Page with redirect
- 2.7 Not found (404)
- 3 Manually verifying over indexation issues
- 4 Cleaning up URL over-indexation
- 5 Give. Google. Time.
The URL over-indexation issue
When working with programmatic SEO, there’s a fine line between enough URLs to target your keywords effectively, and wayyyy too many URLs.
Particularly when it comes to filtered search result pages.
Basically, too many URLs and it becomes an over-indexation issue.
It’s over-indexation when you’re creating significant quantities of crawlable low-value pages.
Patching an over-indexing issue can seriously improve the tech SEO element of a larger website.
I’ve seen serious growth for sites from doing this alone.
It takes time, but patching over-indexation can lead to some epic wins for clients, or yourself, and really help Google hone in on your key pages.
The causes of a URL overindexation
These could be pages that;
- Resurface essentially the same content, over and over.
- Rearrange the order of the exact same content (sort/ordering selectors)
- Have old/broken query parameters leading to duplicate content
- Use query parameters for filters that are a pretty URL too
- Use query parameters that don’t offer significant value to be separate URLs
- Have no results available
- Filtered URLs linking to a filtered product URL where that product doesn’t need the filters
- Have capitalised letters in the URLs due to internal linking bugs
- Extreme pagination counts
- Crawlable result ordering URLs
There are so many different ways to create too many URLs, that it’s really site-dependent.
They can change from client to client, site to site.
Why it could be an issue
The large majority aren’t extremely useful for a good portion of your users.
Pretty much none of them are worth having Googlebot go over them.
These pages will dilute the value of other links on your site, but also, cause crawling issues by sucking up large portions of your crawl budget.
Many of these lower-value URLs can actually also outrank your primary URL, due to how Google will see the links pointing in.
They can be a real hassle!
Determining whether you have an over-indexation issue
The largest indication of an over-indexation issue is a significantly larger % of ‘excluded’ rather than ‘valid’ URLs in the coverage report.
This report is basically saying that Google has found a total of 16.5 million URLs on the site, but only belives 350K of them are worth indexing.
Some of the main warnings you’ll see with regards to over-indexation are;
Discovered – currently not indexed
Google has found the URL linked to, but hasn’t crawled it. It could either be queued for crawl, or Google pre-determined that it wouldn’t be worth crawling.
URLs here could sit here for a day, or could sit here for an eternity.
Might be bad, might not. Really depends on the type of URL and why.
Improving internal linking, and prioritising key pages, can often help URLs here.
Another talked about factor here is content quality, however ‘technically’ these URLs haven’t been crawled yet so ‘technically’ Google wouldn’t know the content quality of that page. So content quality could come down to the content that is linking through to this page.
URLs can be indexed, and then move back to this stage. When that happens, it would lead more toward the content issue.
Crawled – currently not indexed
URLs here have been found, and crawled, by Google. They’re now either in a bit of a holding pattern waiting for rendering or some more ‘thinking’ from Google, or they’ve been deemed unfit for indexation.
It’s a bit hard to tell, but most of the time once URLs hit this stage they’ll slowly be processed and either indexed, or be moved back to Discovered.
Improved content can help a URL move out of this stage faster, but so can prioritised internal links so it’s a bit of a game here like the discovered stage.
A clear indicator it relates to your content this time.
Google thinks this page should be a 404, when in fact it’s not.
Most common with 0-result SRPs, where you’re actively trying to index pages that will say something like “0 results found” or “we’ve got no results that match your search”.
Google will pick these up and flag the page as a soft 404.
Duplicate without user-selected canonical, Duplicate, Google chose different canonical than user & Duplicate, submitted URL not selected as canonical
Essentially the same issue, yet the “without user-selected canonical” just means you haven’t implemented a canonical tag on the pages in question.
Google is choosing a different URL to the one you’re specifying in your canonical tag, and is deeming this version as a duplicate.
You’ve got too many URLs that are either extremely similar, or exactly the same, as each other.
This could be due to certain filters being crawled and indexed, that returns the exact same set of results.
This could be from location data sets where two locations are essentially the same, or you have some bad filter data where both ‘apartment’ and ‘apartments’ exist, and return the same results.
Google’s choice is sometimes wrong, but sometimes right too, so definitely needs to be investigated to be understood further as to how they’re duplicates.
Alternate page with proper canonical tag
One some might see as a content flag, this is actually usually a big tech flag.
This means you’re actively linking to pages that are then canonicalling somewhere else.
You should instead be directly linking to the canonical version of the pages.
Page with redirect
Unless you’ve done a large-scale migration, a significant portion of URLs sitting under this flag indicates that you could have internal links pointing at redirects.
Internal links pointing to redirects could be wasting your crawl budget by forcing Google to crawl multiple URLs to end up at a single URL.
Not found (404)
You’re linking through to URLs that don’t exist.
Manually verifying over indexation issues
Another method of checking, but more importantly validating, the size of an indexation issue, is using a couple of advanced search parameters in a Google search.
A combination of these two will significantly help you identify and isolate specific over indexation issues.
If we use realestate.com.au as an example, we can look at two of the query parameters they use.
Their sort parameter is “activeSort” so we can do a quick indexation check by searching;
Only 21 results, it’s not an issue at all.
But if we then check out one of their filter parameters, the parking spaces filter, we can see a different story;
5,600 URLs are being flagged that have that query parameter in them.
In the scheme of things, it’s not major though. 5,000 is nothing for a site that size.
Is it worth it for them to fix? Probably not.
It’s certainly worth monitoring though.
As with most SEO stuff, there’s a big caveat here.
These advanced operators don’t return exact result counts.
They’re generally not even close, but it’s what we have, so we gotta use it as best possible.
So take the numbers with a grain of salt, and use them to weigh any issues up against each other.
Cleaning up URL over-indexation
There are a few things you can look at for patching over-indexation issues.
Improving your internal linking is the first step to patching over-indexation issues.
- Remove crawlable links to heavily filtered results, like query parameter results
- Remove links to sorted search results
- Remove links to 0-result SRPs
- Remove links to redirects
Redirect sets of URLs
There are cases when query parameters are indexed, that don’t actually do anything. There might be a mistake in the code leading to these being crawled, or they might have previously been indexed.
These are easily patched, by straight-up stripping them with a 301 redirect.
For over-indexation of active parameters, you might need a different solution though.
Complete case-dependent, however, there might even be a case for changing the parameter name and then redirecting the old versions.
Whether the actual parameter name is hanged, or whether the values, would be up to you.
You can update all locations that parameter is used to use the new values, and then be able to strip that parameter when the old name, or values, exist.
This is a good way for a new site to avoid a couple of ways over indexation happens, however, usually something I use as an extreme last resort for an existing site.
It completely cuts Google off and doesn’t give them a chance to clean up the mess that way created.
Any URLs that had any value within the blocks will now essentially have that value just disappear, rather than be redirected (whether actual redirect, or canonical) to a more appropriate spot.
Let me just leave this one here…
Don't use robots.txt to block indexing of URLs with parameters. If you do that, we can't canonicalize the URLs, and you lose all of the value from links to those pages. Use rel-canonical, link cleanly internally, etc.
— 🐝 johnmu.csv (personal) weighs more than 16MB 🐝 (@JohnMu) November 8, 2019
A follow-up tweet directly states that Google might actually continue to index the robots.txt pages. You’re linking to them. Google thinks they’re important, so they’re going to guess and rank them anyway.
We wouldn't see the rel-canonical if it's blocked by robots.txt, so I'd pick either one or the other. If you do use robots.txt, we'll treat them like other robotted pages (and we won't know what's on the page, so we might index the URL without content).
— 🐝 johnmu.csv (personal) weighs more than 16MB 🐝 (@JohnMu) November 8, 2019
On top of this, any URL that was actually ranking and driving traffic will be completely cut off and will deindex. You may have a more appropriate URL to rank, but if Google can’t appropriately see exactly which, you could completely lose any traffic that was being driven from these blocked URLs.
If, however, you decide to go ahead with this please give it 3-6 months from when you’ve implemented fixes before implementing the block.
Please avoid unless absolutely dire.
Give. Google. Time.
Google is slow.
It could take months before your patches start to take serious effect. Give them time to recrawl, to update their index, and to properly re-assess these overindex URLs, after you’ve made your changes.
It’ll be worth it.