Category: Programmatic SEO

Programmatic SEO: The Manual Layer

Programmatic SEO: The Manual Layer

When working on programmatic SEO builds, you’re thinking about the data.

Where has that data come from? Has it been vetted and cleansed?

Chances are, it probably hasn’t.

Well, not properly or from an SEO point of view.

Even more than that, has it been extended?

Has anyone looked at holes in the data, and tried to work out if there are any pieces missing?

Sometimes there are some easy wins in a quick data update.

 

Updating your filters

Checking what filters you’re using, along with the actual filter values, is a great first step in the manual element of programmatic SEO.

 

Grouping your existing filters with a ‘translation’ layer

Some sites have significant numbers of values within some filters. These values could be extremely similar.

Rather than just using the data as it comes in, imagine a layer in between it, and the website.

Imagine modifying and/or grouping the data, to be more user/SEO friendly.

For real estate, you could have ‘units’, ‘apartments’, ‘condos’, ‘flats’, and ‘granny flats’.

Maybe it doesn’t make sense to have individually targeted pages for all of them, and instead, group apartments/condos and units/flats together.

Find out more about deciding what filters to optimise for, here.

 

Extending your categories & sub categories

Particularly with marketplaces & classifieds sites, are there any categories and subcategories, or other layers, that are missing from the data that you have content for?

Are consumers listings items that you currently don’t cover?

Have had a client before that had a classified site, and had pages for plenty of gaming consoles.

XBOX, XBOX ONE, Playstations, PlayStation 2, PlayStation 3, etc but they didn’t have the latest consoles listed.

Whilst they might have the overall product line page covered under the ‘Playstation’ page, they were missing the itemised pages of Playstation 4, and Playstation 5, missing out on some key targeting opportunities for the site.

They had all the category pages for when the site was launched, but nothing that had since popped up over the years the site had been around.

By adding these extra variables into the category set, they can instantly create these new pages and more efficiently target these terms.

Just by checking what the values are and what else is being listed on the site that might be getting missed by the current data set.

You can really get some compounding growth with the added categories. You might not have a heap of listings related to the new ones, but once you’re more effectively targeting them, you can start ranking. Once you start ranking, you’ll get more traffic for the categories, and chances are, you’ll get more people wanting to sell those items / offer those services too.

The missing categories could start a nice growth trajectory for you.

 

Improving your location data

Location data can often be missing key locations, or specific location tiers.

Looking at Australian location data, one key city is the Gold Coast.

However, this isn’t a city in many data sets. It might be under a region, and that level might not be getting integrated into the system.

If that’s the case, the entire location could be getting skipped from being targeted, leaving a giant gap in the strategy.

 

Upgrading your data

Your data may not be 100%. Some info might be missing.

A piece of content could suit a landing page perfectly, but maybe it’s missing a tag?

Maybe a product hasn’t been correctly categorised?

Or maybe there are some specs missing, and adding those specs could double the amount of content on product pages.

There are so many ways that you could be upgrading your data to improve your SEO.

See what gaps you can fill yourself, or build out a process and maybe even offshore the work.

You could have a team fill out the data for you, and as time goes on you’ll get more and more.

 

Creating custom pages

Just like my recommendation for tackling the lower value filters, you can create custom filter combo pages.

These pages involve the heavy manual element of mapping out all the rules, but there could be 10s, hundreds, or even thousands of pages you could be creating to fill in gaps in targeting, that a default filter couldn’t target alone.

Try and understand if there are gaps in your targeting for keywords that include multiple filters you’re already no optimising for.

 

The manual element could be key

There could be a manual element in your programmatic SEO build that you’re missing.

This manual element could be the key to success, by giving your build a point of difference.

Using Data to Determine What Filters Should be Targeted

Using Data to Determine What Filters Should be Targeted

Bedrooms, and bathrooms, and price ranges, oh my!

There are so many filters that could be used in the pretty URL, but what should be used?

What should earn the “pretty” status, and what should be relegated to query parameter status?

Some may say everything goes in pretty.

Some may just pick a few, and leave it at that.

Well, let’s take a look at how you could use data to inform your decision.

I’m not going to go into why you shouldn’t be creating pretty URLs for everything, that’s a separate post.

What we will run through are ways to set up keyword data, for you to gain insights about search trends of the filters you’ve got, so that you know what you should be optimising for.

The data you’ll need

To run this analysis, you’ll ideally just need an extremely large keyword research piece specifically for your niche.

It will need to be rather large to ensure you can get solid data, and you’ll also need to have it pretty clean, or at least understand the caveats of your data.

If you’ve also got a tonne of GSC data to mix in, then that would be great. That will help extend the data to cover a good portion of the 0-volume keywords that might not show up in keyword research tools.

For my examples below, I just spent 20 minutes pulling together some real estate keywords for 3 Australian cities, and ~15 suburbs from a “Top Suburbs in Melbourne” type list.

I wanted to limit it to location-specific keywords, as without a giant list of seed locations it can be hard to compare location v not location without significant time spent. Makes things messy.

 

Setting up your data

Whether you’re using Excel or Google sheets, you’ll need to create a data table to power the analysis.

The data table should contain all the keywords, their volumes, and associated categorisation.

 

Create categories for each filter

You’ll need to create categories for each of the filters that you are trying to analyse and work out whether they should be optimised for.

Go through and create columns, and set upYou’ll  their categorisation with rules for each possible value so that you can capture as many as possible.

For my example, I am using real estate keywords. A column has been created for each filter I’d like to analyse, along with categorisation for each of them.

Each filter has its seed values assigned, along with what the value is for that variable.

If a keyword contains either ‘buy’ or ‘sale’, it gets flagged as “Buy”.

If a keyword contains ‘1 bed’ or ‘1br’ it gets flagged as 1 Bedroom.

You can read more about how this works here.

You’ll want to be as thorough here as possible, and include as many variations as possible.

A couple of missed ones could really sway a decision.

Try and also create a catchall category at the end of filters with only a variable or two.

I created one for ‘features’ based on what I was seeing in the keyword data.

 

Prefiltering / cleansing the data

Depending on how clean your keyword data is, it might be better to just look at a portion of it.

A portion you know is 90% cleaner than the rest of the data.

For my real estate keywords, I know that if the keyword includes a channel, so ‘sale’, ‘buy’, ‘rent’, or ‘rental’, there is a higher chance of it being a keyword of enough quality for the study.

To include keywords that don’t include a channel (like ‘real estate’ or ‘properties’), I also include keywords including a bedroom or bathroom filter.

This is done via a YES/NO filter, that just flags it as YES if any of the filter cells have something inside them.

All my data analysis will have this filter applied, and it brings the keywords down from 9,000 to just 2,000.

I know those 2,000 are trustworthy to tell a story.

 

Creating your pivot tables

You’ll now need to create pivot tables for each of them so that you have a way to read the data

Go through a create a pivot table for each of your filters with the below data;

  • Filter as the row
  • Search volume SUM as value
  • Search volume COUNT as value
  • Search volume COUNT shown as % of grand total as value

The SUM should be obvious, being that it will be the total amount of search volume for each of the filter values.

The COUNT will be how many times that filter value is used among the keyword set.

The COUNT & % of Total will show us the actual % of keywords that use this filter value. A little quicker to analyse than the overall count alone.

 

Analysing and selecting your filters

Now we’ll get to read our data and see what we can make of it.

Let’s take a look at my property-type keywords.

We can see that of the 2,000 keywords included, 85% mention a property type. So only 15% are more generic keywords like ‘properties’ or ‘real estate’.

Even if you consider ‘homes’ as generic, that’s still less than a quarter of the keywords without a property type.

So yes, property type 100% needs to be optimised for.

 

Looking at the features keywords.

Only 2 keywords include pets, 2 with pools, and then 1 mentioning granny flat. If these were the only filter values available, I would not be optimising for them.

Similar story with the bathrooms keywords.

Only 2 keywords contain a bathroom-related phrase. Probably wouldn’t recommend targeting that in bulk.

Now onto the 2 that are a bit contentious when it comes to real estate sites.

The first one being bedrooms.

Bedrooms is one I personally recommend against optimising for directly under normal circumstances. At least at first anyway.

I feel it creates too many variations of URLs, with not enough reward/value in return for doing so. Can be worth targeting once all indexation/crawling boxes are ticked, especially with some rules in place, but maybe not directly out the gate.

In saying that, looking at the data 10% of the keywords (7% of total volume) include a bedroom value.

Is that enough to warrant targeting of it? Maybe.

But if we break that data down a bit further, and split out the city (Melbourne) from the ~15 suburbs, we see something a bit different.

16% of the city keywords (14% of volume) contain a bedroom term, verse only 5% (1% of volume) of the suburbs do.

So that’s 1 location have a significantly larger amount of keywords including it than the 15 other locations combined.

So if you create pages equally amongst cities & suburbs, you’re going to be creating significant volumes of pages when only a small portion of them will be useful.

Yeah, long-tail value this and that. I’m not saying definitely don’t, I’m just advising against it without restrictions in place.

A similar situation is with the prices.

Pretty low volume for the majority of the keywords that include a price (normally ‘under xxx’ type keywords).

And if we break it into city vs suburb, we get;

None of the suburb keywords in this data include a price. It’s only at the city level.

 

Why some filters may not be worth targeting

I’m a big believer in limiting crawlable URLs where possible.

Minimising re-use of listing content, avoiding the possibility of confusing Google – too much.

Keeping the site as small as possible, whilst still targeting as much as possible.

So why would I recommend avoiding creating bedrooms or pricing optimised URLs in bulk?

Well, it comes down to page count.

Crawlable page count to be specific.

Let’s say you have a location set of 5,000 locations.

10 property types.

and 2 channels.

You’ve already got 100,000 crawlable URLs right there.

If you then have 7 bedroom options, you’re looking at 700,000 URLs in addition, to that 100,000 that exist, that Googlebot will have to constantly trawl through.

Is it worth enlarging your site by 700% to target an extra 7% in search volume?

If you think so, then go for it.

That’s also if you do it cleanly. If you have other filters with crawlable links on the site, that overall crawlable URL count will only increase.

So if you’re creating significant page volumes off of smaller % filters like this bedrooms count, you must ensure you have your crawling well in check before you launch.

That way you can avoid exacerbating any existing issues.

There are other ways of efficiently targeting these types of keywords though.

In particular, I recommend a targeting strategy here on how to target these filters that may have value at key locations, and not others, by having a couple of tiers of locations.

 

Picking your filter values

To try and keep some filters in check too, you can also optimise the system so that only certain values of a filter get optimised.

Using the bedrooms as an example, you might choose to just create pretty URLs for Studios, 1 bed, 2 bed, and 3 bedroom apartments. 4+ bedrooms would then be relegated to the query parameter, and not receive the internal links pointing into it.

 

Let the data guide your optimisation

By leveraging this keyword data you can really gain an insight into what filters, and values, you should be optimising for.

Plenty of caveats, particularly around longer tail keywords that tools won’t give you, but there should be more than enough data to at least guide an initial decision.

It’s also easier to expose a filter later on, than to clean up the over-indexation caused by one if it needs to be reverted.

There’s also the other question here, is it even worth putting in the work to have separate ‘pretty’ and ‘parametered’ filters?

I’ll leave it to you to decide.

Programmatic SEO: Cleaning Up Over-Indexation of URLs

Programmatic SEO: Cleaning Up Over-Indexation of URLs

So, you’ve either done an oopsie or are jumping into a new site/client, and there are a few too many URLs.

What do you do next?

How do you fix it?

Do you even need to fix anything?

The URL over-indexation issue

When working with programmatic SEO, there’s a fine line between enough URLs to target your keywords effectively, and wayyyy too many URLs.

Particularly when it comes to filtered search result pages.

Basically, too many URLs and it becomes an over-indexation issue.

It’s over-indexation when you’re creating significant quantities of crawlable low-value pages.

Patching an over-indexing issue can seriously improve the tech SEO element of a larger website.

I’ve seen serious growth for sites from doing this alone.

It takes time, but patching over-indexation can lead to some epic wins for clients, or yourself, and really help Google hone in on your key pages.

 

The causes of a URL overindexation

These could be pages that;

  • Resurface essentially the same content, over and over.
  • Rearrange the order of the exact same content (sort/ordering selectors)
  • Have old/broken query parameters leading to duplicate content
  • Use query parameters for filters that are a pretty URL too
  • Use query parameters that don’t offer significant value to be separate URLs
  • Have no results available
  • Filtered URLs linking to a filtered product URL where that product doesn’t need the filters
  • Have capitalised letters in the URLs due to internal linking bugs
  • Extreme pagination counts
  • Crawlable result ordering URLs

There are so many different ways to create too many URLs, that it’s really site-dependent.

They can change from client to client, site to site.

 

Why it could be an issue

The large majority aren’t extremely useful for a good portion of your users.

Pretty much none of them are worth having Googlebot go over them.

These pages will dilute the value of other links on your site, but also, cause crawling issues by sucking up large portions of your crawl budget.

Many of these lower-value URLs can actually also outrank your primary URL, due to how Google will see the links pointing in.

They can be a real hassle!

 

Determining whether you have an over-indexation issue

The largest indication of an over-indexation issue is a significantly larger % of ‘excluded’ rather than ‘valid’ URLs in the coverage report.

This report is basically saying that Google has found a total of 16.5 million URLs on the site, but only belives 350K of them are worth indexing.

Some of the main warnings you’ll see with regards to over-indexation are;

 

Discovered – currently not indexed

Google has found the URL linked to, but hasn’t crawled it. It could either be queued for crawl, or Google pre-determined that it wouldn’t be worth crawling.

URLs here could sit here for a day, or could sit here for an eternity.

Might be bad, might not. Really depends on the type of URL and why.

Improving internal linking, and prioritising key pages, can often help URLs here.

Another talked about factor here is content quality, however ‘technically’ these URLs haven’t been crawled yet so ‘technically’ Google wouldn’t know the content quality of that page. So content quality could come down to the content that is linking through to this page.

URLs can be indexed, and then move back to this stage. When that happens, it would lead more toward the content issue.

 

Crawled – currently not indexed

URLs here have been found, and crawled, by Google. They’re now either in a bit of a holding pattern waiting for rendering or some more ‘thinking’ from Google, or they’ve been deemed unfit for indexation.

It’s a bit hard to tell, but most of the time once URLs hit this stage they’ll slowly be processed and either indexed, or be moved back to Discovered.

Improved content can help a URL move out of this stage faster, but so can prioritised internal links so it’s a bit of a game here like the discovered stage.

 

Soft 404

A clear indicator it relates to your content this time.

Google thinks this page should be a 404, when in fact it’s not.

Most common with 0-result SRPs, where you’re actively trying to index pages that will say something like “0 results found” or “we’ve got no results that match your search”.

Google will pick these up and flag the page as a soft 404.

 

Duplicate without user-selected canonical, Duplicate, Google chose different canonical than user & Duplicate, submitted URL not selected as canonical

Essentially the same issue, yet the “without user-selected canonical” just means you haven’t implemented a canonical tag on the pages in question.

Google is choosing a different URL to the one you’re specifying in your canonical tag, and is deeming this version as a duplicate.

You’ve got too many URLs that are either extremely similar, or exactly the same, as each other.

This could be due to certain filters being crawled and indexed, that returns the exact same set of results.

This could be from location data sets where two locations are essentially the same, or you have some bad filter data where both ‘apartment’ and ‘apartments’ exist, and return the same results.

Google’s choice is sometimes wrong, but sometimes right too, so definitely needs to be investigated to be understood further as to how they’re duplicates.

 

Alternate page with proper canonical tag

One some might see as a content flag, this is actually usually a big tech flag.

This means you’re actively linking to pages that are then canonicalling somewhere else.

You should instead be directly linking to the canonical version of the pages.

 

Page with redirect

Unless you’ve done a large-scale migration, a significant portion of URLs sitting under this flag indicates that you could have internal links pointing at redirects.

Internal links pointing to redirects could be wasting your crawl budget by forcing Google to crawl multiple URLs to end up at a single URL.

 

Not found (404)

You’re linking through to URLs that don’t exist.

 

Manually verifying over indexation issues

Another method of checking, but more importantly validating, the size of an indexation issue, is using a couple of advanced search parameters in a Google search.

Site:
Inurl:

A combination of these two will significantly help you identify and isolate specific over indexation issues.

If we use realestate.com.au as an example, we can look at two of the query parameters they use.

Their sort parameter is “activeSort” so we can do a quick indexation check by searching;

site:Realestate.com.au inurl:activeSort

Only 21 results, it’s not an issue at all.

But if we then check out one of their filter parameters, the parking spaces filter, we can see a different story;

site:Realestate.com.au inurl:numParkingSpaces

5,600 URLs are being flagged that have that query parameter in them.

In the scheme of things, it’s not major though. 5,000 is nothing for a site that size.

Is it worth it for them to fix? Probably not.

It’s certainly worth monitoring though.

 

The caveat

As with most SEO stuff, there’s a big caveat here.

These advanced operators don’t return exact result counts.

They’re generally not even close, but it’s what we have, so we gotta use it as best possible.

So take the numbers with a grain of salt, and use them to weigh any issues up against each other.

 

Cleaning up URL over-indexation

There are a few things you can look at for patching over-indexation issues.

 

Improve your internal linking

Improving your internal linking is the first step to patching over-indexation issues.

  • Remove crawlable links to heavily filtered results, like query parameter results
  • Remove links to sorted search results
  • Remove links to 0-result SRPs
  • Remove links to redirects

 

Redirect sets of URLs

There are cases when query parameters are indexed, that don’t actually do anything. There might be a mistake in the code leading to these being crawled, or they might have previously been indexed.

These are easily patched, by straight-up stripping them with a 301 redirect.

For over-indexation of active parameters, you might need a different solution though.

Complete case-dependent, however, there might even be a case for changing the parameter name and then redirecting the old versions.

Whether the actual parameter name is hanged, or whether the values, would be up to you.

You can update all locations that parameter is used to use the new values, and then be able to strip that parameter when the old name, or values, exist.

 

Robots.txt blocks

This is a good way for a new site to avoid a couple of ways over indexation happens, however, usually something I use as an extreme last resort for an existing site.

It completely cuts Google off and doesn’t give them a chance to clean up the mess that way created.

Any URLs that had any value within the blocks will now essentially have that value just disappear, rather than be redirected (whether actual redirect, or canonical) to a more appropriate spot.

Let me just leave this one here…

A follow-up tweet directly states that Google might actually continue to index the robots.txt pages. You’re linking to them. Google thinks they’re important, so they’re going to guess and rank them anyway.

On top of this, any URL that was actually ranking and driving traffic will be completely cut off and will deindex. You may have a more appropriate URL to rank, but if Google can’t appropriately see exactly which, you could completely lose any traffic that was being driven from these blocked URLs.

If, however, you decide to go ahead with this please give it 3-6 months from when you’ve implemented fixes before implementing the block.

Please avoid unless absolutely dire.

 

Give. Google. Time.

Google is slow.

It could take months before your patches start to take serious effect. Give them time to recrawl, to update their index, and to properly re-assess these overindex URLs, after you’ve made your changes.

It’ll be worth it.

Just Another WordPress Programmatic SEO Build

Just Another WordPress Programmatic SEO Build

Join me on a 30-day journey to a 1,000+ page programmatic SEO build using a combination of WordPress and Google Sheets (with plenty of formulas!).

I’ve been writing a heap about programmatic SEO lately…. if you haven’t noticed.

It’s mostly been about enterprise-level sites though. Portals, classifieds & marketplaces type stuff.

How do we apply this to more niche-type sites?

How can the ‘little guy’ leverage programmatic, without big budgets and customised dev builds?

Well, let’s find out.

Let’s build out a no-code (unless Google Sheets formulas count?) WordPress programmatic build.

Caveat: This may or may not actually rank for anything significant. I’d be expecting to pick up some long tail, but probably not too much traction due to significantly thin content. It’s a lot of work to do this more thoroughly, and since it’s just a proof of concept I’m really not spending that extra time, effort & money… especially when one of you will copy it straight away haha. This build is more to talk through some concepts, and share some techniques I use. If you wanted, you could then go off and build your own programmatic system in WordPress using this setup. Even copy it! Let’s see what you can do.

The idea

This tweet got shared with me the other day;

It’s not a niche I’d normally build something in, but, this is the perfect scenario for a niche programmatic build.

Probably also one of the largest non-location-centric keyword sets I have seen.

I also came across a post by Charles & Mushfiq from the website flip here discussing programmatic SEO and showing a quick WordPress example, leveraging an automatic integration with Zapier.

This got me thinking.

Not so much on the zapier side of things – that gets expensive!

But more so on the simplicity of the templating system, they’ve used along with the idea of a bulk WP import.

I want to build this system with a few rules in place, rather than just pure dynamic variable inserts.

A way to be able to pick different content templates based on different values, like yes/no for instance.

Probably going to end up turning into a behemoth of a mess in Google Sheets, but everything will hopefully have a purpose.

Let’s see what we can do.

 

The data

Before we do anything, we need to build out the data set.

 

Initial keyword research

I’m gonna start by just jumping into a keyword tool and grabbing a heap of keywords using the seed of ‘can eat’ as the tool will fill in the other sides.

I exclude a tonne of keywords, with a handful of phrases to get rid of some unrelated junk.

Exporting all these keywords, I throw them in Google sheets and now need to extract the animal and the food.

 

Variable extraction

The majority of the keywords follow the same template, so we can strip the ‘can’ and anything after ‘eat’.

To do this I am using the strip before/after text google sheets formulas from here.

I’ll first strip off everything after and including the “eat” by using this formula;

=LEFT(A2,FIND(“eat”,A2)-1)

After this, I can then substitute “can” out of the keyword, by modifying the formula to;

=substitute(LEFT(A2,FIND(“eat”,A2)-1),”can “,)

To then get the food, we will strip off everything including the eat to use what’s left, with the following formula;

=RIGHT(A2,LEN(A2)-FIND(“eat”,A2)-3)

Throwing these in a pivot table we get the top used animals and foods in the top 1,000 keywords from the keyword tool;

You’ll see some junk in there like “when babies” and “catholics” so I’ll just quickly copy the list and spend 2 minutes isolating the main animals.

After cleaning, I’m left with a list of 20 animals, and 72 foods;

If we created a page for each combo, we’d have 1,440 pages!

Pretty good start.

But that’s 1,440 combinations I need to work out whether the animal would eat the food.

Quite a bit of work to get the site started, especially when I don’t know what combos are worth it.

 

Prioritising data

Rather than just working through all 1,440 combinations to build out the data set, I’m going to try and prioritise the top combos.

Since we’re working off the same keyword template for all the keywords here, we can bulk generate the combos.

Following the template of ‘can <animal> eat <food>’ let’s throw the animals and foods in mergewords with “eat” in the middle.

We throw these in google sheets, and prepend “can ” by just doing the formula of;

=”can “&A2

Now I’ve got the keywords for all the combos sorted, I’m going to go grab some search volumes.

Throwing the keywords in the volume tool I use, we get;

A prioritised list of combos based on the search volume for the combination keyword.

We can now sort the combos based on this, and know which ones to focus on first.

 

Creating the base data set

Time to build out the actual dataset for the page.

I’ve thrown the keywords in a new sheet, along with their animal & food values;

Now for the part that will actually take some time, building out the actual values for each combo.

I’ve also included a few key elements that I think are valuable to have on the pages.

So we now need to fill out, at minimum, the ‘canEat’ variable value to ensure we can answer the key questions of “can <animal> eat <food>”.

This is going to be super tedious, but obviously outsourceable.

I’ve filled out the first few combos so that there is some sample data to build the rest of the site off.

Just googling each keyword and filling out the data based on info in the SERPs. The featured snippet and PAA are pretty useful here!

Don’t know the exact formats we want the data in yet, so best not to fill out much more, or outsource, until we know the formats the data will be in.

 

Basic SEO elements

I’ll work through the main SEO elements now, and bulk generate them for each combo.

Slug – Need to substitute the spaces for hyphens. No other funky characters so that’s it.

=substitute(LOWER(A2),” “,”-“)

h1 – Need to capitalise the words and throw a question mark on the end.

=PROPER(A2)&”?”

pageTitle – Just the H1 with the site name tacked on the end for now.

=M2&” | Something Something”

metaDescription – The h1 at the front, with some text along with the animal & food variables mixed in again.

=M2&” Find out whether “&D2&” should be eating “&E2&” right here. We’ve got the answer.”

 

After that, we get the following;

 

Building the core page content

Now that the basic pages are mapped out, and we have some sample data for it all, we need to look through the core content.

We can’t just straight insert the variables to answer the core page question, unless each variable included both the “Yes” and the part of “can” or “can not”.

I’d rather not have to have another data point, so instead, we will use content templates and dynamically generate the page content.

So if the value of <canEat> is “YES”, then use the “can eat” and if the value of <canEat> is “NO”, then use the text “can not eat”.

It’ll just start off like this:

But then we’ve got some other variables to include, so let’s build out a quick template for each of them.

(second canEat rule is meant to say “NO”, not “YES”)

It’s a bit boring, but it’ll do for a start and get the system build-out.

I can get a writer to err… fix my handy work later.

Time to put it all together.

I’ll try to keep this whole system as simple as possible, by building the template out for each sentence, and then just sticking each outputted sentence together.

I’ve created a new sheet to map this all out, with each variable being a new column. Probably don’t need to do all of this, and could reference the main table for the rules, but I figure this will be easier to troubleshoot.

All values are just vlookups so that there is one data source.

Using an IF formula, I’ve just worked through the 2 possible variations of this data, being YES or NO, and then outputted the associated content template.

=if(C2=’Content Templates’!$B$2, ‘Content Templates’!$C$2, IF(C2=’Content Templates’!$B$3,’ Content Templates’!$C$3,))

It correctly outputs the YES or NO content template.

Moving onto the howOftenContent, I ran into an issue with that formula.

I’ll want to add other options later and don’t want to tweak the formula each time.

A vlookup would work perfectly, but it only works on a single value and couldn’t match both the variable name and the variable content.

Adding a new column to the content template, allows us to combine the variable name (column header) & the variable value, and then match it to the combined value on the content templates.

=iferror(VLOOKUP($E$1&E2,‘ Content Templates’!C:D,2,0),)

The other ones are a little easier, for now, as we’re just using a single content template that inserts the variable if it exists.

=IF(istext(G2),’Content Templates’!$D$8,)

So now we’ve got all these templates, we need to replace the variables in them.

To do this, we use a nested substitute formula that swaps out the variable templates, for the actual variable value.

=substitute(SUBSTITUTE(F2, “<animal>”,$C2),”<food>”,$D2)

 

That formula can be pasted into all the contentFinal cells, and it will substitute out the <animal> and <food> giving you the base template.

We now get something like this for some of the other templates;

There are two issues with this one.

The first is that we need to add another substitution layer to swap out <quantity>.

=substitute(substitute( SUBSTITUTE(L2,”<animal>”,$C2), “<food>”,$D2),”<quantity>”,K2)

That would mean each variable needs to be manually replaced though, which can get annoying and doesn’t help to scale.

Let’s use the column heading instead, since that will match the variable.

=substitute(substitute(SUBSTITUTE( L2,”<animal>”,$C2),” <food>”,$D2),”<“&K$1 &”>”,K2)

It will output the exact same, except give us a copy & paste formula so that we can copy the cell, and paste it into the other contentFinal cells.

The other issue with this quantity content is that the ‘dogs’ is lower case. It’s at the front of the sentence though, so lets make it capitalised.

=substitute(substitute(SUBSTITUTE( L2,”<animal>”, proper($C2)), “<food>”,$D2),”<“&K$1&”>”,K2)

Note the proper($C2) within it this time.

The new output is;

Now we need to start piecing it all together.

Created a new column for it, and now reference a cell to include it in the text.

CHAR(10) will insert a new line in Google sheets, and allow us to format it so that we can see the new lines in the Google sheet.

We then have to extend it to add a part so that it only includes the text, and line break, when a sentence exists to ensure there aren’t blank entries & extra line breaks.

Whilst you’d normally use an ISTEXT() for this in Google Sheets, because we’re referencing cells that include formulas Google sheets will always see text.

We have to instead use if(LEN()>1).

=if(len(H2)>1,H2,)&if( len(K2)>1,CHAR(10)& CHAR(10)&K2,)&if(len(N2)>1, CHAR(10)& CHAR(10)&N2,)& if(len(Q2)>1,CHAR(10) &CHAR(10)&Q2,)& if(len(T2)>1, CHAR(10)&CHAR(10)&T2,) &if(len(W2)>1,CHAR(10)& CHAR(10)&W2,)

It looks like a heap of junk, but it’s actually just going cell by cell, with the same formula that uses different cell references.

If this cell has content, include it and add two line breaks before it.
If this cell has content, include it and add two line breaks before it.
If this cell has content, include it and add two line breaks before it.
If this cell has content, include it and add two line breaks before it.

That’s it.

The output is now;

Pretty cool!

If you try and read it though, you’ll notice that any part that includes one of the variables might seem a bit… off.

This is because we didn’t know how the variables were being used.

Some variables contain a full stop, leading to double-ups.

Others contain a dash at the front, or a line break, leading to a weird format.

Let’s fix that.

 

Tweaking the variable values

After the mess that gets created by the initial content generation, we need to come in and patch the variables and better understand how we want to use them.

This will involve tweaking the variables, but also tweaking the templates a bit to make the new formats work.

To do this, I am bringing the content we put together into the main sheet with a vlookup.

=VLOOKUP(A2,’Content Builder’!A:C,3,0)

Now that it’s on the main datasheet, we can start to go through and edit the variable values to set a format.

I’m standardising the formats a bit by just making sure whatever text here will cleanly fit inside a sentence. Non-capitalised to begin, no full-stop at the end, that sort of stuff.

It will make the variable data fit more naturally inside the content.

After cleaning up the variables a bit, and extending the templates to add some little subheaders, we now get;

 

The build

We’ve got the post title, and the post content that we want to use, so now we “just” need to get this into WordPress.

 

Creating our CSV

The blog post I mentioned earlier leveraged Zapier to do this. That’s expensive…. and we’re hacking this.

We’re gonna use a CSV import plugin called ‘WP Import Export’ to get this cranking.

I’ve done a quick WP install on a nice fresh domain, so will get a plugin installed to import and work out what we need.

A quick demo export to test the formats gives us;

We can see the sample post includes some HTML code, yet my test post (with classic editor) doesn’t include any of that.

I’m going to roll with the format without HTML code and see what happens.

Replicated the export format, and created a new export tab to paste in the new data.

Gotta paste the values here, and not formulas/formats etc to keep it as a nice clean CSV export.

Download this sheet as CSV by going FIle > Download.

 

Importing the CSV into WordPress

and importing the CSV by following the steps in the plugin, we get to set all the variables and assign them to each field.

So you assign the title to the post title, content to the post content, category to the category field, etc.

After a few steps, you get a nice little confirmation, and the posts go live.

Magical! We’ve got our first small set of…. posts.

Yes. They need some love.

Time to format the post a little, and re-upload the test data.

Have now formatted the subtitles in the content to export with <h2> tags, which looks like

When imported, these now give us;

Much better.

 

Tweaking the design

It definitely needed some clean-up, so I threw up a quick design.

Obviously not too fussed here, and might clean it up later. Might not, we’ll see.

Looks a bit better haha, can upgrade it later.

 

What’s been built so far

So, we’ve got a functioning website, with 5 pages of ‘content’… if that’s what you can call it.

More than that though, we’ve got an entire data structure and process created that once additional data points are added for each combination, additional pages are created.

In under a minute, all the data required for a new page could be collected! Pretty neat.

Yes, I will reveal the site shortly. The Google Sheet will also be shared once the build is complete.

What’s next?

Find out in part 2 of the build here.

Current Page Count: 6 pages

Internationalisation & Hreflang Management

Internationalisation & Hreflang Management

Implementing hreflang for a programmatic build isn’t necessarily hard, once you understand what content you have.

If a piece of content is available in multiple languages and/or target markets, throw up a tag for that version.

If the piece of content isn’t available, don’t have a tag.

There seem to always be bugs in implementations though, mostly silly little things that are easily patched.

Is hreflang important for programmatic builds?

If you’re an international website, in either multiple markets or multiple languages, then yes, hreflang is extremely important for your programmatic build.

Google isn’t as smart as you think, and requires a lot of direction, or, ‘hints’ at they call them.

Whilst it will do its best to understand the relationship between two pieces of content across your site, telling them bluntly and clearly that these two-piece relate via either being different languages or targeting different markets for the same piece of content.

 

Pre-launching into a market with programmatic SEO

You could be a classified site, portal, or marketplace looking to expand your reach into new markets.

You can deploy a programmatic build, and start to gain a foothold in a new market before actually pushing the brand there.

Start to rank and drive traffic in a fresh country, and test the waters, without spending $ on any local advertising.

Essentially known as a ‘dark launch’, you can start to build rankings & value in a market, well before you actively want to push there.

The pages will get indexed, and you can start to push your content out in the market, and even acquire new content/listings slowly.

 

Tips for an hreflang implementation

Some of my top tips for an hreflang implementation are;

  • Ensure the current page is included in the tags
  • Ensure x-default is implemented correctly
  • Don’t allow content that doesn’t exist in flagged language, to exist
  • Ideally implement meta tags or XML sitemap, not both
  • Ensure each page in the hreflang correctly references all other pages

 

Hreflang meta tags vs hreflang XML sitemap

I personally try to stick with hreflang meta tags in the <head>, as it’s easier to check, which makes my life easier.

When sites start to get to 50+ available languages/locations, then things change though.

Aleyda Solis pretty much sums it up in this tweet.

So small sites get a meta tag. Big sites, get the XML sitemap.

It’s recommended you don’t use both though, as you can run into validation/update issues depending on which version was last cached.

 

Hreflang testing tools

 

Testing via generation

Whilst not a ‘testing’ tool, you can test the output of your build by generating hreflang tags on Aleydas site here. Check to see if the output of the tool, is the same output as what is expected.

 

Favourite direct testing tool

My favourite testing tool for validating an hreflang implementation is this one.

You throw in your URL, along with selecting a user agent, and you’ll get a nice little report back of all the URLs included in the hreflang. The tool will then quickly scan them all, to determine the tag’s validity and whether return references are available.

 

The top issues with hreflang tag implementations

 

No self-referencing hreflang

Sometimes the canonical is believed to be the required self-referencing tag. Whilst it is required, it is separate from the hreflang setup.

You must include an hreflang tag for the current page, just as it would be referenced on any other page.

Basically, every page gets the exact same set of hreflang tags.

 

Incorrect language/region content being shown

A page should not have any content other than its flagged content. Well, excluding any terms/legal callouts that are required that is.

If a language is flagged as German, then it should only include German content.

If a page is flagged as English, it should only include English content. No mixing and matching.

You should hide any content that isn’t in the flagged language.

 

Incorrectly flagging a URL that doesn’t exist

Sometimes, content is mismatched between languages & regions, and that is perfectly mine.

But don’t tell Google that the content is available in other languages & regions if it’s not.

Don’t include hreflang references to pages that don’t exist in the flagged language. These pages should instead be removed from the hreflang, and should not have hreflang tags on the page itself.

So for a blog post, if that post is only available in its current language, don’t have hreflang tags (you could have a tag to itself, and only itself, if that simplified implementation though).

If a page is available in 3/4 languages, then only mark up for those 3/4 languages, and have the 4th as a 404.

 

Make your hreflang simple where possible

Try and keep it simple, and as your build grows, it will be easier to maintain

Programmatic SEO: Optimising for the Highest Value, Low Tier Filters

Programmatic SEO: Optimising for the Highest Value, Low Tier Filters

In my post on how search filter pages should work, I talked about not linking through to query parameter URLs.

The bulk of these pages would not have value, particularly when you have millions upon millions of combinations.

So how do you optimise for the combinations that do in fact have value?

The filters you should be creating pages for

Chances are, you already have a rough idea of some of the keywords you want to target.

If you don’t, then look through your keyword data.

What filters are mentioned in the keywords, that don’t currently have pretty URLs for them?

I’ll go into a bit more detail and provide a template for this a little later, for now though, have a good look through your keyword data.

Do people search for bedrooms? bathrooms? prices? features?

Come up with a list of the filters used, along with some sample keywords.

 

Why you should be creating these pages

To target the longtail.

These are the keywords you’ve determined don’t deserve a pretty URL, and don’t deserve to be actively linked to, in bulk.

You don’t want thousands of these pages receiving SEO value and clogging up your crawl rates, as only a handful of them will have targetable volume.

The bulk won’t have volume, however, a small portion of them will.

You want to be able to still target that small percent of the pages that ha the value.

 

How do you create pages for these filters?

Unfortunately, it’s not just as simple as politely asking “create these pages” to your dev.

Even cake won’t solve this one without some forethought!

What categories/subcategories are used?

What locations?

What filters get applied?

Where are these pages linked from?

This is something I have put some thought into over the years, and have come up with a basic strategy that can be applied, and expanded upon, depending on the build.

You want to create “Custom Filter” pages.

 

Creating custom filter pages

To create these custom pages, you need to be as specific as possible about what you’re trying to do, as the developers will have the challenge to solve.

The devs will need to add a layer to an already complex setup, that enables these pages to seamlessly fit into the existing structure.

You want to let them know things from the page name to the filters being used.

I’d recommend flagging a set of top locations, that become the only locations these are used for. Yeah, there are probably more, and yeah you might create too many pages for some of the keywords, but a list of ~100 locations is still better than a page for all 5,000 locations.

There are two ways I see you can request these pages, and it really comes down to what would be easier for your tech teams, and provide the most value in return for you.

 

Creating pages for each variable value

The simplest way to request pages. You essentially just request a page for each value of a variable, and this gets combined with a top locations list.

These pages act exactly like a normal ‘pretty URL’ filter, except they have the top locations filter applied.

ie you could say you want a “<bedrooms> bedroom properties in <location>” page.

This would then replace <bedrooms> with every variation in your data, and <location> with a top set of defined locations that you want the page for.

 

Creating pages for specific variable values

The other method here of creating pages is to specify the actual pages you’d like to create along with the inclusion of <location>, which will be the only dynamic variable here.

ie you’d request “1 bedroom properties in <location>” and “2 bedroom properties in <location”.

This gives you more control, as you could just use the selected values from certain variables. Great when you have many values for a variable, but only want pages for some.

You will, however, need to create hundreds of these rows for each specific rule you want, so it’s not quite as neat to manage it all, but you get more control.

 

Structuring your custom pages

Ideally, but not 100% required if completely unavoidable, these pages should sit within the existing site structure, and not within a separate site section (like /custom/ or something more appropriate).

You should be trying to fit them in neatly.

So something like;

/<channel>/<custom>/

/<channel>/<location>/<custom>/

/<channel>/<location>/<propertyType>/<custom>/

You might even be asked to pick one structure, and stick with it, in which case try and pick the one that would make the absolute most sense.

Sometimes that means just nestling it within the channel structure, but ideally for this, sometimes it would be within the location structure.

Get the URL in as deep as makes sense, whilst still keeping it clean.

 

Custom filter pages fill the gap

These custom filter pages you can create will help fill the gaps between programmatic SEO & content creation.

You get more control with less chance of over-indexation, creating the perfect system in between.

HTML Sitemaps – It’s 2022, do you REALLY Need One?

HTML Sitemaps – It’s 2022, do you REALLY Need One?

The old HTML sitemaps hey.

Are you still actively trying to create one for large-scale programmatic builds?

Crawling? Indexing? Deeplinking? Passing SEO value?

Yeah, they can help you shortcut passing values and reduce click depth to deep links.

So can a good, and more natural, internal linking structure though, especially once some prioritised links are included.

I am not talking about a few links to the top sections of your site. Thye might still valid, particularly for smaller sites.

I am talking about the endless page(s) of links, that basically replicate an XML sitemap.

 

What is an HTML sitemap?

An HTML sitemap is one of those pages you’ll normally see linked in site footers, under an anchor of “Sitemap”.

Basically just another way of being able to tell Google about the different links on your site, particularly for larger sites.

Google recommended a “user viewable site map” back in 2010 and this is really where HTML sitemaps stem from.

In their latest SEO start guide here, they recommend the following;

Create a navigational page for users, a sitemap for search engines

Include a simple navigational page for your entire site (or the most important pages, if you have hundreds or thousands) for users. Create an XML sitemap file to ensure that search engines discover the new and updated pages on your site, listing all relevant URLs together with their primary content’s last modified dates.

Avoid:

  • Letting your navigational page become out of date with broken links.
  • Creating a navigational page that simply lists pages without organizing them, for example by subject.

 

Why you should avoid using an HTML sitemap

HTML sitemaps are just a shortcut, to a proper internal linking implementation.

A friendly header menu, and friendly overall site navigation, should be covering what an HTML sitemap would.

Do you think Google is really going to care what links you’ve got on a page when it’s just full of links?

It’s like those automated link swaps from 2008 where you’d add a page on your site, that would automatically update with links to every other site in the network. They were killed off a while ago.

Sometimes these sitemaps are broken up into pages and could have an H1 detailing what the links are about.

ie, “real estate in Sydney” could be a page full of links related to Sydney, and the suburbs of Sydney.

This can cause duplicate keyword targeting, for a page that’s purely made of links.

Yes, I have seen these types of pages indexed.

They’re also another area on the website where you need to be mindful of what pages you’re linking to.

Particularly when it comes to 0-result SRP pages.

Ensuring that links are always updated, with no error pages etc.

If the main purpose is deep linking from the HTML sitemap, then one of Google’s ‘avoids’ from the SEO starter guide comes into play;

Creating a navigational page that simply lists pages without organizing them, for example by subject.

So it kind of writes the deep linking off, as many programmatic sites will try and jam as many links in here as possible.

A natural linking setup should offer a much better solution to this.

 

Why you would still create an HTML sitemap

There is only one real reason that I see a full HTML sitemap as valid for larger builds.

Tech limitations.

No matter what ideas you have, the strategies you want to implement, the optimisations you want to make, tech limitations will find you.

And they will cut your dreams off.

Sometimes you can’t do what you want, either due to actual tech talent available, or you’re up against a product manager that doesn’t want to hear the magical 3 letters – S E O.

Apart from fighting harder, and/or going around the product manager, you sometimes have to do hacky SEO.

In this case, an HTML sitemap should be seen as hacky SEO.

Not the best, but it’ll get the job done at the start until a more viable solution can be put in place.

You’re just filling in the gaps where the tech can’t properly do a good link structure.

If you can do a proper link structure, however, there should be no need for a built-out HTML sitemap though.

Google should be naturally crawling the site, going from page to page, discovering the links with context.

Not just discovering everything on a page with 500 other links.

 

How to deprecate your HTML sitemap

The first step is ensuring you’ve properly implemented internal links – this is crucial.

Links to parent pages, child pages, and some cross-links need to be across the site.

Ideally, along with some “priority” links, ensure your top pages / most competitive pages have links from some pages further up in the hierarchy.

After that, you’ll need to redirect the sitemap URLs. You could 404 / 410 if you want, but I prefer to redirect to related URLs where possible to help retain any value.

If you had a paginated HTML sitemap, particularly one that included keywords like my earlier example, then try and redirect these as best possible to an appropriate page.

Treat this depreciation like a migration, where you want to redirect to the most similar page possible.

Do you need to deprecate it though? I’ve had clients where I won’t bother recommending it.

Particularly when there are things that are more valuable that could be done.

I will, however, recommend killing it off as soon as I see it become an actual issue.

Just remember, by removing something like this you need to make sure you’ve properly covered internal links. Even though this sort of setup isn’t ideal, you could be creating a tonne of orphan pages if you’re not set up properly before removal.

 

Some may disagree

I know there are SEOs that will disagree with this, but I haven’t seen an HTML sitemap do anything useful in the last 5+ years.

It’s time to cut them from site builds.

Death of Internal Nofollow: Use JS to Stop Google Crawling

Death of Internal Nofollow: Use JS to Stop Google Crawling

Over time, I have noticed nofollow internal links being crawled more and more.

It doesn’t seem as easy as it used to, to slap a nofollow tag on a URL and it will keep it from being crawled & indexed.

Google’s even made a few comments, and updated guidelines on this in recent times, with the biggest comment being here;

When nofollow was introduced, Google would not count any link marked this way as a signal to use within our search algorithms. This has now changed. All the link attributes—sponsored, ugc, and nofollow—are treated as hints about which links to consider or exclude within Search.

Nofollow is now a hint.

Yeap, just a little hint. It does not stop Google like it used to.

So yeah, it might be worth still slapping the tag on a link, but if it’s not 100% full enforcement and you’re linking to a million variations of a page… you’re in for some over-indexation fun.

Why do you want to block Google anyway?

Filter links.

Filter pages can create millions of crawlable URLs out of sites that should only have tens, or hundreds of thousands of pages.

The canonical tag just doesn’t work like it used to, and even with it you’re also blasting through your crawl budget on these asted pages.

 

Alternatives to using nofollow on internal links

The alternative to using nofollow, is actually doing what most SEOs would never normally want to do.

Implement client-side JS to drive these links.

You want to actually use onclick JS for these links.

Words you will rarely hear from an SEO, because it goes against everything that’s good to help SEO, unless you want to unSEO.

Specifically, onclick JS links that don’t have the URL available in the rendered HTML source code. You gotta use a separate file to include it.

Small but significant detail.

These onclick JS links then won’t render the link in the HTML code, leading the link to be hidden from Google due to the way Googlebot crawls.

It grabs the links, and then crawls. It doesn’t “click” links.

Every link that currently exposes the URL in the HTML source, that you won’t want to be crawled, should be changed to an onclick link that only renders client-side once it’s clicked.

These onclick links are essentially invisible to Google, but have no impact and appear as normal, to users.

If however, you do an onclick link where the URL is still available in the HTML source code, Google will be able to crawl that.

 

Why a robots.txt block or a meta robots noindex/nofollow tag might not be the solution

Blocking the pages in the robots.txt file, or using a noindex/nofollow tag might be the ultimate solution for some.

However, there are a couple of things still in play here that need to be considered.

Actively linking to a page that is blocked in the robots.txt is like telling Google something cool is behind this wall, but they’re not allowed to look.

Actively linking to a page with a meta robots noindex/nofollow tag is telling Google about your cool URL, letting them access it, and then telling them it’s not that important, even though you’re linking to it. You’re wasting Google’s time, crawl budget, and your server resources in allowing Google to crawl it.

In both cases, you’re diluting a page’s value by linking through to these pages you don’t actually want to be ranked or crawled.

John Mueller has also made a comment regarding robots.txt blocks of parametered pages, so it’s worth keeping this in mind:

 

Should you still use nofollow on internal links?

Yeah, there is no reason to remove it if you actually can’t implement an alternate method.

But if you’re implementing an alternate linking method to ensure Google doesn’t crawl it, the nofollow tag kind of gets nulled because Google won’t see the link anyway.

 

The internal nofollow is dead

The internal no-follow is dead.

Now is the time to switch to JS links to prevent the crawling of pages you don’t want to be crawled.

Programmatic SEO: Search Filters & Faceted Search Optimisation

Programmatic SEO: Search Filters & Faceted Search Optimisation

You can make or break a site, just based on how you handle your search filters.

The difference between a good & bad search filter setup can mean the difference between a good amount of quality results that are indexable, and millions upon millions of crawlable & indexable URLs that Google will ignore the majority of.

Let’s take a look through filtered search optimisation for programmatic SEO builds.

The two types of search filters

One thing I always try to define for a client is the two types of filters.

Those you want a pretty URL for, and those you don’t.

You shouldn’t be creating a pretty, completely indexable, URL for every filter combination ever.

You need to define a set of filters that will allow you to target 90% of searches, with a small portion of the URLs.

I was going to use the 80/20 rule here, but 20% of the URLs would be so wrong.

You would be creating millions upon millions of crawlable URLs to target the remainder.

I go further into this during the post.

 

The best URL structure for search filters

When creating search filter URLs, you have to keep in mind the structure of the website.

You should be parenting the filtered content, below its parent page.

So for real estate, if you have an ‘apartments for sale in Sydney’ page, the filters would be;

Property Type: Apartments

Channel: For Sale

Location: Sydney

How you’re structuring the website will control the order of usage, along with how you use, each of the different filters.

Some filters may be a part of the pretty URL whereas some other filters will be query parameters that are just tacked on at the end.

For my clients with the above filters, I would be recommending the following structure;

/<channel>/<location>/<propertyType>/

The reasons behind this choice will be documented separately, however, with that structure in place we know how we should handle this type of URL.

/buy/sydney/apartments/

If we then add a pricing filter, or a bedrooms filter, the URL would change to something similar to the below;

/buy/sydney/apartments/?priceBetween=500000-1000000&bedrooms=3

So there is a clear separation between the ‘pretty’ portion of the URL and the ugly query parameters.

 

How to handle URLs for multi-select filters

Multi-select filters can lead to issues, particularly when the multi-select filter is a part of a pretty URL.

Let’s say you have a multi-select filter of property type, with apartments & houses as an option.

If a consumer selects both of them, you want to make sure that both apartments & houses don’t end up in the URL.

You don’t want to end up with /buy/sydney/apartments/houses/ or /buy/sydney/apartments-houses/.

Whilst you could handle this where you prioritise one, and then query parameter the other, I prefer a simpler solution.

When 2 or more options of a ‘pretty URL’ filter are selected, use them both in a query parameter rather than pretty URL.

ie domain.com/<channel>/<location>/?propertyType=type1-type2

This gets the parameters stripped in the canonical tag, and just ensures you don’t get any issues arising from duplicate targeting or the creation of additional pretty URLs.

 

How to handle internal links to filtered search results

This is one of the main causes of indexation issues, due to the fact that internal links hold such weight with Google.

It can also lead to one of the easiest ways for a larger-scale website to make improvements, that can actually move the needle in the rankings.

 

The URLs you should be linking to

The quick summary here is if it has a pretty URL, at least 1 result, and relates to the current page, a page should definitely be linked to.

So you link to all the filters of the current page, that have a pretty URL & and at least 1 result.

I cover the two different types of internal links for programmatic sites in a little more detail here.

You should avoid actively linking to any query parameter filter.

 

Why crawlable links to non-pretty URLs, or 0-result SRPs, should be avoided

Each crawlable link acts as a vote for a URL in Google’s eyes. If you’re constantly ‘voting’ on poorly filtered URLs, or 0-result SRPs, Google is going to place more weight on these URLs than what they’re worth.

This will lead to crawling and indexation issues, particularly when it comes to prioritising specific URLs above each other.

So on top of your pretty URL filters of Channel, Property Type, and Location, you could have a significant quantity of other filters available, including but not limited to;

Bedrooms

Bathrooms

Car Spaces

Price

Floor Area

Land Area

Nearby Amenities

New / Established

Property Features

The list quickly builds up.

As an example, let’s say that each of the 9 filters had 10 options available to filter by.

You’re going to have 90 filterable versions of a URL, all with query parameters, that are just filtered views of the primary page.

Each of these is significantly re-using the primary results, and Google will crawl and see this on each page.

How many of those filters are actually going to have search volume?

A small handful might, for some top-level locations.

On top of this, these 90 filtered versions of the URL, would then link to the other 89 versions of that URL, with that additional filter applied.

You’d create 8,010 versions of a single URL, with just 9 filters and 10 options for each.

But then if you have 2 channels, 5,000 locations, and 5 property types, you’d have 50,000 pages, with 8,010 versions each, giving you a lovely 400,500,000 URL combinations that will be crawlable.

Google would have a field day.

Do you think “new 3 bedroom home with 2 bathrooms and 2 car spaces under $450000 with swimming pool and balcony” gets enough search volume to warrant the page being linked to?

There are ways you can optimise the pretty URLs to capture a large portion of these super long-tail queries.

Yeah, you won’t capture it all. But you also won’t need to create 400 million pages to ensure that you do.

Also obviously a worst-case scenario should you have no limits in place. Yes, I have seen this multiple times.

 

How you should link to filtered URLs

It’s not just what pages you link to, but how you link to them.

You need to keep SEO and the user experience in mind when linking, as you obviously still want users to filter by price, bedrooms, and an array of other filters that will help them find the content they’re after.

Links will be broken down into your ‘SEO links’ that should be server-side and in the source HTML, and then other links/interactions that should not be crawlable, and should only be available via client-side JS / onclick event links.

You’ll probably have some sort of filtering widget, normally in the sidebar, that contains every filter available for the set of results.

Provided these filters don’t expose any links in the HTML source of the page, this widget can be left completely untouched and left for the designers to toy with as required. Don’t fight design on this, we can get more value from some slightly separate SEO links :)

However, if these filters expose links in the HTML (dropdown filters sometimes do this), then you’ll ideally need to get these switched to onclick event / client-side javascript links, to ensure they’re not crawlable/as easily crawled. You can read more about how these JS links work here.

The other links, the pages you actively want to link to, should be added to a separate little widget under the filters under a title like “popular locations”, or “popular property types”.

You can also throw them in a nice footer widget, but a sidebar link might have more value so could be prefered.

They just need to be exposed in the HTML source, unlike the other non-pretty, parametered URL links.

 

Blocking parameter filters in the robots.txt

This is something many SEOs will do to avoid crawling & indexation of query parameters.

It’s a viable strategy for newly launched sites to prevent issues, however, existing sites need to keep a few things in mind.

Personally, I prefer to try alternative methods of patching crawling/indexation issues that can be attributed to parametered pages.

I will try and lower their value, by removing links pointing in, and then hope that the canonical tag takes over and does what it’s supposed to.

 

1. Are the primary, clean URLs indexed?

If you’ve actively linked to parameter-filtered URLs of SRPs, then you may be blocking the only indexed version of a page.

Google might not have the primary page for those search results, indexed.

While yeah, Google will eventually index the new URL, you might be temporarily killing a chunk of your traffic.

 

2. Do the parametered URLs have links coming in?

If the URLs filtered with parameters have links pointing in, you could be culling any value they have. Google will no longer look at the canonical tag and assign any weight to the parent, non-parametered, URL.

Double-check this, and make sure you’re not about to potentially remove the value these links would have passed.

 

Google says no

All in all, there’s this.

So keep that in mind.

 

Handling canonical tags of search filters

Pretty URLs get included, along with pagination.

All non-page query parameters should get stripped.

That will help pass your SEO value around, ensure minimal over-indexation, and try to keep Googlebot in check if it discovers these filter URLs.

Let’s say you have a URL of example.com/buy/sydney/apartments/?priceBetween=500000-1000000&bedrooms=3

The canonical tag I would be recommending is;

<link rel=”canonical” href=”https://example.com/buy/sydney/apartments/” />

Canonicals are seeming more and more like a suggestion though, rather than a rule, so just keep in mind that they, unfortunately, don’t work as well as they used to control indexation.

 

Common mistakes in search filter handling

Here are the common mistakes I see with clients when it comes to handling search filters.

 

Ordering of filters not handled leading to a duplicate page for each possible combination

Having both ?priceBetween=500000-1000000&bedrooms=3 and ?bedrooms=3&priceBetween=500000-1000000 crawlable will lead to crawling & indexation issues. Ensure these alternate orders are either never possibly crawled and that all internal filters & links contain the correct order, or, you have a way to detect them and 301 redirect to the primary order.

Even though you’ll have canonical tags attempting to clean this up, they don’t always work so it should be at least attempted to have this patched before it becomes an issue.

If you’re 100% definitely not actively linking to any of these query parameter pages, and there is absolutely no chance of links coming through, then you’ll be fine.

But, well… bugs happen. Just keep that in mind.

 

Search filters are available as both a pretty URL and a query parameter

When you have both /buy/sydney/apartments/ and /buy/sydney/?propertyType=apartments crawlable and indexable, you’re duplicating and devaluing some of this content. You need to make sure the parameter redirects to the pretty URL version (unless multiple values exist).

 

Multi-select filters adding both selections to a pretty URL

When URLs are handled like /buy/sydney/apartments/houses/ or /buy/sydney/apartments-houses/. My recommend way of handling this is moving both to a query parameter version, like /<channel>/<location>/?propertyType=type1-type2.

 

Search filters included on listing/product links

This is one I have now seen a few times, and it caused massive indexation issues. The query parameters were included on the links to the listings. Google then not only crawled them all, but indexed many of them too, and plenty were ranking. Ranking with a search filter query parameter URLs on a page that didn’t do anything with them.

I’ve also recently learnt that this is a Shopify default… which is weird.

You can read here how to fix it, but the collections part of the URL is included in the product URL they link to. This is stripped with a canonical, but what a mess this can cause for Google!

And yes, I am looking at a site right now, that does this, and the collection’s version of a product URL is ranking.

 

Non ‘pretty’ search filters are included in the canonical

If you don’t want it to rank, don’t include it in the canonical tag. Many sites still include these parameters in the canonical, that generate thin & heavily re-used content.

To avoid issues, these thing parameter filters should be stripped and try to pass their value/indexing back up the chain to their primary results URL.

 

Results not loaded server-side

There are plenty of site builds where the majority of the site is server-side rendered, but then their entire search is client-side. This dramatically impacts indexation & crawling of the search, which is critical. The entire page, including the search results, should be loaded server-side, not just the overall template.

 

Cleaning up over-indexation of search filters

If you’ve exposed too many search filters, and need to clean them up, then you will need to undertake over-indexation clean-up… which is a whole separate topic.

You can read more about over-indexation clean-up for programmatic builds here.

 

Extra: How can you target the higher-value low-tier non-pretty URL filters

Well, that’s a mouthful.

Something I will cover in a bit more detail later, but yes. There are super long tail filter variations that have value, and are worth targeting.

But you need to be careful with this, as it is extremely easy to create 100s of thousands of pages, to just target a handful of keywords.

You want to target the top 80% of each of these types of keywords.

The ideal scenario to handle these is a set of custom rules.

These rules should allow you to map out what filter combinations could be used.

A way you can select 100 key locations, out of the 5,000 you have in your data set.

A way to select just 2 property types, from the 10 in your data set.

A way for you to set the price filter to under 300,000.

The system could then spit out the combinations of those 100 key locations and 2 property types, to create pages for “<property type> for sale in <location> under 300,000”.

You’ll get 200 combo pages, which will have 80% of the total volume, rather than 50,000.

 

Top tier strat: Automatically targeting these higher value but low tier filters

Recently, I have started recommending a new approach.

Three tiers, but could be expanded, of filters with pretty URLs.

Top tier – Every value gets a pretty URL, at all levels

Mid-tier – Every value gets a pretty URL, but only for top-tier locations

Low tier – Always in a query parameter.

This middle tier is new to the way I am pushing clients to handle URLs, and offers a more “simplistic” approach to landing page automation.

Let’s go back to the bedrooms and pricing examples.

Rather than creating “Properties under <price> in <location>”, and “Properties with <bedrooms> bedrooms in <location>” for every single location in your database (5,000+), you could filter this to just a set of top locations.

For price, you might have 15 values. Instead of creating 15 x 5,000 pages, you could create 15 x 100 pages.

For bedrooms, you might have 6 values. Instead of creating 6 x 5,000 pages, you could create 6 x 100 pages.

You might want more pages than just the ‘top tier’ locations for some filters though. Bedrooms are a good example of this. It might not be worth creating 5,000 locations worth of bedrooms pages, but more than 100 locations could be warranted.

To solve this you could add an extra tier into what I discussed above, where you create a set of ‘mid-tier’ locations, to compliment top-tier locations, and use this mid-tier list for some filters.

This mid-tier location list could be 500 top locations as an example.

That way, bedrooms could be 6 variations x 500 locations, rather than just the 100 top locations.

 

Faceted search vs search filters

This is where people lose me a bit.

UX people/designers etc seem to say filters are basically a single selection to refine the data, and “Faceted search” is about multiple selections. So basically multi-select filters are “faceted search”?

Filters do the same though? They filter the results.

Seems weird, but oh well. They’re the same to me, just ‘multi-select’ filters…

So yeah, I’m only talking about faceted because that’s what others call it, and what you could be searching for.

This is still what you’re after, I swear.

It’s all the same for this.

 

Keep your filters in check

Google is dumb. You might think they’re smart, but in reality, they’re just a system that needs direction. The more you give them, the better they’ll be able to crawl, index, and pass value around your site.

Know what indexable and crawlable URLs you’re creating, keep them tamed, and you will be rewarded with that lovely SEO traffic in the long term.

If you haven’t already, make sure you check out my programmatic SEO checklist to help you tick off the build.

Programmatic SEO: Handling Search Result Pagination

Programmatic SEO: Handling Search Result Pagination

Pagination is something that should be easy, but when it goes wrong, it can really go wrong, and impact crawling & indexation.

It’s one of the first things I go to clean up, as it’s a great way to help reduce the overall crawlable pages by a site, particularly one that’s wasting their crawl budget.

Folders or query parameters for pagination?

Provided there are no tech issues that make it not possible, I recommend that query parameters are used for pagination.

Page=x is just such a clear signal to Google that its page related, and I’m all about making things clear for a robot.

 

Are Rel next & Rel Prev still required tags?

No, you do not need to use the rel and prev tags anymore for Google.

One of the Google engineers did a presentation in Australia, and mentioned how this conversation about the tags went.

Mueller walked in one day and essentially said “You know we’re not using these tags anymore?”, they looked at each other, and then the tags were deprecated.

Someone accidentally / unknowingly removed them from being checked and no one noticed, so they dropped their use of them.

They are still getting used by other search engines, so if Google isn’t the primary in your market, you should definitely still include the tags.

 

SEO Pagination best practices

Pagination best practices are pretty simple at the core.

  • Always link to the first page
  • Link to the next couple of pages, and a couple of previous pages
  • Use a clean parameter where possible, with no parameter for the first page
  • Ensure the pagination query parameter is in the canonical tag
  • Use rel next/prev tags if you want / if they’re easy enough
  • Don’t link to the final page in a series from the early pages
  • Limit page counts where possible
  • 301 redirect any page counts outside of available range back to the first page
  • Include a decent amount of results per page, 20-30, rather than many pages of fewer results

 

Common SEO issues with pagination

There are a handful of issues I tend to look at when auditing a pagination setup.

 

1. Page 1 (the default URL) has a query parameter on internal links

Some default e-commerce setups will include ?page=1 parameter on the end of links back to the first page.

By default though, this first page has no query parameter.

So you then have the following created as duplicate pages;

https://www.site.com/category/

https://www.site.com/category/?page=1

Exactly the same page, with exactly the same content, yet two different URLs.

Even with page=1 being stripped with a parameter, you’re actively linking to an alternate version of a page confusing Google.

John Mueller made a comment re: UTM tags on internal links, but the comment directly applies to this scenario too.

What you’re asking Google to rank, and what you’re linking two, are separate pieces of content.

Why confuse Google?

Just ensure that any links back to page 1 exactly match what you’re expecting, which would be without the query parameter.

 

2. Canonical tag doesn’t include the query parameter

Canonical tags must include the pagination query parameter.

Google even mentions this on their ‘common canonical mistakes’ blog post here.

 

3. The final page is linked to from the first page

Linking to the final page, is what every single pagination linkset does. Ever.

Why would you link to the end, when the most important results are on the first set of pages?

I always recommend removing the link to the final page in a set, and instead ensuring the first few links are available.

If Google wants to crawl all the way to page 345 it can, naturally, going page by page in the order of priority you have set.

 

4. Not including enough results per page

Some websites won’t use their content to its full potential. Instead, they’ll limit each page to 10 results, and just have more pages.

Bring some of that hidden content forward, to the first page.

Include more results, at least 20, maybe even 30 nowadays, and make your primary page for the search result stronger.

 

5. No limits on pagination

The final one here is allowing as many pages of results to be created, as possible.

Will a user ever click through to page 345?

I highly doubt it.

But a scraper will, and they’ll take all your data.

You’ll also be re-using the same set of results, over and over across the site.

By limiting the amount of pagination, you severely limit the reuse of the same listings.

IMO, this then gives the listings more value for when they’re used in the top couple pages.

Limit the pagination to what you feel is reasonable, maybe page 20, or page 50 tops, and you can start to craft crawl behavior a little more.

With pagination limited, you should 301 redirect any page outside this range back to the first page.

That will help with the initial URL cull, and will also help clean up URLs when page counts lower due to less results.

 

 

Is infinite scroll or pagination better for SEO?

Well, there are ways to make infinite scroll work. If you really want it.

However, pagination is so much easier to not only implement but to monitor and patch issues with, so it is my preferred go-to option should I be given the choice.

 

Handling pagination is easy

Handling pagination cleanly, and efficiently, is actually pretty easy. Once you tick a few boxes and set some limits.