Category: Technical SEO

Subdomain to Subfolder: The Simple Cloudflare Reverse Proxy

Subdomain to Subfolder: The Simple Cloudflare Reverse Proxy

For years, there has been a subdomain verse subfolder debate.

One side says this, another side says that.

People have run their own tests, shown that moving a subdomain to a subfolder can improve a blogs ranking, yet so many have been in denial because Google has said the opposite.

Well, Aleyda Solis came out with direct test results, and a lot more people have finally jumped on the subfolder bandwagon.

Most of the time though, running a blog off a subfolder instead of a subdomain isn’t technically feasible.

It’s a pain to set up.

It was a shocking process the last time I tried to do it with a larger site 4/5 years ago.

However, lucky for us things have changed over the last couple of years.

If you use Cloudflare, you can now how you can have a blog installed on a subdomain, yet force it to load, and make it looks like it exists, under a subfolder.

 

Aleyda’s test

Kicking it off here, this is the tweet that stirred things up again.

Clearly highlighting the tech challenges that she’s gone through to get it going, Aleyda shows there’s a definitive growth following the migration to a subfolder.

An almost instant improvement.

There are plenty of caveats that could exist here, as it’s SEO after all…. but once the migration was complete you can see a nice upwards trend.

In some projects like this, I have seen what I dub a ‘new shiny’ effect. New URLs getting a little boost when discovered/migrated too, and then sometimes a bit of a drop off afterwards.

I reached out for an update to see if there was such a dip, and Aleyda was kind enough to provide a new graph;

No post-launch dip! Nice.

 

Reverse proxies

Before we get into the different methods of setting this up, you need to understand reverse proxies.

Well, the super basics of them anyway…. which is where my knowledge of them stops anyway lol.

A reverse proxy allows you to essentially mask your website’s true file location, and load it somewhere else.

CloudFlare, in general, acts as a reverse proxy by being a CDN. It masks your server’s true location, by forwarding the requests from a Cloudflare ‘middle-man’ to it.

But this runs as URL in, URL out, where the URL is the same before and after the request and only the IP of the content is modified.

We can tweak and override this a bit, so that the URL of the request, differs from the URL of the content location.

So you could have a blog installed on the subdomain, and leave it there, but make it act and look like it’s actually installed on a subfolder.

A request will come through to https://domain.com/blog/ and then the reverse proxy will grab the content available at https://blog.domain.com, and then load it as if it actually exists at https://domain.com/blog/.

It allows you to bypass the biggest tech issue of running a blog on a subfolder, which is managing multiple different technologies installed in the same location.

Well, something along those lines anyway.

 

Subdomain to Subfolder with Nginx

Two previous blog moves from subdomain to subfolder that I have helped with, involved nginx reverse proxying.

Nginx is a server technology that does server things, but one thing it does is route traffic. You can give it filters or rules, and tell it to send specific requests one way, or another.

It’s like a little middleman that can move your site’s traffic around.

Using this, you tell it to reverse proxy your subfolder requests, to your subdomain, and have it look like everything loads under the subfolder.

Nginx has mostly been a more enterprise-level setup, so there’s a good chance you might not be using it.

If you have Nginx installed, here is a detailed guide on using it as a reverse proxy.

 

Subdomain to subfolder with Apache/.htaccesss

Similar to Nginx, Apache is another server tech that does similar things.

In particular, their htaccess file allows you to set this sort of thing up.

If you’re on a typical web hosting setup, this is the most likely setup you’ve got going.

To jump into a subfolder migration with the .htaccess, you can find a detailed guide here.

 

Subdomain to subfolder with Cloudflare

Now, this is my new favourite.

Why?

Because I can do it without any tech involvement, and it is independent of any other server tech.

And in under 2 minutes! Pretty sweet.

No messing around, it’s magical.

Many will also run Cloudflare before Nginx / Apache is hit, so it will work across both and be a bit more flexible.

Today, I will show you how you can do it too.

 

How to setup a reverse proxy with Cloudflare

It first started with a guide I found from 403.ie here.

There were a few others floating around, but this was the best one I could find to match the specific requirement of reverse proxying content from a subdomain to a subfolder leveraging Cloudflare.

However, it, unfortunately, didn’t work for me. It was close, but the WordPress side of things kept failing.

It took a few goes to work out whether it was the server (Siteground has some fun caching :/) or whether it was the Cloudflare setup.

Modifying the DB directly, some WordPress config scripts, and many other changes, but every time something else would break.

Missing CSS files, bad redirects, and constant server errors. Every time I patched something, something else would break.

I gave up, and called in some dev support.

A dev named Dat came through, and sorted me out.

 

Steps to setup the reverse proxy

The following instructions will help you get your reverse proxy setup in both Cloudflare, and your WordPress setup.

1. Create the Cloudflare workers
  • Log into your Cloudflare account, but don’t load up any of the sites, and you’ll see the ‘Workers & Pages’ setting option;

  • Jump in here, and click ‘create application’

  • Then find ‘create worker’

  • Name the first one;

sitename-reverse-proxy

Modifying the sitename to be your actual site name. This name can be anything, but including the sitename can help incase you want to do this multiple times, as each worker could be loaded under any domain.

  • Click deploy, and it will load in a default script, which we will replace.
  • Click on ‘edit code’;

  • Delete the default code, and then paste in the first set of code from below, for the ‘reverse proxy worker’.
  • Modify any mention of blog.domain.com or domain.com/blog to be the settings you require.

Be careful not to modify any existing, or add any new, trailing slashes or https mentions as they will break everything.

  • Click on ‘Save and deploy’ in the top right corner, and then ‘Save and deploy’ again on the little popup modal

  • Repeat the above steps for the redirect worker, named “sitename-redirect-worker”, and get that one deployed too.

 

2. Setup the Cloudflare routes
  • Open the website you wish to add the routes for, and then find ‘Workers Routes’

  • Click on ‘add route’

 

  • Create routes for the following two URLs (modifying them to match what you need), by selecting the redirect service worker you created, and ‘production’ environment

*blog.domain.com
*blog.example.com/*

 

  • Create routes for the following two URLs (modifying them to match what you need), by selecting the proxy service worker you created, and ‘production’ environment

*domain.com/blog
*domain.com/blog*

 

  • After both sets of 2 routes have been created, you will see something similar to this on your Workers Routes page;

 

3. Modify the WordPress Site URL

The easiest step of them all.

Load up the WordPress admin area, and jump into Settings > General.

Modify the Site Address, and not the WordPress address, as per the below settings;

 

4. Add a trailing slash redirect for the sub-folder

After all this, we couldn’t get a final issue solved in the end, unfortunately. The blog homepage was available with both the trialing slash, and no trailing slash. Just the homepage. Everything else works beautfulllllly.

  • To get this patched up, we load up Cloudflare and head to the Rules > Redirect Rules

  • Click ‘Create rule’ under the single redirects section

  • Create a rule that uses the non-trailing slash blog URL version as the incoming requests rule, and the same URL but with a trailing slash as the URL to redirect these requests to

5. Implement a redirect strategy

If this is an existing build you’re modifying, make sure you implement a full 301 redirect strategy! It should just be a simple 301 rule that forwards from the sub-domain to a sub-folder, but triple check it all.

There’s no point moving to a sub-folder if you break everything along the way.

 

The scripts

 

Reverse proxy worker

addEventListener('fetch', event => {
// Skip redirects for WordPress preview posts
if (event.request.url.includes('&preview=true')) { return; }

event.respondWith(handleRequest(event.request))
})

class AttributeRewriter {
constructor(rewriteParams) {
this.attributeName = rewriteParams.attributeName
this.old_url = rewriteParams.old_url
this.new_url = rewriteParams.new_url
}

element(element) {
const attribute = element.getAttribute(this.attributeName)
if (attribute && attribute.startsWith(this.old_url)) {
element.setAttribute(
this.attributeName,
attribute.replace(this.old_url, this.new_url),
)
}
}
}

const rules = [
{
from: 'domain.com/blog',
to: 'blog.domain.com'
},
// more rules here
]

const handleRequest = async req => {
// Redirect WordPress login to the subdomain
let baseUrl = req.url;
if (baseUrl.includes('domain.com/blog/wp-login.php')) {
return new Response('', { status: 302, headers: { 'Location': baseUrl.replace('domain.com/blog', 'blog.domain.com') } });
}

const url = new URL(req.url);

let fullurl = url.host + url.pathname;
var newurl = req.url;
var active_rule = { from: '', to: '' }
rules.map(rule => {
if (fullurl.startsWith(rule.from)) {
let url = req.url;
newurl = url.replace(rule.from, rule.to);
active_rule = rule;
console.log(rule);
}
})

const newRequest = new Request(newurl, new Request(req));
const res = await fetch(newRequest);

const rewriter = new HTMLRewriter()
.on('a', new AttributeRewriter({ attributeName: 'href', old_url: active_rule.from, new_url: active_rule.to }))
.on('img', new AttributeRewriter({ attributeName: 'src', old_url: active_rule.from, new_url: active_rule.to }))
.on('link', new AttributeRewriter({ attributeName: 'href', old_url: active_rule.from, new_url: active_rule.to }))
.on('script', new AttributeRewriter({ attributeName: 'src', old_url: active_rule.from, new_url: active_rule.to }))
// .on('*', new AttributeRewriter({ attributeName: 'anytext', old_url: active_rule.from, new_url: active_rule.to }))

if (newurl.indexOf('.js') !== -1 || newurl.indexOf('.xml') !== -1) {
return res;
} else {
return rewriter.transform(res);
}
}

 

 

Redirect worker

const base = "https://domain.com/blog"
const statusCode = 301

async function handleRequest(request) {
const excludedPaths = ['/wp-login.php', '/wp-admin', '/wp-admin/']
const url = new URL(request.url)
const { pathname, search, hash } = url
const destinationURL = base + pathname + search + hash

if (excludedPaths.some(path => pathname.startsWith(path))) {
return fetch(request)
} else {
return Response.redirect(destinationURL, statusCode)
}
}

addEventListener("fetch", async event => {
event.respondWith(handleRequest(event.request))
})

 

Your subdomain to subfolder setup should be live

Following the steps above, your blog should now be publicly accessible under the sub-folder, and the admin panel will be accessible under the original sub-domain.

You should be able to directly load the new subfolder, and it’ll work as if you’re accessing the subdomain where it is actually installed.

I’d love to hear how it goes if you went ahead with the change! Both whether the instruction above completely worked for you, and how performance was if it was an existing build you modified.

Prioritising SEO Tasks with Development Teams

Prioritising SEO Tasks with Development Teams

Prioritising SEO tasks is key to being able to get work completed by a development team in an efficient manner.

Rather than just going “we need them all done”, or “they’re all a priority”, breaking your tasks into individual priority scores can help them gauge what’s most important.

A lot of guessing is involved, but I like to think of it as educated guessing.

Let’s break down task prioritisation, and how you can better request SEO work with development teams.

The ICE score

If you haven’t heard of it before, ICE scoring gives a simple way of scoring a task 3 ways, to give it a priority.

Impact – How impactful do you think a task will be?

Confidence – How confident are you of that impact?

Ease – How easy a task will be. Essentially the effort, but a reverse score.

Similar to the RICE scoring method, but one less variable!

The ICE score allows you to essentially create a sortable list of tasks.

 

How to prioritise SEO tasks

You probably have a list of SEO tasks required already, and just need to prioritise them.

Give each task a score out of 10, for each of the 3 variables – Impact, Confidence, and Ease.

Then add the score up, and that is your ICE score, out of 30.

Knowing what score each item variable requires comes down to experience.

Experience in implementing tasks.

Experience in working with developers and knowing what they say to different types of tickets.

The big thing here is, that it’s just a guess!

You don’t need to be accurate.

Since all the scores are based on what you know, you’re essentially weighing up each task against each other task which is essentially what we want.

If you really have no clue about the ease of implementation for a task, then you could sit down with the devs and ask them for a hand.

Let them give you the ease scores themselves after a high-level chat about each task.

It can always be edited later, which will reprioritise the tasks.

 

Breaking down complex tasks into simple ones

One of the fastest ways I found to get SEO work done was to make the complex, simple.

Take the large, complex tasks, and break them down into a few tasks.

The initial task should be seen as a ‘foot in the door’ task. What is the absolute minimum amount of work required, to achieve the result, with minimal negative results?

Any other “sub-task” related to that task can be seen as upgrades.

There are so many instances where you just need to get a little piece of the overall task complete and live. Then you can either reap some rewards, or make further related tasks more attractive for a development team to pick up by simplifying them.

Lots of “simple” tasks can sometimes achieve more than a single larger task.

 

Working with development teams

Depending on whether you’re in-house and working with an in-house dev team, or whether you’re agency-side working with an external team, it’s not a straight task list hand-off, unfortunately.

You’ll need to work with the development team, probably a product manager, and walk them through the tasks.

You might need to “sell” them on each individual task, if they’re that picky.

It really comes down to what information you’re giving them.

Are you just giving them a straight-up list of tasks and making them go to an effort to work out specifics?

Are you flagging everything as a high priority?

Are all the tasks complex, and will take a lot of dev effort to complete?

One of the easiest ways to get SEO tasks implemented with a development team, is to become gap-fill work.

Development teams may have a 2-week sprint, and complete the work on Wednesday or Thursday. They will be looking for something quick and easy to pick up, that doesn’t require a full sprint brief.

Have gap-fill tasks ready for them, so that they can squeeze in when they need them.

Make a product/development team’s life easier, and you stand a higher chance of having your SEO work pushed through.

 

If all else fails – cake.

Legit.

Cake is an awesome way to sweet-talk your way to the top of a task list.

Who doesn’t love cake? And who doesn’t love people that bring them cake?

 

SEO task prioritisation template

Want a plug-and-play ICE scoring & task prioritisation template?

Grab my SEO task prioritisation sheet with the link below;

Just copy the sheet into your own Google drive, and you will then be able to list out your tasks.

I work in a separate doc for an audit, and then link through to that from this prioritisation list.

You could just add a description column, and provide full details here, but managing images/comments is a little bit harder in sheets.

 

Simplify and get the job done

Getting your SEO tasks implemented can be time-consuming.

Simplify them, and not only get more done, but get more done faster.

Programmatic SEO: Building a Strong Site Structure

Programmatic SEO: Building a Strong Site Structure

A strong site structure should be at the core of any programmatic SEO build.

How the value of your site flows around the site is crucial to ensure both key pages ranking, and longer tail pages are seen as valuable enough to rank.

Let’s dig into building a strong site structure.

The importance of URL structure in SEO

The URL plays such a key part in defining how pages relate to each other.

Whilst a breadcrumb can also help do this, the URL structure does this stronger and even clearer.

The structure helps Google quickly understand what a page’s parent is, and thus it can understand the relationship between the two.

By default, the fact a page is a child of another page would help it know that the content will probably be extremely related, with a lot of topical cross-over.

The structure also helps pass SEO value around the site effectively.

Ensuring that child pages pass their value up the hierarchy, to their parent page.

Ensuring that a parent can help give their children pages a little boost from their value too.

Topical relation + SEO value passing = a win.

 

Site structuring with breadcrumbs

When building a strong site structure, URL structure is number 1. However, site structuring via breadcrumbs can also come into play.

Not just as a backup, but also as something that can work hand-in-hand with the URL structure.

 

Breadcrumbing as a backup

If all else fails with being able to implement URL structures, breadcrumb structures are your best backup.

They’re pretty much the only other way, to effectively set the parent-child relationship.

The downside here is that they only really work up the chain, rather than down, due to how a breadcrumb works.

 

Breadcrumbing alongside URL structures

Leveraging breadcrumbs on top of a good URL structure is where they stand out.

Not only can you back up the URL structure, by following it up with a breadcrumb structure that imitates the URLs, you can also drop in extra levels where applicable.

Sometimes a piece of content sits across different levels in the URL structure.

You obviously can’t include both in the URL, as that wouldn’t be right,

You can, however, drop the extra level into the breadcrumb structure and let Google know that relation with no issues.

 

How to build a strong site structure

Let’s look at how I’d be recommending you structure your site.

Just remember, there isn’t a ‘right’ or ‘wrong’ answer here, but pays to put in the effort at the start as it’s not exactly something you want to change.

 

Organising your content around a core structure

My primary aim with the structure is to try and put the items with the highest differentiation between them, higher in the list.

This way, you put related content together, and try and separate un-related content as best possible.

For real estate, at the heart of all listings are two channels – Buy & Rent.

So I’d recommend starting here with;

/buy/

/rent/

For the second tier, I would bring in location.

Something like;

/buy/melbourne/

/buy/sydney/

The third tier is when I would then bring in the property type;

/buy/melbourne/apartments/

/buy/melbourne/houses/

But you’d still have the no-location search URLs available, like;

/buy/apartments/

/buy/houses/

The reasoning behind structuring it this way, is that buy verse rent are two completely separate journeys.

It’s a rare occurrence for someone wanting to both at both renting & buying a home to live in.

The next thing they’re looking at is the location. There’s more of a chance they’ll want to look at both apartments & houses in a location, than to look at just apartments in multiple areas at once.

On top of this, you’ll also get property types like ‘apartments’, ‘units’, ‘condos’ and ‘flats’ being seen as extremely related in Google’s eyes. Sometimes even direct synonyms.

Google won’t ever see nearby suburbs in a city that same way.

Well, maybe not never. But very very very rarely.

 

Leverage your blog content

You can dramatically improve the value of your core structure, by leveraging your blog content.

Rather than throwing your blog content under /blog/, /advice/ or something completely random, nestle it under your core structure.

Sometimes harder said than done, even just using the parent folder can dramatically improve the value passed through to that folder, helping it and sub-pages rank.

This would involve post URLs like;

/buy/how-to-buy-a-property/

/rent/the-ultimate-guide-to-renting/

and similar.

If a post relates to the location too, it could be;

/buy/melbourne/how-to-buy-a-property-in-melbourne/

And an even more extreme could be about a location + property type article;

/buy/melbourne/apartment/how-to-buy-an-apartment-in-melbourne/

But that’s getting a bit deep, and would probably be a bit more on the difficult side of an integration.

You’ll always get a bit of tech pushback due to the way the URLs resolve from the server.

It can be done on pretty much any tech, with the work.

There’s also a bit of a shortcut, that would still work wonders.

An old workplace got around to restructuring their blog.

Whether they found an old doc I was working off, or whether one of the newer team got it done, they restructured their blog content to add value to their core structure.

airtasker.com/cleaning/guides/how-to-clean-home-after-flood/

They used the /guides/ subfolder to hack the content in there. The server knows where to grab the content from, purely from that subfolder.

An alternate way of doing this could be to just use ‘/guide-‘ with a dash, rather than creating a whole new directory, but the directory does make things a bit simpler.

They’ve used only their top-tier folder from the structure, and yet have added significant value to all content within the /cleaning/ folder now because of this.

 

Flat verse foldered

A debate that’s raged for years.

Is a flat site architecture better than a built-out, folder structure?

The way I see it, each folder level is similar to a 301 redirect. Probably doesn’t pass all the value through.

Let’s say it’s 80% (completely arbitrary, just to give it a number for context).

3 levels deep, and you’d have the parent sitting at 80% of 80% of 80% being 51.2%.

So wouldn’t the 80% of a single level, be better than 51.2% in a 3?

No, because that 3 tier is so much more targeted. Channel, location, & property type, just to name 3, could be included in a real estate portal.

The value gained from the 3-tier, nestled structure is more than what is lost through the levels.

I will always recommend a solid structure for any large-scale programmatic builds, and then a flatter, maybe one level, structure for smaller content sites where there’s less value gained from the tiering.

The structure is so important when it comes to content filtering, so giving Google the best hint to understand how each child relates to the parent is the easiest way to set that relationship.

Setting that relationship in a way Google can easily understand will help them rank the correct content, for the correct term.

 

There is no ‘best’ site structure

There might be ‘best for you’, but there is no ‘best’ site structure for you to build.

There are definitely wrong site structures though, I’ll tell you that.

Just build out the one you feel is best, based on what’s discussed above.

 

Should you be siloing your site?

To properly ‘silo’ your site, every single link on the page needs to be tamed and within the silo’s rules.

Every link.

It’s a lot of work.

Almost impossible at the enterprise level.

You can, however, keep note of what’s linking to what, and limit it where possible, to ensure the most value is passed to each individual link.

 

Is it worth doing a migration to update the structure?

It takes a lot for me to recommend an existing site migrate over to a new structure.

One of the core decision elements here is the amount of opportunity they stand to gain from doing so.

If they’re newer, planning a massive expansion, or have an extremely large amount of market opportunity and are prepared for the risk, then I’d say the migration to a more-solidified structure is worth it.

For a #1 in the market, a structure change could be more-so a solidification of their position, rather than for growth opportunity.

It’s a dangerous change to make, as you’re essentially redirecting over the core of your SEO performance, and risking it, during and after the migration.

 

Site structure is a key element

Just remember that site structure is a key element when it comes to the optimisation of your site.

Many other factors are at play, but the site structure is the core of your website.

Get it right, and the work done will be repaid.

Using Data to Determine What Filters Should be Targeted

Using Data to Determine What Filters Should be Targeted

Bedrooms, and bathrooms, and price ranges, oh my!

There are so many filters that could be used in the pretty URL, but what should be used?

What should earn the “pretty” status, and what should be relegated to query parameter status?

Some may say everything goes in pretty.

Some may just pick a few, and leave it at that.

Well, let’s take a look at how you could use data to inform your decision.

I’m not going to go into why you shouldn’t be creating pretty URLs for everything, that’s a separate post.

What we will run through are ways to set up keyword data, for you to gain insights about search trends of the filters you’ve got, so that you know what you should be optimising for.

The data you’ll need

To run this analysis, you’ll ideally just need an extremely large keyword research piece specifically for your niche.

It will need to be rather large to ensure you can get solid data, and you’ll also need to have it pretty clean, or at least understand the caveats of your data.

If you’ve also got a tonne of GSC data to mix in, then that would be great. That will help extend the data to cover a good portion of the 0-volume keywords that might not show up in keyword research tools.

For my examples below, I just spent 20 minutes pulling together some real estate keywords for 3 Australian cities, and ~15 suburbs from a “Top Suburbs in Melbourne” type list.

I wanted to limit it to location-specific keywords, as without a giant list of seed locations it can be hard to compare location v not location without significant time spent. Makes things messy.

 

Setting up your data

Whether you’re using Excel or Google sheets, you’ll need to create a data table to power the analysis.

The data table should contain all the keywords, their volumes, and associated categorisation.

 

Create categories for each filter

You’ll need to create categories for each of the filters that you are trying to analyse and work out whether they should be optimised for.

Go through and create columns, and set upYou’ll  their categorisation with rules for each possible value so that you can capture as many as possible.

For my example, I am using real estate keywords. A column has been created for each filter I’d like to analyse, along with categorisation for each of them.

Each filter has its seed values assigned, along with what the value is for that variable.

If a keyword contains either ‘buy’ or ‘sale’, it gets flagged as “Buy”.

If a keyword contains ‘1 bed’ or ‘1br’ it gets flagged as 1 Bedroom.

You can read more about how this works here.

You’ll want to be as thorough here as possible, and include as many variations as possible.

A couple of missed ones could really sway a decision.

Try and also create a catchall category at the end of filters with only a variable or two.

I created one for ‘features’ based on what I was seeing in the keyword data.

 

Prefiltering / cleansing the data

Depending on how clean your keyword data is, it might be better to just look at a portion of it.

A portion you know is 90% cleaner than the rest of the data.

For my real estate keywords, I know that if the keyword includes a channel, so ‘sale’, ‘buy’, ‘rent’, or ‘rental’, there is a higher chance of it being a keyword of enough quality for the study.

To include keywords that don’t include a channel (like ‘real estate’ or ‘properties’), I also include keywords including a bedroom or bathroom filter.

This is done via a YES/NO filter, that just flags it as YES if any of the filter cells have something inside them.

All my data analysis will have this filter applied, and it brings the keywords down from 9,000 to just 2,000.

I know those 2,000 are trustworthy to tell a story.

 

Creating your pivot tables

You’ll now need to create pivot tables for each of them so that you have a way to read the data

Go through a create a pivot table for each of your filters with the below data;

  • Filter as the row
  • Search volume SUM as value
  • Search volume COUNT as value
  • Search volume COUNT shown as % of grand total as value

The SUM should be obvious, being that it will be the total amount of search volume for each of the filter values.

The COUNT will be how many times that filter value is used among the keyword set.

The COUNT & % of Total will show us the actual % of keywords that use this filter value. A little quicker to analyse than the overall count alone.

 

Analysing and selecting your filters

Now we’ll get to read our data and see what we can make of it.

Let’s take a look at my property-type keywords.

We can see that of the 2,000 keywords included, 85% mention a property type. So only 15% are more generic keywords like ‘properties’ or ‘real estate’.

Even if you consider ‘homes’ as generic, that’s still less than a quarter of the keywords without a property type.

So yes, property type 100% needs to be optimised for.

 

Looking at the features keywords.

Only 2 keywords include pets, 2 with pools, and then 1 mentioning granny flat. If these were the only filter values available, I would not be optimising for them.

Similar story with the bathrooms keywords.

Only 2 keywords contain a bathroom-related phrase. Probably wouldn’t recommend targeting that in bulk.

Now onto the 2 that are a bit contentious when it comes to real estate sites.

The first one being bedrooms.

Bedrooms is one I personally recommend against optimising for directly under normal circumstances. At least at first anyway.

I feel it creates too many variations of URLs, with not enough reward/value in return for doing so. Can be worth targeting once all indexation/crawling boxes are ticked, especially with some rules in place, but maybe not directly out the gate.

In saying that, looking at the data 10% of the keywords (7% of total volume) include a bedroom value.

Is that enough to warrant targeting of it? Maybe.

But if we break that data down a bit further, and split out the city (Melbourne) from the ~15 suburbs, we see something a bit different.

16% of the city keywords (14% of volume) contain a bedroom term, verse only 5% (1% of volume) of the suburbs do.

So that’s 1 location have a significantly larger amount of keywords including it than the 15 other locations combined.

So if you create pages equally amongst cities & suburbs, you’re going to be creating significant volumes of pages when only a small portion of them will be useful.

Yeah, long-tail value this and that. I’m not saying definitely don’t, I’m just advising against it without restrictions in place.

A similar situation is with the prices.

Pretty low volume for the majority of the keywords that include a price (normally ‘under xxx’ type keywords).

And if we break it into city vs suburb, we get;

None of the suburb keywords in this data include a price. It’s only at the city level.

 

Why some filters may not be worth targeting

I’m a big believer in limiting crawlable URLs where possible.

Minimising re-use of listing content, avoiding the possibility of confusing Google – too much.

Keeping the site as small as possible, whilst still targeting as much as possible.

So why would I recommend avoiding creating bedrooms or pricing optimised URLs in bulk?

Well, it comes down to page count.

Crawlable page count to be specific.

Let’s say you have a location set of 5,000 locations.

10 property types.

and 2 channels.

You’ve already got 100,000 crawlable URLs right there.

If you then have 7 bedroom options, you’re looking at 700,000 URLs in addition, to that 100,000 that exist, that Googlebot will have to constantly trawl through.

Is it worth enlarging your site by 700% to target an extra 7% in search volume?

If you think so, then go for it.

That’s also if you do it cleanly. If you have other filters with crawlable links on the site, that overall crawlable URL count will only increase.

So if you’re creating significant page volumes off of smaller % filters like this bedrooms count, you must ensure you have your crawling well in check before you launch.

That way you can avoid exacerbating any existing issues.

There are other ways of efficiently targeting these types of keywords though.

In particular, I recommend a targeting strategy here on how to target these filters that may have value at key locations, and not others, by having a couple of tiers of locations.

 

Picking your filter values

To try and keep some filters in check too, you can also optimise the system so that only certain values of a filter get optimised.

Using the bedrooms as an example, you might choose to just create pretty URLs for Studios, 1 bed, 2 bed, and 3 bedroom apartments. 4+ bedrooms would then be relegated to the query parameter, and not receive the internal links pointing into it.

 

Let the data guide your optimisation

By leveraging this keyword data you can really gain an insight into what filters, and values, you should be optimising for.

Plenty of caveats, particularly around longer tail keywords that tools won’t give you, but there should be more than enough data to at least guide an initial decision.

It’s also easier to expose a filter later on, than to clean up the over-indexation caused by one if it needs to be reverted.

There’s also the other question here, is it even worth putting in the work to have separate ‘pretty’ and ‘parametered’ filters?

I’ll leave it to you to decide.

Programmatic SEO: Cleaning Up Over-Indexation of URLs

Programmatic SEO: Cleaning Up Over-Indexation of URLs

So, you’ve either done an oopsie or are jumping into a new site/client, and there are a few too many URLs.

What do you do next?

How do you fix it?

Do you even need to fix anything?

The URL over-indexation issue

When working with programmatic SEO, there’s a fine line between enough URLs to target your keywords effectively, and wayyyy too many URLs.

Particularly when it comes to filtered search result pages.

Basically, too many URLs and it becomes an over-indexation issue.

It’s over-indexation when you’re creating significant quantities of crawlable low-value pages.

Patching an over-indexing issue can seriously improve the tech SEO element of a larger website.

I’ve seen serious growth for sites from doing this alone.

It takes time, but patching over-indexation can lead to some epic wins for clients, or yourself, and really help Google hone in on your key pages.

 

The causes of a URL overindexation

These could be pages that;

  • Resurface essentially the same content, over and over.
  • Rearrange the order of the exact same content (sort/ordering selectors)
  • Have old/broken query parameters leading to duplicate content
  • Use query parameters for filters that are a pretty URL too
  • Use query parameters that don’t offer significant value to be separate URLs
  • Have no results available
  • Filtered URLs linking to a filtered product URL where that product doesn’t need the filters
  • Have capitalised letters in the URLs due to internal linking bugs
  • Extreme pagination counts
  • Crawlable result ordering URLs

There are so many different ways to create too many URLs, that it’s really site-dependent.

They can change from client to client, site to site.

 

Why it could be an issue

The large majority aren’t extremely useful for a good portion of your users.

Pretty much none of them are worth having Googlebot go over them.

These pages will dilute the value of other links on your site, but also, cause crawling issues by sucking up large portions of your crawl budget.

Many of these lower-value URLs can actually also outrank your primary URL, due to how Google will see the links pointing in.

They can be a real hassle!

 

Determining whether you have an over-indexation issue

The largest indication of an over-indexation issue is a significantly larger % of ‘excluded’ rather than ‘valid’ URLs in the coverage report.

This report is basically saying that Google has found a total of 16.5 million URLs on the site, but only belives 350K of them are worth indexing.

Some of the main warnings you’ll see with regards to over-indexation are;

 

Discovered – currently not indexed

Google has found the URL linked to, but hasn’t crawled it. It could either be queued for crawl, or Google pre-determined that it wouldn’t be worth crawling.

URLs here could sit here for a day, or could sit here for an eternity.

Might be bad, might not. Really depends on the type of URL and why.

Improving internal linking, and prioritising key pages, can often help URLs here.

Another talked about factor here is content quality, however ‘technically’ these URLs haven’t been crawled yet so ‘technically’ Google wouldn’t know the content quality of that page. So content quality could come down to the content that is linking through to this page.

URLs can be indexed, and then move back to this stage. When that happens, it would lead more toward the content issue.

 

Crawled – currently not indexed

URLs here have been found, and crawled, by Google. They’re now either in a bit of a holding pattern waiting for rendering or some more ‘thinking’ from Google, or they’ve been deemed unfit for indexation.

It’s a bit hard to tell, but most of the time once URLs hit this stage they’ll slowly be processed and either indexed, or be moved back to Discovered.

Improved content can help a URL move out of this stage faster, but so can prioritised internal links so it’s a bit of a game here like the discovered stage.

 

Soft 404

A clear indicator it relates to your content this time.

Google thinks this page should be a 404, when in fact it’s not.

Most common with 0-result SRPs, where you’re actively trying to index pages that will say something like “0 results found” or “we’ve got no results that match your search”.

Google will pick these up and flag the page as a soft 404.

 

Duplicate without user-selected canonical, Duplicate, Google chose different canonical than user & Duplicate, submitted URL not selected as canonical

Essentially the same issue, yet the “without user-selected canonical” just means you haven’t implemented a canonical tag on the pages in question.

Google is choosing a different URL to the one you’re specifying in your canonical tag, and is deeming this version as a duplicate.

You’ve got too many URLs that are either extremely similar, or exactly the same, as each other.

This could be due to certain filters being crawled and indexed, that returns the exact same set of results.

This could be from location data sets where two locations are essentially the same, or you have some bad filter data where both ‘apartment’ and ‘apartments’ exist, and return the same results.

Google’s choice is sometimes wrong, but sometimes right too, so definitely needs to be investigated to be understood further as to how they’re duplicates.

 

Alternate page with proper canonical tag

One some might see as a content flag, this is actually usually a big tech flag.

This means you’re actively linking to pages that are then canonicalling somewhere else.

You should instead be directly linking to the canonical version of the pages.

 

Page with redirect

Unless you’ve done a large-scale migration, a significant portion of URLs sitting under this flag indicates that you could have internal links pointing at redirects.

Internal links pointing to redirects could be wasting your crawl budget by forcing Google to crawl multiple URLs to end up at a single URL.

 

Not found (404)

You’re linking through to URLs that don’t exist.

 

Manually verifying over indexation issues

Another method of checking, but more importantly validating, the size of an indexation issue, is using a couple of advanced search parameters in a Google search.

Site:
Inurl:

A combination of these two will significantly help you identify and isolate specific over indexation issues.

If we use realestate.com.au as an example, we can look at two of the query parameters they use.

Their sort parameter is “activeSort” so we can do a quick indexation check by searching;

site:Realestate.com.au inurl:activeSort

Only 21 results, it’s not an issue at all.

But if we then check out one of their filter parameters, the parking spaces filter, we can see a different story;

site:Realestate.com.au inurl:numParkingSpaces

5,600 URLs are being flagged that have that query parameter in them.

In the scheme of things, it’s not major though. 5,000 is nothing for a site that size.

Is it worth it for them to fix? Probably not.

It’s certainly worth monitoring though.

 

The caveat

As with most SEO stuff, there’s a big caveat here.

These advanced operators don’t return exact result counts.

They’re generally not even close, but it’s what we have, so we gotta use it as best possible.

So take the numbers with a grain of salt, and use them to weigh any issues up against each other.

 

Cleaning up URL over-indexation

There are a few things you can look at for patching over-indexation issues.

 

Improve your internal linking

Improving your internal linking is the first step to patching over-indexation issues.

  • Remove crawlable links to heavily filtered results, like query parameter results
  • Remove links to sorted search results
  • Remove links to 0-result SRPs
  • Remove links to redirects

 

Redirect sets of URLs

There are cases when query parameters are indexed, that don’t actually do anything. There might be a mistake in the code leading to these being crawled, or they might have previously been indexed.

These are easily patched, by straight-up stripping them with a 301 redirect.

For over-indexation of active parameters, you might need a different solution though.

Complete case-dependent, however, there might even be a case for changing the parameter name and then redirecting the old versions.

Whether the actual parameter name is hanged, or whether the values, would be up to you.

You can update all locations that parameter is used to use the new values, and then be able to strip that parameter when the old name, or values, exist.

 

Robots.txt blocks

This is a good way for a new site to avoid a couple of ways over indexation happens, however, usually something I use as an extreme last resort for an existing site.

It completely cuts Google off and doesn’t give them a chance to clean up the mess that way created.

Any URLs that had any value within the blocks will now essentially have that value just disappear, rather than be redirected (whether actual redirect, or canonical) to a more appropriate spot.

Let me just leave this one here…

A follow-up tweet directly states that Google might actually continue to index the robots.txt pages. You’re linking to them. Google thinks they’re important, so they’re going to guess and rank them anyway.

On top of this, any URL that was actually ranking and driving traffic will be completely cut off and will deindex. You may have a more appropriate URL to rank, but if Google can’t appropriately see exactly which, you could completely lose any traffic that was being driven from these blocked URLs.

If, however, you decide to go ahead with this please give it 3-6 months from when you’ve implemented fixes before implementing the block.

Please avoid unless absolutely dire.

 

Give. Google. Time.

Google is slow.

It could take months before your patches start to take serious effect. Give them time to recrawl, to update their index, and to properly re-assess these overindex URLs, after you’ve made your changes.

It’ll be worth it.

Internationalisation & Hreflang Management

Internationalisation & Hreflang Management

Implementing hreflang for a programmatic build isn’t necessarily hard, once you understand what content you have.

If a piece of content is available in multiple languages and/or target markets, throw up a tag for that version.

If the piece of content isn’t available, don’t have a tag.

There seem to always be bugs in implementations though, mostly silly little things that are easily patched.

Is hreflang important for programmatic builds?

If you’re an international website, in either multiple markets or multiple languages, then yes, hreflang is extremely important for your programmatic build.

Google isn’t as smart as you think, and requires a lot of direction, or, ‘hints’ at they call them.

Whilst it will do its best to understand the relationship between two pieces of content across your site, telling them bluntly and clearly that these two-piece relate via either being different languages or targeting different markets for the same piece of content.

 

Pre-launching into a market with programmatic SEO

You could be a classified site, portal, or marketplace looking to expand your reach into new markets.

You can deploy a programmatic build, and start to gain a foothold in a new market before actually pushing the brand there.

Start to rank and drive traffic in a fresh country, and test the waters, without spending $ on any local advertising.

Essentially known as a ‘dark launch’, you can start to build rankings & value in a market, well before you actively want to push there.

The pages will get indexed, and you can start to push your content out in the market, and even acquire new content/listings slowly.

 

Tips for an hreflang implementation

Some of my top tips for an hreflang implementation are;

  • Ensure the current page is included in the tags
  • Ensure x-default is implemented correctly
  • Don’t allow content that doesn’t exist in flagged language, to exist
  • Ideally implement meta tags or XML sitemap, not both
  • Ensure each page in the hreflang correctly references all other pages

 

Hreflang meta tags vs hreflang XML sitemap

I personally try to stick with hreflang meta tags in the <head>, as it’s easier to check, which makes my life easier.

When sites start to get to 50+ available languages/locations, then things change though.

Aleyda Solis pretty much sums it up in this tweet.

So small sites get a meta tag. Big sites, get the XML sitemap.

It’s recommended you don’t use both though, as you can run into validation/update issues depending on which version was last cached.

 

Hreflang testing tools

 

Testing via generation

Whilst not a ‘testing’ tool, you can test the output of your build by generating hreflang tags on Aleydas site here. Check to see if the output of the tool, is the same output as what is expected.

 

Favourite direct testing tool

My favourite testing tool for validating an hreflang implementation is this one.

You throw in your URL, along with selecting a user agent, and you’ll get a nice little report back of all the URLs included in the hreflang. The tool will then quickly scan them all, to determine the tag’s validity and whether return references are available.

 

The top issues with hreflang tag implementations

 

No self-referencing hreflang

Sometimes the canonical is believed to be the required self-referencing tag. Whilst it is required, it is separate from the hreflang setup.

You must include an hreflang tag for the current page, just as it would be referenced on any other page.

Basically, every page gets the exact same set of hreflang tags.

 

Incorrect language/region content being shown

A page should not have any content other than its flagged content. Well, excluding any terms/legal callouts that are required that is.

If a language is flagged as German, then it should only include German content.

If a page is flagged as English, it should only include English content. No mixing and matching.

You should hide any content that isn’t in the flagged language.

 

Incorrectly flagging a URL that doesn’t exist

Sometimes, content is mismatched between languages & regions, and that is perfectly mine.

But don’t tell Google that the content is available in other languages & regions if it’s not.

Don’t include hreflang references to pages that don’t exist in the flagged language. These pages should instead be removed from the hreflang, and should not have hreflang tags on the page itself.

So for a blog post, if that post is only available in its current language, don’t have hreflang tags (you could have a tag to itself, and only itself, if that simplified implementation though).

If a page is available in 3/4 languages, then only mark up for those 3/4 languages, and have the 4th as a 404.

 

Make your hreflang simple where possible

Try and keep it simple, and as your build grows, it will be easier to maintain

Programmatic SEO: Optimising for the Highest Value, Low Tier Filters

Programmatic SEO: Optimising for the Highest Value, Low Tier Filters

In my post on how search filter pages should work, I talked about not linking through to query parameter URLs.

The bulk of these pages would not have value, particularly when you have millions upon millions of combinations.

So how do you optimise for the combinations that do in fact have value?

The filters you should be creating pages for

Chances are, you already have a rough idea of some of the keywords you want to target.

If you don’t, then look through your keyword data.

What filters are mentioned in the keywords, that don’t currently have pretty URLs for them?

I’ll go into a bit more detail and provide a template for this a little later, for now though, have a good look through your keyword data.

Do people search for bedrooms? bathrooms? prices? features?

Come up with a list of the filters used, along with some sample keywords.

 

Why you should be creating these pages

To target the longtail.

These are the keywords you’ve determined don’t deserve a pretty URL, and don’t deserve to be actively linked to, in bulk.

You don’t want thousands of these pages receiving SEO value and clogging up your crawl rates, as only a handful of them will have targetable volume.

The bulk won’t have volume, however, a small portion of them will.

You want to be able to still target that small percent of the pages that ha the value.

 

How do you create pages for these filters?

Unfortunately, it’s not just as simple as politely asking “create these pages” to your dev.

Even cake won’t solve this one without some forethought!

What categories/subcategories are used?

What locations?

What filters get applied?

Where are these pages linked from?

This is something I have put some thought into over the years, and have come up with a basic strategy that can be applied, and expanded upon, depending on the build.

You want to create “Custom Filter” pages.

 

Creating custom filter pages

To create these custom pages, you need to be as specific as possible about what you’re trying to do, as the developers will have the challenge to solve.

The devs will need to add a layer to an already complex setup, that enables these pages to seamlessly fit into the existing structure.

You want to let them know things from the page name to the filters being used.

I’d recommend flagging a set of top locations, that become the only locations these are used for. Yeah, there are probably more, and yeah you might create too many pages for some of the keywords, but a list of ~100 locations is still better than a page for all 5,000 locations.

There are two ways I see you can request these pages, and it really comes down to what would be easier for your tech teams, and provide the most value in return for you.

 

Creating pages for each variable value

The simplest way to request pages. You essentially just request a page for each value of a variable, and this gets combined with a top locations list.

These pages act exactly like a normal ‘pretty URL’ filter, except they have the top locations filter applied.

ie you could say you want a “<bedrooms> bedroom properties in <location>” page.

This would then replace <bedrooms> with every variation in your data, and <location> with a top set of defined locations that you want the page for.

 

Creating pages for specific variable values

The other method here of creating pages is to specify the actual pages you’d like to create along with the inclusion of <location>, which will be the only dynamic variable here.

ie you’d request “1 bedroom properties in <location>” and “2 bedroom properties in <location”.

This gives you more control, as you could just use the selected values from certain variables. Great when you have many values for a variable, but only want pages for some.

You will, however, need to create hundreds of these rows for each specific rule you want, so it’s not quite as neat to manage it all, but you get more control.

 

Structuring your custom pages

Ideally, but not 100% required if completely unavoidable, these pages should sit within the existing site structure, and not within a separate site section (like /custom/ or something more appropriate).

You should be trying to fit them in neatly.

So something like;

/<channel>/<custom>/

/<channel>/<location>/<custom>/

/<channel>/<location>/<propertyType>/<custom>/

You might even be asked to pick one structure, and stick with it, in which case try and pick the one that would make the absolute most sense.

Sometimes that means just nestling it within the channel structure, but ideally for this, sometimes it would be within the location structure.

Get the URL in as deep as makes sense, whilst still keeping it clean.

 

Custom filter pages fill the gap

These custom filter pages you can create will help fill the gaps between programmatic SEO & content creation.

You get more control with less chance of over-indexation, creating the perfect system in between.

HTML Sitemaps – It’s 2022, do you REALLY Need One?

HTML Sitemaps – It’s 2022, do you REALLY Need One?

The old HTML sitemaps hey.

Are you still actively trying to create one for large-scale programmatic builds?

Crawling? Indexing? Deeplinking? Passing SEO value?

Yeah, they can help you shortcut passing values and reduce click depth to deep links.

So can a good, and more natural, internal linking structure though, especially once some prioritised links are included.

I am not talking about a few links to the top sections of your site. Thye might still valid, particularly for smaller sites.

I am talking about the endless page(s) of links, that basically replicate an XML sitemap.

 

What is an HTML sitemap?

An HTML sitemap is one of those pages you’ll normally see linked in site footers, under an anchor of “Sitemap”.

Basically just another way of being able to tell Google about the different links on your site, particularly for larger sites.

Google recommended a “user viewable site map” back in 2010 and this is really where HTML sitemaps stem from.

In their latest SEO start guide here, they recommend the following;

Create a navigational page for users, a sitemap for search engines

Include a simple navigational page for your entire site (or the most important pages, if you have hundreds or thousands) for users. Create an XML sitemap file to ensure that search engines discover the new and updated pages on your site, listing all relevant URLs together with their primary content’s last modified dates.

Avoid:

  • Letting your navigational page become out of date with broken links.
  • Creating a navigational page that simply lists pages without organizing them, for example by subject.

 

Why you should avoid using an HTML sitemap

HTML sitemaps are just a shortcut, to a proper internal linking implementation.

A friendly header menu, and friendly overall site navigation, should be covering what an HTML sitemap would.

Do you think Google is really going to care what links you’ve got on a page when it’s just full of links?

It’s like those automated link swaps from 2008 where you’d add a page on your site, that would automatically update with links to every other site in the network. They were killed off a while ago.

Sometimes these sitemaps are broken up into pages and could have an H1 detailing what the links are about.

ie, “real estate in Sydney” could be a page full of links related to Sydney, and the suburbs of Sydney.

This can cause duplicate keyword targeting, for a page that’s purely made of links.

Yes, I have seen these types of pages indexed.

They’re also another area on the website where you need to be mindful of what pages you’re linking to.

Particularly when it comes to 0-result SRP pages.

Ensuring that links are always updated, with no error pages etc.

If the main purpose is deep linking from the HTML sitemap, then one of Google’s ‘avoids’ from the SEO starter guide comes into play;

Creating a navigational page that simply lists pages without organizing them, for example by subject.

So it kind of writes the deep linking off, as many programmatic sites will try and jam as many links in here as possible.

A natural linking setup should offer a much better solution to this.

 

Why you would still create an HTML sitemap

There is only one real reason that I see a full HTML sitemap as valid for larger builds.

Tech limitations.

No matter what ideas you have, the strategies you want to implement, the optimisations you want to make, tech limitations will find you.

And they will cut your dreams off.

Sometimes you can’t do what you want, either due to actual tech talent available, or you’re up against a product manager that doesn’t want to hear the magical 3 letters – S E O.

Apart from fighting harder, and/or going around the product manager, you sometimes have to do hacky SEO.

In this case, an HTML sitemap should be seen as hacky SEO.

Not the best, but it’ll get the job done at the start until a more viable solution can be put in place.

You’re just filling in the gaps where the tech can’t properly do a good link structure.

If you can do a proper link structure, however, there should be no need for a built-out HTML sitemap though.

Google should be naturally crawling the site, going from page to page, discovering the links with context.

Not just discovering everything on a page with 500 other links.

 

How to deprecate your HTML sitemap

The first step is ensuring you’ve properly implemented internal links – this is crucial.

Links to parent pages, child pages, and some cross-links need to be across the site.

Ideally, along with some “priority” links, ensure your top pages / most competitive pages have links from some pages further up in the hierarchy.

After that, you’ll need to redirect the sitemap URLs. You could 404 / 410 if you want, but I prefer to redirect to related URLs where possible to help retain any value.

If you had a paginated HTML sitemap, particularly one that included keywords like my earlier example, then try and redirect these as best possible to an appropriate page.

Treat this depreciation like a migration, where you want to redirect to the most similar page possible.

Do you need to deprecate it though? I’ve had clients where I won’t bother recommending it.

Particularly when there are things that are more valuable that could be done.

I will, however, recommend killing it off as soon as I see it become an actual issue.

Just remember, by removing something like this you need to make sure you’ve properly covered internal links. Even though this sort of setup isn’t ideal, you could be creating a tonne of orphan pages if you’re not set up properly before removal.

 

Some may disagree

I know there are SEOs that will disagree with this, but I haven’t seen an HTML sitemap do anything useful in the last 5+ years.

It’s time to cut them from site builds.

Death of Internal Nofollow: Use JS to Stop Google Crawling

Death of Internal Nofollow: Use JS to Stop Google Crawling

Over time, I have noticed nofollow internal links being crawled more and more.

It doesn’t seem as easy as it used to, to slap a nofollow tag on a URL and it will keep it from being crawled & indexed.

Google’s even made a few comments, and updated guidelines on this in recent times, with the biggest comment being here;

When nofollow was introduced, Google would not count any link marked this way as a signal to use within our search algorithms. This has now changed. All the link attributes—sponsored, ugc, and nofollow—are treated as hints about which links to consider or exclude within Search.

Nofollow is now a hint.

Yeap, just a little hint. It does not stop Google like it used to.

So yeah, it might be worth still slapping the tag on a link, but if it’s not 100% full enforcement and you’re linking to a million variations of a page… you’re in for some over-indexation fun.

Why do you want to block Google anyway?

Filter links.

Filter pages can create millions of crawlable URLs out of sites that should only have tens, or hundreds of thousands of pages.

The canonical tag just doesn’t work like it used to, and even with it you’re also blasting through your crawl budget on these asted pages.

 

Alternatives to using nofollow on internal links

The alternative to using nofollow, is actually doing what most SEOs would never normally want to do.

Implement client-side JS to drive these links.

You want to actually use onclick JS for these links.

Words you will rarely hear from an SEO, because it goes against everything that’s good to help SEO, unless you want to unSEO.

Specifically, onclick JS links that don’t have the URL available in the rendered HTML source code. You gotta use a separate file to include it.

Small but significant detail.

These onclick JS links then won’t render the link in the HTML code, leading the link to be hidden from Google due to the way Googlebot crawls.

It grabs the links, and then crawls. It doesn’t “click” links.

Every link that currently exposes the URL in the HTML source, that you won’t want to be crawled, should be changed to an onclick link that only renders client-side once it’s clicked.

These onclick links are essentially invisible to Google, but have no impact and appear as normal, to users.

If however, you do an onclick link where the URL is still available in the HTML source code, Google will be able to crawl that.

 

Why a robots.txt block or a meta robots noindex/nofollow tag might not be the solution

Blocking the pages in the robots.txt file, or using a noindex/nofollow tag might be the ultimate solution for some.

However, there are a couple of things still in play here that need to be considered.

Actively linking to a page that is blocked in the robots.txt is like telling Google something cool is behind this wall, but they’re not allowed to look.

Actively linking to a page with a meta robots noindex/nofollow tag is telling Google about your cool URL, letting them access it, and then telling them it’s not that important, even though you’re linking to it. You’re wasting Google’s time, crawl budget, and your server resources in allowing Google to crawl it.

In both cases, you’re diluting a page’s value by linking through to these pages you don’t actually want to be ranked or crawled.

John Mueller has also made a comment regarding robots.txt blocks of parametered pages, so it’s worth keeping this in mind:

 

Should you still use nofollow on internal links?

Yeah, there is no reason to remove it if you actually can’t implement an alternate method.

But if you’re implementing an alternate linking method to ensure Google doesn’t crawl it, the nofollow tag kind of gets nulled because Google won’t see the link anyway.

 

The internal nofollow is dead

The internal no-follow is dead.

Now is the time to switch to JS links to prevent the crawling of pages you don’t want to be crawled.

Programmatic SEO: Search Filters & Faceted Search Optimisation

Programmatic SEO: Search Filters & Faceted Search Optimisation

You can make or break a site, just based on how you handle your search filters.

The difference between a good & bad search filter setup can mean the difference between a good amount of quality results that are indexable, and millions upon millions of crawlable & indexable URLs that Google will ignore the majority of.

Let’s take a look through filtered search optimisation for programmatic SEO builds.

The two types of search filters

One thing I always try to define for a client is the two types of filters.

Those you want a pretty URL for, and those you don’t.

You shouldn’t be creating a pretty, completely indexable, URL for every filter combination ever.

You need to define a set of filters that will allow you to target 90% of searches, with a small portion of the URLs.

I was going to use the 80/20 rule here, but 20% of the URLs would be so wrong.

You would be creating millions upon millions of crawlable URLs to target the remainder.

I go further into this during the post.

 

The best URL structure for search filters

When creating search filter URLs, you have to keep in mind the structure of the website.

You should be parenting the filtered content, below its parent page.

So for real estate, if you have an ‘apartments for sale in Sydney’ page, the filters would be;

Property Type: Apartments

Channel: For Sale

Location: Sydney

How you’re structuring the website will control the order of usage, along with how you use, each of the different filters.

Some filters may be a part of the pretty URL whereas some other filters will be query parameters that are just tacked on at the end.

For my clients with the above filters, I would be recommending the following structure;

/<channel>/<location>/<propertyType>/

The reasons behind this choice will be documented separately, however, with that structure in place we know how we should handle this type of URL.

/buy/sydney/apartments/

If we then add a pricing filter, or a bedrooms filter, the URL would change to something similar to the below;

/buy/sydney/apartments/?priceBetween=500000-1000000&bedrooms=3

So there is a clear separation between the ‘pretty’ portion of the URL and the ugly query parameters.

 

How to handle URLs for multi-select filters

Multi-select filters can lead to issues, particularly when the multi-select filter is a part of a pretty URL.

Let’s say you have a multi-select filter of property type, with apartments & houses as an option.

If a consumer selects both of them, you want to make sure that both apartments & houses don’t end up in the URL.

You don’t want to end up with /buy/sydney/apartments/houses/ or /buy/sydney/apartments-houses/.

Whilst you could handle this where you prioritise one, and then query parameter the other, I prefer a simpler solution.

When 2 or more options of a ‘pretty URL’ filter are selected, use them both in a query parameter rather than pretty URL.

ie domain.com/<channel>/<location>/?propertyType=type1-type2

This gets the parameters stripped in the canonical tag, and just ensures you don’t get any issues arising from duplicate targeting or the creation of additional pretty URLs.

 

How to handle internal links to filtered search results

This is one of the main causes of indexation issues, due to the fact that internal links hold such weight with Google.

It can also lead to one of the easiest ways for a larger-scale website to make improvements, that can actually move the needle in the rankings.

 

The URLs you should be linking to

The quick summary here is if it has a pretty URL, at least 1 result, and relates to the current page, a page should definitely be linked to.

So you link to all the filters of the current page, that have a pretty URL & and at least 1 result.

I cover the two different types of internal links for programmatic sites in a little more detail here.

You should avoid actively linking to any query parameter filter.

 

Why crawlable links to non-pretty URLs, or 0-result SRPs, should be avoided

Each crawlable link acts as a vote for a URL in Google’s eyes. If you’re constantly ‘voting’ on poorly filtered URLs, or 0-result SRPs, Google is going to place more weight on these URLs than what they’re worth.

This will lead to crawling and indexation issues, particularly when it comes to prioritising specific URLs above each other.

So on top of your pretty URL filters of Channel, Property Type, and Location, you could have a significant quantity of other filters available, including but not limited to;

Bedrooms

Bathrooms

Car Spaces

Price

Floor Area

Land Area

Nearby Amenities

New / Established

Property Features

The list quickly builds up.

As an example, let’s say that each of the 9 filters had 10 options available to filter by.

You’re going to have 90 filterable versions of a URL, all with query parameters, that are just filtered views of the primary page.

Each of these is significantly re-using the primary results, and Google will crawl and see this on each page.

How many of those filters are actually going to have search volume?

A small handful might, for some top-level locations.

On top of this, these 90 filtered versions of the URL, would then link to the other 89 versions of that URL, with that additional filter applied.

You’d create 8,010 versions of a single URL, with just 9 filters and 10 options for each.

But then if you have 2 channels, 5,000 locations, and 5 property types, you’d have 50,000 pages, with 8,010 versions each, giving you a lovely 400,500,000 URL combinations that will be crawlable.

Google would have a field day.

Do you think “new 3 bedroom home with 2 bathrooms and 2 car spaces under $450000 with swimming pool and balcony” gets enough search volume to warrant the page being linked to?

There are ways you can optimise the pretty URLs to capture a large portion of these super long-tail queries.

Yeah, you won’t capture it all. But you also won’t need to create 400 million pages to ensure that you do.

Also obviously a worst-case scenario should you have no limits in place. Yes, I have seen this multiple times.

 

How you should link to filtered URLs

It’s not just what pages you link to, but how you link to them.

You need to keep SEO and the user experience in mind when linking, as you obviously still want users to filter by price, bedrooms, and an array of other filters that will help them find the content they’re after.

Links will be broken down into your ‘SEO links’ that should be server-side and in the source HTML, and then other links/interactions that should not be crawlable, and should only be available via client-side JS / onclick event links.

You’ll probably have some sort of filtering widget, normally in the sidebar, that contains every filter available for the set of results.

Provided these filters don’t expose any links in the HTML source of the page, this widget can be left completely untouched and left for the designers to toy with as required. Don’t fight design on this, we can get more value from some slightly separate SEO links :)

However, if these filters expose links in the HTML (dropdown filters sometimes do this), then you’ll ideally need to get these switched to onclick event / client-side javascript links, to ensure they’re not crawlable/as easily crawled. You can read more about how these JS links work here.

The other links, the pages you actively want to link to, should be added to a separate little widget under the filters under a title like “popular locations”, or “popular property types”.

You can also throw them in a nice footer widget, but a sidebar link might have more value so could be prefered.

They just need to be exposed in the HTML source, unlike the other non-pretty, parametered URL links.

 

Blocking parameter filters in the robots.txt

This is something many SEOs will do to avoid crawling & indexation of query parameters.

It’s a viable strategy for newly launched sites to prevent issues, however, existing sites need to keep a few things in mind.

Personally, I prefer to try alternative methods of patching crawling/indexation issues that can be attributed to parametered pages.

I will try and lower their value, by removing links pointing in, and then hope that the canonical tag takes over and does what it’s supposed to.

 

1. Are the primary, clean URLs indexed?

If you’ve actively linked to parameter-filtered URLs of SRPs, then you may be blocking the only indexed version of a page.

Google might not have the primary page for those search results, indexed.

While yeah, Google will eventually index the new URL, you might be temporarily killing a chunk of your traffic.

 

2. Do the parametered URLs have links coming in?

If the URLs filtered with parameters have links pointing in, you could be culling any value they have. Google will no longer look at the canonical tag and assign any weight to the parent, non-parametered, URL.

Double-check this, and make sure you’re not about to potentially remove the value these links would have passed.

 

Google says no

All in all, there’s this.

So keep that in mind.

 

Handling canonical tags of search filters

Pretty URLs get included, along with pagination.

All non-page query parameters should get stripped.

That will help pass your SEO value around, ensure minimal over-indexation, and try to keep Googlebot in check if it discovers these filter URLs.

Let’s say you have a URL of example.com/buy/sydney/apartments/?priceBetween=500000-1000000&bedrooms=3

The canonical tag I would be recommending is;

<link rel=”canonical” href=”https://example.com/buy/sydney/apartments/” />

Canonicals are seeming more and more like a suggestion though, rather than a rule, so just keep in mind that they, unfortunately, don’t work as well as they used to control indexation.

 

Common mistakes in search filter handling

Here are the common mistakes I see with clients when it comes to handling search filters.

 

Ordering of filters not handled leading to a duplicate page for each possible combination

Having both ?priceBetween=500000-1000000&bedrooms=3 and ?bedrooms=3&priceBetween=500000-1000000 crawlable will lead to crawling & indexation issues. Ensure these alternate orders are either never possibly crawled and that all internal filters & links contain the correct order, or, you have a way to detect them and 301 redirect to the primary order.

Even though you’ll have canonical tags attempting to clean this up, they don’t always work so it should be at least attempted to have this patched before it becomes an issue.

If you’re 100% definitely not actively linking to any of these query parameter pages, and there is absolutely no chance of links coming through, then you’ll be fine.

But, well… bugs happen. Just keep that in mind.

 

Search filters are available as both a pretty URL and a query parameter

When you have both /buy/sydney/apartments/ and /buy/sydney/?propertyType=apartments crawlable and indexable, you’re duplicating and devaluing some of this content. You need to make sure the parameter redirects to the pretty URL version (unless multiple values exist).

 

Multi-select filters adding both selections to a pretty URL

When URLs are handled like /buy/sydney/apartments/houses/ or /buy/sydney/apartments-houses/. My recommend way of handling this is moving both to a query parameter version, like /<channel>/<location>/?propertyType=type1-type2.

 

Search filters included on listing/product links

This is one I have now seen a few times, and it caused massive indexation issues. The query parameters were included on the links to the listings. Google then not only crawled them all, but indexed many of them too, and plenty were ranking. Ranking with a search filter query parameter URLs on a page that didn’t do anything with them.

I’ve also recently learnt that this is a Shopify default… which is weird.

You can read here how to fix it, but the collections part of the URL is included in the product URL they link to. This is stripped with a canonical, but what a mess this can cause for Google!

And yes, I am looking at a site right now, that does this, and the collection’s version of a product URL is ranking.

 

Non ‘pretty’ search filters are included in the canonical

If you don’t want it to rank, don’t include it in the canonical tag. Many sites still include these parameters in the canonical, that generate thin & heavily re-used content.

To avoid issues, these thing parameter filters should be stripped and try to pass their value/indexing back up the chain to their primary results URL.

 

Results not loaded server-side

There are plenty of site builds where the majority of the site is server-side rendered, but then their entire search is client-side. This dramatically impacts indexation & crawling of the search, which is critical. The entire page, including the search results, should be loaded server-side, not just the overall template.

 

Cleaning up over-indexation of search filters

If you’ve exposed too many search filters, and need to clean them up, then you will need to undertake over-indexation clean-up… which is a whole separate topic.

You can read more about over-indexation clean-up for programmatic builds here.

 

Extra: How can you target the higher-value low-tier non-pretty URL filters

Well, that’s a mouthful.

Something I will cover in a bit more detail later, but yes. There are super long tail filter variations that have value, and are worth targeting.

But you need to be careful with this, as it is extremely easy to create 100s of thousands of pages, to just target a handful of keywords.

You want to target the top 80% of each of these types of keywords.

The ideal scenario to handle these is a set of custom rules.

These rules should allow you to map out what filter combinations could be used.

A way you can select 100 key locations, out of the 5,000 you have in your data set.

A way to select just 2 property types, from the 10 in your data set.

A way for you to set the price filter to under 300,000.

The system could then spit out the combinations of those 100 key locations and 2 property types, to create pages for “<property type> for sale in <location> under 300,000”.

You’ll get 200 combo pages, which will have 80% of the total volume, rather than 50,000.

 

Top tier strat: Automatically targeting these higher value but low tier filters

Recently, I have started recommending a new approach.

Three tiers, but could be expanded, of filters with pretty URLs.

Top tier – Every value gets a pretty URL, at all levels

Mid-tier – Every value gets a pretty URL, but only for top-tier locations

Low tier – Always in a query parameter.

This middle tier is new to the way I am pushing clients to handle URLs, and offers a more “simplistic” approach to landing page automation.

Let’s go back to the bedrooms and pricing examples.

Rather than creating “Properties under <price> in <location>”, and “Properties with <bedrooms> bedrooms in <location>” for every single location in your database (5,000+), you could filter this to just a set of top locations.

For price, you might have 15 values. Instead of creating 15 x 5,000 pages, you could create 15 x 100 pages.

For bedrooms, you might have 6 values. Instead of creating 6 x 5,000 pages, you could create 6 x 100 pages.

You might want more pages than just the ‘top tier’ locations for some filters though. Bedrooms are a good example of this. It might not be worth creating 5,000 locations worth of bedrooms pages, but more than 100 locations could be warranted.

To solve this you could add an extra tier into what I discussed above, where you create a set of ‘mid-tier’ locations, to compliment top-tier locations, and use this mid-tier list for some filters.

This mid-tier location list could be 500 top locations as an example.

That way, bedrooms could be 6 variations x 500 locations, rather than just the 100 top locations.

 

Faceted search vs search filters

This is where people lose me a bit.

UX people/designers etc seem to say filters are basically a single selection to refine the data, and “Faceted search” is about multiple selections. So basically multi-select filters are “faceted search”?

Filters do the same though? They filter the results.

Seems weird, but oh well. They’re the same to me, just ‘multi-select’ filters…

So yeah, I’m only talking about faceted because that’s what others call it, and what you could be searching for.

This is still what you’re after, I swear.

It’s all the same for this.

 

Keep your filters in check

Google is dumb. You might think they’re smart, but in reality, they’re just a system that needs direction. The more you give them, the better they’ll be able to crawl, index, and pass value around your site.

Know what indexable and crawlable URLs you’re creating, keep them tamed, and you will be rewarded with that lovely SEO traffic in the long term.

If you haven’t already, make sure you check out my programmatic SEO checklist to help you tick off the build.