Over time, I have noticed nofollow internal links being crawled more and more.
It doesn’t seem as easy as it used to, to slap a nofollow tag on a URL and it will keep it from being crawled & indexed.
Google’s even made a few comments, and updated guidelines on this in recent times, with the biggest comment being here;
When nofollow was introduced, Google would not count any link marked this way as a signal to use within our search algorithms. This has now changed. All the link attributes—sponsored, ugc, and nofollow—are treated as hints about which links to consider or exclude within Search.
Nofollow is now a hint.
Yeap, just a little hint. It does not stop Google like it used to.
So yeah, it might be worth still slapping the tag on a link, but if it’s not 100% full enforcement and you’re linking to a million variations of a page… you’re in for some over-indexation fun.
Why do you want to block Google anyway?
Filter pages can create millions of crawlable URLs out of sites that should only have tens, or hundreds of thousands of pages.
The canonical tag just doesn’t work like it used to, and even with it you’re also blasting through your crawl budget on these asted pages.
The alternative to using nofollow, is actually doing what most SEOs would never normally want to do.
Implement client-side JS to drive these links.
You want to actually use onclick JS for these links.
Words you will rarely hear from an SEO, because it goes against everything that’s good to help SEO, unless you want to unSEO.
Specifically, onclick JS links that don’t have the URL available in the rendered HTML source code. You gotta use a separate file to include it.
Small but significant detail.
These onclick JS links then won’t render the link in the HTML code, leading the link to be hidden from Google due to the way Googlebot crawls.
It grabs the links, and then crawls. It doesn’t “click” links.
Every link that currently exposes the URL in the HTML source, that you won’t want to be crawled, should be changed to an onclick link that only renders client-side once it’s clicked.
These onclick links are essentially invisible to Google, but have no impact and appear as normal, to users.
If however, you do an onclick link where the URL is still available in the HTML source code, Google will be able to crawl that.
Why a robots.txt block or a meta robots noindex/nofollow tag might not be the solution
Blocking the pages in the robots.txt file, or using a noindex/nofollow tag might be the ultimate solution for some.
However, there are a couple of things still in play here that need to be considered.
Actively linking to a page that is blocked in the robots.txt is like telling Google something cool is behind this wall, but they’re not allowed to look.
Actively linking to a page with a meta robots noindex/nofollow tag is telling Google about your cool URL, letting them access it, and then telling them it’s not that important, even though you’re linking to it. You’re wasting Google’s time, crawl budget, and your server resources in allowing Google to crawl it.
In both cases, you’re diluting a page’s value by linking through to these pages you don’t actually want to be ranked or crawled.
John Mueller has also made a comment regarding robots.txt blocks of parametered pages, so it’s worth keeping this in mind:
Don't use robots.txt to block indexing of URLs with parameters. If you do that, we can't canonicalize the URLs, and you lose all of the value from links to those pages. Use rel-canonical, link cleanly internally, etc.
— 🐝 johnmu.csv (personal) weighs more than 16MB 🐝 (@JohnMu) November 8, 2019
Yeah, there is no reason to remove it if you actually can’t implement an alternate method.
But if you’re implementing an alternate linking method to ensure Google doesn’t crawl it, the nofollow tag kind of gets nulled because Google won’t see the link anyway.
The internal nofollow is dead
The internal no-follow is dead.
Now is the time to switch to JS links to prevent the crawling of pages you don’t want to be crawled.