17 Oct 2022
What is crawl control?
Search engines discover new content to index by using bots to crawl websites for linked URLs.
Crawl control (or crawl management) is the process of restricting certain URLs on your site from being crawled by these bots, for example, duplicate pages such as print-only versions or staging sites, or low value pages such as internal search results.
By editing your robots.txt file, you can restrict bot access to individual URLs or entire directories/sub-directories. You can also, however, accidentally restrict your entire site – so be very careful!
Preventing URLs from being crawled will typically keep them from being indexed in search engines, although a URL that is blocked in robots.txt can still be indexed if linked to from other sites.
Why is it important for driving organic traffic?
Some sites have thousands or millions of URLs (e.g. ecommerce, forums, those with faceted navigation), however crawlers such, as Googlebot, only have finite capacity to crawl these.
Each site has a crawl budget, determined by crawl capacity limit (can your server handle it?) and crawl demand (site size, update frequency, page quality).
Wasting crawl budget on low value URLs (i.e. those with no search demand, duplicate/low quality content, soft 404s, faceted navigation etc) could mean higher value URLs aren’t crawled, thereby preventing them from being indexed and driving organic traffic.
Google’s recent helpful content update is focused on penalising poor content, so blocking low quality content can protect your website’s performance.
What to do next?
URLs have SEO value if they are driving organic traffic and/or they are vital to the discovery of other URLs and to the flow of link equity.
To control the crawling process, examine your Page Indexing reports in Google Search Console (there are several) and ask yourself:
- Are those reported as “Blocked by robots.txt” correctly blocked?
If not, they can be made accessible by amending the robots.txt file in the site root.
- Do other reports such as “Crawled not indexed” contain valuable URLs?
If not, they can be blocked in robots.txt
Identifying patterns (i.e. based on sub-folders or other defining features) will enable URLs to be blocked more efficiently.
Before making any changes to your robots.txt file, always test them in Google’s tool to ensure it behaves as expected before rolling it out live.
If you need support to ensure your website is being crawled (and indexed) correctly, get in contact for help from our SEO team.