Technical SEO: Log File Analysis and Crawl Budgets
Analyzing a website’s log files involves downloading specific files from the site’s server, and then analyzing the raw data for insight into specific activities. In terms of SEO, this mainly relates to bot crawls. However, you’ll also be able to view both human and bot requests in the server log. This specific file is also referred to as an “access log.”
Whenever a website is visited (either by a human or bot), the server that it’s stored on collects specific data about that visitor. The IP address, user agent, URL code (e.g. 404, etc.), device, timestamp, etc. are all collected and stored within the access log.
Nearly every single interaction on your website is stored in the access log, including image requests, other media forms, etc. So, there are literally thousands upon thousands of interactions logged within this file.
Why Log File Analysis Is Important
Thousands of pieces of data per day? And we need to analyze that? Why? These are common questions, and completely understandable if you’re not familiar with technical SEO.
Access logs are important because they provide a clear-cut snapshot into how search engine bots are crawling your website. This is very important in terms of SEO because it can provide you with a deep understanding of:
- What assets are being crawled the most
- Which pages aren’t being crawled (and why)
- Exactly which URL responses are being returned (e.g. 302, 404, etc.)
- Internal linking errors that can cause crawl issues
- Pages that bots place the most emphasis on
Crawl Budget Optimization via Log File Analysis
The crawl budget of your website is basically this: how much time/effort Google’s bots will place into crawling your site. If you have a low crawl budget, the bots aren’t going to be crawling for very long. However, if you have a large crawl budget, the bots will spend more time crawling your site (which means your site will have a better chance of ranking more pages in Google/other search engines).
Crawl budget is directly relevant to your site’s authority (which is a primary factor in SEO). This is why it’s important to analyze your site’s log files. Optimizing your crawl budget through reviewing where it might be wasted (e.g. on pages/assets that don’t need to be crawled) can lead to a more streamlined site structure (which bots love and will want to crawl).
An example of this would be analyzing your access log, and realizing that bots are crawling pages that are outdated (or otherwise shouldn’t be crawled) – and then losing out on them crawling brand new pages/pieces of content. If bots don’t crawl a page, good luck having it rank in Google. Below are some things that can seriously bite into your site’s crawl budget:
- Dynamic URLs
- Unique user IDs (i.e. session IDs)
- Duplicate content-related issues (very prevalent among eCommerce sites)
- Soft 404s (which are terrible for SEO to begin with)
- Content that’s very low quality
The easiest way to free up server-side resources (and make your site more crawlable) is by setting crawl restrictions on certain URL strings from your site’s robots.txt file. Websites that have problems ranking their content (or even having it indexed) often overlook this very simple step, which sometimes can have an immediate effect on their number of KWs and pages ranking in search.