Google: Gary Illyes and Lizzi Sassman on Web Crawlers
In the latest episode of Search Off the Record, Gary Illyes and Lizzi Sassman took a deep dive into crawling the web: what is a web crawler, and how does it really work.
In the latest episode of Search Off the Record, Gary Illyes and Lizzi Sassman took a deep dive into crawling the web: what is a web crawler, and how does it really work.
Based on the conversation, we wrap up:
- A crawler, also known as a spider or web spider, is a software program that downloads information from websites. Search engines use crawlers to index the web.
- Crawlers schedule, fetch, and process information retrieved from the web. They schedule what to fetch based on various signals including links from other sites and sitemaps. Fetching refers to downloading the content of the webpages. Processing refers to analyzing the fetched content.
- Crawl budget is a term used to describe the amount of resources that a search engine is willing to spend on crawling a particular website. This can be influenced by factors such as the size of the website, the frequency of updates, and the importance of the website to search users.
- There are myths about crawl budget. For example, some people believe that you can pay Google to increase your crawl budget. This is not the case.
- Web servers can be overloaded by crawlers if they are not careful. Crawlers should respect robots.txt files, which can be used to instruct crawlers which parts of a website should not be crawled.
- Some webmasters mistakenly believe that they need to get all of their pages crawled by search engines. This is not necessarily the case. In fact, it can be beneficial to block search engines from crawling pages that are not useful for search results, such as login pages or pages with duplicate content.