Google has continued their “How Search Works” mini-series this month, uploading the second episode which is titled, “How Google Search crawls pages”.
Episode 2, which was released February 22nd, follows the first episode with Google Search engineer, Gary Illyes, offering invaluable insights into how crawling works. Crawling being how Google travels the internet, discovers new URLs and explores the vast amount of online content. In this video, Illyes explains Google’s crawling process, the role of Googlebot – the search engine’s web crawling bot – and what their role is, and the significance of sitemaps and why they are so important to your website. With this access to behind-the-scenes operations, this web-series will help digital markers, web designers, SEO specialists, and curious viewers gain a better understanding of Google’s web crawling process as well as its important role in shaping the landscape of online search.
Illyes explains the role of crawlers and breaks down the steps in the crawling process:
- Finding New or Updated web pages – The main goal of a crawler. After they are discovered and deemed eligible, they are downloaded and made searchable.
- URL Discovery – The first step of crawling. When a crawler encounters a new link, it explores the page and adds its URL to the search engine’s index.
- Following Links – Crawlers navigate the web by following hyperlinks from one webpage to another. When other websites link to a particular page, it increases the likelihood of search engines discovering that page.
- Downloading and Fetching – The next step involves requesting web pages for the content of specific URLs to be analyzed and indexed. Content includes the HTML, CSS, JavaScript, images, and other resources associated with the page.
- After collecting the content, the crawler processes and renders the web page’s content so it would appear like how users in a browser would see it. This way, the crawler can learn the context of the site, allowing it to be indexed properly, and ensuring that search results accurately reflect the user experience.
Googlebot, Google’s main crawler, plays a huge role by continuously crawling and analyzing web content to keep Google’s search index up-to-date. Illyes notes a few important things about Googlebot:
- Googlebot operates on a set of algorithms that determine which pages to crawl, how frequently to crawl them, and how to prioritize the indexing of new content.
- Googlebot has different speeds that it crawls each website. The speed depends on a variety of factors including the site’s reaction speed to interacting with Googlebot, the overall quality of the content on the page, server errors, and other signals interfering.
- Website owners can actually control how Googlebot interacts with their site through techniques such as robots.txt files and meta tags. Some pages may be inaccessible due to technical issues such as server errors or poor quality content and won’t be indexed. Meanwhile, websites behind a paywall or login page also prevents Googlebot from crawling the site.
Sitemaps are files that provide information about the structure and organization of a website’s content to search engines. A typical sitemap is an XML file that lists URLs along with additional data like the last modified date and index priority. Illyes not only recommends using sitemaps to ensure that all relevant pages are included in search engine indexes, but he also suggests having them generated automatically to avoid human errors and makes the work much easier.
Conclusion
Google’s, “How Search Works,” has continued to provide invaluable insights into the intricate process of crawling, unveiling the inner workings of Google’s web-crawling mechanism and shedding light on the role of Googlebot and sitemaps. Stay tuned for our next blog recap when Google posts their next video in the series, explaining the very important process of Indexing.