What is a crawler? Everything You Need to Know

Team TeachWiki

What is a crawler?

A crawler, also known as a robot, bot, spider, search bot or webcrawler, is a program that independently searches the World Wide Web and reads and indexes content and information. Derived from the search engine "WebCrawler" which worked in 1994 as the first public search engine with full-text index search.

How does a crawler work?

Triggered by a hyperlink on a website, the crawler scours the Internet and thus gets from website to website, the data collected from this is in turn stored in a database. The algorithms determine how often a website is crawled, the better known the site, the more frequently it is visited.

Which information a crawler records depends on its task:

Google uses various bots for this, be it for Adsense, mobile sites, image search, news.

How can crawlers be blocked or controlled?

You can prevent crawling using robots.txt . Example:

  • User-agent: Googlebot
  • Disallow:

    In this example, Googlebot is not allowed to visit the page

  • User-agent: Googlebot
  • Disallow: /reports

    This example disallows Googlebot from indexing the /reports directory

With the help of the meta tags " nofollow " or " noindex " it is also possible to tell the crawler which page it should not follow or index. You can use the canonical tag to inform the crawler about the original page or show the structure using a sitemap .xml.

crawlers and search engine optimization

In SEO, it should be of interest to specifically control crawlers on your own website. Every website has a crawl budget , so this should be used as best as possible. Through targeted control or locking out, this can be used as effectively as possible. Pay attention to fast loading times, small file sizes and lean website architecture.

Comments