What is a crawler? Everything You Need to Know
What is a crawler?
A crawler, also known as a robot, bot, spider, search bot or webcrawler, is a program that independently searches the World Wide Web and reads and indexes content and information. Derived from the search engine "WebCrawler" which worked in 1994 as the first public search engine with full-text index search.
How does a crawler work?
Triggered by a hyperlink on a website, the crawler scours the Internet and thus gets from website to website, the data collected from this is in turn stored in a database. The algorithms determine how often a website is crawled, the better known the site, the more frequently it is visited.
Which information a crawler records depends on its task:
- Price comparison portals search for products, their availability and prices
- In data mining, crawlers are used to generate addresses
- News from the news portals are crawled
- Plagiarism search for copyrighted material on the net
Google uses various bots for this, be it for Adsense, mobile sites, image search, news.
How can crawlers be blocked or controlled?
You can prevent crawling using robots.txt . Example:
- User-agent: Googlebot
- User-agent: Googlebot
Disallow:
In this example, Googlebot is not allowed to visit the page
Disallow: /reports
This example disallows Googlebot from indexing the /reports directory
With the help of the meta tags " nofollow " or " noindex " it is also possible to tell the crawler which page it should not follow or index. You can use the canonical tag to inform the crawler about the original page or show the structure using a sitemap .xml.
crawlers and search engine optimization
In SEO, it should be of interest to specifically control crawlers on your own website. Every website has a crawl budget , so this should be used as best as possible. Through targeted control or locking out, this can be used as effectively as possible. Pay attention to fast loading times, small file sizes and lean website architecture.