Welcome to Centmin Mod Community
Become a Member

WebPerf Top 10 Web Crawlers and User-Agents

Discussion in 'All Internet & Web Performance News' started by eva2000, Apr 28, 2016.

  1. eva2000

    eva2000 Administrator Staff Member

    55,458
    12,257
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,841
    Local Time:
    2:52 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    When it comes to the world wide web there are both bad bots and good bots. The bad bots you definitely want to avoid as these consume your CDN bandwidth, take up server resources, and steal your content. Good bots (also known as web crawlers) on the other hand, should be handled with care as they are a vital part of getting your content to index with search engines such as Google, Bing, and Yahoo. Read more below about some of the top 10 web crawlers and user-agents to ensure you are handling them correctly.


    Web Crawlers


    [​IMG]

    Web crawlers, also known as web spiders or internet bots, are programs that browse the web in an automated manner for the purpose of indexing content. Crawlers can look at all sorts of data such as content, links on a page, broken links, sitemaps, and HTML code validation.

    Search engines like Google, Bing, and Yahoo use crawlers to properly index downloaded pages so that users can find them them faster and more efficiently when they are searching. Without crawlers there would be nothing to tell them that your website has new and fresh content. Sitemaps also can play a part in that process. So web crawlers, for the most part, are a good thing. However there are also issues sometimes when it comes to scheduling and load as a crawler might be constantly polling your site. And this is where a robots.txt file comes into play. This file can help control the crawl traffic and ensure that it doesn’t overwhelm your server.

    [​IMG]

    Web crawlers identify themselves to a web server by using the User-agent field in an HTTP request, and each crawler has their own unique identifier. Most of the time you will need to examine your web server referrer logs to view web crawler traffic.

    Robots.txt


    [​IMG]

    By placing a robots.txt file at the root of your web server you can define rules for web crawlers such as allow or disallow that they must follow. You can apply generic rules which apply to all bots or get more granular and specify their specific User-agent string.

    Example 1

    This example instructs all Search engine robots to not index any of the website’s content. This is defined by disallowing the root “/” of your website.

    User-agent: *
    Disallow: /

    Example 2

    This example achieves the opposite of the previous one. In this case, the instructions are still applied to all user agents, however there is nothing defined within the Disallow instruction, meaning that everything can be indexed.

    User-agent: *
    Disallow:

    To see more examples make sure to check out our in-depth post on how to use a robots.txt file. KeyCDN also has an easy way to control modify your robots.txt file on your CDN account right from within the dashboard. See example of a CDN’s robots.txt settings below on a WordPress site.

    [​IMG]

    Top 10 Web Crawlers and Bots


    There are hundreds of web crawlers and bots scouring the internet but below is a list of 10 popular web crawlers and bots that we have been collected based on ones that we see on a regular basis within our web server logs.

    1. GoogleBot


    [​IMG]

    Googlebot is obviously one of the most popular web crawlers on the internet today as it is used to index content for Google’s search engine. Patrick Sexton wrote a great article about what a Googlebot is and how it pertains to your website indexing. One great thing about Google’s web crawler is that they give us a lot of tools and control over the process so you can ensure

    User-Agent


    User-agent: Googlebot
    Full User-Agent String


    Mozilla/5.0 (compatible; Googlebot/2.1; +Googlebot - Search Console Help
    Googlebot Example in Robots.txt


    This example displays a little more granularity pertaining to the instructions defined. Here, the instructions are only relevant to Googlebot. More specifically, it is telling Google not to index a specific page: your-page.html.

    User-agent: Googlebot
    Disallow: /no-index/your-page.html

    Besides Google’s web search crawler, they actually have 9 additional web crawlers:

    Web Crawler User-Agent String
    Googlebot News​
    Googlebot-News
    Googlebot Images​
    Googlebot-Image/1.0
    Googlebot Video​
    Googlebot-Video/1.0
    Google Mobile (featured phone)​
    SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +Googlebot - Search Console Help
    Google Smartphone​
    Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +Googlebot - Search Console Help
    Google Mobile Adsense​
    (compatible; Mediapartners-Google/2.1; +Googlebot - Search Console Help
    Google Adsense​
    Mediapartners-Google
    Google AdsBot (PPC landing page quality)​
    AdsBot-Google (+http://www.google.com/adsbot.html)
    Google app crawler (fetch resources for mobile)​
    AdsBot-Google-Mobile-Apps​

    You can use the Fetch tool in Google Search Console to test how Google crawls or renders a URL on your site. See whether Googlebot can access a page on your site, how it renders the page, and whether any page resources (such as images or scripts) are blocked to Googlebot.

    [​IMG]

    See Googlebot robots.txt documentation.

    Google+


    Another one you might see popup is Google+. When a user shares a URL on Google+ or an app writes an app activity, Google+ attempts to fetch the content and create a snippet to provide a summary of the linked content. This service is different than the Googlebot that crawls and indexes your site. These requests do not honor robots.txt or other crawl mechanisms because this is a user-initiated request.

    User-Agent


    Google (+Snippet  |  Google+ Platform for Web  |  Google Developers

    See more Google+ web snippet documentation.

    2. Bingbot


    [​IMG]

    Bingbot is a web crawler deployed by Microsoft in 2010 to supply information to their Bing search engine. This is the replacement of what used to be the MSN bot.

    User-Agent


    Bingbot
    Full User-Agent String


    Mozilla/5.0 (compatible; Bingbot/2.0; +Meet our crawlers

    Bing also has a very similiar tool as Google, called Fetch as Bingbot, within Bing Webmaster Tools. Fetch As Bingbot allows you to request a page be crawled and shown to you as our crawler would see it. You will see the page code as Bingbot would see it, helping you to understand if they are seeing your page as you intended.

    [​IMG]

    See Bingbot robots.txt documentation.

    3. Slurp


    [​IMG]

    Yahoo Search results come from the Yahoo web crawler Slurp and Bing’s web crawler, as a lot of Yahoo is now powered by Bing. Sites should allow Yahoo Slurp access in order to appear in Yahoo Mobile Search results. Additionally, Slurp does the following:

    • Collects content from partner sites for inclusion within sites like Yahoo News, Yahoo Finance and Yahoo Sports.
    • Accesses pages from sites across the Web to confirm accuracy and improve Yahoo’s personalized content for our users.
    User-Agent


    Slurp
    Full User-Agent String


    Mozilla/5.0 (compatible; Yahoo! Slurp; Search | SLN22600 - Why is Slurp crawling my page?

    See Slurp robots.txt documentation.

    4. DuckDuckBot


    [​IMG]

    DuckDuckBot is the Web crawler for DuckDuckGo, a search engine that has become quite popular lately as it is known for privacy and not tracking you. It now handles over 12 million queries per day. DuckDuckGo gets its results from over four hundred sources. These include hundreds of vertical sources delivering niche Instant Answers, DuckDuckBot (their crawler) and crowd-sourced sites (Wikipedia). They also have more traditional links in the search results, which they source from Yahoo!, Yandex and Bing.

    User-Agent


    DuckDuckBot
    Full User-Agent String


    DuckDuckBot/1.0; (+DuckDuckGo Bot

    It respects WWW::RobotRules and originates from these IP addresses:


    72.94.249.34
    72.94.249.35
    72.94.249.36
    72.94.249.37
    72.94.249.38

    5. Baiduspider


    [​IMG]

    Baiduspider is the official name of the Chinese Baidu search engine’s web crawling spider. It crawls web pages and returns updates to the Baidu index. Baidu is the leading Chinese search engine that takes an 80% share of the overall search engine market of China Mainland.

    User-Agent


    Baiduspider
    Full User-Agent String


    Mozilla/5.0 (compatible; Baiduspider/2.0; +百度用户服务中心-站长平台

    Besides Baidu’s web search crawler, they actually have 6 additional web crawlers:

    Web Crawler User-Agent String
    Image Search​
    Baiduspider-image
    Video Search​
    Baiduspider-video
    News Search​
    Baiduspider-news
    Baidu wishlists​
    Baiduspider-favo
    Baidu Union​
    Baiduspider-cpro
    Business Search​
    Baiduspider-ads
    Other search pages​
    Baiduspider​

    See Baidu robots.txt documentation.

    6. Yandex Bot


    [​IMG]

    YandexBot is the web crawler to one of the largest Russian search engines, Yandex. According to LiveInternet, for the three months ended December 31, 2015, they generated 57.3% of all search traffic in Russia.

    User-Agent


    YandexBot
    Full User-Agent String


    Mozilla/5.0 (compatible; YandexBot/3.0; +What is a search engine robot — Webmaster — Yandex.Support

    There are many different User-Agent strings that the YandexBot can show up as in your server logs. See the full list of Yandex robots and Yandex robots.txt documentation.

    7. Sogou Spider


    [​IMG]

    Sogou Spider is the web crawler for Sogou.com, a leading Chinese search engine that was launched in 2004. As of April 2016 it has a rank of 103 in Alexa’s internet rankings. Note: The Sogou web spider does not respect the robots.txt internet standard, and is therefore banned from many web sites because of excessive crawling.

    User-Agents


    Sogou Pic Spider/3.0( 搜狗-帮助中心-站长指南
    Sogou head spider/3.0( 搜狗-帮助中心-站长指南
    Sogou web spider/4.0(+搜狗-帮助中心-站长指南
    Sogou Orion spider/3.0( 搜狗-帮助中心-站长指南
    Sogou-Test-Spider/4.0 (compatible; MSIE 5.5; Windows 98)
    8. Exabot


    [​IMG]

    Exabot is a web crawler for Exalead, which is a search engine based out of France. It was founded in 2000 and now has more than 16 billion pages currently indexed.

    User-Agents


    Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Exabot-Thumbnails)
    Mozilla/5.0 (compatible; Exabot/3.0; +Webmaster's guide - Exalead

    See Exabot robots.txt documentation.

    9. Facebook External Hit


    [​IMG]

    Facebook allows its users to send links to interesting web content to other Facebook users. Part of how this works on the Facebook system involves the temporary display of certain images or details related to the web content, such as the title of the webpage or the embed tag of a video. The Facebook system retrieves this information only after a user provides a link.

    One of their main crawling bots is Facebot, which is designed to help improve advertising performance.

    User-Agents


    facebot
    facebookexternalhit/1.0 (+External User Agent Text - Help For Webmasters | Facebook
    facebookexternalhit/1.1 (+External User Agent Text - Help For Webmasters | Facebook

    See Facebot robots.txt documentation.

    10. Alexa Crawler


    [​IMG]

    Ia_archiver is the web crawler for Amazon’s Alexa internet rankings. As you probably know they collect information to show rankings for both local and international sites.

    User-Agent


    ia_archiver
    Full User-Agent String


    ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)

    See Ia_archiver robots.txt documentation.

    Bad Bots


    [​IMG]

    As we mentioned above most of those are actually good web crawlers. You generally don’t want to block Google or Bing from indexing your site unless you have a good reason. But what about the thousands of bad bots? KeyCDN released a new feature back in February 2016 which you can enable in your dashboard called “Block Bad Bots.” KeyCDN uses a comprehensive list of known bad bots and blocks them based on their User-Agent string.

    This is enabled by default on new zones. You can enable it on your existing zones by following the steps below.

    1. Login to the KeyCDN dashboard and click into zones. [​IMG]
    2. Click “Edit” on the zone you want to enable this new feature on. [​IMG]
    3. Select “Show Advanced Features.”
      [​IMG]
    4. Scroll down to “Block Bad Bots” and select “enabled.” Then make sure to save your changes.
      [​IMG]

    Read more about how to block bad bots on your origin server.

    Bot Resources


    Perhaps you are seeing some user-agent strings in your logs that have you concerned. Here is a list of a couple good resources in which you can lookup popular bad bots, crawlers, and scrapers.


    Caio Almeida also has a pretty good list on his crawler-user-agents github project.

    Summary


    There are hundreds of different web crawlers out there but hopefully you are now familiar with couple of the more popular ones. Again you want to be careful when blocking any of these as they could cause indexing issues. It is always good to check your web server logs to see how often they are actually crawling your site.

    Did we miss any important ones? If so please let us know below and we will add them.

    Related Articles


    The post Top 10 Web Crawlers and User-Agents appeared first on KeyCDN Blog.

    Continue reading...