Discover Centmin Mod today
Register Now

Security Blocking bad or aggressive bots

Discussion in 'System Administration' started by eva2000, Feb 28, 2016.

  1. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    Centmin Mod nginx vhosts have a commented out include file for /usr/local/nginx/conf/block.conf which has a list of bad or aggressive bots that it may block. I plan to rework the /usr/local/nginx/conf/block.conf include file and add a few more bots. And you can use ngxtop to do custom bot report stats against your Nginx access.log logs :)

    Update: April 3rd, 2018 - Folks might be interested in an alternate version developed by Mitchell Krog at Security - Nginx Ultimate Bad Block Blocker

    For me, I am going to add the following:
    • Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/) <-- nasty crawler for backlinks. One of my sites was hit with 243,000 bots per day and up to 13,000 per hour !
    • Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) another common one, though for Centmin Mod, it has alot of Asian users so probably not a good idea to totally block.
    So list other bots you would like an option to block by their user agents etc.

    First version of the new work in progress bot rate limit & blocking

    in nginx.conf within http{} context add include file
    Code (Text):
    include /usr/local/nginx/conf/botlimit.conf;

    within the include file /usr/local/nginx/conf/botlimit.conf add
    Code (Text):
    # map user agents to $bot_agent
    # 0 = not rate limited
    # 1 = not or rate limited less restrictive
    # 2 = rate limited more
    # 3 = block completely
    # http://www.botopedia.org/
    # http://www.botreports.com/badbots/
    map $http_user_agent $bot_agent {
      default                    0;
      # general protections
      "~Mozilla/4.0"             2;
      "~MSIE\ 6.0"               3;
      "~MSIE\ 7.0"               2;
      "~*archive.org"            3;
      "~*Brandprotect"           3;
      "~*Brandwatch"             3;
      "~*MarkMonitor"            3;
      "~*Name\ Intelligence"     3;
      "~*Nameprotect"            3;
      "~*Picscout"               3;
      "~*Picsearch"              3;
      "~*Pixray"                 3;
      # bots whitelisted
      "~*Googlebot"              1;
      "~*bingbot"                1;
      "~*yahoo"                  1;
      # other bots rate limited or blocked
      "~*80legs"                 3;
      "~*Acunetix"               3;
      "~*AhrefsBot"              2;
      "~*BackDoorBot"            3;
      "~*Baiduspider"            2;
      "~*boardreader"            3;
      "~*calculon\ spider"       3;
      "~*CCBot"                  3;
      "~*Claritybot"             3;
      "~*Cliqzbot"               3;
      "~*dirbuster"              3;
      "~*Download\ Demon"        3;
      "~*DTS\ Agent"             3;
      "~*EMail\ Exractor"        3;
      "~*Exabot"                 3;
      "~*Express\ WebPictures"   3;
      "~*ExtractorPro"           3;
      "~*ezooms"                 3;
      "~*facebookexternalhit"    2;
      "~*fimap"                  3;
      "~*FlipboardProxy"         2;
      "~*Genieo"                 3;
      "~*GetRight"               3;
      "~*GetWeb"                 3;
      "~*Go!Zilla"               3;
      "~*GrabNet"                3;
      "~*grapeFX"                3;
      "~*GrapeshotCrawler"       3;
      "~*Havij"                  3;
      "~*HTTrack"                3;
      "~*Huaweisymantecspider"   3;
      "~*ia_archiver"            2;
      "~*Image\ Sucker"          3;
      "~*Java"                   2;
      "~*jbrofuzz"               3;
      "~*Joomla"                 3;
      "~*Kraken"                 3;
      "~*libwhisker"             3;
      "~*libwww-perl"            2;
      "~*linkdexbot"             3;
      "~*LinkpadBot"             3;
      "~*lmspider"               3;
      "~*magpie-crawler"         3;
      "~*mail.ru"                3;
      "~*Mail.RU_Bot"            3;
      "~*majestic12"             3;
      "~*MarkWatch"              3;
      "~*MegaIndex.ru"           3;
      "~*Metauri"                2;
      "~*MJ12bot"                3;
      "~*msnbot"                 2;
      "~*musobot"                3;
      "~*nessus"                 3;
      "~*nikto"                  3;
      "~*Nmap"                   3;
      "~*Nutch"                  3;
      "~*omgilibot"              3;
      "~*Openvas"                3;
      "~*OrangeBot"              3;
      "~*proximic"               3;
      "~*Qwantify"               3;
      "~*R6_CommentReader"       2;
      "~*R6_FeedFetcher"         2;
      "~*ScanAlert"              3;
      "~*Scrapy"                 3;
      "~*ScreenerBot"            2;
      "~*SemrushBot"             3;
      "~*seomoz"                 3;
      "~*SISTRIX"                3;
      "~*SiteLockSpider"         3;
      "~*SiteSnagger"            3;
      "~*Slackbot-LinkExpanding" 3;
      "~*Sogou\ web\ spider"     3;
      "~*Sosospider"             2;
      "~*Spaidu"                 2;
      "~*spbot"                  3;
      "~*Spinn3r"                3;
      "~*sqlmap"                 3;
      "~*Sucuri"                 3;
      "~*Swiftbot"               3;
      "~*TeleportPro"            3;
      "~*trendictionbot"         3;
      "~*TurnitinBot"            3;
      "~*WASALive-Bot"           3;
      "~*WBSearchBot"            3;
      "~*Web\ Image\ Collector"  3;
      "~*Web\ Sucker"            3;
      "~*WebCopier"              3;
      "~*WebLeacher"             3;
      "~*WebReaper"              3;
      "~*webshag"                3;
      "~*WebStripper"            3;
      "~*WebZIP"                 3;
      "~*WeSEE"                  3;
      "~*whatweb"                3;
      "~*wonderbot"              3;
      "~*WordPress"              3;
      "~*Xaldon_WebSpider"       3;
      "~*Y!J-ASR"                3;
      "~*YandexBot"              2;
      "~*YandexImages"           2;
      "~*zitebot"                3;
      "~*ZumBot"                 3;
    }
    
    map $bot_agent $bot_iplimit {
        0    "";
        1    "";
        2    $binary_remote_addr;
    }
    
    # limits for googlebot and $bot_agent = 1
    #limit_conn_zone $bot_iplimit zone=bota_connlimit:16m;
    #limit_req_zone  $bot_iplimit zone=bota_reqlimitip:16m  rate=50r/s;
    # limits for $bot_agent = 2
    limit_conn_zone $bot_iplimit zone=botb_connlimit:16m;
    limit_req_zone  $bot_iplimit zone=botb_reqlimitip:16m  rate=2r/s;
    


    in your nginx vhost's location context add an include file
    Code (Text):
    include /usr/local/nginx/conf/blockbots.conf;

    in include file /usr/local/nginx/conf/blockbots.conf add
    Code (Text):
    #######################################################################
    # add this to your nginx vhost domain's config file within
    # the location contexts you want to rate limit and/or bot
    # block with 403 permission denied error or change return 403
    # to return 444 to just drop the connection completely
    #limit_conn bota_connlimit 100;
    limit_conn botb_connlimit 10;
    #limit_req  zone=bota_reqlimitip burst=50;
    limit_req  zone=botb_reqlimitip burst=10;
    if ($bot_agent = '3') {
      return 444;
    }
    #######################################################################

    • The above has connection limit to 10 connections per ip and request rate of 2 requests per second if $bot_agent has a value of 2.
    • If $bot_agent has a value of 3 it's connection is completely dropped with return 444.
    • If $bot_agent has value of 1, it bypasses connection and request rate limits i.e. Googlebot
    Examples

    Bingbot has a $bot_agent value of 2 so is subject to connection and request rate limits. Siege 10 concurrent connections with 1 request each ends up with roughly 1 request every 2.21s
    Code (Text):
    siege -b -c10 -r1 -A "Bingbot" http://localhost
    Transactions:                     10 hits
    Availability:                 100.00 %
    Elapsed time:                   4.45 secs
    Data transferred:               0.02 MB
    Response time:                  2.21 secs
    Transaction rate:               2.25 trans/sec
    Throughput:                     0.00 MB/sec
    Concurrency:                    4.96
    Successful transactions:          10
    Failed transactions:               0
    Longest transaction:            4.45
    Shortest transaction:           0.01

    Bingbot has a $bot_agent value of 2 so is subject to connection and request rate limits. Siege 20 concurrent connections with 1 request each ends up being connection limited and returns 503 errors and rate limited to 2.44s per request
    Code (Text):
    siege -b -c20 -r1 -A "Bingbot" http://localhost
    ** SIEGE 3.1.3
    ** Preparing 20 concurrent users for battle.
    The server is now under siege...
    HTTP/1.1 200   0.01 secs:    1580 bytes ==> GET  /
    HTTP/1.1 503   0.00 secs:     206 bytes ==> GET  /
    HTTP/1.1 503   0.00 secs:     206 bytes ==> GET  /
    HTTP/1.1 503   0.00 secs:     206 bytes ==> GET  /
    HTTP/1.1 503   0.01 secs:     206 bytes ==> GET  /
    HTTP/1.1 503   0.00 secs:     206 bytes ==> GET  /
    HTTP/1.1 503   0.01 secs:     206 bytes ==> GET  /
    HTTP/1.1 503   0.01 secs:     206 bytes ==> GET  /
    HTTP/1.1 503   0.00 secs:     206 bytes ==> GET  /
    HTTP/1.1 503   0.01 secs:     206 bytes ==> GET  /
    HTTP/1.1 200   0.43 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   0.93 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   1.44 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   1.93 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   2.44 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   2.93 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   3.44 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   3.93 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   4.43 secs:    1580 bytes ==> GET  /
    HTTP/1.1 200   4.93 secs:    1580 bytes ==> GET  /
    done.
    
    Transactions:                     11 hits
    Availability:                  55.00 %
    Elapsed time:                   4.94 secs
    Data transferred:               0.02 MB
    Response time:                  2.44 secs
    Transaction rate:               2.23 trans/sec
    Throughput:                     0.00 MB/sec
    Concurrency:                    5.44
    Successful transactions:          11
    Failed transactions:               9
    Longest transaction:            4.93
    Shortest transaction:           0.00


    GetWeb user agent has a $bot_agent value of 3 so is blocked and connection dropped with return 444 or if you're behind Cloudflare will get 520 http status code
    Code (Text):
    siege -b -c10 -r1 -A "GetWeb" http://localhost
    Transactions:                      0 hits
    Availability:                   0.00 %
    Elapsed time:                   0.00 secs
    Data transferred:               0.00 MB
    Response time:                  0.00 secs
    Transaction rate:               0.00 trans/sec
    Throughput:                     0.00 MB/sec
    Concurrency:                    0.00
    Successful transactions:           0
    Failed transactions:              10
    Longest transaction:            0.00
    Shortest transaction:           0.00


    In your sites access.log it will turn up as
    Code (Text):
    IPADDR - - [29/Feb/2016:02:09:27 +0000] "GET / HTTP/1.1" 444 0 "-" "GetWeb"


    Filtered access.log to grab IP count for all 444 status codes
    Code (Text):
    read -ep "Filter which status code ? i.e. 404 : " var ; awk -v errno=${var} '$9 == 'errno' { print $1 }' access.log | sort | uniq -c | sort -n
    

    Code (Text):
    read -ep "Filter which status code ? i.e. 404 : " var ; awk -v errno=${var} '$9 == 'errno' { print $1 }' access.log | sort | uniq -c | sort -n
    Filter which status code ? i.e. 404 : 444
         32 IPADDR


    Further Notes




    1. Static files are processed by /usr/local/nginx/conf/staticfiles.conf include file in your nginx vhost, so if you want to rate limit static files you need add the 1st post created include file /usr/local/nginx/conf/blockbots.conf to those location matches add
      Code (Text):
      include /usr/local/nginx/conf/blockbots.conf;
      . However, static files shouldn't be a problem for most folks as Nginx eats static file process requests for breakfast and handle them with ease :)
    2. If you used centmin mod 123.09beta01 centmin.sh menu option 22 to auto install wordpress, you would also have white listed location matches for common wp plugins in auto created file at /usr/local/nginx/conf/wpincludes/${vhostname}/wpsecure_${vhostname}.conf where ${vhostname} is yourdomain.com name. Similar to staticfiles.conf, you would need to edit wpsecure_${vhostname}.conf each location match for white listed wp plugins with
      Code (Text):
      include /usr/local/nginx/conf/blockbots.conf;
      . With latest 123.09beta01 update also added to a common include file /usr/local/nginx/conf/wpincludes/${vhostname}/wpwhitelist_common.conf in generated /usr/local/nginx/conf/wpincludes/${vhostname}/wpsecure_${vhostname}.conf which you can add the include file /usr/local/nginx/conf/blockbots.conf to so it is shared with all whitelisted location matches.
      Code (Text):
        # below include file needs to be manually created at that path and to be uncommented
        # by removing the hash # in front of below line to take effect. This wpwhitelist_common.conf
        # allows you to add commonly shared settings to all wp plugin location matches which
        # whitelist php processing access at /usr/local/nginx/conf/wpincludes/${vhostname}/wpsecure_${vhostname}.conf
        #include /usr/local/nginx/conf/wpincludes/${vhostname}/wpwhitelist_common.conf;
      
     
    Last edited: Sep 19, 2016
  2. pamamolf

    pamamolf Premium Member Premium Member

    4,068
    427
    83
    May 31, 2014
    Ratings:
    +832
    Local Time:
    7:09 AM
    Nginx-1.25.x
    MariaDB 10.3.x
    Baiduspider is a server killer :(
    I recommend to add very aggressive bots like Baidu in the config file and if you think it should not be blocked by default just keep it disable so the user can enable it easy with just uncomment it and not have to search for it for exact agent and then create an entry for it and .....

    Code:
    Sogou web spider
    MJ12bot
    lmspider
    omgilibot
    Spinn3r
    WeSEE
    WASALive-Bot
    Scrapy
    Genieo
    Kraken
    Mail.RU_Bot
    Exabot
    trendictionbot
    Claritybot
    musobot
    linkdexbot
    proximic
    Slackbot-LinkExpanding
    calculon spider
    Swiftbot
    zitebot
     
    Last edited: Feb 28, 2016
  3. ModeltogTossen

    ModeltogTossen I wish I could??

    313
    97
    28
    Dec 20, 2015
    Denmark
    Ratings:
    +143
    Local Time:
    6:09 AM
    1.9.12
    10.0.23
    Yeah - This frakkers is visible everytime on my site when I'm looking at iftop tools.. Nasty, so one update of the block file would be very lovely..
     
  4. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    moved to 1st post :)
     
    Last edited: Feb 29, 2016
  5. pamamolf

    pamamolf Premium Member Premium Member

    4,068
    427
    83
    May 31, 2014
    Ratings:
    +832
    Local Time:
    7:09 AM
    Nginx-1.25.x
    MariaDB 10.3.x
    Should be added to latest Centminmod :)
    So with a simple uncomment we will be able to enable it :)

    Cool !
     
  6. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    that's the plan just need folks to test the above steps manually or provide further suggested bots to block and/or rate limit :) ;)
     
  7. pamamolf

    pamamolf Premium Member Premium Member

    4,068
    427
    83
    May 31, 2014
    Ratings:
    +832
    Local Time:
    7:09 AM
    Nginx-1.25.x
    MariaDB 10.3.x
    A few more that may help :)

    Code:
    http://www.botreports.com/badbots/
    http://wpsecure.net/bad-bot-list/
    https://perishablepress.com/ultimate-htaccess-blacklist-2-compressed-version/
     
  8. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    cheers already updated the above list with some so more is nice

    note once you get to a certain number of mappings you need to edit nginx.conf to add or raise some map hash sizes, 123.09beta01's new nginx.conf defaults will be within http{} group
    Code:
    http {
    map_hash_bucket_size 128;
    map_hash_max_size 4096;
    server_names_hash_bucket_size 128;
    server_names_hash_max_size 2048;
     
    Last edited: Jun 2, 2017
  9. pamamolf

    pamamolf Premium Member Premium Member

    4,068
    427
    83
    May 31, 2014
    Ratings:
    +832
    Local Time:
    7:09 AM
    Nginx-1.25.x
    MariaDB 10.3.x
    I have the exact default sizes..... do i have to increase them more?
     
  10. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    no need, centminmod.com site has 100s of mappings for various geoip nginx related variables from targeting ad banners by geographic region to serving up specific side bars to different countries and uses those default values for now. Of course if you run into the limit, then increase :)

    you have then as you installed latest 123.09beta01 :)
     
  11. pamamolf

    pamamolf Premium Member Premium Member

    4,068
    427
    83
    May 31, 2014
    Ratings:
    +832
    Local Time:
    7:09 AM
    Nginx-1.25.x
    MariaDB 10.3.x
    Do you mean bellow:
    Code:
    location / {
    ?
     
  12. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    yeah any location context you want it to apply to i.e. /, /wp, /forum, /blog, /directory etc
     
  13. pamamolf

    pamamolf Premium Member Premium Member

    4,068
    427
    83
    May 31, 2014
    Ratings:
    +832
    Local Time:
    7:09 AM
    Nginx-1.25.x
    MariaDB 10.3.x
    blockbots.conf will be merged with block.conf?

    If not then move the agents part to it as i think it does the same thing?
     
  14. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    probably moved, or separate haven't decided yet
    Implemented this for testing on this forum too :)
    Code:
    siege -b -c10 -r1 -A "GetWeb" https://community.centminmod.com/
    Transactions:                      0 hits
    Availability:                   0.00 %
    Elapsed time:                   0.04 secs
    Data transferred:               0.00 MB
    Response time:                  0.00 secs
    Transaction rate:               0.00 trans/sec
    Throughput:                     0.00 MB/sec
    Concurrency:                    0.00
    Successful transactions:           0
    Failed transactions:              10
    Longest transaction:            0.00
    Shortest transaction:           0.00
     
  15. pamamolf

    pamamolf Premium Member Premium Member

    4,068
    427
    83
    May 31, 2014
    Ratings:
    +832
    Local Time:
    7:09 AM
    Nginx-1.25.x
    MariaDB 10.3.x
    If there is no disadvantages (performance most) then it will be better to have only one file :)
    But if it is not easy to do or not very flexible or for any other reason than you think then make it separate :)
     
  16. negative

    negative Active Member

    415
    50
    28
    Apr 11, 2015
    Ratings:
    +98
    Local Time:
    7:09 AM
    1.9.10
    10.1.11
    Firstly, how can i analyze which bots request too many pages in a day or in last week ? So i will decide which bots will be blocked. @eva2000

    Thank you for your helpful.
     
  17. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    yeah will decide later
    you'd need to analyse your site's access.log using some scripting or via grep/awk/sort commands or probably can do that via ngxtop too which has fields to count by user agents etc Nginx - ngxtop real time metrics for Nginx | Centmin Mod Community and at GitHub - lebinh/ngxtop: Real-time metrics for nginx server

    Example using my previous siege 'GetWeb' user agent to test my own forum

    444 count was 30 entries
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep 'GetWeb' | ngxtop --no-follow
    running for 0 seconds, 30 records processed: 26318.58 req/sec
    
    Summary:
    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |---------+------------------+-------+-------+-------+-------|
    |      30 |            0.000 |     0 |     0 |    30 |     0 |
    
    Detailed:
    | request_path   |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |----------------+---------+------------------+-------+-------+-------+-------|
    | /              |      30 |            0.000 |     0 |     0 |    30 |     0 |


    group by remote_addr acount
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep 'GetWeb' | ngxtop --no-follow --group-by remote_addr
    running for 0 seconds, 30 records processed: 33078.11 req/sec
    
    Summary:
    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |---------+------------------+-------+-------+-------+-------|
    |      30 |            0.000 |     0 |     0 |    30 |     0 |
    
    Detailed:
    | remote_addr     |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |-----------------+---------+------------------+-------+-------+-------+-------|
    | IPADDR |      30 |            0.000 |     0 |     0 |    30 |     0 |


    group by user agent without filtering on GetWeb but all user agenst
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | ngxtop --no-follow --group-by http_user_agent
    running for 41 seconds, 557981 records processed: 13740.10 req/sec
    
    Summary:
    |   count |   avg_bytes_sent |    2xx |   3xx |    4xx |   5xx |
    |---------+------------------+--------+-------+--------+-------|
    |  557981 |        19773.468 | 275904 | 30752 | 251250 |    68 |
    
    Detailed:
    | http_user_agent                                                                                                                |   count |   avg_bytes_sent |   2xx |   3xx |    4xx |   5xx |
    |--------------------------------------------------------------------------------------------------------------------------------+---------+------------------+-------+-------+--------+-------|
    | Amazon Route 53 Health Check Service; ref:2a00cab0-xxxx-4e4d-bb32-30d1d58xxxxx; report http://amzn.to/1vsZADi                  |  243810 |          176.998 |     0 |     0 | 243807 |     0 |
    | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Q312461; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)                         |   99117 |        77823.952 | 98846 |     0 |    264 |     7 |
    | NewRelicPinger/1.0 (652248)                                                                                                    |   16034 |         9245.900 | 16032 |     0 |      2 |     0 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                       |   15758 |        11262.793 | 10130 |  5388 |    239 |     1 |


    Filter for just Feb 29th access.log via grep for all http_user_agents
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '29/Feb' | ngxtop --no-follow --group-by http_user_agent
    running for 8 seconds, 105938 records processed: 13751.72 req/sec
    
    Summary:
    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |---------+------------------+-------+-------+-------+-------|
    |  105938 |        22324.371 | 52923 |  5451 | 47563 |     0 |
    
    Detailed:
    | http_user_agent                                                                                                                |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |--------------------------------------------------------------------------------------------------------------------------------+---------+------------------+-------+-------+-------+-------|
    | Amazon Route 53 Health Check Service; ref:2a00cab0-xxxx-4e4d-bb32-30d1d58xxxxx; report http://amzn.to/1vsZADi                  |   45075 |          176.996 |     0 |     0 | 45074 |     0 |
    | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Q312461; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)                         |   22568 |        78140.910 | 22568 |     0 |     0 |     0 |
    | NewRelicPinger/1.0 (652248)                                                                                                    |    2942 |         7230.017 |  2942 |     0 |     0 |     0 |


    Filter for Feb 29th access.log via grep for 444 status codes and print request path, status code and http_user_agent logged - you can see how well the above bot rate limiting and blocking is doing
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '29/Feb' | ngxtop --no-follow -i 'status == 444' print request status http_user_agent
    running for 8 seconds, 670 records processed: 81.26 req/sec
    
    request, status, http_user_agent:
    | request                                                                                                            |   status | http_user_agent                                                                         |
    |--------------------------------------------------------------------------------------------------------------------+----------+-----------------------------------------------------------------------------------------|
    | GET / HTTP/1.1                                                                                                     |      444 | GetWeb                                                                                  |
    | GET / HTTP/1.1                                                                                                     |      444 | Mozilla/5.0 (compatible; GrapeshotCrawler/2.0; +http://www.grapeshot.co.uk/crawler.php) |
    | GET / HTTP/1.1                                                                                                     |      444 | Mozilla/5.0 (compatible; linkdexbot/2.0; +http://www.linkdex.com/bots/)                 |
    | GET / HTTP/1.1                                                                                                     |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET / HTTP/1.1                                                                                                     |      444 | libwww-perl/5.833                                                                       |
    | GET /account/alerts HTTP/1.1                                                                                       |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET /categories/centmin-mod.7/ HTTP/1.1                                                                            |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET /find-new/300035/posts HTTP/1.1                                                                                |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET /find-new/300043/posts HTTP/1.1                                                                                |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET /find-new/300145/posts HTTP/1.1                                                                                |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET /forums/beta-release-code.9 HTTP/1.1                                                                           |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET /forums/beta-release-code.9/ HTTP/1.1                                                                          |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET /forums/bug-reports.12/?prefix_id=29&order=post_date HTTP/1.0                                                  |      444 | Mozilla/5.0 (compatible; Qwantify/2.2w; +https://www.qwant.com/)/*          


    For March 1 check out 444 status codes
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '01/Mar' | ngxtop --no-follow -i 'status == 444' print request status http_user_agent
    running for 0 seconds, 17 records processed: 46.12 req/sec
    
    request, status, http_user_agent:
    | request                                                     |   status | http_user_agent                                                              |
    |-------------------------------------------------------------+----------+------------------------------------------------------------------------------|
    | GET / HTTP/1.1                                              |      444 | libwww-perl/5.833                                                            |
    | GET /threads/nginx-failed-to-install-upgrade.1798/ HTTP/1.1 |      444 | Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)      |
    | GET /threads/php-7-0-3-is-available.6001/ HTTP/1.1          |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php) |


    Check only Feb 29 for Googlebot via grep of access.log and pipe through to ngxtop and print http_user_agent and status codes
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '29/Feb' | grep 'Googlebot' | ngxtop --no-follow print http_user_agent status   
    running for 0 seconds, 2598 records processed: 15727.07 req/sec
    
    http_user_agent, status:
    | http_user_agent                                                                                                                                                                                    |   status |
    |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------|
    | DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)                                                                                               |      200 |
    | Googlebot-Image/1.0                                                                                                                                                                                |      200 |
    | Googlebot-Image/1.0                                                                                                                                                                                |      304 |
    | Googlebot-Image/1.0                                                                                                                                                                                |      404 |
    | Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                                                    |      200 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           |      200 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           |      301 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           |      303 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           |      304 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           |      307 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           |      403 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           |      404 |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |      200 |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |      301 |
    | SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)                          |      200 |


    Same filter as before but switch out status for remote_addr for Googlebot count for Feb 29
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '29/Feb' | grep 'Googlebot' | ngxtop --no-follow print http_user_agent remote_addr
    running for 0 seconds, 2598 records processed: 17187.30 req/sec
    
    http_user_agent, remote_addr:
    | http_user_agent                                                                                                                                                                                    | remote_addr    |
    |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------|
    | DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)                                                                                               | 66.249.79.61   |
    | DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)                                                                                               | 66.249.79.63   |
    | DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)                                                                                               | 66.249.79.65   |
    | Googlebot-Image/1.0                                                                                                                                                                                | 192.240.110.42 |
    | Googlebot-Image/1.0                                                                                                                                                                                | 209.58.130.199 |
    | Googlebot-Image/1.0                                                                                                                                                                                | 66.249.79.61   |
    | Googlebot-Image/1.0                                                                                                                                                                                | 66.249.79.63   |
    | Googlebot-Image/1.0                                                                                                                                                                                | 66.249.79.65   |
    | Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                                                    | 162.216.19.183 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 127.0.0.1      |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 192.240.110.42 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 209.58.130.199 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.64.125  |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.64.2    |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.66.179  |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.66.182  |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.66.185  |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.66.44   |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.79.223  |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.79.230  |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.79.237  |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.79.61   |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.79.63   |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           | 66.249.79.65   |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 66.249.66.179  |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 66.249.66.185  |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 66.249.66.47   |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 66.249.79.61   |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 66.249.79.63   |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 66.249.79.65   |
    | SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)                          | 66.249.79.61   |


    Building on previous command to group by remote_addr for filtered Googlebot on Feb 29th
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '29/Feb' | grep 'Googlebot' | ngxtop --no-follow --group-by remote_addr
    running for 0 seconds, 2598 records processed: 16824.80 req/sec
    
    Summary:
    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |---------+------------------+-------+-------+-------+-------|
    |    2598 |        12314.505 |  1670 |   881 |    47 |     0 |
    
    Detailed:
    | remote_addr    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |----------------+---------+------------------+-------+-------+-------+-------|
    | 66.249.79.61   |    1231 |        14078.539 |   852 |   363 |    16 |     0 |
    | 66.249.79.223  |     525 |         8911.514 |   292 |   232 |     1 |     0 |
    | 66.249.79.63   |     311 |        13264.614 |   208 |    95 |     8 |     0 |
    | 66.249.66.179  |     230 |         8986.509 |   119 |   102 |     9 |     0 |
    | 66.249.79.65   |     191 |        14356.131 |   139 |    50 |     2 |     0 |
    | 66.249.79.230  |      34 |         8199.324 |    18 |    16 |     0 |     0 |
    | 66.249.66.182  |      29 |         9298.276 |    16 |     9 |     4 |     0 |
    | 66.249.66.185  |      19 |         5710.632 |     8 |     6 |     5 |     0 |
    | 66.249.79.237  |      16 |         7718.500 |     9 |     7 |     0 |     0 |
    | 209.58.130.199 |       3 |          795.000 |     1 |     0 |     2 |     0 |


    Filter for Feb 29th only on several bots you want to check for via egrep case insensitive and group count by http_user_agent. Googlebot, Baidu, bingbot, ahrefsbot, yandex and msnbot
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '29/Feb' | egrep -i 'Googlebot|Baidu|bingbot|Ahrefsbot|yandex|msnbot' | ngxtop --no-follow --group-by http_user_agent
    running for 1 seconds, 8609 records processed: 16183.12 req/sec
    
    Summary:
    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |---------+------------------+-------+-------+-------+-------|
    |    8609 |        11377.074 |  5210 |  3150 |   249 |     0 |
    
    Detailed:
    | http_user_agent                                                                                                                                                                                    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------------------+-------+-------+-------+-------|
    | Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)                                                                                                                            |    2496 |        10524.571 |  1668 |   770 |    58 |     0 |
    | Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)                                                                                                                           |    2430 |        11587.994 |  1524 |   860 |    46 |     0 |
    | Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)                                                                                                                                 |    1180 |         5371.688 |   339 |   743 |    98 |     0 |
    | Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)                                                                                                                |    1117 |        15648.950 |   695 |   379 |    43 |     0 |
    | Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)                                                                                                                                   |    1053 |        11357.065 |   721 |   329 |     3 |     0 |
    | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) |     122 |        27010.811 |   121 |     1 |     0 |     0 |
    | Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)                                                                                                                                |      91 |        27978.187 |    47 |    44 |     0 |     0 |
    | Googlebot-Image/1.0                                                                                                                                                                                |      33 |         5185.970 |    12 |    20 |     1 |     0 |
    | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36                                                                          |      14 |         4820.214 |    14 |     0 |     0 |     0 |
    | Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36                                                                                     |       9 |         7903.889 |     9 |     0 |     0 |     0 |


    So check out Nginx - ngxtop real time metrics for Nginx | Centmin Mod Community :)
     
    Last edited: Mar 1, 2016
  18. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    Looks like blocking libwww-perl broke my munin stats logging at User Stats Extended | Centmin Mod Community heh

    so switch from block value 3 to rate limited value 2
    Code:
      "~*libwww-perl"            2;
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '01/Mar' | ngxtop --no-follow -i 'status == 444' print request status http_user_agent
    running for 1 seconds, 90 records processed: 84.74 req/sec
    
    request, status, http_user_agent:
    | request                                                                                                            |   status | http_user_agent                                                                         |
    |--------------------------------------------------------------------------------------------------------------------+----------+-----------------------------------------------------------------------------------------|
    | GET / HTTP/1.1                                                                                                     |      444 | Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)            |
    | GET / HTTP/1.1                                                                                                     |      444 | libwww-perl/5.833        


    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '01/Mar' | egrep -i 'libwww-perl' | ngxtop --no-follow --group-by http_user_agent
    running for 0 seconds, 40 records processed: 28693.72 req/sec
    
    Summary:
    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |---------+------------------+-------+-------+-------+-------|
    |      40 |         3262.750 |     1 |     0 |    39 |     0 |
    
    Detailed:
    | http_user_agent   |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |-------------------+---------+------------------+-------+-------+-------+-------|
    | libwww-perl/5.833 |      40 |         3262.750 |     1 |     0 |    39 |     0 |


    after switching from block to rate limit libwww-perl user agent
    Code (Text):
    cat /home/nginx/domains/community.centminmod.com/log/access.log | grep '01/Mar' | egrep -i 'libwww-perl' | ngxtop --no-follow --group-by status
    running for 0 seconds, 41 records processed: 11771.27 req/sec
    
    Summary:
    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |---------+------------------+-------+-------+-------+-------|
    |      41 |         6366.341 |     2 |     0 |    39 |     0 |
    
    Detailed:
    |   status |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |----------+---------+------------------+-------+-------+-------+-------|
    |      444 |      39 |            0.000 |     0 |     0 |    39 |     0 |
    |      200 |       2 |       130510.000 |     2 |     0 |     0 |     0 |


    great munin is resuming proper logging now that i unblocked libwww-perl :)

    upload_2016-3-1_13-28-8.png
     
    Last edited: Mar 1, 2016
  19. deltahf

    deltahf Premium Member Premium Member

    581
    264
    63
    Jun 8, 2014
    Ratings:
    +482
    Local Time:
    12:09 AM
    Thanks very much for bringing this up in the newsletter, @eva2000. Little things like these can make a surprising difference in a server's performance, but it seems to be an area many sys admins overlook or don't think about much. (y)

    I was using the following directives to block spiders, although I know it's a bit primitive:
    Code:
    if ($http_user_agent ~* (baidu|yandex|ahref|seomoz|exabot|majestic12|ezooms|boardreader|mail.ru))
    {
            return 403;
    }
    
    This list was created by me by simply looking through my access logs occasionally over the years. I have added them to the above Centminmod bot list with the following values:
    Code:
      "~*seomoz"                    3;
      "~*boardreader"               3;
      "~*exabot"                    3;
      "~*majestic12"                3;
      "~*ezooms"                    3;
      "~*mail.ru"                   3;
    I would strongly recommend blocking "boardreader" and "seomoz". Boardreader will scrape and steal your content, while SEOmoz - although a legit company - is being used by SEO folks to find backlinks and websites which might be good to put links on (in other words, spam). :poop: "exabot" and "majestic12" do appear to be from somewhat legitimate search engines, but I really don't want them consuming any of my valuable resources.

    Also, something I wanted to point out with Bing: if you add your site to the Bing Webmaster Tools, you can control how frequently and when their bot accesses your site.

    [​IMG]

    This might be better than limiting Bing server-side, because this may harm your site's ranking. That might not be important now but Bing is growing. I am whitelisting Bing in my Centminmod config.
     
  20. eva2000

    eva2000 Administrator Staff Member

    53,153
    12,110
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,645
    Local Time:
    2:09 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    @deltahf thanks for contribution to the bad bot list :D I've updated 1st post with updated listing and whitelisted bingbot too. Exabot is already on the list so don't need to duplicate that entry.

    Yeah bing webmaster tools is handy for that. Just rogue bots pretending to be bingbot won't observe the limits you place there so just remember that :)