Want to subscribe to topics you're interested in?
Become a Member

What was going on..? High 15 minute load average alert and lot's of ?

Discussion in 'System Administration' started by Meirami, Apr 19, 2018.

  1. Meirami

    Meirami Member

    128
    15
    18
    Dec 21, 2017
    Ratings:
    +41
    Local Time:
    3:33 PM
    I received automated mail:
    Code:
    High 15 minute load average alert - 25.29
    
    [-- Attachment #1 --]
    [-- Type: text/plain, Encoding: 7bit, Size: 0.2K --]
    
    Time:                    Sun Apr 15 07:16:35 2018 +0000
    1 Min Load Avg:          22.01
    5 Min Load Avg:          29.03
    15 Min Load Avg:         25.29
    Running/Total Processes: 1/69
    
    [-- Attachment #2: ps.txt --]
    [-- Type: text/plain, Encoding: 7bit, Size: 0.1K --]
    
    Output from ps:
    Unable to obtain process output within 15 seconds - Timed out
    
    [-- Attachment #3: vmstat.txt --]
    [-- Type: text/plain, Encoding: 7bit, Size: 0.1K --]
    
    Output from vmstat:
    Unable to obtain vmstat output within 10 seconds - Timed out
    
    [-- Attachment #4: netstat.txt --]
    [-- Type: text/plain, Encoding: 7bit, Size: 2.3K --]
    
    Output from netstat:
    Active Internet connections (servers and established)
     Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
     tcp        0      0 127.0.0.1:8004          0.0.0.0:*               LISTEN      2250/php-fpm: maste
     tcp        0      0 127.0.0.1:8005          0.0.0.0:*               LISTEN      2250/php-fpm: maste
     tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      1/init
     tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      10803/nginx: master
     tcp        0      0 0.0.0.0:21              0.0.0.0:*               LISTEN      6253/pure-ftpd (SER
     tcp        0      0 0.0.0.0:712             0.0.0.0:*               LISTEN      430/sshd
     tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN      775/master
     tcp        0      0 0.0.0.0:443             0.0.0.0:*               LISTEN      10803/nginx: master
     tcp        0      0 127.0.0.1:8000          0.0.0.0:*               LISTEN      2250/php-fpm: maste
     tcp        0      0 127.0.0.1:8002          0.0.0.0:*               LISTEN      2250/php-fpm: maste
     tcp        0      0 127.0.0.1:3303          0.0.0.0:*               LISTEN      17533/mysqld
     tcp        0      0 127.0.0.1:8003          0.0.0.0:*               LISTEN      2250/php-fpm: maste
     tcp6       0      0 :::80                   :::*                    LISTEN      10803/nginx: master
     tcp6       0      0 :::21                   :::*                    LISTEN      6253/pure-ftpd (SER
     tcp6       0      0 :::712                  :::*                    LISTEN      430/sshd
     tcp6       0      0 ::1:25                  :::*                    LISTEN      775/master
     tcp6       0      0 :::443                  :::*                    LISTEN      10803/nginx: master
     udp        0      0 111.22.333.44:63249     8.8.4.4:53              ESTABLISHED 10805/nginx: worker
     udp        0      0 111.22.333.44:44679     8.8.8.8:53              ESTABLISHED 10805/nginx: worker
     udp        0      0 111.22.333.44:64408     8.8.8.8:53              ESTABLISHED 10804/nginx: worker
     udp        0      0 111.22.333.44:32236     8.8.4.4:53              ESTABLISHED 10804/nginx: worker
    
    [-- Attachment #5: apachestatus.html --]
    [-- Type: text/html, Encoding: 7bit, Size: 0.1K --]
    
    Unable to retrieve Apache Server Status [http://127.0.0.1/server-status] - Unable to download: Server closed connection without sending any data back
    
    Why it's trying to get Apache Server Status?


    I checked the logs but I don't understand what happened.

    This was few hours later:
    Code:
    error.log
    2018/04/15 15:04:03 [crit] 10804#10804: *236 SSL_do_handshake() failed (SSL: error:1417D102:SSL routines:tls_process_client_hello:unsupported protocol) while SSL handshaking, client: 172.104.154.126, server: 0.0.0.0:443
    (lot's of these)
    Code:
    access.log
    54.183.165.74 - - [15/Apr/2018:05:59:48 +0000] "GET / HTTP/1.1" 400 7442 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36"
    This is me    - - [15/Apr/2018:07:44:27 +0000] "GET /index.php/204 HTTP/1.1" 204 0 "-" "Mozilla/5.0 (Android) ownCloud-android/20180412"
    ...
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x02\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    172.104.154.126 - - [15/Apr/2018:15:04:03 +0000] "\x15\x03\x01\x00\x02\x01\x00" 400 166 "-" "-"
    (this continues...)
    Code:
    cron
    Apr 15 06:15:03 info CROND[13371]: (root) CMD (/usr/lib64/sa/sa1 1 1)
    Apr 15 06:15:03 info CROND[13372]: (nginx) CMD (/usr/local/bin/php -f /home/nginx/domains/domain.name/public/cron.php)
    Apr 15 06:20:02 info CROND[13401]: (root) CMD (/usr/lib64/sa/sa1 1 1)
    Apr 15 06:30:37 info crond[13422]: (root) INFO (Job execution of per-minute job scheduled for 06:25 delayed into subsequent minute 06:30. Skipping job run.)
    Apr 15 06:32:09 info crond[13436]: (nginx) INFO (Job execution of per-minute job scheduled for 06:30 delayed into subsequent minute 06:31. Skipping job run.)
    Apr 15 06:32:11 info crond[13435]: (root) INFO (Job execution of per-minute job scheduled for 06:30 delayed into subsequent minute 06:31. Skipping job run.)
    Apr 15 06:32:11 info crond[13435]: (root) INFO (Job execution of per-minute job scheduled for 06:30 delayed into subsequent minute 06:31. Skipping job run.)
    Apr 15 07:16:36 info crond[13460]: (root) INFO (Job execution of per-minute job scheduled for 07:13 delayed into subsequent minute 07:14. Skipping job run.)
    Apr 15 07:16:36 info crond[13454]: (root) INFO (Job execution of per-minute job scheduled for 06:48 delayed into subsequent minute 07:14. Skipping job run.)
    Apr 15 07:16:36 info crond[13453]: (root) INFO (Job execution of per-minute job scheduled for 06:36 delayed into subsequent minute 07:14. Skipping job run.)
    Apr 15 07:16:36 info crond[13457]: (root) INFO (Job execution of per-minute job scheduled for 06:50 delayed into subsequent minute 07:14. Skipping job run.)
    Apr 15 07:16:36 info CROND[13485]: (root) CMD (/usr/lib64/sa/sa1 1 1)
    Code:
    maillog
    Apr 15 06:15:09 info postfix/local[13382]: F22CD40D2E: to=<[email protected]>, orig_to=<nginx>, relay=local, delay=7.8, delays=5.7/0.01/0/2.1, dsn=2.0.0, status=sent (delivered to mailbox)
    Apr 15 06:15:09 info postfix/qmgr[783]: F22CD40D2E: removed
    Apr 15 06:30:34 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 06:30:59 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 06:32:16 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 06:48:55 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 06:32:16 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 06:48:55 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 07:16:36 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 07:16:36 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/qmgr
    Apr 15 07:16:36 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 07:16:36 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 07:18:08 info postfix/pickup[13483]: 8D8F440DF0: uid=0 from=<root>
    Apr 15 07:18:08 info postfix/cleanup[13494]: 8D8F440DF0: message-id=<[email protected]>
    Apr 15 07:18:08 info postfix/qmgr[783]: 8D8F440DF0: from=<[email protected]>, size=4085, nrcpt=1 (queue active)
    Apr 15 07:18:08 info postfix/local[13502]: 8D8F440DF0: to=<[email protected]>, orig_to=<root>, relay=local, delay=86, delays=56/3.2/0/27, dsn=2.0.0, status=sent (delivered to mailbox)
    
    Code:
    messages
    Apr 15 06:15:02 info systemd: Started Session 2626430 of user root.
    Apr 15 06:15:03 info systemd: Starting Session 2626430 of user root.
    Apr 15 06:15:03 info systemd: Created slice User Slice of nginx.
    Apr 15 06:15:03 info systemd: Starting User Slice of nginx.
    Apr 15 06:15:03 info systemd: Started Session 2626431 of user nginx.
    Apr 15 06:15:03 info systemd: Starting Session 2626431 of user nginx.
    Apr 15 06:15:04 info systemd: Removed slice User Slice of nginx.
    Apr 15 06:15:04 info systemd: Stopping User Slice of nginx.
    Apr 15 06:20:02 info systemd: Started Session 2626661 of user root.
    Apr 15 06:20:02 info systemd: Starting Session 2626661 of user root.
    Apr 15 06:29:00 info systemd-logind: Failed to start session scope session-2626800.scope: Connection timed out
    Apr 15 06:30:27 info systemd: Failed to start Session 2626800 of user root.
    Apr 15 06:31:38 info systemd: Started Session 2626985 of user root.
    Apr 15 06:32:04 info systemd: Starting Session 2626985 of user root.
    Apr 15 06:32:09 info systemd: Created slice User Slice of nginx.
    Apr 15 06:32:11 info systemd: Starting User Slice of nginx.
    Apr 15 06:32:11 info systemd: Failed to start Session 2626986 of user nginx.
    Apr 15 06:32:23 info systemd: Removed slice User Slice of nginx.
    Apr 15 06:32:24 info systemd: Stopping User Slice of nginx.
    Apr 15 06:32:11 info systemd: Starting User Slice of nginx.
    Apr 15 06:32:11 info systemd: Failed to start Session 2626986 of user nginx.
    Apr 15 06:32:23 info systemd: Removed slice User Slice of nginx.
    Apr 15 06:32:24 info systemd: Stopping User Slice of nginx.
    Apr 15 07:16:36 info journal: Permanent journal is using 96.0M (max allowed 4.0G, trying to leave 4.0G free of 63.1G available → current limit 4.0G).
    Apr 15 07:16:36 info journal: Journal started
    Apr 15 07:16:36 info systemd: systemd-journald.service watchdog timeout (limit 3min)!
    Apr 15 07:16:36 info rsyslogd: imjournal: journal reloaded... [v8.24.0 try http://www.rsyslog.com/e/0 ]
    Apr 15 07:16:36 info journal: Journal stopped
    Apr 15 07:18:08 info journal: Permanent journal is using 96.0M (max allowed 4.0G, trying to leave 4.0G free of 63.1G available → current limit 4.0G).
    Apr 15 07:18:08 info journal: Journal started
    Apr 15 07:19:55 info systemd: Removed slice User Slice of nginx.
    Apr 15 07:19:55 info systemd: Stopping User Slice of nginx.
    Apr 15 07:20:02 info systemd: Started Session 2628008 of user root.
    Apr 15 07:20:02 info systemd: Starting Session 2628008 of user root.
    Apr 15 07:25:01 info systemd: Started Session 2628202 of user root.
    Apr 15 07:25:02 info systemd: Starting Session 2628202 of user root.
    Code:
    journalctl
    Apr 15 06:20:02 info systemd[1]: Started Session 2626661 of user root.
    Apr 15 06:20:02 info CROND[13401]: (root) CMD (/usr/lib64/sa/sa1 1 1)
    Apr 15 06:20:02 info systemd[1]: Starting Session 2626661 of user root.
    Apr 15 06:29:00 info systemd-logind[121]: Failed to start session scope session-2626800.scope: Connection timed out
    Apr 15 06:30:21 info crond[13422]: pam_systemd(crond:session): Failed to create session: Connection timed out
    Apr 15 06:30:27 info systemd[1]: Failed to start Session 2626800 of user root.
    Apr 15 06:30:34 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 06:30:37 info crond[13422]: (root) INFO (Job execution of per-minute job scheduled for 06:25 delayed into subsequent minute 06:30. Skipping job r
    Apr 15 06:30:59 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 06:31:38 info systemd[1]: Started Session 2626985 of user root.
    Apr 15 06:32:02 info crond[13435]: pam_systemd(crond:session): Failed to create session: Connection timed out
    Apr 15 06:32:04 info systemd[1]: Starting Session 2626985 of user root.
    Apr 15 06:32:09 info crond[13436]: pam_systemd(crond:session): Failed to create session: Connection timed out
    Apr 15 06:32:09 info systemd[1]: Created slice User Slice of nginx.
    Apr 15 06:32:09 info crond[13436]: (nginx) INFO (Job execution of per-minute job scheduled for 06:30 delayed into subsequent minute 06:31. Skipping job
    Apr 15 06:32:11 info systemd[1]: Starting User Slice of nginx.
    Apr 15 06:32:11 info crond[13435]: (root) INFO (Job execution of per-minute job scheduled for 06:30 delayed into subsequent minute 06:31. Skipping job r
    Apr 15 06:32:11 info systemd[1]: Failed to start Session 2626986 of user nginx.
    Apr 15 06:32:16 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 06:32:23 info systemd[1]: Removed slice User Slice of nginx.
    Apr 15 06:32:24 info systemd[1]: Stopping User Slice of nginx.
    Apr 15 06:48:55 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 07:16:36 info systemd-journal[13462]: Permanent journal is using 96.0M (max allowed 4.0G, trying to leave 4.0G free of 63.1G available <E2><86>
    Apr 15 07:16:36 info systemd-journal[13462]: Journal started
    Apr 15 07:16:36 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 07:16:36 info systemd[1]: systemd-journald.service watchdog timeout (limit 3min)!
    Apr 15 07:16:36 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/qmgr
    Apr 15 07:16:36 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 07:16:36 info crond[13460]: (root) INFO (Job execution of per-minute job scheduled for 07:13 delayed into subsequent minute 07:14. Skipping job r
    Apr 15 07:16:36 info crond[13454]: (root) INFO (Job execution of per-minute job scheduled for 06:48 delayed into subsequent minute 07:14. Skipping job r
    Apr 15 07:16:36 info crond[13453]: (root) INFO (Job execution of per-minute job scheduled for 06:36 delayed into subsequent minute 07:14. Skipping job r
    Apr 15 07:16:36 info crond[13457]: (root) INFO (Job execution of per-minute job scheduled for 06:50 delayed into subsequent minute 07:14. Skipping job r
    Apr 15 07:16:36 info CROND[13485]: (root) CMD (/usr/lib64/sa/sa1 1 1)
    Apr 15 07:16:36 info CROND[13486]: (nginx) CMD (/usr/local/bin/php -f /home/nginx/domains/domain.name/public/cron.php)
    Apr 15 07:16:36 info postfix/master[775]: warning: unix_trigger_event: read timeout for service public/pickup
    Apr 15 07:16:36 info rsyslogd[117]: imjournal: journal reloaded... [v8.24.0 try http://www.rsyslog.com/e/0 ]
    Apr 15 07:16:36 info systemd-journal[13462]: Journal stopped
    Apr 15 07:18:08 info systemd-journal[13501]: Permanent journal is using 96.0M (max allowed 4.0G, trying to leave 4.0G free of 63.1G available <E2><86>
    Apr 15 07:18:08 info systemd-journal[13501]: Journal started
    To me it looks like there is empty space in the logs about the high load time. Is it just because system was in high use or did someone get in and swiped his/her steps..?
    This is the first time when I got automated warning, so I'm bit confused with this system.

    I didn't know ngingx receives mail. How do I read them?
    Code:
    su nginx -c mutt
    This account is currently not available.
    Or how do I change mailbox?
     
  2. eva2000

    eva2000 Administrator Staff Member

    40,190
    8,888
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +13,698
    Local Time:
    10:33 PM
    Nginx 1.15.x
    MariaDB 5.5/10.x
    CSF Firewall was originally made for cpanel/WHM servers which had apache server status and is defined in /etc/csf/csf.conf so just ignore that part for now
    Code (Text):
    fgrep -n -C15 '/server-status' /etc/csf/csf.conf
    1696-
    1697-# Check the PT_LOAD_AVG minute Load Average (can be set to 1 5 or 15 and
    1698-# defaults to 5 if set otherwise) on the server every PT_LOAD seconds. If the
    1699-# load average is greater than or equal to PT_LOAD_LEVEL then an email alert is
    1700-# sent. lfd then does not report subsequent high load until PT_LOAD_SKIP
    1701-# seconds has passed to prevent email floods.
    1702-#
    1703-# Set PT_LOAD to "0" to disable this feature
    1704-PT_LOAD = "600"
    1705-PT_LOAD_AVG = "15"
    1706-PT_LOAD_LEVEL = "8"
    1707-PT_LOAD_SKIP = "3600"
    1708-
    1709-# This is the Apache Server Status URL used in the email alert. Requires the
    1710-# Apache mod_status module to be installed and configured correctly
    1711:PT_APACHESTATUS = "http://127.0.0.1/server-status"
    


    CPU load optimal is 1.0 per cpu thread so 8 cpu thread normal = 8x1 = 8.0. 16 cpu server normal is 16x1 = 16.0

    so cpu load of 8 for
    • 2 cpu server = 8/2 = 4.0 load average
    • 4 cpu server = 8/4 = 2.0 load average
    • 8 cpu server = 8/8 = 1.0 load average
    When you type top command, hit number 1 to break down per cpu stats

    Centmin Mod is provided as is, so troubleshooting load issues is left to end user to do. However, there's many linux tools and scripts that can help you figure out what was causing the load issues and when.

    If you're providing info on this forum, more info might be helpful
    1. What version of Centmin Mod ? .08 stable or .09 beta ? If .09 beta when was it installed and when was last time you updated ?
    2. What's your VPS/Server hardware specifications ? cpu type ? memory available ? disk space ?
    3. Who's your web host ? Different hosts have different limits for server resource usage and some are more restrictive that others so it could just be their resource usage policy you tripped which may or may not be restrictive. If restrictive, then real solution would be finding a better web host. If not restrictive, then it's finding out what caused high cpu load.
    Tools and commands you will want to read up on and learn for basic system admin tasks and troubleshooting.
    Notes:
    For example what's output for these commands
    Code (Text):
    top -bn1
    

    Code (Text):
    ps aufxw
    

    Code (Text):
    sar -q
    

    Code (Text):
    sar -r
    

    Code (Text):
    sar -q -f /var/log/sa/sa$(date +%d -d yesterday)
    

    Code (Text):
    sar -r -f /var/log/sa/sa$(date +%d -d yesterday)
    

    If you want to look at per process usage stats use pidstat run pidstat every 1 sec for 5 runs and sed is just to replace your main hostname with word hostname so mask your domain for posting on public places
    Code (Text):
    pidstat -durh 1 5 | sed -e "s|$(hostname)|hostname|g"
    

    for posting code you might want to use CODE tags for code How to use forum BBCODE code tags :)
     
  3. Meirami

    Meirami Member

    128
    15
    18
    Dec 21, 2017
    Ratings:
    +41
    Local Time:
    3:33 PM

    1. .09 beta updated (a day before this happened) on the 14th of April. Installed 15th of March. This was first high usage alert. I'm running Nextcloud for 2 person. So it's mainly 100% idle according to top. Just tested small video upload. 1 thread went shortly ~85 id while other 3 was more than 90 id. When the high usage happened, we didn't used it. I think this was openvz problem or something from outisde caused it.
    2. HostUS
      • 2GB RAM
      • 2GB vSwap
      • 4 vCPU Cores (Fair Share)
      • 75GB Disk Space
      • 2TB Bandwidth
      OpenVZ VPS (Special)
      Amsterdam Install - Post your Centmin Mod 123.09beta01 install time stats
    3. HosUS
    Thank You @eva2000 for your great information package! I'll study it as soon as I have some spare time. :)
     
    Last edited: Apr 19, 2018
  4. eva2000

    eva2000 Administrator Staff Member

    40,190
    8,888
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +13,698
    Local Time:
    10:33 PM
    Nginx 1.15.x
    MariaDB 5.5/10.x
    Yes OpenVZ and HostUS host issues could be a factor. I just updated 123.09beta01's cminfo command with a cminfo top command you can also use to get a better overview of cpu, memory and disk usage stats. Still up to you to troubleshoot but cminfo top command gives you some stats to look over.
     
    • Informative Informative x 1
..