Want to subscribe to topics you're interested in?
Become a Member

Troubleshooting Random MySQL Problems

Discussion in 'System Administration' started by deltahf, Aug 7, 2017.

  1. deltahf

    deltahf Active Member

    207
    101
    43
    Jun 8, 2014
    Ratings:
    +154
    Local Time:
    6:22 PM
    Over the past two weeks, I've had three incidents (the third is currently ongoing) with suddenly slow database performance.

    The issue is not consistent, but when it does appear, it always begins shortly after 1:00am local (server) time. There is nothing scheduled at this time of day (backups, etc.) that I am aware of which would cause the issue.

    Here is what the incident looks like in NewRelic. As you can see, there is a drastic increase in MySQL's contribution to the overall web transaction time:

    Screen Shot 2017-08-06 at 10.27.27 PM.png

    Taking a closer look at the database activity report in NewRelic, unfortunately, does not offer any more clues. All of the tracked queries seem to have become "slow", with no particular outliers:

    Screen Shot 2017-08-06 at 10.30.36 PM.png

    Restarting MySQL or any of the related services (Nginx, PHP) has no impact on the problem. The only way I can resolve the issue is by rebooting the entire server. However... rebooting now takes an unusually long time (15 minutes compared to the previous 3 minutes — this is a dedicated box at ReliableSite). Of course, this is not a permanent fix, as the problem randomly resurfaces a few days later.

    I am running the latest versions of Centminmod, Nginx (1.13.3), PHP-FPM (5.6.31), and MariaDB (5.5.57). I checked in mysqld.log but nothing stood out as unusual.

    I have no idea how to troubleshoot this further. Anyone have some ideas?
     
  2. eva2000

    eva2000 Administrator Staff Member

    30,156
    6,785
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +10,136
    Local Time:
    8:22 AM
    Nginx 1.13.x
    MariaDB 5.5
    Plenty of tools to diagnose this, including my below mentioned mysqlmymonlite.sh script.

    Centmin Mod is provided as is, so troubleshooting load issues is left to end user to do. However, there's many linux tools and scripts that can help you figure out what was causing the load issues and when.

    If you're providing info on this forum, more info might be helpful
    1. What version of Centmin Mod ? .08 stable or .09 beta ? If .09 beta when was it installed and when was last time you updated ?
    2. What's your VPS/Server hardware specifications ? cpu type ? memory available ? disk space ?
    Tools and commands you will want to read up on and learn for basic system admin tasks and troubleshooting.
    Notes:
     
  3. deltahf

    deltahf Active Member

    207
    101
    43
    Jun 8, 2014
    Ratings:
    +154
    Local Time:
    6:22 PM
    Thanks, eva. Unfortunately, I'm not able to get any more information about the problem from other tools. There is no runaway processes, no unusual memory usage, nothing in the logs that I can find.

    It's incredibly frustrating, and now, it's getting worse.

    I finally had to just reboot the server again. The full reboot took 17 minutes, sort of. I say "sort of" because the server came back online after two minutes, but MySQL did not start functioning until 15 minutes later. Nginx and PHP were both working, because my WordPress and XenForo pages returned "Error establishing connection to the database..." messages until it started working again.

    The strangest thing, however, was that I was not able to SSH into the server until after the full 17 minutes. My client just kept saying the connection was refused. Sure enough, my SSH port (2222) was not open, even though the server was obviously online and serving pages.

    Even if MySQL took 15 minutes to boot up, how could this possibly have an impact on SSH? This does not make any sense.
     
  4. eva2000

    eva2000 Administrator Staff Member

    30,156
    6,785
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +10,136
    Local Time:
    8:22 AM
    Nginx 1.13.x
    MariaDB 5.5
    using centos 7 ? you can analyse system boot time info via commands
    Code (Text):
    systemd-analyze critical-chain --no-pager
    

    Code (Text):
    systemd-analyze blame --no-pager
    
     
    • Informative Informative x 1
  5. deltahf

    deltahf Active Member

    207
    101
    43
    Jun 8, 2014
    Ratings:
    +154
    Local Time:
    6:22 PM
    I eventually contracted @Matt to help me out with this, and he discovered the source of the problem was not MySQL at all, but a faulty SSD drive in my RAID array. :eek: The problem was cropping up after the weekly RAID health check was running:

    Code (Text):
    top - 03:20:45 up 6 days, 4:04, 2 users, load average: 4.21, 4.18, 4.10
    Tasks: 286 total, 2 running, 284 sleeping, 0 stopped, 0 zombie
    Cpu0 : 15.0%us, 0.7%sy, 0.0%ni, 84.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
    Cpu1 : 4.0%us, 1.0%sy, 0.0%ni, 95.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
    Cpu2 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
    Cpu3 : 0.3%us, 0.3%sy, 0.0%ni, 54.6%id, 44.7%wa, 0.0%hi, 0.0%si, 0.0%st
    Cpu4 : 6.6%us, 0.3%sy, 0.0%ni, 93.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
    Cpu5 : 0.7%us, 0.0%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
    Cpu6 : 0.3%us, 0.3%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
    Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 10.7%id, 89.3%wa, 0.0%hi, 0.0%si, 0.0%st
    Mem: 32836560k total, 32484108k used, 352452k free, 158024k buffers
    
    root @ rS ~ $ cat /proc/mdstat
    Personalities : [raid1]
    md4 : active raid1 sdc2[0] sdd2[1]
    960245760 blocks super 1.1 [2/2] [UU]
    bitmap: 0/8 pages [0KB], 65536KB chunk
    
    md0 : active raid1 sdb2[1] sda2[0]
    511936 blocks super 1.0 [2/2] [UU]
    
    md3 : active raid1 sdd1[1] sdc1[0]
    16375808 blocks super 1.1 [2/2] [UU]
    
    md2 : active raid1 sdb3[1] sda3[0]
    233030656 blocks super 1.1 [2/2] [UU]
    [=======>.............] check = 38.7% (90359296/233030656) finish=2572.5min speed=923K/sec
    bitmap: 2/2 pages [8KB], 65536KB chunk
    
    md1 : active raid1 sdb1[1] sda1[0]
    16375808 blocks super 1.1 [2/2] [UU]
    
    unused devices: <none>


    My host, ReliableSite, took the server offline for more analysis and confirmed the drive was faulty. They had it replaced within two hours, and now the server is running better than ever, and my site is faster than ever. :D

    For anyone curious, here's how the New Relic graphs changed...

    Screen Shot 2017-08-15 at 10.26.12 PM.png

    Thanks again, @Matt! I would strongly recommend his services to anyone who needs an extra bit of help. :)
     
    • Like Like x 1
    • Informative Informative x 1
  6. eva2000

    eva2000 Administrator Staff Member

    30,156
    6,785
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +10,136
    Local Time:
    8:22 AM
    Nginx 1.13.x
    MariaDB 5.5
    very nice work @Matt and yup faulty disks are a thing and why this forum is still on Linode for disk redundancy health :)
     
    • Like Like x 2