Want to subscribe to topics you're interested in?
Become a Member

Email Sysadmin SMTP errors solved by creating a new server from a snapshot, Hetzner shared vCPU cloud

Discussion in 'System Administration' started by MaximilianKohler, Aug 15, 2024.

  1. MaximilianKohler

    MaximilianKohler Member

    179
    5
    18
    Jun 23, 2023
    Ratings:
    +23
    Local Time:
    8:43 PM
    I'm sending emails via listmonk.app + Amazon SES. I started getting various errors out of the blue. It was solved after I made a snapshot of the server, and created a new server in a new location from that snapshot.

    This is the second time that a new-server-from-snapshot solved various issues. My server host is Hetzner and their support tells me each time that they see no problems on their end and I should check my server logs.

    I checked all the error logs listed here:
    FAQ - CentminMod.com LEMP Nginx web stack for CentOS, AlmaLinux, Rocky Linux, and there was nothing notable. Just a few timeouts that don't necessarily match all the SMTP errors.

    This has to be an issue on Hetzner's end no? Has anyone else experienced this need to regularly create new servers from snapshots to solve various problems?

    I know that server hosts have some influence on SMTP functionality since they can block the ports by default, but this set of SMTP-exclusive errors seems odd since everything else seemed to be working normally.

    I have another website on the same server running Xenforo and sending emails with another Amazon SES account, but I didn't think to try to send any test emails with it. And the activity is not frequent enough that it was trying to send emails during the same period.

     
  2. eva2000

    eva2000 Administrator Staff Member

    53,223
    12,116
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,654
    Local Time:
    1:43 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    What are your AWS SES account's daily and sending rate - emails per-second sending limits right now? It's possible that AWS SES sending rate limits could be contributing to the issues you're experiencing. AWS SES imposes limits on both the number of emails you can send per day and the rate at which emails can be sent per second. This could explain your intermittent issues or after a while it resolves itself enhancement: Max message rate per rolling window · Issue #119 · knadh/listmonk.

    You should send AWS SES emails in line with your current allowed daily and per-second sending limits and let AWS SES automatically determine when it's safe to increase your limits based on your sending practices i.e. low bounce/complaint rates Managing your Amazon SES sending limits - Amazon Simple Email Service (which includes message size limits) and Increasing your Amazon SES sending quotas - Amazon Simple Email Service.

    AWS SES Sending Limits
    1. Daily Sending Quota: The total number of emails you can send in a 24-hour period.
    2. Sending Rate Limit: The maximum number of emails you can send per second.
    To determine if rate limits are the issue, you can:
    • Check SES Sending Statistics: In the AWS SES console, you can monitor your sending limits, including your sending quota and rate.
    • Reduce Batch Size and Rate: Lower your batch size and sending rate temporarily to see if this resolves the issue. This will reduce the likelihood of hitting rate limits.
    • Request a Quota Increase: If you find that your sending needs exceed your current limits, you can request an increase in your SES quota through the AWS console Increasing your Amazon SES sending quotas - Amazon Simple Email Service (only do this if you have bounce/complain rates under control).
    I've seen SES accounts start out as low as 25,000 emails/day and 14 emails/second limits at start. I'm at at 250,000 emails/day and 50 emails/second right now.
     
  3. MaximilianKohler

    MaximilianKohler Member

    179
    5
    18
    Jun 23, 2023
    Ratings:
    +23
    Local Time:
    8:43 PM
    I'm using 0.1% of my limits and I was getting the errors when attempting to send a single test email. And after creating the new Hetzner server the errors are gone completely and I sent out thousands of emails. I have low bounce/complaint rates.

    Regarding the issue being intermittent, I started getting errors, then one email went through, then I was back to getting errors. All in the space of ~30 minutes and sending out single test messages.

    My current limits are something like 1mil/day, 250/s. I typically send a few thousand emails per day.
     
  4. eva2000

    eva2000 Administrator Staff Member

    53,223
    12,116
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,654
    Local Time:
    1:43 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    If rate limits isn't any issue, then I would work backwards from PostgreSQL database, PHP-FPM and Nginx proxy related load and timeout settings as your context deadline exceeded timeouts (Golang related) and SMTP timeouts do point to some form of timeout in play I suspect. You probably need to further tune PostgreSQL database, PHP-FPM and Nginx settings.

    I don't have any practical experience with Listmonk for mailings - only use Sendy.co for such on Centmin Mod and know quite a few Centmin Mod users also using Sendy.co without issues.
     
  5. MaximilianKohler

    MaximilianKohler Member

    179
    5
    18
    Jun 23, 2023
    Ratings:
    +23
    Local Time:
    8:43 PM
    How can it be an issue with them if it was solved by creating a new server from a snapshot? Everything should be identical.

    And I've sent hundreds of thousands of emails per day before without issue, and this issue arose during a low-volume period when single emails were being sent with significant delay between them.

    I've been doing the same thing every day (besides one period ~4 months ago with a much higher send volume) on this server for over a year now I think.
     
  6. eva2000

    eva2000 Administrator Staff Member

    53,223
    12,116
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,654
    Local Time:
    1:43 PM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    One common one on MySQL, at least, is database/table bloat and fragmentation over time that reduces performance and introduces slow queries. This causes PHP-FPM processes to queue up, which in turn causes Nginx to time out while waiting for PHP-FPM. I have seen this many times with clients who hire me to optimize their servers.

    If you back up your database and reimport it into a new server, you effectively defrag and optimize your database sizes, which could improve the issue - at least for MySQL (but probably the same for PostgreSQL, too). But if you're doing snapshot restores, then it could be something else or still related i.e. database process/features that run that cause table/row level locking on databases which again does the same thing = PHP-FPM processes to queue up, which in turn causes Nginx to time out while waiting for PHP-FPM
     
  7. MaximilianKohler

    MaximilianKohler Member

    179
    5
    18
    Jun 23, 2023
    Ratings:
    +23
    Local Time:
    8:43 PM
    I'm not sure if rebooting the server is the solution for that, but I rebooted before trying the snapshot method, and the reboot didn't solve it.

    My server/database usage is pretty low, the databases themselves are not large, and I recently backed them up and restored them to a new server when upgrading from CentOS 7 to Almalinux, so I'm doubtful that they're responsible for this.

    EDIT:
    Oh, I mentioned this on the github issue, but the only other unique thing I did was upgrade nginx from 1.27.0 to 1.27.1 just prior to getting those errors. Reverting back to 1.27.0 didn't help, and I did spot a few prior errors in the logs about a week or two back.
     
    Last edited: Aug 17, 2024