Discover Centmin Mod today
Register Now

WebPerf 6-Part Guide to NGINX Application Performance Optimization

Discussion in 'All Internet & Web Performance News' started by eva2000, Mar 10, 2016.

Tags:
  1. eva2000

    eva2000 Administrator Staff Member

    55,189
    12,251
    113
    May 24, 2014
    Brisbane, Australia
    Ratings:
    +18,829
    Local Time:
    4:26 AM
    Nginx 1.27.x
    MariaDB 10.x/11.4+
    [​IMG]

    This guide was written by Stéphane Brunet. He is the Director of Development at SuperTCP, a performance optimizer for applications that serve rich media content.

    When I think about web app performance, I think about NGINX, CDNs, HTTP/2, and a few other hot technologies. And the undertone in all of these is the idea that we can use scale and multiplicity to our advantage. Whether via specialization, bulk efficiency, deferral, or old fashioned common sense, less becomes more.

    In this guide, I’ll focus specifically on NGINX reverse proxy, sharing a ton of simple yet powerful ways to optimize it for rich content applications. These tips will help you improve the responsiveness of your mobile/web app and maximize download throughput for larger assets such as streaming videos and shared files.

    Time Spent Analysis


    Before diving into details, let’s consider where and why optimizations are likely to make a meaningful difference for a web app that delivers rich content.

    Round-Trip Time (RTT)


    The RTT is the single most important variable as it affects both latency and throughput. Its effects are expressed in all three phases of a client request: initial connection setup time, request-response latency, and throughput in the delivery of the response body.

    Establishing a new HTTPS connection involves the DNS query (typically < 10ms), the TCP handshake, and the TLS tunnel negotiation. Once the HTTPS connection is established, each request/response will require (at least) one more RTT. If you’re keeping score, a single HTTPS request over a new connection requires a minimum of four RTTs:

    [​IMG]

    NOTE: Additional round trips may be required if the size of data to send doesn’t yet fit in the TCP congestion window. In fact, the impact of a large initial amount of data to send may be felt in both directions of a TCP transaction: in terms of the client request size (e.g. a large POST), and in terms of the server’s TLS certificate chain and the server response body.

    Finally, in terms of TCP throughput, don’t forget that it’s inversely proportional to the RTT. All things being equal, doubling the RTT will halve the TCP throughput. As such, we need to consider the location of the origin server(s) relative to the clients.

    [​IMG]

    The RTT from US East Coast to US West Coast is on the order of 80 ms, which is quite substantial. US East Coast to Asia can be a mind-numbing 150-400 ms depending on the route taken. Moreover, the RTT between endpoints may increase (bufferbloat) and fluctuate wildly (jitter) as network congestion occurs.

    As stated in O’Reilly’s High Performance Browser Networking, if an application is designed for human consumption, it must consider the human perception of time. Thus we should endeavor to produce visual feedback in under 250 ms to keep a user engaged – whether it be presenting static content, starting video playback, or acknowledging a request for action.

    On the other hand, if you’re using a CDN and it has a Point of Presence (POP) near your end users, the RTT can be very low, on the order of 5-15 ms. The cacheable content will be fetched from the CDN edge node where browsers and mobile apps can start to render while dynamic content is fetched upstream, improving perceived responsiveness.


    Related Tutorial: What Is Content Caching?


    To save the connection setup time to origin servers, CDN edge nodes typically maintain warm connections – the same kind of trick that browsers use to keep connections warm with the Connection: keep-alive HTTP/1.1 header. In fact, some CDNs go as far as to provide a tiered hierarchy of intermediate nodes for increased keep-alive scalability and connection collapsing. MaxCDN does this with Origin Shield.

    If you’re not yet using a CDN, you’re at the mercy of the RTT to your origin server(s). But you can use the same kinds of tricks that browsers and CDNs use to improve RTT. Some of these tricks include:

    • Using static content caching built into the mobile app
    • Having the app maintain ready-to-use connections to your origin
    • Deploying new origin servers around the world
    • Using an NGINX reverse proxy with content caching, content relay, and load balancing at your origin

    A really great read is Ilya Grigorik’s post on Eliminating Roundtrips with Preconnect. While this guide mostly discusses browsers and web apps, the same concept can be leveraged by mobile app developers.

    Beyond that, NGINX supports HTTP/2 as of version 1.9.5. This is the best way to avoid connection setup time as HTTP/2 performs all requests within a single tunnel. (Bonus: it doesn’t require changes to your HTTP/1.1 application semantics.) Of course, to make full use of HTTP/2, you’ll have to abandon domain sharding, traditionally used as a trick to bypass a browser’s maximum endpoint connections limit.

    Request processing time (RPT)


    Server side processing may be quick for simple apps, but it’s typical to see a few hundred milliseconds or more for requests to turn around. It can vary wildly depending on the nature of the request.

    Scripting and database query optimizations are low hanging fruit, but application design optimizations will likely take you further. The trick is finding ways to turn blocking waits into parallel work.

    In the end, a single application server isn’t enough for busy apps. As requests start to queue up, processing latency becomes a major part of your measured response time. Which is why deploying a reverse-proxy to front-end your application is a total game changer. It unlocks “superpowers” such as load balancing, content caching, content relay, and micro-caching.

    In general, your time is very well spent if you can find ways to free up your app servers. Have them focus on business logic and dynamic content generation – nothing else.

    Response delivery time (RDT)


    I’ll loosely define the response delivery time as the elapsed time between the generation of the app server’s response, and its delivery to the user (or to the CDN edge node).

    We can optimize RDT a few different ways based on the goals we set and the type of app we’re building. For example, a file sharing app would likely measure the download turnaround time as the key metric. A real-time video streaming app, on the other hand, might measure how long it takes to deliver the first 10 seconds of video to the user so the media player can start playback.

    To improve on these metrics, we should consider: 1) the disposition and size of the response body, 2) the encoding and buffering of the response body,and 3) the throughput efficiency and effects of scale when serving multiple users.

    Reverse Proxy Buffering


    Proxy buffering is of interest when NGINX is receiving a response from the backend. This can either happen on first fetch of a cacheable asset, or when dynamic/uncacheable content is requested.

    By design, NGINX sets up for buffering of reasonably-sized response bodies. But if responses from the backend app server don’t fit into these buffers, the response is written to a temporary file.

    For cacheable content, this is less of an issue because you’ve probably configured your cache to live on the reverse proxy’s filesystem. However, you’d want to analyze your app’s non-cacheable responses, as well as the inter-chunk gaps for chunk-encoded responses, in order to rightsize your proxy buffers.

    The proxy_buffering directive determines whether NGINX is relaying the response asynchronously (enabled by default) or synchronously (disabled).

    [​IMG]

    With proxy_buffering disabled, data received from the server is immediately relayed by NGINX, allowing for minimum Time To First Byte (TTFB).

    The amount of data that is read from the response is controlled by proxy_buffer_size – the only relevant proxy buffering directive when proxy buffering is disabled. So if TTFB is your goal, make sure that tcp_nodelay is enabled (default) and that tcp_nopush is disabled (default).

    Warning: Disabling proxy buffering is actually quite risky, so I wouldn’t recommend it unless you know what exactly you’re doing. Typically, the reverse-proxy and the backend app servers are colocated on a very fast LAN. But the client-side connection quality can vary quite a bit and sometimes stalls.

    If the proxy’s client-side connection causes back pressure on the proxy’s upstream connection (large assets, or HTTP/2), it can hold an app server hostage as it’s forced to drain the tail end of a response at the client’s slower speed. This is especially problematic for those who prefer deploying many, less capable backend servers that are not able to support more than a few hundred simultaneous connections.

    On the flipside, with proxy_buffering enabled, be mindful of using very large proxy buffers. This may start to eat into your memory and limit the maximum number of simultaneous connections your proxy can support.

    While most folks likely configure proxy buffering and buffer sizes globally, it’s interesting to note that this set of directives can be configured per server block and even per location block, giving you ultimate flexibility to customize content delivery.


    Related Tutorial: List of NGINX Proxy Directives



    The HTTP Archive reveals the average individual response size to be less than 32KB for HTML or Javascript, so you may not need to adjust the default value for proxy_buffers.

    Take a look at your application’s response body sizes before making an uneducated guess, and try to limit proxy buffer size increases to dynamic responses since those cannot be cached. Cacheable responses will need to go to disk anyway, so there may not be much point in trying to buffer them entirely.

    NGINX can also let the application server determine proxy buffering behavior on a per-response basis with the X-Accel-Buffering HTTP response header field (set to yes or no). However, it doesn’t allow the app server to influence buffer sizes for that response, so inherited configuration values will be used. Alternatively, that header can be ignored just like any other HTTP header with the proxy_ignore_headers directive.

    Content Caching, Relay, and Micro-Caching


    Your NGINX reverse proxy is ideally suited for brute force I/O and makes a great content cache, moving the data closer to the client or the edge node. This allows you to completely free up your application servers and have them focus on business logic and dynamic content generation.

    In a perfect world, static files are served from fast local storage (SSD) on your origin’s reverse-proxy and further cached by the CDN. There are several, often complementary ways, to set up an NGINX reverse proxy for content caching and heavy lifting. They include:

    1. Micro-caching of dynamic content
    2. Caching of static content
    3. Content relay via local storage and/or app server redirect
    4. Backend storage array relay
    5. Storage service relay with response caching

    [​IMG]

    Micro-caching is the idea that dynamic, non-personalized responses can be cached for a very short amount of time (e.g. 1 sec). In fact, one could argue that personalized responses can also be cached for small amounts of time depending on the anticipated workflow.

    While it may not make intuitive sense, micro-caching allows your service to survive longer when faced with overwhelming demand or attacks. It can (somewhat artificially) boost your benchmarks.


    Related Blog Post: The Benefits of Micro-caching with NGINX


    When dealing with a manageable catalogue of static content, the simplest approach might be to have the reverse proxy host large public assets on its filesystem as a sparse webroot and serve them directly. Public assets can be served using a trivial location block with try_files (and possibly alias). Cache misses can be sent to a backend server as usual and the response can be cached:

    location / {
    alias /home/nginx/www-sparse;
    try_files $uri @backend;
    }
    location @backend {
    proxy_cache myCache;
    proxy_cache_valid 2h;
    proxy_pass http://backend;
    }

    When authentication or other business logic is required for access to assets, the app server can produce a redirect response with the X-Accel-Redirect HTTP header, asking the reverse proxy to serve the resource to the client.

    The internal directive can be used on the reverse proxy to limit access to these internally generated requests. NGINX ensures that client-side requests will never match locations marked as internal:

    location /secret {
    internal;
    alias /home/nginx/group/data;
    try_files $uri =404;
    }

    A backend storage array can also be addressed using the proxy_pass directive. If you’re using a storage service instead, you may want to cache the response as well in an effort to move the data closer to the user or the CDN edge node.

    location /external {
    proxy_cache MY_CACHE;
    proxy_cache_valid 1h;
    proxy_pass http://192.168.10.201;
    }

    Don’t forget to update the required HTTP headers and add the proxy’s IP address to the XFF header (or the new RFC 7239 Forwarded header):

    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    When proxying HTTPS client connections to an HTTP backend, the app server must generate content URLs for the proper scheme. You can communicate this scheme using the X-Forwarded-Proto header. Some Microsoft applications look for Front-End-Https instead.

    map $scheme $front_end_https {
    https on;
    default off;
    }
    proxy_set_header X-Forwarded-Proto $scheme;
    add_header Front-End-Https $front_end_https;

    For example, WordPress uses PHP’s $_SERVER global variable to control the HTTP(S) scheme when generating links. You can add the following snippet to your WordPress backend webroot (e.g. at the bottom of wp-config.php) to make use of the X-Forwarded-Proto header.

    <?php
    if ($_SERVER['HTTP_X_FORWARDED_PROTO'] == 'https')
    $_SERVER['HTTPS']='on';
    ?>

    The proxy_cache_key directive determines how NGINX uniquely identifies a response body. You can explicitly reference query parameters in the cache key with NGINX variables by prepending the parameter name with “$arg_”. As an example, consider this URL: Example Domain. NGINX will make $arg_abc and $arg_xyz available to your NGINX configuration.


    Related Tutorial: Using Query String Parameters to Apply Custom Rules



    Load Balancing


    NGINX allows us to configure the backend with the upstream directive. Of special interest are session persistence and the load balancing scheme.

    [​IMG]

    In terms of session persistence, there are three interesting variables to consider:

    1. The round-trip time
    2. The proxy’s TCP CWND for persistent sessions
    3. The number of persistent sessions
    Performance Considerations


    A very low RTT allows for very quick connection setup between the proxy and the app server, and very fast ramp of the proxy’s throughput. So for a (typical) collocated backend, a warm connection simply reduces the work required to process uncached requests.

    However, if you’ve deployed a reverse proxy that vectors uncached requests towards a remote origin (e.g. coast to coast 80ms), keepalives can save a ton of setup time – especially if you’re obliged to provide encryption throughout. (Recall that a new TLS tunnel takes 3 RTTs to negotiate.)

    In the remote backend case, you may also want to play with tcp_slow_start_after_idle (in sys.net.ipv4). This determines whether or not the CWND size will go back to its initial value after the connection becomes idle (one RTO). Typically, that behavior is enabled and it’s the desirable behavior, all things being equal. But if you’re using a dedicated point-to-point connection, you’ll want to disable that since it’s unlikely to suffer from congestion. It’s also unlikely that the BDP will change.

    Now, I know what you’re thinking: I don’t have a dedicated connection, but let’s be greedy and try it anyway! But when wagering, consider the risk and reward.

    Analytics would really come in handy in comparing the current congestion window size to your guess of the future BDP (e.g. median throughput rate multiplied by median RTT). Also worth consideration: the impact on your bandwidth costs, assuming the data doesn’t all make it to the far end.

    Alternatively, if your BDP is quite large compared to the initial congestion window, reconfigure your initial CWND. Again, there’s likely no need to go through any of this for a collocated backend on a fast LAN (like AWS EC2).

    Persistent Backend Connections


    The keepalive (session persistence) directive sets the maximum number of idle connections to maintain to upstream servers on a per worker process basis. In other words, when a worker process’ connections exceed the number set by keepalive, it will start to close the least-recently used idle connections until that number is reached. Think of it as a per-worker process connection pool.

    HTTP/1.1 is required to support backend keepalive, so we need to set the proxy_http_version directive and clear the Connection header. According to NGINX:

    upstream backend {
    keepalive 100;
    server 192.168.100.250 weight=1 max_fails=2 fail_timeout=10;
    server 192.168.100.251 weight=1 max_fails=2 fail_timeout=10;
    server 192.168.100.252 weight=1 max_fails=2 fail_timeout=10;
    }
    server {
    location /http {
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_pass http://backend;
    }
    }

    In setting the value of keepalive, keep the following in mind:

    • That number of connections will be allocated from both the proxy’s and the app server’s connection limits (see worker_connections)
    • That value is a per-worker process setting
    • You may have configured multiple upstream blocks with the keepalive directive

    [​IMG]

    Unless you’ve configured round-robin load balancing, it’s difficult to predict the distribution of connections on a per app server basis. Rather, that depends on your load balancing scheme, the disposition of each request, and the request timing.

    The (unlikely) worst-case scenario occurs when every worker process establishes all its connections to the same backend server. However, that would be a strong indication that the load balancing scheme needs to be reconsidered.

    NOTE: If you’re running a default NGINX configuration on your application server, it’s worker_connections limit is set to 512.


    Related Tutorial: Increasing NGINX Open Files Limit on CentOS 7

    Load Balancing Scheme


    NGINX offers the following load balancing schemes:

    1. Weighted round robin
    2. ip_hash: hashing based on client IPv4 or IPv6 address
    3. hash: hashing based on a user-defined key
    4. least_conn: least number of active connections
    5. least_time: NGINX Plus offers the least average response time scheme

    In terms of performance, least_time is likely preferred but may not be relevant if your backend is composed of identical, well-behaved app servers. Otherwise, hash and ip_hash offer interesting optionality. For example, if your app server takes a while to load a user’s profile, sending the same user to the same backend server may benefit from cache hits.

    A client’s IP address is available via the $remote_addr variable. But beware of client IP hashing since a single IP address may represent several users coming in from behind the same NAT (e.g. corporate office or school).

    Throughput


    The NGINX reverse proxy configuration sets up two network legs: client-to-proxy and proxy-to-server. Not only are these two legs distinct HTTP spans, but they’re also distinct TCP network transport domains.

    [​IMG]

    Especially when serving larger assets, our goal is to ensure that TCP is making the most of the end-to-end connection. In theory, if the TCP stream is packed tightly with HTTP data, and that data is sent as quickly as possible, we’ll have achieved maximum throughput.

    But I often wonder why I’m not seeing faster downloads. Know the feeling? So let’s go downstairs into the basement and inspect the foundation.

    Network Transport Primer


    TCP employs two basic principles in deciding when and how much data to send:

    1. Flow control to ensure the receiver can accept the data
    2. Congestion control to manage network bandwidth

    Flow control is implemented by the receiver’s advertised receive window, which dictates the maximum amount of data that the receiver is willing to accept and store at one time. This window may grow – from a few KB to several MB – depending on the measured bandwidth-delay product (BDP) of the connection (more on the BDP below).

    Congestion control is implemented by the sender as a clamping function on the RWND size. The sender will limit the amount of data it transmits to the minimum of CWND and RWND. Think of this as a self-imposed constraint in the name of “network fairness.” This window size (CWND) grows with time, or as the sender receives acknowledgements for previously transmitted data. It also shrinks as network congestion is detected.

    [​IMG]

    Together, the sender and the receiver each play a key role in determining the maximum achievable TCP throughput. If the receiver’s RWND is too small, or if the sender is overly sensitive to network congestion or too slow to react to network congestion subsiding, then the TCP throughput will be suboptimal.

    Filling the Pipe


    Network connections are often modeled as pipes. The sender pumps data into one end and the receiver drains it at the other end.

    The BDP (expressed in KB or MB) is a product of the bitrate and the RTT, a measure of how much data is required to fill the pipe. For example, if you’re dealing with a 100 Mbps end-to-end connection and an RTT of 80 ms, the BDP is evaluated at 1 MB (100 Mbps * 0.080 sec = 1 MB).

    [​IMG]

    TCP tries to fill the pipe without spillage or rupture of the pipe, so the BDP becomes the ideal value for the RWND: the maximum amount of data that TCP can place in-flight (not yet acknowledged by the receiver).

    Assuming that there’s enough data to send (larger files), and that nothing is preventing the sending application (NGINX) from pumping that data into the pipe as fast as the pipe can accept it, the RWND and CWND can be the limiting variables in terms of achieving maximum throughput.

    Most modern TCP stacks auto-tune these parameters using the TCP timestamps and window scaling options. But older systems do not and some applications misbehave. So the two obvious questions are:

    1. How can I check that?
    2. How can I fix that?

    We’ll direct question #1 below, but fixing TCP involves learning how to tune your TCP stack – a full time job in itself. The more viable option is TCP acceleration software or hardware. And there are quite a few vendors out there, including the product I work on every day – SuperTCP.

    Checking RWND and CWND


    Trying to determine whether the RWND or the CWND are limiting factors involves comparing them to the BDP.

    To do this, we’ll packet-sniff the HTTP(S) transfer of a large asset using the tcpdump tool on the (headless) NGINX proxy, and load the capture file into Wireshark on a machine with a UI. We can then plot a few meaningful graphs to get some insight into whether these foundational variables are being set properly.

    # tcpdump -w http_get.pcap -n -s 100 -i eth0 port <80|443>

    If you use a different capture filter, just make sure that it captures both directions of the TCP HTTP conversation. Also make sure that you capture on the sending device because we need Wireshark to properly calculate the amount of in-flight data. (Performing the capture on the receiver leads Wireshark to believe that the RTT is near zero, since the receiver’s ACK may immediately follow incoming data).

    Load the http_get.pcap file into Wireshark, find the HTTP stream of interest and take note of its tcp.stream index:

    [​IMG]

    Open the Statistics->IO Graph and configure it as follows:

    • Y-axis -> Unit: Advanced
    • Scale: Auto
    • Graph 5 (pink)
      • Filter: tcp.dstport==<80|443> && tcp.stream==<index>
      • Calc: MAX and tcp.window_size
      • Style: Impulse
    • Graph 4 (blue)
      • Filter: tcp.srcport==<80|443> && tcp.stream==<index>
      • Calc: MAX and tcp.analysis.bytes_in_flight
      • Style: FBar

    Next, make sure that the Graph 4 and Graph 5 buttons are depressed (enabled) to plot those results. Here’s an example of what you might expect:

    [​IMG]

    I was using a 100 Mbps connection to GET a 128 MB file from an NGINX proxy that’s 80 ms away (AWS Oregon to/from our office in Ottawa, ON). This corresponds to a BDP of 1MB.

    Notice how the RWND (in pink) starts off small and grows to just over 1 MB after a few round-trips. This confirms that the receiver is capable of adapting the RWND and is BDP-aware (excellent). Alternatively, if we saw the RWND being reduced (aka closing), it would be an indication that the receiving application is unable to read the data fast enough – perhaps not getting enough CPU time.

    In terms of the sender’s capability – CWND (in blue) – we want an indication that the amount of in-flight data can run up against the RWND limit. We can see in the interval of 3s to 6s that the NGINX proxy was able to place the maximum amount of data allowed by the RWND in flight. This confirms that the sender is able to push enough data to satisfy the BDP.

    However, something seems to have gone terribly wrong near the 6s mark. The sender significantly reduces the amount of data placed in flight. This self-imposed behavior is typically due to congestion detection by the sender. Recall that the GET response was travelling from West Coast to East Coast, so encountering network congestion is likely.

    Identifying Network Congestion


    When the sender detects congestion, it will reduce its CWND in an effort to reduce its contribution to network congestion. But how can we tell?

    In general, TCP stacks can use two types of indicators to detect or measure network congestion: packet loss and latency variation (bufferbloat).

    • Packet loss can occur on any network, predominantly in Wi-Fi networks, or when network elements actively manage their queues (e.g. random early discard) or when they don’t manage their queues at all (tail drop). TCP will see packet loss as missing ACKs, or non-incrementing ACKs from the receiver (also known as duplicate ACKs).
    • Bufferbloat is the increase in latency (RTT) between endpoints caused by an increasing backlog of packets.

    TIP: Using the simple ping tool or the more verbose mtr tool, you can sometimes detect meaningful bufferbloat – especially if your sender can push higher data rates. It’s quite impressive and gives you a real sense of what may be happening deeper into the network.

    On the same Wireshark IO Graph window, let’s add the following:

    • Graph 2 (red)
      • Filter: tcp.dstport==<80|443> && tcp.stream==<index>
      • Calc: COUNT FIELDS and tcp.analysis.duplicate_ack
      • Style: FBar
    • Let’s also change the Scale to Logarithmic

    [​IMG]

    Bingo! Evidence that duplicate ACKs were sent by the receiver, indicating that packets were in fact lost. This explains why the sender would reduce its CWND, the amount of in-flight data.

    You may also want to look for evidence of tcp.analysis.retransmission that occurs when either packets are reported as lost and must be resent, or when the sender times out waiting for an ACK from the receiver (and assumes the packets were lost or that the ACK was lost). In the latter case, look for tcp.analysis.rto.

    For both of the above, make sure to set your filter to tcp.srcport=<80|443> since retransmissions originate from the sender.

    Connection Optimizations


    RFC 7413 defines TCP Fast Open (TFO) – an extension to the TCP protocol that allows data to be carried in the TCP-SYN and TCP-SYN/ACK packets during the TCP handshake, thus eliminating one RTT.

    [​IMG]

    Upon the initial TCP connection, the client generates a TFO cookie which it places in the TCP-SYN packet as a TCP option. Later, when the client reconnects, it resends the same cookie in the TCP-SYN packet, along with data – presumably an HTTP request or a TLS ClientHello. If the server recognizes the cookie, it immediately transitions to the connected state and receives the data found in the TCP-SYN packet.

    While the server can reply with data in the TCP-SYN/ACK packet, it’s more likely (from a timing perspective) that the TCP stack will send out the TCP-SYN/ACK packet before the application data follows. Either way, the data-less TCP handshaking round trip is now used to send and receive data, saving a round trip.

    This can meaningfully accelerate HTTP requests and improve Time to First Byte (TTFB) for streaming media. The kernel implements TFO, and all that’s required by NGINX is the fastopen parameter for the listen directive.

    Support for TFO over IPv4 was merged into the Linux kernel mainline as of 3.7 (you can check your kernel version with uname -r.) If you’re running kernel 3.13 or better, chances are that TFO may be enabled by default. You can check whether TFO is enabled with the following command (Linux):

    $ cat /proc/sys/net/ipv4/tcp_fastopen

    A zero value indicates it’s disabled; bit 0 corresponds to client operations while bit 1 corresponds to server operations. Setting tcp_fastopen to 3 enables both.

    Most NGINX downloadable packages don’t include TFO support since TCP_FASTOPEN isn’t always defined in tcp.h – even if kernel support is available. You can build NGINX from source by adding the -DTCP_FASTOPEN=23 compile-time definition to NGINX’s configure script (--with-cc-opt).


    Related Tutorial: Enabling TFO for NGINX on CentOS 7

    Certificate Validation


    During the TLS negotiation, the client must validate the server’s certificate to establish trust in that the server is in fact who it claims to be.

    Validation normally involves the client generating an OCSP request to a Certificate Authority (CA), which means generating a DNS query and a new TCP connection. All of these steps prolong the TLS negotiation with the server.

    The concept of OCSP Stapling moves the burden of the OCSP query from the client to the server. This results in the server “stapling” the CA’s timebound answer to the server’s certificate, thus streamlining the certificate validation step for the client.

    NGINX supports OCSP Stapling via the ssl_stapling and its related directives.

    TLS Session Resumption


    There are two mechanisms that can be used to eliminate a round trip for subsequent TLS connections (discussed below):

    1. TLS session IDs
    2. TLS session tickets

    [​IMG]

    It should be noted that TLS session resumption handicaps perfect forward secrecy by leaking TLS session information on the server side. Therefore it should not be used if PFS is required. On the other hand, HTTP/2 provides the same kind of RTT optimization as TLS Session Resumption by multiplexing multiple requests into one TLS tunnel – without sacrificing PFS.

    TLS Session IDs


    During the initial TLS handshake, the server generates a TLS session ID and sends it to the client via its the ServerHello message – this can be viewed in a Wireshark trace:

    [​IMG]

    For a subsequent TLS handshake, the client may send the session ID in its ClientHello message. This allows the server to restore the cached TLS context and avoids the 2nd round trip of a TLS handshake.

    NGINX supports TLS session IDs via the ssl_session_cache and ssl_session_timeout directives.

    TLS Session Tickets


    TLS session tickets are similar to the concept of TLS session IDs, however the burden of storing session information is moved from the server to the client.

    During a normal TLS handshake, the server will encrypt its TLS session information and send the resulting ticket to the client. While the client is unable to decode the ticket, it can send it back to the server the next time it wants to establish a TLS connection.

    As with TLS session IDs, the number of round trips is reduced by one – but without the server having to maintain an SSL cache. However, the server must maintain and secure a private key (ssl_session_ticket_key) that is used to encrypt the ticket.

    Find Your Perfect NGINX Setup


    NGINX was designed to be totally unleashed. It’s a track car with racing slicks and a crazy driver. So have some fun, try some things out, and let us know how fast you can take it. Also, if you think anything should be added to this guide, let us know. We will make additions and amendments in Version 2.

    I would like to give a huge shoutout to Aaron Kaufman, Esteban Borges, Guy Podjarny, Heather Weaver, Josh Mervine, Justin Dorfman, and Robert Gibb for helping put this guide together.

    The post 6-Part Guide to NGINX Application Performance Optimization appeared first on MaxCDN Blog.

    Continue reading...
     
    Last edited: Mar 11, 2016