How a Slow Carrier API Quietly Turned a WooCommerce Checkout Into a 504 Factory

A client running a UK homewares store pinged me on a Wednesday afternoon. Their checkout was "sort of working" — most customers got through, but every few minutes somebody would hit a 504 Gateway Timeout on the final step. The site was up. The cart page was fine. Adding products was fine. Only the checkout was flaky, and only during the afternoon.

By the time I finished, the real problem was not WordPress, not the database, not the plugin stack. It was a single blocking HTTP call to a third-party shipping rate API that was quietly pinning PHP-FPM workers and dragging the whole site into the ground every time the carrier had a bad five minutes.

Here is exactly how I traced it and what I changed to stop it happening again.

The symptoms did not match the obvious causes

I get called into "checkout is slow" jobs fairly often. Usually it is one of a short list of things: wp_options autoload bloat, Action Scheduler runaway, a bloated cart fragment request, a mis-configured object cache, or a DDoS hitting /?add-to-cart=. This one did not look like any of those.

Nginx was returning 504 Gateway Time-out after exactly 60 seconds — the fastcgi_read_timeout on the server. The WooCommerce status log showed no errors from the store itself. Database load was normal. CPU was fine. Redis was happily serving object cache hits. From the outside, nothing was broken.

The only real clue was timing: the 504s only happened between roughly 13:00 and 17:00 UK time. Outside of that window, checkout was perfectly fast. That told me something external was misbehaving during business hours, and whatever it was, it was blocking long enough to chew through all my PHP-FPM workers.

Step 1: Find what the workers are actually doing

First job was to catch a request in the act. I enabled the PHP-FPM slow log on the pool:

; /etc/php/8.3/fpm/pool.d/www.conf
request_slowlog_timeout = 5s
slowlog = /var/log/php-fpm/slow.log

After a reload and about twenty minutes of waiting, the slow log started showing entries like this:

[12-Apr-2026 14:23:41]  [pool www] pid 18422
script_filename = /home/store/htdocs/index.php
[0x00007fe3c1234abc] curl_exec() /home/store/htdocs/wp-includes/class-wp-http-curl.php:175
[0x00007fe3c1234cde] request() /home/store/htdocs/wp-includes/class-requests.php:388
[0x00007fe3c1234f12] request() /home/store/htdocs/wp-includes/class-wp-http.php:402
[0x00007fe3c12350ab] post() /home/store/htdocs/wp-content/plugins/acme-shipping/includes/class-acme-api.php:212
[0x00007fe3c1235234] get_rates() /home/store/htdocs/wp-content/plugins/acme-shipping/includes/class-acme-shipping-method.php:186
[0x00007fe3c1235456] calculate_shipping() /home/store/htdocs/wp-includes/class-wc-shipping-zone.php:341

I have anonymised the plugin name, but the pattern was unmistakable. Every slow request was a customer triggering calculate_shipping() — either through the shipping calculator on the cart, or at checkout — and every one of those calls was stuck inside curl_exec() talking to a single external shipping rate endpoint.

To confirm, I grabbed the PID of a stuck worker from ps and ran strace against it:

strace -p 18422 -e trace=network,read,write 2>&1 | head -40

The output was a single line repeating:

recvfrom(14, 0x7ffd1e2a1b30, 16384, 0, NULL, NULL) = ?

That is a worker sitting on a TCP socket waiting for bytes that are not coming. It was not doing anything else. It was just waiting for a carrier API to respond.

Step 2: Measure the damage from the outside

I wanted to know how bad the carrier was really being. A quick curl against their rates endpoint showed wild variability:

for i in $(seq 1 20); do
  curl -s -o /dev/null -w "%{time_total}\n" \
    -X POST https://api.example-carrier.com/v2/rates \
    -H 'Authorization: Bearer REDACTED' \
    -d @payload.json
  sleep 1
done

Most requests came back in 200-400ms. But roughly one in six took more than 25 seconds, and two of the twenty never returned at all before I gave up. The plugin was using WordPress's default HTTP timeout, which for wp_remote_post() is 5 seconds — but the plugin had explicitly raised it to 30 seconds via the http_request_args filter "to avoid false failures during peak hours". Well meant. Catastrophic in practice.

Step 3: The worker math explains the outage

The server had 20 PHP-FPM workers configured (pm.max_children = 20) for a checkout-heavy store on a 4GB VPS. That number was sensible for normal traffic: WooCommerce pages take under 300ms on this site when things are healthy, so 20 workers is enough for hundreds of requests per second of headroom.

But when every checkout request spent 25-30 seconds waiting on the carrier API, the math flipped. If the carrier was slow for a five minute stretch and around eight customers hit checkout in that window, eight of my twenty workers would be blocked on curl_exec() doing absolutely nothing. Add a couple of WordPress admin requests and a handful of cart fragment polls, and the remaining workers were saturated. New requests queued up behind them. Once the queue exceeded 60 seconds, Nginx started returning 504s.

In other words: a slow third party was converting my PHP-FPM pool into a single-threaded I/O service. This is exactly the failure mode that the classic "bulkhead" pattern is designed to prevent. WordPress does not give you bulkheads out of the box — you have to build them yourself.

Step 4: The fix, in three layers

I did not want to rip out the shipping plugin. The rates it returned were accurate and the client genuinely needs live quotes for oversized items. So I built three layers of defence around it instead.

Layer 1: Cap the cURL timeout hard

The first thing was to stop any single request from hogging a worker for more than a few seconds. I dropped a small must-use plugin into wp-content/mu-plugins/:

<?php
/*
 * Plugin Name: Acme Shipping Timeout Guard
 * Description: Cap outbound HTTP calls to the carrier API so they can never
 * block a PHP-FPM worker for more than 4 seconds.
 */

add_filter( 'http_request_args', function ( $args, $url ) {
    if ( false === strpos( $url, 'api.example-carrier.com' ) ) {
        return $args;
    }
    $args['timeout']     = 4;
    $args['redirection'] = 1;
    $args['blocking']    = true;
    return $args;
}, 999, 2 );

Four seconds is long enough to catch the 95th percentile when the carrier is healthy, and short enough that a bad run only costs me four worker-seconds per customer instead of thirty.

Layer 2: Cache the rates per address + cart hash

Most stores ask the same carrier the same question over and over. A customer on the cart page, a customer on checkout, and a customer refreshing after fixing a postcode all hit the API with nearly identical payloads. I wrapped the plugin's rate lookup in a short-lived transient keyed on the thing that actually matters: the destination postcode plus a hash of the cart contents.

add_filter( 'acme_shipping_get_rates', function ( $rates, $package ) {
    $key = 'acme_rates_' . md5(
        $package['destination']['postcode'] .
        $package['destination']['country'] .
        wp_json_encode( wp_list_pluck( $package['contents'], 'product_id' ) ) .
        wp_list_pluck( $package['contents'], 'quantity' )[0] ?? ''
    );

    $cached = get_transient( $key );
    if ( false !== $cached ) {
        return $cached;
    }

    if ( ! empty( $rates ) ) {
        set_transient( $key, $rates, 10 * MINUTE_IN_SECONDS );
    }

    return $rates;
}, 10, 2 );

I deliberately kept the TTL short — 10 minutes — because carriers genuinely do update rates and I did not want to quote stale prices. With Redis object cache in front, reads are sub-millisecond and never touch wp_options. On this store, cache hit rate on that transient settled around 78% within an hour, which by itself removed four out of every five calls to the carrier.

Layer 3: A circuit breaker when the carrier is clearly broken

The last layer was the one I was most nervous about, because it changes customer-facing behaviour. If the carrier API returns a timeout or a 5xx, the plugin records the failure against a short-window counter. Three failures in 60 seconds and the breaker opens: for the next two minutes, every rate lookup returns a fallback flat-rate method immediately, without calling the API at all.

const BREAKER_KEY  = 'acme_breaker_state';
const BREAKER_FAIL = 3;
const BREAKER_WIN  = 60;
const BREAKER_COOL = 120;

function acme_breaker_is_open() {
    $state = get_transient( BREAKER_KEY );
    return is_array( $state ) && ! empty( $state['open_until'] )
        && $state['open_until'] > time();
}

function acme_breaker_record_failure() {
    $state = get_transient( BREAKER_KEY ) ?: array( 'failures' => array() );
    $now   = time();
    $state['failures'] = array_filter(
        $state['failures'],
        function ( $t ) use ( $now ) { return $t > $now - BREAKER_WIN; }
    );
    $state['failures'][] = $now;

    if ( count( $state['failures'] ) >= BREAKER_FAIL ) {
        $state['open_until'] = $now + BREAKER_COOL;
        error_log( 'Acme shipping breaker opened for ' . BREAKER_COOL . 's' );
    }

    set_transient( BREAKER_KEY, $state, BREAKER_WIN + BREAKER_COOL );
}

When the breaker is open, the fallback rate is a flat GBP 7.95 "Standard delivery (estimated)" that the client is happy to absorb for a couple of minutes rather than lose the sale. The breaker log line also feeds into the uptime monitor, so I know retroactively how often the carrier had a bad day.

Step 5: Give Nginx and PHP-FPM some headroom too

Layers one to three fix the root cause, but I still wanted the server to survive a thundering herd. Two small changes:

# /etc/nginx/conf.d/store.conf
fastcgi_connect_timeout 5s;
fastcgi_send_timeout    30s;
fastcgi_read_timeout    30s;

I shortened the read timeout from 60 to 30 seconds. If a request cannot complete in 30 seconds at the PHP-FPM layer, I would rather fail fast and return a 504 to one customer than have five more customers queueing up behind it.

And in PHP-FPM:

pm                   = dynamic
pm.max_children      = 25
pm.start_servers     = 6
pm.min_spare_servers = 4
pm.max_spare_servers = 10
pm.max_requests      = 500
request_terminate_timeout = 30s

request_terminate_timeout is the seat belt here. If somehow a worker gets stuck beyond 30 seconds despite the HTTP filter, FPM will kill it rather than letting it pin a slot forever. Pair that with pm.max_requests = 500 to recycle workers periodically and you get a pool that is genuinely self-healing.

The result

I deployed the mu-plugin, the caching, and the breaker in that order, and watched the php-fpm slow.log for the next 48 hours. The 504s stopped completely. The slow log went from ten to fifteen entries per afternoon to zero. The carrier still has bad runs — I see the breaker opening once or twice a week — but customers do not notice. They get a flat rate estimate, they check out, and the store keeps taking money.

The wider lesson I take from jobs like this is that WordPress performance work has two phases. The first is the obvious stuff — autoload bloat, object caching, PHP-FPM worker counts, database indexes. The second is harder: finding every place where your site has given a blocking dependency on something you do not control. Payment gateways. Shipping APIs. Licensing servers. Email providers. Any of them can quietly turn into a worker-pool killer the moment they have a bad afternoon.

If you are running a WooCommerce store that integrates with live carrier rates, go and audit your timeouts today. Look for any http_request_args filter that sets a timeout above 5 seconds. Look at the PHP-FPM slow log during peak hours. And ask yourself what would happen to your checkout if that API stopped responding for ten minutes. If the answer is "my site falls over", you need a cap, a cache, and a breaker.

If you want someone to do this kind of deep WooCommerce performance investigation on your store — catching the problems that only show up at 3pm on a Wednesday — take a look at my WooCommerce maintenance plans. I also handle one-off WordPress emergency fixes if you are already on fire.