Script to individually terminate “stuck” web server threads without restarting Apache

load At work, a misbehaving web app was causing Apache worker threads to get indefinitely "stuck" in "W" state (Sending Reply), not returning anything to client, but taking up a chunk of CPU (often maxing a single thread at 99%). As additional requests for problematic pages came in, they would eventually tie up all available processor capacity, leading to high load averages (worst it got was a load average of 90 while googlebot was crawling) and a very sluggish server. Problem would get "resolved" by restarting Apache, but of course that killed all the clients, so needed a more targeted solution.

While the developer is investigating, as a stop gap measure, came up with this script to identify and individually terminate these stuck Apache workers and log the request URL for later analysis. At first, I had monit run it whenever Apache CPU usage exceeded a threshold, but that wasn't often enough, so now it's called by crontab at 1 min intervals. If no Apache worker threads match the pattern, the script silently exits. So far, it's worked really well with zero false positives, though of course this is a temporary stop gap measure and not something I'd run normally.

Example notification sent and log entry made by script when Apache workers are stopped:

--------------------------------
 Stopped on 2012-08-28 22:38:01
--------------------------------
Srv   PID    Acc         M  CPU    SS  Req Conn Child  Slot Client        VHost             Request                   
13-4  22979  4/87/618    W  66.29  81   0  0.0  0.06  1.79  66.249.68.16  host.example.com  GET  /path/to/page  HTTP/1.1
0-4   25369  75/86/2054  W  42.95  79   0  0.0  0.00  9.28  66.249.68.16  host.example.com  GET  /path/to/page  HTTP/1.1
15-3  2133   86/346/449  W  67.24  72   0  0.0  0.49  1.15  66.249.68.16  host.example.com  GET  /path/to/page  HTTP/1.1     

And here's the script:

#/bin/bash

GetAllWorkers()
{
 AllWorkers=$(apache2ctl fullstatus | awk '/^Srv /,/^$/ {print}')
}

GetStuckWorkers()
{
 StuckWorkers=$(echo "$AllWorkers" | awk '$4 == "W" && $6 > 60 && $7 == 0 && $8 == "0.0" {print}')
 header=$(echo "$AllWorkers" | head -n 1)
}

GetStuckPIDs()
{
 StuckPIDs=$(echo "$AllWorkers" | awk '$4 == "W" && $6 > 60 && $7 == 0 && $8 == "0.0" {print$2}')
}

Show()
{
 echo "--------------------------------"
 echo " Stopped on $(date +%F\ %T)"
 echo "--------------------------------"
 echo "$header"
 echo "$StuckWorkers"
}

GetAllWorkers && GetStuckPIDs
if [ -n "$StuckPIDs" ]; then
  for PID in $StuckPIDs; do
    echo stopping $PID with SIGTERM
    kill $PID
  done
  GetStuckWorkers
  Show | mail -s "$(basename $0) executed on $(hostname -s)" root
  Show >> /path/to/ksaw.log
fi

4 Comments

  • 1. Benoit Georgelin replies at 5th November 2012, 6:40 pm :

    Thanks for sharing your work. I appreciate !
    Great explication, great script.

    I did just a little change to be able to use it.
    From
    AllWorkers=$(apache2ctl fullstatus | awk ‘/^Srv /,/^$/ {print}’)

    TO
    AllWorkers=$(apache2ctl fullstatus | awk ‘/^ Srv /,/^$/ {print}’)

    With 3 blank space before Srv 🙂

    Regards,
    Benoît

  • 2. Detlev replies at 3rd February 2014, 12:08 am :

    thx a lot. I have an odd problem and can`t find the bug in my conf.

    hope now I can get rid of it!

  • 3. Raymond replies at 31st October 2014, 10:00 pm :

    Did you ever figure out what was causing the Apache worker threads to get indefinitely stuck?

  • 4. Alain Kelder replies at 1st November 2014, 12:10 am :

    That was over two years ago, so I had to look it up. 😉 But yes, it was an application level bug. The developer found some bad code that was causing an endless loop when some pages were requested with certain query strings.

Leave a comment

NOTE: Enclose quotes in <blockquote></blockquote>. Enclose code in <pre lang="LANG"></pre> (where LANG is one of these).