Script to individually terminate “stuck” web server threads without restarting Apache
At work, a misbehaving web app was causing Apache worker threads to get indefinitely "stuck" in "W" state (Sending Reply), not returning anything to client, but taking up a chunk of CPU (often maxing a single thread at 99%). As additional requests for problematic pages came in, they would eventually tie up all available processor capacity, leading to high load averages (worst it got was a load average of 90 while googlebot was crawling) and a very sluggish server. Problem would get "resolved" by restarting Apache, but of course that killed all the clients, so needed a more targeted solution.
While the developer is investigating, as a stop gap measure, came up with this script to identify and individually terminate these stuck Apache workers and log the request URL for later analysis. At first, I had monit run it whenever Apache CPU usage exceeded a threshold, but that wasn't often enough, so now it's called by crontab at 1 min intervals. If no Apache worker threads match the pattern, the script silently exits. So far, it's worked really well with zero false positives, though of course this is a temporary stop gap measure and not something I'd run normally.
Example notification sent and log entry made by script when Apache workers are stopped:
--------------------------------
Stopped on 2012-08-28 22:38:01
--------------------------------
Srv PID Acc M CPU SS Req Conn Child Slot Client VHost Request
13-4 22979 4/87/618 W 66.29 81 0 0.0 0.06 1.79 66.249.68.16 host.example.com GET /path/to/page HTTP/1.1
0-4 25369 75/86/2054 W 42.95 79 0 0.0 0.00 9.28 66.249.68.16 host.example.com GET /path/to/page HTTP/1.1
15-3 2133 86/346/449 W 67.24 72 0 0.0 0.49 1.15 66.249.68.16 host.example.com GET /path/to/page HTTP/1.1
And here's the script:
#/bin/bash
GetAllWorkers()
{
AllWorkers=$(apache2ctl fullstatus | awk '/^Srv /,/^$/ {print}')
}
GetStuckWorkers()
{
StuckWorkers=$(echo "$AllWorkers" | awk '$4 == "W" && $6 > 60 && $7 == 0 && $8 == "0.0" {print}')
header=$(echo "$AllWorkers" | head -n 1)
}
GetStuckPIDs()
{
StuckPIDs=$(echo "$AllWorkers" | awk '$4 == "W" && $6 > 60 && $7 == 0 && $8 == "0.0" {print$2}')
}
Show()
{
echo "--------------------------------"
echo " Stopped on $(date +%F\ %T)"
echo "--------------------------------"
echo "$header"
echo "$StuckWorkers"
}
GetAllWorkers && GetStuckPIDs
if [ -n "$StuckPIDs" ]; then
for PID in $StuckPIDs; do
echo stopping $PID with SIGTERM
kill $PID
done
GetStuckWorkers
Show | mail -s "$(basename $0) executed on $(hostname -s)" root
Show >> /path/to/ksaw.log
fi
4 Comments
1. Benoit Georgelin replies at 5th November 2012, 6:40 pm :
Thanks for sharing your work. I appreciate !
Great explication, great script.
I did just a little change to be able to use it.
From
AllWorkers=$(apache2ctl fullstatus | awk ‘/^Srv /,/^$/ {print}’)
TO
AllWorkers=$(apache2ctl fullstatus | awk ‘/^ Srv /,/^$/ {print}’)
With 3 blank space before Srv 🙂
Regards,
Benoît
2. Detlev replies at 3rd February 2014, 12:08 am :
thx a lot. I have an odd problem and can`t find the bug in my conf.
hope now I can get rid of it!
3. Raymond replies at 31st October 2014, 10:00 pm :
Did you ever figure out what was causing the Apache worker threads to get indefinitely stuck?
4. Alain Kelder replies at 1st November 2014, 12:10 am :
That was over two years ago, so I had to look it up. 😉 But yes, it was an application level bug. The developer found some bad code that was causing an endless loop when some pages were requested with certain query strings.
Leave a comment