Problem: A Windows 2003 Server that starts refusing connections after approximately 14 days of uptime, and if left alone will completely lock up and stop displaying a local console login window. The server is monitored by Nagios, and Munin. Nagios gives a couple of hours advance warning that the server is going to go south, which gives us plenty of time to reboot it. Of the server traits Munin monitors, there doesn’t appear to be anything unusual. During the melt-down phase the following error is plentiful in the event logs: “The server was unable to allocate from the system nonpaged pool because the pool was empty.”
Task Manager doesn’t display any unusual memory usage. Process Explorer didn’t yield any immediately useful clues. Quite a bit of Google searching and about the only tool I could find that was recommended for tracking down resource leaks that affect the nonpaged memory pool was a Microsoft Utility called poolmon.exe. Poolmon has to be the most primitive wacked-up tool anyone at Microsoft has developed, released, and then blatantly ignored. It’s pure ASCII, runs in a DOS-style command window, can’t log its results over time to a log file; and I kid you not, the official way to use this utility to track down leaks is to run it and copy and paste the entire screen output to a text file every fifteen minutes so you can diff the results.
Poolmon needs an overhaul: the ability to track growth of values over time, log output to a file, and a GUI would be nice since this is a Windows utility. I know, it’s a lot for which to ask. I have no interest in picking up where I left off with Visual C++ ten years ago, so I made do with a little bit of perl and Excel. Using ActiveState’s ActivePerl, I set up a scheduled task that ran every fifteen minutes and executed this bit of braindead code that just dumps the output of poolmon.exe to a text file that has the timestamp in the file name: poolmonlog.pl
I let that run for two days, and then I wacked this bit of perl together to analyze the results: poolmon_analysis.pl
I copy and paste the output into Excel, and I turn it into a line graph. As you can see, there is obviously a problem and it is quite visible with only a short period of logging:
According to the poolmon.txt file that comes with poolmon,exe, the pool label Thre is “nt!ps – Thread objects”, which is the pool name for Windows thread and handle objects. Now I know I’m not looking for a memory issue, I’m looking for thread or handle usage. Task Manager can be used to display these resources, but it is not configured to do so by default. I open up Task Manager, click the Processes tab, and then go to View…Select Columns… and I turn on Handle Count and Thread Count. Sorting by these two columns, I quickly track down the culprit as HPBPRO.EXE which is part of the HP drivers for a Color LaserJet 3500.
Since it took me this much effort to track down this problem, and because I’ve read lots and lots of message board posts written by people having an equally difficult time tracking down similar issues, I thought I’d share. If there is an easier way that I could have diagnosed this issue, I would love to hear it. If this post helps you solve a similar issue, I’d love to hear that too.