How many tools does it take to find a resource leak in Windows Server 2003?

Problem: A Windows 2003 Server that starts refusing connections after approximately 14 days of uptime, and if left alone will completely lock up and stop displaying a local console login window. The server is monitored by Nagios, and Munin. Nagios gives a couple of hours advance warning that the server is going to go south, which gives us plenty of time to reboot it. Of the server traits Munin monitors, there doesn’t appear to be anything unusual. During the melt-down phase the following error is plentiful in the event logs: “The server was unable to allocate from the system nonpaged pool because the pool was empty.”

Task Manager doesn’t display any unusual memory usage. Process Explorer didn’t yield any immediately useful clues. Quite a bit of Google searching and about the only tool I could find that was recommended for tracking down resource leaks that affect the nonpaged memory pool was a Microsoft Utility called poolmon.exe. Poolmon has to be the most primitive wacked-up tool anyone at Microsoft has developed, released, and then blatantly ignored. It’s pure ASCII, runs in a DOS-style command window, can’t log its results over time to a log file; and I kid you not, the official way to use this utility to track down leaks is to run it and copy and paste the entire screen output to a text file every fifteen minutes so you can diff the results.

Poolmon needs an overhaul: the ability to track growth of values over time, log output to a file, and a GUI would be nice since this is a Windows utility. I know, it’s a lot for which to ask. I have no interest in picking up where I left off with Visual C++ ten years ago, so I made do with a little bit of perl and Excel. Using ActiveState’s ActivePerl, I set up a scheduled task that ran every fifteen minutes and executed this bit of braindead code that just dumps the output of poolmon.exe to a text file that has the timestamp in the file name:

I let that run for two days, and then I wacked this bit of perl together to analyze the results:

I copy and paste the output into Excel, and I turn it into a line graph. As you can see, there is obviously a problem and it is quite visible with only a short period of logging:

poolmon output graphed in excel

According to the poolmon.txt file that comes with poolmon,exe, the pool label Thre is “nt!ps – Thread objects”, which is the pool name for Windows thread and handle objects. Now I know I’m not looking for a memory issue, I’m looking for thread or handle usage. Task Manager can be used to display these resources, but it is not configured to do so by default. I open up Task Manager, click the Processes tab, and then go to View…Select Columns… and I turn on Handle Count and Thread Count. Sorting by these two columns, I quickly track down the culprit as HPBPRO.EXE which is part of the HP drivers for a Color LaserJet 3500.

Since it took me this much effort to track down this problem, and because I’ve read lots and lots of message board posts written by people having an equally difficult time tracking down similar issues, I thought I’d share. If there is an easier way that I could have diagnosed this issue, I would love to hear it. If this post helps you solve a similar issue, I’d love to hear that too.




3 Replies to “How many tools does it take to find a resource leak in Windows Server 2003?”

  1. Hey Chris,

    Thanks for sharing. I’m fighting with a memory leak issue too, so far it’s caused the server to bomb out 3-4 times. Usually takes 10-14 days to fail, so we can try scheduling a reboot every weekend, but that’s not a solution. I’m leaning toward a printer as the cause, but this is a TermServ machine, so we can’t exactly control the printers. I haven’t grabbed your code bits yet, but that’ll be my next step if the last step I took didn’t help. You mentioned the handles/threads counters in Task Manager, that was helpful also. Copying and pasting from poolmon is really a half-solution, so it’s great to see that there’s a better method out there.


  2. Very interesting indeed – I have exactly the same problem. I know HPBPRO.EXE is present on the server but I have yet to prove that as the cause. I’m just trying out your perl scripts to see what I can find. Thanks very much.


  3. Thanks for this, Though im a little bit late to the party,i managed to get into poolmon.exe and find the tag “Thre” but couldn’t figure out how to associate that with a process/service.

    Your Post helped me get over the last hurdle, found a process with 82,000 handles, and find the service related to this and restarting it saw a drop of around 100,000,000 Bytes in Poolmon instantly.

    Hopefuly no more crashes on this server 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *