Win7 x64 unresponsive for a minute or so. HD failing?

Posted on

QUESTION :

On a fully updated Win7 x64, every so often the system stalls for a minute or so. This has been going on for a couple months now. By stalling I mean the mouse responds and I can move windows around, but any window, any program, that is open becomes whiteish when I select it AND any new programs will not open. It doesn’t matter what kind of program it is. When the stall stops all clicks I made (open new programs for example) take effect.

Nothing shows up consistently (as in every time this happens) in the event log. Today though I was able to find something, but it doesn’t reveal much other than the “system was unresponsive”. It’s a 7009 for “A timeout was reached (30000 milliseconds) while waiting for the Windows Error Reporting Service service to connect.

It doesn’t matter if I have any USB devices plug-in or not. I’ve ran Microsoft Security Essentials and Malwarebytes.

While the machine is unresponsive, I’ve noticed that Drive D (the other partition on the single internal HD in this laptop) is displayed like this in explorer. This never occurs with Drive C or any other drive on the machine. how drive D shows up in explorer in explorer.

SMART report for the physical drive: SMART report

Read benchmark by HD Tune 5 Pro, probably the most telling piece of the puzzle. Isn’t this alone enough to see there is a problem with the drive, regardless of whether the unresponsiveness is caused by such purported problem? read benchmark by HD Tune 5 Pro

Here is a short hardware report:

Computer:      LENOVO ThinkPad T520
CPU:           Intel Core i5-2520M (Sandy Bridge-MB SV, J1)
               2500 MHz (25.00x100.0) @ 797 MHz (8.00x99.7)
Motherboard:   LENOVO 423946U
Chipset:       Intel QM67 (Cougar Point) [B3]
Memory:        8192 MBytes @ 664 MHz, 9.0-9-9-24
               - 4096 MB PC10600 DDR3 SDRAM - Samsung M471B5273CH0-CH9
               - 4096 MB PC10600 DDR3 SDRAM - Patriot Memory (PDP Systems) PSD34G13332S
Graphics:      Intel Sandy Bridge-MB GT2+ - Integrated Graphics Controller [D2/J1/Q0] [Lenovo]
               Intel HD Graphics 3000 (Sandy Bridge GT2+), 3937912 KB 
Drive:         ST320LT007, 312.6 GB, Serial ATA 3Gb/s
Sound:         Intel Cougar Point PCH - High Definition Audio Controller [B2]
Network:       Intel 82579LM (Lewisville) Gigabit Ethernet Controller
Network:       Intel Centrino Advanced-N 6205 AGN 2x2 HMC
OS:            Microsoft Windows 7 Professional (x64) Build 7601

The drive less than 1 year old. Do I have a defective drive? Seagate Tools diag says there is nothing wrong with the drive…

UPDATE: I noticed that the windows error reporting service entered the running state then the stopped state and the space between the two events was exactly 2 minutes. Which error it was trying to report I don’t know. I check the “Reliability Monitor” and it shows no errors to be reported. I’ve disabled the windows error reporting service to see if the problem stops.

ANSWER :

Based on the new information you have provided, I can say that there is in fact no problem at all. Then why does it “go offline” for a few seconds for up to three minutes after suspending the guest OS? Because as you said, the HDD LED light stays lit while the drive remains unresponsive because it is being heavily used.

What is happening is that when you finish using VMWare and want to sleep the guest OS, you use the standby or hibernation feature instead of shutting down. This causes VMWare to copy the contents of the VM’s RAM to disk so that it can resume where it left off without having to boot up all over again. Depending on how much memory you have assigned to the VM and how much was being used, this can mean that VMWare has to write quite a lot of data (gigabytes) to disk.

When VMWare copies the memory to disk, the drive becomes more or less unresponsive to new disk operations until the current disk operations (writing the RAM to a file) have finished. As a result, when you open My Computer, Windows tries to refresh the data but it cannot read the drive to fetch the needed data because there’s all those write commands already in line waiting to happen. Therefore it leaves it empty and looking like it’s offline until it can manage to slip in those read requests (between VMWare’s write operations).

If you open the drive in Explorer, you will see that either it will not open it at all for a while, or it will open it and flash the address bar with a green progress bar like it does whenever there is a lengthy file operation (like searching for thousands of files).

In summary, there is nothing surprising or mysterious about this situation. If instead of putting a VMWare guest OS into standby, you had just manually copied a giant file to the drive, the results would be exactly the same.

So what can you do to fix it? Aside from changing to a faster drive (or using an internal one if D: is external), your best bet is to defragment the drive. If D: is very fragmented, then when VMWare tries to flush the RAM to disk, it will cause it to thrash around a lot while writing chunks of the giant file to different areas (of course this is assuming it’s not an SSD, which if D: is still a partition on the same 0ST320LT007 drive as C:, then it’s not).

If you defragment the drive (assuming that there is sufficient free space), then the system can write the RAM file with only a few file operations in large swaths (e.g., write 1GB of data at cluster X) instead of many, many little operations (write 1MB here, write 245.18MB there, 4KB here, another 18.1MB somewhere else…) Then sleeping the VM will finish much faster and the drive will be more responsive.

To find out exactly what the access is that is causing the drive to be active and busy, you can use a tool like Process Monitor. Run it and click the class-filters to select only the file-class filter as seen below.

Now you can see what files and folders are being accessed. Make sure to memorize the hotkey to start and stop activity capturing (Ctrl+E) so that you can stop it once it starts flooding with what is likely to be the disk operations from VMWare.

Screenshot of Proccess Monitor with only file class filter active

The described symptoms are indeed endemic of a bad drive. When a disk is unresponsive, the system waits for a seemingly immeasurable amount of time before timing out and throwing an error.

That said, it is curious that it only seems to happen to the D: volume (which you implied was a partition on the same physical drive as C:). If it were a software issue (e.g., corrupt file-system on D:), then it should not be happening intermittently, while a hardware issue could indeed happen intermittently if for example there are only a couple of bad sectors towards the inside of the platter and the system only occasionally happens to touch them. Of course you already said that HD Tune reported none. However, as you thought, modern drives do indeed hide bad sectors. They usually have a bunch of spare sectors that they can remap bad sectors to and yes, they do this transparently so that the OS does not know about them (other than generic information via SMART).

If the Data column is reporting raw data, then yes, 2,465 relocated sectors is a lot. If it only happens with D:, then the bad sectors are likely grouped towards the center of the platter where the head goes to park, so maybe the drive got jostled while the drive was shutting down/spinning up.

What is that volume being used for? If it is being used for things like storing the temp directory and such where the OS or programs make occasional access to it, then it could be a corrupt file-system (of course you said you ran chkdsk, so it should not be).

You can check/confirm if it is a physical problem with your drive by opening the Event Viewer (eventvwr.exe) and checking the System log for events with a Source of Disk. You can cross-reference the indicated disk number in the Disk Management MMC snap-in (diskmgmt.msc).

Bad Disk event in Event Viewer

Corresponding disk number in Disk Management snap-in

The problem has been traced down to VMWare Player. It happens immediately after on some time after VMWare guest OS is shut down. More info here.

The solution in my case was disabling the VMware Authorization Service. This service is only needed when the virtual machine needs to be run by non administrators.

Update: Disabling the VMware auth Service AND re-enabling the
Application Experience Service (which I had disabled because I deemed it unecessary) solved the problem.

The D: drive still goes “offline” for a few seconds, even after I have
replaced the HD. This doesn’t render the entire machine unresponsive,
only specific applications that depend on data stored on D: (like
outlook, in my config). I’m going to consider the D: offline drive
issue as a separate issue.

This is a hard problem to diagnose from the information you provided (which was a lot of info, don’t get me wrong). One way to diagnose this as a hardware problem is to try to recreate the problem with an install of Linux, such as through wubi.

I have seen similar things happen when there are bad sectors on the HD. But I have also seens simliar problems due to faulty drivers.

Have you tried CHKDSK and scanned for bad sectors?

Leave a Reply

Your email address will not be published.