How do I diagnose a hard Linux crash?

Posted on

Problem :

I have a home-built Linux server (Ubuntu 12.04.5 LTS, Intel i5-3570K, 8GB RAM) acting primarily as a mail and web server. It operates in console mode only (no GUI). I will SSH to it now and then, and almost never operate it from the console. It tends to work fine for many days, even weeks, but then sometimes crashes hard without warning. And when I say “crashes hard”, I mean the PC suddenly becomes completely unresponsive:

  • It leaves no log entries
  • It doesn’t emit an “Oops”, kernel panic message or core dump
  • It doesn’t display any message on the screen.
  • It doesn’t respond to any keyboard or mouse input (The NumLock light is also unresponsive to that key)
  • It cannot be accessed by SSH
  • The case’s reset switch will not operate

The only solution is to hold the case power button in till it turns off, then restart it.

Of course this screams “hardware problem”, but which component is the most likely? Memtest86+ shows no errors, so that would seem to leave the Big Three – motherboard, CPU or power supply. (The PC is not overclocked, and the sensors last messages (before the crash) indicate no overheating or fan problems)

  1. Is there a statistical likelihood which of these components is likely to be the problem?

  2. I put the last criteria in bold above because it seemed unusual to me. Usually even with a hard crash, a PC can still be rebooted with the case’s reset switch. Does this suggest a problem with the PSU, or the motherboard? (holding in the power switch 4-5 seconds to turn off the PC does still work)

  3. Is there a way to test them without simply ordering new parts one at a time until I’m confident (after several weeks of no crashing) that the problem is resolved?

Thanks to anyone who can help.

Solution :

I am a bit surprised no one has suggested the use of the SysRq magic key.

First of all, it should be used instead of the power switch to force a reboot, because this gives programs a chance to save unsaved data to the disk; failure to do so might cause considerable problems upon reboot (not to mention the crashing bore of having to wait for the usual fsck check). This is done as follows: keeping Alt and SysRq simultaneously pressed, enter, each spaced by a few seconds, r e i s u b (the famous mnemonics in English is Raising Elephants Is So Utterly Boring, I prefer Running Errands Is So Utterly Boring, try to come up with a better one if you can).

Even apart from this, when the system freezes the use of Alt + SysRq + X (where X is a letter) allows you to run some diagnostics: for instance, X=d displays all current locks, which may help diagnose a software problem; X=j thaws frozen filesystems; X=l (l is an ell) shows a stack backtrace; X=t outputs to the console a list of current tasks; X=w displays a list of blocked tasks.

You can find more codes on Wikipedia.

While I cannot say this will be a decisive step (there are situations where even this fails), yet it is the next step in the investigation, which will help point to a software or hardware problem, and to restrict the range of possible culprits.

1: Is your Ubuntu Stable??

Did you download a stable version of ubuntu? if not try downgrading to the latest stable build.

2: Have you tried it on another Virtual/Physical Machine?

It could very well be a script error testing it in a VM like Virtual Box that will more then likely prevent any hard-crashing if you haven’t tried these steps already also it would give you an environment where you could debug and monitor the OS

3: Ram failure?

Okay so its very unlikely to be the local SSD/HDD/SSHD because the linux os is loaded into the RAM and it would post a warning if there was an inability to contact the kernel then it would crash. however if the ram where to lock-up because its faulty/Defective the operating system would freeze completely being unable to post (or even be aware of) any errors which might explain there being no logs However it is VERY possible that it could be something else

4: Have a look at the forums

Okay i’m not the most-effective Linux user out there and there is a lot that i don’t really know i have had similar hardware and software issues, however i don’t really know what it is your home-brew server does so its hard to pinpoint the flaw out there id browse the Forum

The best you can do is look at the logs near the time of the lock up and see if you can correlate the lockup with any system event of any type. It’s a difficult thing to do and you may not be able to find anything that could be a direct cause this way.

Some hints for diagnosing hardware problems:

The easiest thing to eliminate is firmware issues/settings:

  • Make sure your system has the latest firmware/BIOS updates from the manufacturer.

  • Make sure any storage devices are also updated to latest firmware.

  • Try disabling any CPU or other power management options in the firmware/BIOS.

  • Try disabling virtualization in the firmware if you don’t use it.

Problems with RAM can cause hard lockups even if they don’t show on a memory test. It could be something very intermittent. Actual servers have ECC RAM that prevents rare/transient RAM errors from causing problems but if this is a non-server PC it doesn’t have this. Try swapping out the RAM if you can.

A power issue from your wall power could cause problems like this. If you are serious about running a home server you should have a battery backup that also filters out transient power issues.

If problems persist thereafter, try replacing the power supply or using another one.

Afterward, assume the motherboard is flaky and look into replacing.

Leave a Reply

Your email address will not be published.