Ubuntu Server: Cannot delete file “x”, Structure needs cleaning

Posted on

Problem :

I have a game server hosted on my Ubuntu Server 16.04 and out of the blue I cannot start/restart it because of the following file:

-????????? ? ?     ?        ?            ? proceduralmap.3000.1499245715.149.sav

This seems to be the only file in the fs with this situation.
Now, the server is a dedicated server purchased from a hosting provider.
The drive on which the file resides is a SCSI mounted HDD (/dev/sdb1).

The df -hT output:

Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs  3.7G     0  3.7G   0% /dev
tmpfs          tmpfs     744M   81M  663M  11% /run
/dev/sda4      ext4       21G   16G  4.7G  77% /
tmpfs          tmpfs     3.7G   24K  3.7G   1% /dev/shm
tmpfs          tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs          tmpfs     3.7G     0  3.7G   0% /sys/fs/cgroup
/dev/sda3      ext4      946M  143M  739M  17% /boot
cgmfs          tmpfs     100K     0  100K   0% /run/cgmanager/fs
/dev/sdb1      ext2      985G  265G  670G  29% /storage
tmpfs          tmpfs     744M     0  744M   0% /run/user/1011

What would be the appropriate way of repairing/removing that file? I would prefer repairing it, but removing will do as well. I already ran:

debugfs -w /dev/sdb1

In which I typed:

clri home/steam/serverfiles/server/rustserver/proceduralmap.3000.1499245715.149.sav

I understand, from what I could find on the web, that I would need to run e2fsck, but I understand I would need to unmount the drive first. I wouldn’t want to do that just for this one file, if possible.


Solution :

What’s up with the “structure needs clearing” error message

The error “structure needs clearing” is the error which file systems (in particular ext4 and xfs) return when they have detected a file system corruption problem. Unfortunately, the only safe thing to do to repair the corruption is to unmount the disk and run e2fsck on the file system. (Technically, you won’t need the the -f option because the file system has already detected problems and has marked the file system as being in trouble. So when you run e2fsck it will do a full scan to fix those issues and you don’t need the -f option to force a check.)

Reports of file system corruption

You should also be able to see the reports of file system corruption by looking at the kernel logs. (e.g., by running dmesg, or looking at /var/log/kern.log or wherever your syslog or journald has been configured to log kernel messages. You should see messages that begin EXT4-fs error (device sdXX). For example:

EXT4-fs error (device sda3): ext4_lookup:1602: inode #37005: comm docker: deleted inode referenced: 31872136

You can also see indications of errors by looking at dumpe2fs -h on the file system. Fields of interest:

FS Error count:           25

This means that kernel has found file system inconsistencies 25 times.

First error time:         Thu Jan  1 12:19:59 2015
First error function:     ext4_ext_find_extent
First error line #:       400
First error inode #:      95223833
First error block #:      0

The first error was found on January 1, 2015, at the specified time. The error function and line # allows you to identify exactly which part of the kernel code found the problem. The inode # tells you which inode was involved with the file system inconsistency.

Last error time:          Wed Feb  4 11:57:05 2015
Last error function:      ext4_ext_find_extent
Last error line #:        400
Last error inode #:       95223833
Last error block #:       0

This tells you the most recent time the kernel found a file system inconsistency. The large deltas between the two times means that someone hasn’t been scanning their kernel messages. That’s because every 24 hours, ext4 will log warning messages that there is a file system with corruptions, and those kernel messages will look like this:

EXT4-fs (dm-0): error count since last fsck: 12
EXT4-fs (dm-0): initial error at time 1441536566: ext4_dirty_inode:4655
EXT4-fs (dm-0): last error at time 1441537273: ext4_remount:4550

Note: the time is in the kernel messages are number of seconds since January 1, 1970 midnight UTC. You can convert this to a more human readable time using the date command, for example:

% date -d @1441536566
Sun Sep  6 06:49:26 EDT 2015

What to do when you become aware your file system is corrupt

You really don’t want to run with file system inconsistencies, since that can lead to more data loss. It’s really a good idea to jump on these reports, schedule downtime if necessary, and fix them ASAP.

Why did e2fsck complain the device was in use after I unmounted it?

Finally, in answer to your question: “I ran fsck after unmounting and I get the following error: /dev/sdb1 is in use. Any ideas?” That’s probably be cause you have one or more processes in an alternate mount namespace, and those processes still have /dev/sdb1 mounted in that mount name space. You might want to try:

grep /dev/sdb1 /proc/*/mounts

If you find processes running in an alternate mount namespace, the simplest thing to do is to kill and restart those processes. (They are probably daemon processes.) When the last process using a mount namespace exits, the mount name space goes away. And once there are no more mount namespaces that have /dev/sdb1 mounted, it will really be unmounted for real.

The way to think about this is that umount acts like unlink. If you have a file with multiple hardlinks, the space is only released when the last hard link is deleted. If you have multiple namespaces active, each namespace effectively acts as a “hard link” to the mount in question. So merely unmounting the file system in the default mount namespace won’t help if some process has forked the default mount namespace and is running itself and possibly some child processes in that copy-on-write copy of its parent mount namespace.

Leave a Reply

Your email address will not be published. Required fields are marked *