Problem :
I have a game server hosted on my Ubuntu Server 16.04 and out of the blue I cannot start/restart it because of the following file:
-????????? ? ? ? ? ? proceduralmap.3000.1499245715.149.sav
This seems to be the only file in the fs with this situation.
Now, the server is a dedicated server purchased from a hosting provider.
The drive on which the file resides is a SCSI mounted HDD (/dev/sdb1
).
The df -hT
output:
Filesystem Type Size Used Avail Use% Mounted on
udev devtmpfs 3.7G 0 3.7G 0% /dev
tmpfs tmpfs 744M 81M 663M 11% /run
/dev/sda4 ext4 21G 16G 4.7G 77% /
tmpfs tmpfs 3.7G 24K 3.7G 1% /dev/shm
tmpfs tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs tmpfs 3.7G 0 3.7G 0% /sys/fs/cgroup
/dev/sda3 ext4 946M 143M 739M 17% /boot
cgmfs tmpfs 100K 0 100K 0% /run/cgmanager/fs
/dev/sdb1 ext2 985G 265G 670G 29% /storage
tmpfs tmpfs 744M 0 744M 0% /run/user/1011
What would be the appropriate way of repairing/removing that file? I would prefer repairing it, but removing will do as well. I already ran:
debugfs -w /dev/sdb1
In which I typed:
clri home/steam/serverfiles/server/rustserver/proceduralmap.3000.1499245715.149.sav
I understand, from what I could find on the web, that I would need to run e2fsck, but I understand I would need to unmount the drive first. I wouldn’t want to do that just for this one file, if possible.
Thanks!
Solution :
What’s up with the “structure needs clearing” error message
The error “structure needs clearing” is the error which file systems (in particular ext4 and xfs) return when they have detected a file system corruption problem. Unfortunately, the only safe thing to do to repair the corruption is to unmount the disk and run e2fsck
on the file system. (Technically, you won’t need the the -f
option because the file system has already detected problems and has marked the file system as being in trouble. So when you run e2fsck
it will do a full scan to fix those issues and you don’t need the -f
option to force a check.)
Reports of file system corruption
You should also be able to see the reports of file system corruption by looking at the kernel logs. (e.g., by running dmesg
, or looking at /var/log/kern.log
or wherever your syslog
or journald
has been configured to log kernel messages. You should see messages that begin EXT4-fs error (device sdXX)
. For example:
EXT4-fs error (device sda3): ext4_lookup:1602: inode #37005: comm docker: deleted inode referenced: 31872136
You can also see indications of errors by looking at dumpe2fs -h
on the file system. Fields of interest:
FS Error count: 25
This means that kernel has found file system inconsistencies 25 times.
First error time: Thu Jan 1 12:19:59 2015
First error function: ext4_ext_find_extent
First error line #: 400
First error inode #: 95223833
First error block #: 0
The first error was found on January 1, 2015, at the specified time. The error function and line # allows you to identify exactly which part of the kernel code found the problem. The inode # tells you which inode was involved with the file system inconsistency.
Last error time: Wed Feb 4 11:57:05 2015
Last error function: ext4_ext_find_extent
Last error line #: 400
Last error inode #: 95223833
Last error block #: 0
This tells you the most recent time the kernel found a file system inconsistency. The large deltas between the two times means that someone hasn’t been scanning their kernel messages. That’s because every 24 hours, ext4 will log warning messages that there is a file system with corruptions, and those kernel messages will look like this:
EXT4-fs (dm-0): error count since last fsck: 12
EXT4-fs (dm-0): initial error at time 1441536566: ext4_dirty_inode:4655
EXT4-fs (dm-0): last error at time 1441537273: ext4_remount:4550
Note: the time is in the kernel messages are number of seconds since January 1, 1970 midnight UTC. You can convert this to a more human readable time using the date command, for example:
% date -d @1441536566
Sun Sep 6 06:49:26 EDT 2015
What to do when you become aware your file system is corrupt
You really don’t want to run with file system inconsistencies, since that can lead to more data loss. It’s really a good idea to jump on these reports, schedule downtime if necessary, and fix them ASAP.
Why did e2fsck
complain the device was in use after I unmounted it?
Finally, in answer to your question: “I ran fsck
after unmounting and I get the following error: /dev/sdb1 is in use.
Any ideas?” That’s probably be cause you have one or more processes in an alternate mount namespace, and those processes still have /dev/sdb1
mounted in that mount name space. You might want to try:
grep /dev/sdb1 /proc/*/mounts
If you find processes running in an alternate mount namespace, the simplest thing to do is to kill and restart those processes. (They are probably daemon processes.) When the last process using a mount namespace exits, the mount name space goes away. And once there are no more mount namespaces that have /dev/sdb1
mounted, it will really be unmounted for real.
The way to think about this is that umount
acts like unlink
. If you have a file with multiple hardlinks, the space is only released when the last hard link is deleted. If you have multiple namespaces active, each namespace effectively acts as a “hard link” to the mount in question. So merely unmounting the file system in the default mount namespace won’t help if some process has forked the default mount namespace and is running itself and possibly some child processes in that copy-on-write copy of its parent mount namespace.