SMART Rewriting Bad Sectors
Want to know how to find what file is associated with a sector and what inode it has?
How do you get rid of those "pending sector" errors that smartd reports when they never seem to go away?
What the hell is a pending sector anyway?
There are two primary means of discovering the specific bad sector(s) on a drive using SMART. You may want to attempt to write to that specific sector on a drive, or determine what file on that sector is possibly corrupt, etc. If you determine the file you can attempt to move it and nuke the sector from orbit.
This article was written as a summation of how to do this. It is based heavily on these two articles, but is a condensation of that information.
For more information on SMART in general, see the article SMART Drive Monitoring
SMART Attributes: Pending Sectors
If a read error occurs on a sector the error is recorded in the smart log and the sector is marked pending in the attribute list. It will remain pending until a failed write operation, at witch time it will be reallocated.
The pending sectors can generate error messages, etc. You can force write to them and reallocate them, however if you do it will nuke the file in that sector. Hummm . . . a conundrum.
The sector number of the last read errors will be given in the error section of the smart logs, for example:
Error 3 occurred at disk power-on lifetime: 14511 hours (604 days + 15 hours) When the command that caused the error occurred, the device was doing SMART Offline or Self-test. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 48 b8 e2 e0 Error: UNC 8 sectors at LBA = 0x00e2b848 = 14858312
(Some drives will only give the value in hex, you would then have to convert it to decimal.)
The error log only shows the last five errors. It may contain the same or different sector numbers. If it only contains one sector, yet the attribute list shows more than one you will most likely have to repair the first one, then run a smart read test, rinse and repeat.
Other errors beside read error get recorded too, so if none is given here, examine the smart read test results.
SMART Read Test Failure
Another method of locating the bad sector number is to run a smart read test, short or extended, with smartctl. If it fails it will report the sector number. You can run either of these tests while the server is up. Unless the server has a very high I/O it will not affect performance. The command to run the test is:
#smartctl -t short /dev/hda
Upon completion check the smart log section on test results and you will see entries giving the bad sectorsuch as:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 10% 14577 232962120 # 2 Short offline Completed: read failure 10% 14504 232962120 # 3 Conveyance offline Completed without error 00% 1501 -
Calculating the File System Sector
Because sectors are logical on the drive (Logical Block Addressing = LBA) you need to convert between LBA and physical (file system) sectors. This is pretty easy to do:
First - get a table of the start and end sectors of the partition table:
[[email protected] ~]# fdisk -lu /dev/hda Disk /dev/hda: 120.0 GB, 120034123776 bytes 255 heads, 63 sectors/track, 14593 cylinders, total 234441648 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/hda1 * 63 208844 104391 83 Linux /dev/hda2 208845 4401809 2096482+ 83 Linux /dev/hda3 4401810 8482319 2040255 82 Linux swap /dev/hda4 8482320 234436544 112977112+ 5 Extended /dev/hda5 8482383 29447144 10482381 83 Linux /dev/hda6 29447208 50411969 10482381 83 Linux /dev/hda7 50412033 52516484 1052226 83 Linux /dev/hda8 52516548 234436544 90959998+ 83 Linux
Use this to determine what partition the bad sector is in. In this case 232962120 is inside the start and end values for /dev/hda5
NOTE: This is in partition 5 - ignore partition 4 as it is the extended partition. Any block from partitions 5 through 8 will also be in partition 4, but you want the real partition, not the extended partition.
Next, calculate the file system block using the formula:
b = (int)((L-S)*512/B)
b = File System block number B = File system block size in bytes (almost always is 4096) L = LBA of bad sector S = Starting sector of partition as shown by fdisk -lu and (int) denotes the integer part.
The reported sector from the smart log above is 232962120, thus:
((14858312 - 8482383) * 512) / 4096 = 796991.125 ^Bad Sec. ^Start Sec. ^Cha Ching! This is the sector!
(Use the block number from the smart test section, not from the smart error log section. They are using different methods of reporting file system vs. physical blocks.)
((BadBLock - StartPartition) * 512) / 4096 You can just paste this into Google as a template
Any fraction left indicates the problem sector is in the mid or latter part of the block (which contains a number of sectors). Ignore the fraction and just use the integer.
Next, use debugfs to locate the inode and then file associated with that sector:
[[email protected]]# debugfs debugfs 1.35 (28-Feb-2004) debugfs: open /dev/hda5 debugfs: icheck 796991 Block Inode number 796991 <block not found> debugfs: quit
Ah! It didn't give the inode! It if did, you could have found the file with:
[[email protected]]# debugfs debugfs 1.35 (28-Feb-2004) debugfs: open /dev/hda5 debugfs: icheck 796991 Block Inode number 796991 41032 debugfs: ncheck 41032 Inode Pathname 41032 /S1/R/H/714197568-714203359/H-R-714202192-16.gwf
So what the heck? Why no inode? Well, remember how it said the sector might be bad?
Banishing Evil Sectors
If you want to check it with a read test try:
dd if=/dev/hda5 of=my.block skip=796991 bs=4096 count=1
If it gives an error, then you know for sure it is bad.
As far as I know, that sector is toast at this point.
If you want to force a write to it, destroying the data in it, use:
dd if=/dev/zero of=/dev/hda5 bs=4096 count=1 seek=796991 sync
That sector should no longer be listed as pending in the smart attributes, it should now show as reallocated.