Blog Archives

5/31/2023

Mdadm check disk health

Read Now

This will detect & repair any new bad blocks as otherwise seldom-read bad blocks can linger and ultimately cause rebuild failure.ġ Fencepost error-a type of off-by-one error from failing to count one of the ends. Many distros already ship a script that does this monthly. PPS: Make sure your RAID arrays are scrubbed (every sector read and all the parity verified) routinely. And if its continuously increasing, that's bad news too. That attribute going to failed is bad news. PS: Make sure to keep an eye on the reallocated sectors count. Write errors, though, will fail the disk out of the array. The two numbers are read errors and write errors mdraid will easily correct read errors via parity (and write the failed sector back to the disk to let it reallocate). It's harmless (except for trivial wear, and wasting hours of time) to zero the whole disk if you'd like to, though.Īnother thing to do, if your disks support it: smartctl -l scterc,40,100 (or whatever numbers) to tell the disk that you want it to give up on correcting read errors quicker-40 would be 4 seconds. I think I managed to avoid the all-to-easy fencepost error 1 above, so that should blank the first and last 128MiB of the disk. The only exception the beginning and end of the disk-it's possible a few tens of megabytes will be missed (due to alignment, headers, etc.). The RAID init will write to the entire disk anyway, so that's pointless. Normally, that'd be a simple pv -pterba /dev/zero > /dev/sdX (or just plain cat, or dd) but you plan to make these part of a RAID array. Now, to recover the sectors you just need to write to them. This will typically update the "current pending sector" count in SMART.Įither of the above may also increase the reallocated sector count-this is case (b), the ECC recovered the data. Have Linux read the entire disk, e.g., badblocks -b 4096 -c 1024 -s /dev/sdX. (Note: drives can be configured to automatically run an offline check routinely.) This will typically update the "offline uncorrectable" count in SMART. You then just leave the disk alone (completely idle will be fastest) until it's done (check the "Offline data collection status" in smartctl -c /dev/sdX).

Use smartctl -t offline /dev/sdX to tell the disk firmware to do an offline surface scan. There are two ways to get the disk to notice bad sectors: Typically there will be at least a count of reallocated sectors and two counts of pending (discovered bad on read, ECC failed, has not yet been written to). The disk firmware typically lets you monitor this process (the counts at least) via SMART attributes. It will do so if (a) while reading, it discovers the block is "weak", but ECC is enough to recover the data (b) while writing, it discovers the sector header is bad (c) while writing, if a read previously detected the sector as bad, but the data was not recoverable. In brief, a disk will handle a bad block by transparently replacing it with a spare sector. Also, this is disk with no data (or no data you care to preserve) on it see my answer to “Can I fix bad blocks on my hard disk with a single command” for what to do if you have important data on the disk.ĭisks made since at least the late 90s manage bad blocks themselves.

0 Comments

Mdadm check disk health

Author

Archives

Categories