2

I have a 4 disc RAIDZ config running on Freenas 9.1. Earlier I noticed that one of the drives had been taken offline. I ran smartctl, and all the discs passed SMART. A few days ago the machine had been powered off while a windows client was writing to it (I don't have a UPS yet). I thought that may have corrupted the data, so put the drive back online through command line. The drive remained online without incident.

So, I decided to run a scrub to check for errors. The scrub is able to get to ~50% of the pool size before the Computer takes the drive offline on its own. No CKSUM errors are reported. SmartCTL still shows all drives passing. Zpool Status reports that the drive was taken offline by the administrator - obviously I haven't taken the drive offline myself, so I assume that maybe FreeNAS is doing it for some reason.

The drive appears fine, with no CheckSum errors. Why would freenas be taking it offline? Should I check any other statistics on the disc to check if it is a true failure?

2
  • A quick edit: I have run the 'fast' smart test. All disks in the machine pass. Long test is running now, none of the disks have any reallocated sectors. All disks are 8TB SMR's, is it possible ZFS is freaking out about this?
    – James
    Feb 8, 2018 at 9:58
  • In theory SMR could cause such behaviour (see also my answer), but then you would experience it on any disk and much earlier, because the performance penalty/write delay is the same on all disks if you added them at the same time.
    – user121391
    Feb 8, 2018 at 10:57

1 Answer 1

0

It may be possible that the disc tries to read a sector and does not achieve results inside the TLER (time limited error recovery) window (can be anything from a few seconds to several minutes). Instead of trying for a longer time (and possibly locking the whole array), it tells the controller to drop it (or the controller assumes it is dead and drops it, whatever comes first). This results in the OFFLINE status.

To solve it, you may either modify the timeout on both the disk and the controller and then do another scrub (which might take longer and lock the system temporarily) to detect and fix any unreadable sectors; or you may remove the disk and do the same procedure on another system, taking it back afterwards; or you may just remove and replace it.

1
  • 1
    I've increased the timeout through the variable kern.cam.ada.default_timeout (I think this is the right one). I'll post again if this solves the problem or not To be safe, I shut the machine down and added a spare drive. If it goes offline again, I'll replace it and perform a more thorough check on the disk with my main machine. Thanks for the knowledge
    – James
    Feb 8, 2018 at 11:09

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .