Is SMART still a thing? Scrubs alone enough?

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
Hi everyone
I recently bought a empty TrueNAS Mini X+ and inserted two disks.
I noticed that SMART is disabled by default. Which begs the question for me, do we still need SMART?
Do we see checksum errors with scrubs anyway so there is no need for SMART?
If you should not use SMART during scrub or during resilver, do you disable SMART before replacing a disk and turn it back on after the resilver?
Appreciate your opinions.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
I noticed that SMART is disabled by default. Which begs the question for me, do we still need SMART?
Do we see checksum errors with scrubs anyway so there is no need for SMART?
The two processes look at entirely different things.

A scrub only looks at the data and checksums on the disk, so no checking of free space.

A long SMART test looks at every block on the disk (used or not) and a bunch of other things too.

Scrubs are for data integrity, SMART tests are a proactive disk health check.

In my opinion, both are most certainly required for best-practice pool health.

If you should not use SMART during scrub or during resilver, do you disable SMART before replacing a disk and turn it back on after the resilver?
That's probably right, although would not be a disaster even if it were to happen (resilvering would take longer though, so there's that increased risk).
 

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
I have read some horror stories about SMART messing around with backpanes and HDD vendor errors :wink:
What do you think about my co-worker statements?

- SMART is broken and does not really work. Only a small percentage will be detected, and these will be detected shortly after by scrub anyway
- More-often scrub will detect a HDD error than SMART. The last 10y he had only checksum errors that were not detected by SMART
- It adds an additional and unnecessary risk (broken implementations).
- Because you use a form of RAID, it is not really necessary the precative detect a bad drive. You can just replace a drive when it throws checksum errors. Pre-fail detection is only important in your single SSD PC or Laptop.

In my opinion, both are most certainly required for best-practice pool health.
That is what I think, but it no being enabled by default (compared to scrubs) bewilders me.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
I have read some horror stories about SMART messing around with backpanes and HDD vendor errors :wink:
What do you think about my co-worker statements?
I don't know how SMART would break a backplane... SMART test are run internally by the disk's own onboard controller chip, so the backplane shouldn't even figure into the test.
I don't know what is meant by HDD vendor errors, but indeed there are differences in SMART codes by vendor, but there's a database that helps to interpret them.

- SMART is broken and does not really work. Only a small percentage will be detected, and these will be detected shortly after by scrub anyway
Complete BS. Errors in un-used blocks won't be found by a scrub.

- More-often scrub will detect a HDD error than SMART. The last 10y he had only checksum errors that were not detected by SMART
Was he doing both long and short SMART tests properly? I am guessing not.

It adds an additional and unnecessary risk (broken implementations).
I don't agree. HDD vendors can be held accountable by RMA if it doesn't work.

Because you use a form of RAID, it is not really necessary the precative detect a bad drive. You can just replace a drive when it throws checksum errors. Pre-fail detection is only important in your single SSD PC or Laptop.
If you're using RAIDZ2 and have easy/fast physical access to the system, maybe it's an OK approach given that you can keep a spare on hand and do a replacement more-or-less immediately on a failure without a lot of risk. Without those 2 conditions met, you're just asking for a difficult life of lost data and/or backup restoration.

it no being enabled by default (compared to scrubs) bewilders me.
I think this may be some kind of error, not an intentional thing... anyway you need to intervene to set the schedule, so perhaps it's more a reference to that.
 

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
Thank you very much for your answers. You convinced me to keep my SMART schedules active :cool:
 
Joined
Jul 2, 2019
Messages
648
There is one SMART "issue" - it seems that SAS drives do not report SMART consistently between drives makes/models. There are couple of threads on this; for example, smartctl attributes not displayed for SAS drives?

--Edit: Note that "SCSI/SAS and NVMe drives do not provide ATA/SATA-like SMART Attributes. "
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
Note that "SCSI/SAS and NVMe drives do not provide ATA/SATA-like SMART Attributes. "
Fair point.

It's very much the case for SATA though (which is what I was talking about, so reconsider my advice in that context).
 
Joined
Jul 2, 2019
Messages
648
@sretalla - definitely agree. I just wanted to add that info for others reading the thread.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
- SMART is broken and does not really work. Only a small percentage will be detected, and these will be detected shortly after by scrub anyway
Well the scrub part is B.S. but there is could be some truth to "only a small percentage will be detected" but lets understand this more:
SMART was designed as an attempt to warn the user up to 24 hours prior to a pending failure of the hard drive from the time it was tested. So if you run a SMART Long test right now, if failures show up then the hard drive is predicting a catastrophic failure coming very soon. What SMART can't do is tell you if the spindle motor is going to fail on the next power up or if you have an infant mortality with some of the hardware. SMART will do some basic circuit checks, make sure the rotational timing is good, and do a surface read test to ensure the mechanical aspects are good.

So SMART is not fool proof nor is it perfect but it is what the manufacturers of the hard drives use to allow the end users to diagnose failures and ultimately for an RMA of a hard drive.

All hard drives and as far as I know all SSD and NVMe should support SMART but my colleagues are correct, reporting of SMART attributes may not work so well on some drive interfaces.

My last comment is that everyone should run SMART testing. My recommended frequency is to run a SMART Short test once a day (only takes a few minutes to run) and run a SMART Long/Extended test once a week when your system is not heavily used if possible or you could run the tests one or two drives a day if that makes more sense, but run those Long tests. SCRUBs do not make up for a SMART test.
 

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
Thank you all for your input! Just this morning, I got a two SMART errors in my home pod.

8 Currently unreadable (pending) sectors

1 Offline uncorrectable sectors.


Which seems strange to me, because that is a 3 month old Toshiba NAS drive. I have way worse disks in my pool, I have got some old Seagate Archive disks for free, they die like flies :wink:

Anyway, thanks a lot guys, I am off to change some disks.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
I have got some old Seagate Archive disks for free, they die like flies
I think those might be SMR... ZFS won't play well with them.
 

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
You are correct, they are SMR. But because I started with them empty and did not use them for resilver, SMR was not that bad.
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,700
I started with them empty and did not use them for resilver, SMR was not that bad.
But scrubs won't have gone well either...
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Thank you all for your input! Just this morning, I got a two SMART errors in my home pod.

8 Currently unreadable (pending) sectors

1 Offline uncorrectable sectors.
If you post the FULL output of smartctl -a /dev/ada0 where ada0 is the drive for the errors, we can give you some advice. And I say the full output because some folks like to give us what they feel is good enough but often they leave out important data.
 

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
@sretalla Maybe I did not notice it during scrubs because the were during the night.

@joeschmuck of course, I would love if you could have a look at it.
Here you go:
Code:
smartctl 7.1 2019-12-30 r5022 [FreeBSD 12.2-RELEASE-p3 amd64] (local build)

Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org


=== START OF INFORMATION SECTION ===

Device Model:     TOSHIBA HDWN180

Serial Number:    deleted

LU WWN Device Id: 5 000039 94ba80540

Firmware Version: GX2M

User Capacity:    8,001,563,222,016 bytes [8.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        Not in smartctl database [for details use: -P showall]

ATA Version is:   ACS-3 T13/2161-D revision 5

SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is:    Thu Feb 11 16:17:53 2021 CET

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x85) Offline data collection activity

                                        was aborted by an interrupting command from host.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0) The previous self-test routine completed

                                        without error or no self-test has ever

                                        been run.

Total time to complete Offline

data collection:                (  120) seconds.

Offline data collection

capabilities:                    (0x5b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        No Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (   2) minutes.

Extended self-test routine

recommended polling time:        ( 744) minutes.

SCT capabilities:              (0x003d) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.


SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0

  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0

  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       10580

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       67

  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0

  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0

  9 Power_On_Hours          0x0032   071   071   000    Old_age   Always       -       11742

 10 Spin_Retry_Count        0x0033   101   100   030    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40

191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       37

193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       90

194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       45 (Min/Max 12/48)

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       8

198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       1

199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0

220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0

222 Loaded_Hours            0x0032   071   071   000    Old_age   Always       -       11721

223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0

226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       553

240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0


SMART Error Log Version: 1

No Errors Logged


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline       Completed without error       00%     11739         -

# 2  Extended offline    Completed: read failure       00%     11718         2431815016


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
Which seems strange to me, because that is a 3 month old Toshiba NAS drive.
11742 hours = 489 days, a bit more than 3 months.

# 2 Extended offline Completed: read failure 00% 11718 2431815016
Run another SMART Long/Extended test, if it fails again then I highly recommend you replace the drive. If it's under RMA, this is grounds to replace it.

Good Luck!
 

Jamberry

Contributor
Joined
May 3, 2017
Messages
106
Haha, you are right. To be honest, when I opened up my case, I was totally surprised that I have two Thosiba drives in there and that all my Seagate drives are gone :smile: But it has still 1 year warranty.
 

Matt_G

Explorer
Joined
Jan 24, 2016
Messages
65
Jamberry, you should check your cooling as well. That drive has been up to 48 degrees Celsius and was at 45 when the test was run.
Many would consider that too hot.
Ideally, you don't want them going above the mid thirties.
Temps above 40C may shorten the life of the drive.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,996
As for drive temps, that is a 7200 RPM drive so they run hotter. My 7200 RPM drives when at idle run at 32C to 34C depending on the location within the case. My alarm setpoint is 43C and I have not hit that value during a scrub or extended test but I have come close. Proper airflow is required to keep the drives cool. Of course the ambient air temperature should be cool as well. So as @Matt_G said, you should check into your cooling to see if you can reduce the drive temps. I have made simple modifications that used only some tape and cardboard to direct airflow better. But when you make changes like this, pay attention to the other computer parts to ensure you are not starving them of cooling air.
 

millst

Contributor
Joined
Feb 2, 2015
Messages
141
Jamberry, you should check your cooling as well. That drive has been up to 48 degrees Celsius and was at 45 when the test was run.
Many would consider that too hot.
Ideally, you don't want them going above the mid thirties.
Temps above 40C may shorten the life of the drive.

I have heard this before and it makes sense intuitively, but is there any real evidence to back this recommendation? The only studies I've seen show that heat isn't an issue with drive life.
 
Top