Hey, ok I have followed through the steps up to the DIF Format command,Data Integrity Field
DIF extends the disk sector from its traditional 512-byte, to 520-byte, by adding 8 additional protection bytes. You might also find also disk sectors extended to 528-byte by custom firmware. OEM rebranded HDDs or SSDs from major storage vendors are plagued with this "enhancement". Linux does not support 520-byte sectors, unless the drive is formatted with DIF and installed into a DIF-capable HBA.
Example of branded disk:
Code:
# sg_scan -i /dev/sda
/dev/sda: scsi0 channel=0 id=0 lun=0
NETAPP X287_S15K5288A15 NA00 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
Example of unbranded disk:
Code:
# sg_scan -i /dev/sda
/dev/sda: scsi0 channel=0 id=0 lun=0
ATA HUH537060BKD702 0003 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
DIF format command:
Code:
# sg_format -vFs 512 /dev/sda
Since Linux cannot normally see disks with 520-byte sectors, is safe to try formatting the disk with 512-byte sectors. See the detailed formatting process, below.
Disk /dev/sdb: 5.46 TiB, 6001175126016 bytes, 11721045168 sectors Disk model: MG04SCA60EE Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: FDE2DAA6-87D5-4C38-A3BD-CADE5652B16A Device Start End Sectors Size Type /dev/sdb1 128 4194304 4194177 2G Linux swap /dev/sdb2 4194432 11721045134 11716850703 5.5T Solaris /usr & Apple ZFS
That means nothing, Linux will show 512-byte sectors on 520-byte sector formatted disks because it does not understand it.The drives are running 512-byte sectors as shown by the output below.
If the command listed below fails, at least you know where is the issue, but I'm pretty sure you will be successful. First, you need to make sure you format the correct hard drive, do you know which one you need to fix?If I run # sg_format -vFs 512 /dev/sda on the SAS drive I am having issues with, this will wipe the drive completely correctly but will leave the other drives intact correct?
# sg_format -v -F -s 512 /dev/sda
According to TrueNAS it is Disk sda, as this is the only disk I am not able to use and is not apart of the array.do you know which one you need to fix?
currently running a Raid Z3 with one drive reporting as failed (the drive in question.it part of raidz2 array?
Great stuff, you care about your data. The easiest way to see all relations between your pool disks and actual system disks is by running:currently running a Raid Z3 with one drive reporting as failed (the drive in question.
# zpool status # lsblk -o NAME,PARTUUID,PATH,FSTYPE
lsblk
command output. This way, you'll know for sure the correct /dev/sdX
name.How do I offline the disk?Obviously, you need to offline the disk, like explained into OP, prior formatting
Looks like you already ran the formatting command and your disk is okay now, see the instructions posted by @bonfire62. Ping him here in the thread, if you have any questions related to his procedure.When I go into the pool all I see for that disk is Details for3869088571791395513, disk is unavaliable, Then it only gives me to option to replace with no option to offline the drive.
Definitely not, I mentioned in an earlier post that you only format the affected disks. If Bluefin does not reports errors, you are good.Am I also going to need to run this for every disk?
Cool, thanks for the help,Looks like you already ran the formatting command and your disk is okay now, see the instructions posted by @bonfire62. Ping him here in the thread, if you have any questions related to his procedure.
Definitely not, I mentioned in an earlier post that you only format the affected disks. If Bluefin does not reports errors, you are good.
/dev/sdb: scsi0 channel=0 id=12 lun=0 [em] HGST HSCAC2DA4SUN400G A29A [rmb=0 cmdq=1 pqual=0 pdev=0x0]
Read Capacity results: Protection: prot_en=1, p_type=0, p_i_exponent=0 [type 1 protection] Logical block provisioning: lbpme=1, lbprz=1 Last LBA=781422767 (0x2e9390af), Number of logical blocks=781422768 Logical block length=512 bytes Logical blocks per physical block exponent=3 [so physical block length=4096 bytes] Lowest aligned LBA=0 Hence: Device size: 400088457216 bytes, 381554.1 MiB, 400.09 GB
pool: boot-pool state: ONLINE status: Some supported and requested features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(7) for details. scan: scrub repaired 0B in 00:00:40 with 0 errors on Wed Dec 14 03:45:42 2022 config: NAME STATE READ WRITE CKSUM boot-pool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdb3 ONLINE 0 0 0 sdc3 ONLINE 0 0 0 errors: No known data errors
NAME PARTUUID PATH FSTYPE sdb /dev/sdb ├─sdb1 88647042-9740-46e7-b48e-ea65e4d4b372 /dev/sdb1 ├─sdb2 61de13cc-8960-4a38-8622-d674fe442d99 /dev/sdb2 vfat ├─sdb3 ca179f5a-2cf6-47b1-a880-809f69def2a6 /dev/sdb3 zfs_member └─sdb4 9f04c168-0750-9948-94f5-48f9feaa7119 /dev/sdb4 zfs_member
/dev/sg0 /dev/sg1 /dev/sg2 /dev/sdai /dev/sg3 /dev/sds /dev/sg4 /dev/sdak /dev/sg5 /dev/sdr /dev/sg6 /dev/sdv /dev/sg7 /dev/sdab /dev/sg8 /dev/sdad /dev/sg9 /dev/sdy /dev/sg10 /dev/sdw /dev/sg11 /dev/sdaa /dev/sg12 /dev/sdac /dev/sg13 /dev/sdaj /dev/sg14 /dev/sdb * /dev/sg15 /dev/sdc * /dev/sg16 /dev/sg17 /dev/sdah /dev/sg18 /dev/sdx /dev/sg19 /dev/sdh /dev/sg20 /dev/sdz /dev/sg21 /dev/sdal /dev/sg22 /dev/sdk /dev/sg23 /dev/sdu /dev/sg24 /dev/sdae /dev/sg25 /dev/sdag /dev/sg26 /dev/sdaf /dev/sg27 /dev/sda /dev/sg28 /dev/sdd /dev/sg29 /dev/sde /dev/sg30 /dev/sdj /dev/sg31 /dev/sdf /dev/sg32 /dev/sdp /dev/sg33 /dev/sdo /dev/sg34 /dev/sdq /dev/sg35 /dev/sdl /dev/sg36 /dev/sdi /dev/sg37 /dev/sdg /dev/sg38 /dev/sdm /dev/sg39 /dev/sdt /dev/sg40 /dev/sdn /dev/sg41
/dev/sdb
disk. I strongly recommend you to use tmux
, as detailed into OP. If you lose ssh connectivity, the interrupted format process might affect your disk, it takes several hours to format the disk. Unless you run the command directly from Scale console, which is the safest way.# time sg_format -v -F /dev/sdb
sg_format
command as listed into Formatting Procedure section, after you did your troubleshooting. If the disk is not repairable, the format command will tell you where is the issue. Also, only the commands listed into Formatting Procedure section are relevant.Thank you for the quick reply. I will give this a shot tomorrow.Easy fix @mattyv316. First, take the disk offline in your pool:
View attachment 61244
Then run the formatting procedure for the affected/dev/sdb
disk. I strongly recommend you to usetmux
, as detailed into OP. If you lose ssh connectivity, the interrupted format process might affect your disk, it takes several hours to format the disk. Unless you run the command directly from Scale console, which is the safest way.
Code:# time sg_format -vFf 0 -s 512 /dev/sdb
Let us know how it went and please post UI screenshots, before and after.
For other people, I know this looks a little scary buy you can confidently run thesg_format
command as listed into Formatting Procedure section, after you did your troubleshooting. If the disk is not repairable, the format command will tell you where is the issue.
Easy fix @mattyv316. First, take the disk offline in your pool:
View attachment 61244
Then run the formatting procedure for the affected/dev/sdb
disk. I strongly recommend you to usetmux
, as detailed into OP. If you lose ssh connectivity, the interrupted format process might affect your disk, it takes several hours to format the disk. Unless you run the command directly from Scale console, which is the safest way.
Code:# time sg_format -vFf 0 -s 512 /dev/sdb
Let us know how it went and please post UI screenshots, before and after.
For other people, I know this looks a little scary but you can confidently run thesg_format
command as listed into Formatting Procedure section, after you did your troubleshooting. If the disk is not repairable, the format command will tell you where is the issue. Also, only the commands listed into Formatting Procedure section are relevant.
# time sg_format -vFf 0 -s 512 /dev/sdb HGST HSCAC2DA4SUN400G A29A peripheral_type: disk [0x0] PROTECT=1 << supports protection information>> Unit serial number: 001517JQWN8A 0QVBWN8A LU name: 5000cca04e159f90 mode sense(10) cdb: [5a 00 01 00 00 00 00 00 fc 00] Mode Sense (block descriptor) data, prior to changes: Number of blocks=781422768 [0x2e9390b0] Block size=512 [0x200] A FORMAT UNIT will commence in 15 seconds ALL data on /dev/sdb will be DESTROYED Press control-C to abort A FORMAT UNIT will commence in 10 seconds ALL data on /dev/sdb will be DESTROYED Press control-C to abort A FORMAT UNIT will commence in 5 seconds ALL data on /dev/sdb will be DESTROYED Press control-C to abort Format unit cdb: [04 18 00 00 00 00] Format unit has started FORMAT UNIT Complete sg_format -vFf 0 -s 512 /dev/sdb 0.00s user 0.00s system 0% cpu 1:15.00 total
# zpool status -v boot-pool pool: boot-pool state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 00:00:40 with 0 errors on Wed Dec 14 03:45:42 2022 config: NAME STATE READ WRITE CKSUM boot-pool DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 sdb3 FAULTED 3 262 0 too many errors sdc3 ONLINE 0 0 0 errors: No known data errors
Did you checked theI attempted to format the one /dev/sdb drive
sg_readcap
after format? Is the only way to see if the disk is fixed. These steps are detailed into OP, there is no need to wait for instructions how to check your disk. sg_format
does not like to run both actions together. If that’s the case, I’ll update the guide.zpool status
. From there treat the issue as a degraded pool, you can start your own thread for that. The goal is to limit this thread to disk formatting issues only.I apologize for forgetting that step. sg_readcap showed no protection or problems. I was able replace the drive in my pool, and I thank you for all the help and patience.Did you checked thesg_readcap
after format? Is the only way to see if the disk is fixed. These steps are detailed into OP, there is no need to wait for instructions how to check your disk.
If the disk is not fine, please post the command output here. Maybesg_format
does not like to run both actions together. If that’s the case, I’ll update the guide.
If the disk is fine, reboot the server, then check the pool if is resilvering withzpool status
. From there treat the issue as a degraded pool, you can start your own thread for that. The goal is to limit this thread to disk formatting issues only.
No worries, glad everything is fixed!I apologize for forgetting that step.
I suggest starting your own thread for this issue. Ping me in that thread and I will help there.Now, my ssd-storage pool is fixed, but boot-pool is still degraded and no option for drives when I select replace. Not sure how to do this with a partitioned boot pool.
If you want, you can destroy the pool and format all disks in one shot, withWondering if it would it be better to just backup all the data, format all the drives, and start over with the pool.
tmux
. Is all about how long it takes to format a disk. It can be between few minutes of few hours. Honestly, I would start with one disk and see how long it takes. Is a pain to backup, destroy pool, redo the pool etc. Plus during this time your Scale server is not usable.