error when replacing boot-pool drive

angst911

Dabbler
Joined
Sep 11, 2015
Messages
12
I had a single NVME SSD for my boot drive and it kept reporting errors, so I procured two new 256GB NVME SSD's (a different brand though)... I installed one of the new SSD's, added it as a mirror, and waited for reslivering to finish.. shut down, installed the second drive, and when I go to replace the failed drive I'm getting the following error... Any thoughts?
Error: [EFAULT] sgdisk -n4:0:+16777216K -t4:8200 /dev/nvme2n1 failed: Could not create partition 4 from 34 to 33554465 Could not change partition 4's type code to 8200! Error encountered; not saving changes.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Could you perhaps add some information about your setup? "Me too" does not help anyone to help you.

Any thoughts?
Well, sgdisk's error is supremely vague and useless. I assume you're doing this from the GUI?
Let's start by figuring out where we're at: Let's see the output of ls /dev/nvme* and smartctl - x /dev/nvme{n} for every NVMe device that shows up under /dev/
 

angst911

Dabbler
Joined
Sep 11, 2015
Messages
12
Could you perhaps add some information about your setup? "Me too" does not help anyone to help you.


Well, sgdisk's error is supremely vague and useless. I assume you're doing this from the GUI?
Let's start by figuring out where we're at: Let's see the output of ls /dev/nvme* and smartctl - x /dev/nvme{n} for every NVMe device that shows up under /dev/
I gave up after no response and just reinstalled and imported my configuration. I suggest you try replacing a disk in a mirrored boot volume and see the error for yourself. This sounds like it should have been a bug report based on the "me too"
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Possibly, but the error is too vague to make heads or tails of what the actual problem is.
@cloak, the requested info also applies to your case, though perhaps not with NVMe.
 

cloak

Cadet
Joined
Nov 11, 2022
Messages
4
Could you perhaps add some information about your setup? "Me too" does not help anyone to help you.


Well, sgdisk's error is supremely vague and useless. I assume you're doing this from the GUI?
Let's start by figuring out where we're at: Let's see the output of ls /dev/nvme* and smartctl - x /dev/nvme{n} for every NVMe device that shows up under /dev/

Apologize for the vague comment, was in a rush that evening and had to get out of the house.

I am running TrueNAS-SCALE-22.12.0, the SSD's in question are 2 x Kingston 120GB A400. This is a fresh install, and both drives are brand new and were formatted and secure erased before installed into the system. The first SSD is running the OS, and navigating to the System > Boot Pool, and attaching the second SSD to run a mirror for redundancy with the "Attach" feature with "Use all disk space" checked throws out this error.

Code:
 Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 426, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 461, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1284, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1152, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/boot.py", line 102, in attach
    await self.middleware.call('boot.format', dev, format_opts)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1306, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1255, in _call
    return await methodobj(*prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1284, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/boot_/format.py", line 99, in format
    raise CallError(
middlewared.service_exception.CallError: [EFAULT] sgdisk -n4:0:+16777216K -t4:8200 /dev/sdi failed:
Could not create partition 4 from 34 to 33554465
Could not change partition 4's type code to 8200!
Error encountered; not saving changes.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Alright, what's the output of ls -1 /dev/disk/by-id/?
 

cloak

Cadet
Joined
Nov 11, 2022
Messages
4
Alright, what's the output of ls -1 /dev/disk/by-id/?

Code:
ata-KINGSTON_SA400S37120G_500[XXXX]4
ata-KINGSTON_SA400S37120G_500[XXXX]4-part1
ata-KINGSTON_SA400S37120G_500[XXXX]4-part2
ata-KINGSTON_SA400S37120G_500[XXXX]4-part3
ata-KINGSTON_SA400S37120G_500[XXXX]4-part4
ata-KINGSTON_SA400S37120G_500[XXXX]5
ata-KINGSTON_SA400S37120G_500[XXXX]5-part1
ata-KINGSTON_SA400S37120G_500[XXXX]5-part2
ata-KINGSTON_SA400S37120G_500[XXXX]5-part3
ata-WDC_WD181KFGX-68AFPN0_[XXXX]RHH
ata-WDC_WD181KFGX-68AFPN0_[XXXX]14V
ata-WDC_WD181KFGX-68AFPN0_[XXXX]0NH
ata-WDC_WD181KFGX-68AFPN0_[XXXX]U3H
ata-WDC_WD181KFGX-68AFPN0_[XXXX]UZV
ata-WDC_WD181KFGX-68AFPN0_[XXXX]LVV
ata-WDC_WD181KFGX-68AFPN0_[XXXX]3WV
ata-WDC_WD181KFGX-68AFPN0_[XXXX]ADV
dm-name-sdj4
dm-uuid-CRYPT-PLAIN-sdj4
nvme-Corsair_MP600_PRO_XT_[XXXX]A1
nvme-eui.[XXXX]0f
wwn-0x[XXXX]d6ff
wwn-0x[XXXX]f42a
wwn-0x[XXXX]17e4
wwn-0x[XXXX]18ae
wwn-0x[XXXX]776c
wwn-0x[XXXX]2421
wwn-0x[XXXX]0f57
wwn-0x[XXXX]c96d
wwn-0x[XXXX]39a4
wwn-0x[XXXX]39a4-part1
wwn-0x[XXXX]39a4-part2
wwn-0x[XXXX]39a4-part3
wwn-0x[XXXX]39a4-part4
wwn-0x[XXXX]3a05
wwn-0x[XXXX]3a05-part1
wwn-0x[XXXX]3a05-part2
wwn-0x[XXXX]3a05-part3


[mod note: some data obscured -JG]
 
Last edited by a moderator:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ok, so it did create partitions 1-3. Let's see the output of sgdisk -p /dev/disk/by-id/$DISK_ID for the two ata-KINGSTON disks. Use tab to autocomplete those and preserve your sanity or run ls -l /dev/disk/by-id/ (lowercase letter L instead of the number 1) to see the corresponding /dev/sdX device names.
 

cloak

Cadet
Joined
Nov 11, 2022
Messages
4
Just an update.

Since it is a fresh install, I went through the installation media again and selected both of the SSD's for the redundancy mirror.
Part of the log before extracting:

Code:
Warning: Partition table header claims that the size of partition table 
entries is 0 bytes, but this program supports only 128-byte entries. 
Adjusting accordingly, but partition table may be garbage.

I checked the partitions with fdisk -l and both SSD's partitions look identical with nothing strange in place?

Code:
Disk /dev/sdi: 111.79 GiB, 120034123776 bytes, 234441648 sectors
Disk model: KINGSTON SA400S3
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: X-X-X-X-X

Device        Start       End   Sectors  Size Type
/dev/sdi1      4096      6143      2048    1M BIOS boot
/dev/sdi2      6144   1054719   1048576  512M EFI System
/dev/sdi3  34609152 234441614 199832463 95.3G Solaris /usr & Apple ZFS
/dev/sdi4   1054720  34609151  33554432   16G Linux swap

Partition table entries are not in disk order.


Disk /dev/sdj: 111.79 GiB, 120034123776 bytes, 234441648 sectors
Disk model: KINGSTON SA400S3
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: X-X-X-X-X

Device        Start       End   Sectors  Size Type
/dev/sdj1      4096      6143      2048    1M BIOS boot
/dev/sdj2      6144   1054719   1048576  512M EFI System
/dev/sdj3  34609152 234441614 199832463 95.3G Solaris /usr & Apple ZFS
/dev/sdj4   1054720  34609151  33554432   16G Linux swap

Partition table entries are not in disk order.
 

anaxis

Cadet
Joined
Jan 11, 2023
Messages
4
Hi,

I encountered the same issue, I had a boot drive failure and tried to replace the disk.

1. Removed the faulty disk
2. Replaced with an identical one. Same size/type/vendor etc.
3. Quick wiped the new disk from disk menu
4. Navigated the boot pool status and I chose the replace disk option. Picked the new disk.
Error:

Code:
Error: [EFAULT] sgdisk -n4:0:+16777216K -t4:8200 /dev/sdi failed: Could not create partition 4 from 34 to 33554465 Could not change partition 4's type code to 8200! Error encountered; not saving changes.


So far what I tried:

Info after fresh wipe:

Code:
root@gs-truenas[~]# sgdisk -p /dev/disk/by-id/ata-ADATA_SU650_4M0223CFABF9
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Creating new GPT entries in memory.
Disk /dev/disk/by-id/ata-ADATA_SU650_4M0223CFABF9: 234441648 sectors, 111.8 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): FF8D3C28-9C77-47E8-B46D-F6C9C2D995D5
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 234441614
Partitions will be aligned on 2048-sector boundaries
Total free space is 234441581 sectors (111.8 GiB)

Number  Start (sector)    End (sector)  Size       Code  Name


After the error:

Code:
root@gs-truenas[~]# sgdisk -p /dev/disk/by-id/ata-ADATA_SU650_4M0223CFABF9
Disk /dev/disk/by-id/ata-ADATA_SU650_4M0223CFABF9: 234441648 sectors, 111.8 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 2510F735-0C8D-4E00-B8BA-3DACD0035AB8
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 234441614
Partitions will be aligned on 8-sector boundaries
Total free space is 6 sectors (3.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              40            2087   1024.0 KiB  EF02 
   2            2088         1050663   512.0 MiB   EF00 
   3         1050664       234441614   111.3 GiB   BF01 
root@gs-truenas[~]# 


I did a quick wipe again and tried to execute the same command in shell what gives the error:

Code:
root@gs-truenas[~]# sgdisk -n4:0:+16777216K -t4:8200 /dev/sdi             
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Warning: Partition table header claims that the size of partition table
entries is 0 bytes, but this program  supports only 128-byte entries.
Adjusting accordingly, but partition table may be garbage.
Creating new GPT entries in memory.
The operation has completed successfully.
root@gs-truenas[~]#


Code:
root@gs-truenas[~]# sgdisk -p /dev/disk/by-id/ata-ADATA_SU650_4M0223CFABF9
Disk /dev/disk/by-id/ata-ADATA_SU650_4M0223CFABF9: 234441648 sectors, 111.8 GiB
Sector size (logical/physical): 512/512 bytes
Disk identifier (GUID): 04EF577B-CD62-481A-9B74-8CC9D681B20B
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 234441614
Partitions will be aligned on 2048-sector boundaries
Total free space is 200887149 sectors (95.8 GiB)


Number  Start (sector)    End (sector)  Size       Code  Name
   4            2048        33556479   16.0 GiB    8200 
root@gs-truenas[~]#


After this, when I try to replace the disk same error happened same partition table created by the function, etc.

Any help appreciated

Version:
TrueNAS-SCALE-22.12.0
 

dashtesla

Explorer
Joined
Mar 8, 2019
Messages
75
Literally the same problem, I had one of my boot drives throwing out lots of errors so i decided to replace as a precaution but I also had another unrelated pool drive fail on me, which also had to replace so that one is currently resilvering i'm not sure it affects the ability of the boot drive to also resilver but technically it shouldn't.

I'm not sure what to do here as the boot-pool just states degraded now and the option to try to replace the disk is missing as if the disk is no longer suitable to replace it and if i go to disks it shows as a member of boot-pool but still in degraded state.

Version:
TrueNAS-SCALE-22.12.0

Disk that's been 'replaced':

Disk /dev/sdb: 136.73 GiB, 146815737856 bytes, 286749488 sectors
Disk model: DG146BB976
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: C3C6B411-135E-4B1D-BB6B-13DD4B9A54A7
Device Start End Sectors Size Type
/dev/sdb1 40 2087 2048 1M BIOS boot
/dev/sdb2 2088 1050663 1048576 512M EFI System
/dev/sdb3 1050664 286749454 285698791 136.2G Solaris /usr & Apple ZFS
 
Last edited:

dashtesla

Explorer
Joined
Mar 8, 2019
Messages
75
Tried to reboot the disk went back to being unassigned, wiped and tried again i get this:

Replacing Boot Pool Disk​

Error: [EFAULT] sgdisk -n4:0:+16777216K -t4:8200 /dev/sdb failed: Could not create partition 4 from 34 to 33554465 Could not change partition 4's type code to 8200! Error encountered; not saving changes.
close
 

dashtesla

Explorer
Joined
Mar 8, 2019
Messages
75
Another update, i had another disk have thousands of errors and the disk i had replaced was showing 'healthy' but it seems that ZFS hadn't finished the rebuild and being a Z1 and not having anything seemingly resilvering or anything i just replaced the other drive which caused data loss, i put the alleged faulty drive back and now im just dumping all the data to an external hard drive and im gonna reinstall the entire system again.

Oddly enough the SAS drives that were getting errors were some old 146GB seagate/hp which being old enough i thought they could be faulty but i suspect none of the drives have any actual faults i will check all of them later on windows with an hba one by one, also i do have tape backups and only a few files were newer than the latest backup but this could be a sign of major instability with truenas all the hardware used is as mainstream as it gets but real oddity is that i have 3 pools plus the boot pool, the pool that was getting errors is the largest one with a mix of seagate exos/iron wolf pro sata i also a z1 with 3x wd red sata and another with 3x sas drives both had no errors which pretty much rules out the possibility of it being a faulty sas controller since none of the other pools were having any issues and the data can still be read just fine even from the faulty pool.

I'm going to replace the drives and make something a little different this time but I do think this is a serious issue that could potentially cause data loss to some people as the information that's being given can be misleading.
 

anaxis

Cadet
Joined
Jan 11, 2023
Messages
4
Because the other boot drive started giving write errors, I could not wait more and risked doing it my way.

!WARNING ONLY DO THIS IF YOU KNOW WHAT YOU DOING!

These pools in the end just a zpools, i tried to replace the disk using shell.

Step 0: SSH to the server.

Step 1: Find the faulty disk that you want to replace. Disk name and pool name is what you looking for. In my case 13433413234999191087 , boot-pool

Code:
root@gs-truenas[~]# zpool status
  pool: boot-pool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:03:30 with 0 errors on Thu Jan 12 13:35:56 2023
config:


        NAME                      STATE     READ WRITE CKSUM
        boot-pool                 DEGRADED     0     0     0
          mirror-0                DEGRADED     0    20     0
            sdh3                  DEGRADED     0    23     0  too many errors
            13433413234999191087  UNAVAIL      0     0     0  was /dev/sdj3


errors: No known data errors


Step2: Check the name of the disk that you want to use for replacement. Webgui -> Storage -> Disks

In my case it was sdi. I did a quick wipe just in case.

Step3: Execute this command this will do replace. AGAIN I WARN YOU DO IT IF YOU HAVE NOTHING TO LOSE AND DO A BACKUP!

Code:
zpool replace boot-pool 13433413234999191087 sdi


This should replace the disk and start a re-silver process.

Hope it helps.

The result is this. Still degraded because my other disk is failing also. (cheap ssd man...)

Code:
root@gs-truenas[~]# zpool status
  pool: boot-pool
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 8.25G in 00:02:24 with 0 errors on Fri Jan 13 18:36:02 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   DEGRADED     0     0     0
          mirror-0  DEGRADED     0    20     0
            sdh3    DEGRADED     0    23     0  too many errors
            sdi     ONLINE       0     0     0

errors: No known data errors
 

dashtesla

Explorer
Joined
Mar 8, 2019
Messages
75
@anaxis Doing it manually is obviously an option for everything, but the reason we come here is to report issues and to make sure they're fixed, though I have done that myself in the past when things weren't working as expected with CORE and yes can't have a fault and wait until next patch to rebuild an array, been using SCALE now for a while so hoping won't have any more issues from now on though as I always say keep some tape backups just in case everything fails also saves some time not having to move all the data out then back.

I decided it would be quicker and cleaner to start over and also gave me the chance to change all the drives and layout i replaced all seagate drives with 9x 12TB WD Red moving all the data back in and now it's Z2 instead of Z1 plus imported the 3x 10TB Wd Red that had no issues from the other install no issues of any king other than having to redo all the ACLs which is expected.

Boot drive i went for some newer 2x 300GB 10k SAS as i didn't see a reason to waste ssds for boot no errors of any kind now and the data is moving back just fine, I also did something different this time i previously used a hyper-v gen 1 with drive passthrough to just get truenas installed to them and now i went ahead and used the dell idrac ipmi to install directly don't think it makes any difference but helps since i can't get the USB drive to load the installer with the dell r510.

Oh and I forgot to mention I also had the same experience as you, I was unable to get the boot-pool to come back to healthy no matter what I did, i didn't try to do it manually because like i said i wanted to just start over and in my case it was easy enough to do took me 30min. As for the 'bad drives/ssds' i would try those on windows check with hd sentinel/crytaldiskmark for the smart status as i suspect that might be a false fault, I personally had some mixed drive types and now they're all the same type not sure if things being out of sync with one or another drive could've caused errors even though the boot-pool had exactly the same ancient 146GB sas drives which are 15 years old at this point though still had no smart faults (checked with hd sentinel in windows using an hba).
 
Last edited:

anaxis

Cadet
Joined
Jan 11, 2023
Messages
4
@dashtesla
To update you, I tried to replace my other faulty drive, turns out my earlier solution, was not working. Because this is a boot disk truenas makes, other modifications on the disk also that zpool replaces does not. In the end, you can't replace boot drive successfully with zpool replace. I replaced the fault disk and system could not boot ( no grub was on the resilvered disk, no partitions etc).

So I raised a bug ticket, hope it will be solved soon.

NAS-119880
 

dashtesla

Explorer
Joined
Mar 8, 2019
Messages
75
@dashtesla
To update you, I tried to replace my other faulty drive, turns out my earlier solution, was not working. Because this is a boot disk truenas makes, other modifications on the disk also that zpool replaces does not. In the end, you can't replace boot drive successfully with zpool replace. I replaced the fault disk and system could not boot ( no grub was on the resilvered disk, no partitions etc).

So I raised a bug ticket, hope it will be solved soon.

NAS-119880
I also don't see why the boot pool doesn't show up in the storage menu you literally have to go to system settings find it hidden away there as if the system is trying to hide the fact it has a boot pool or something.

I wish they would change that and give similar options to view/scrub/replace and also if people want to use some of the space there they should have that option as well, i could easily have 2x 5TB drives for boot and extra storage mirrored and use those for something other than just boot without compromising the fact it can boot off the same drives even if the filesystem has to be different (i would imagine it's just an mdadm mirror for the boot pool if using EXT4 like standard debian distros do) but truenas seems to want ZFS even for boot, one can always have partitions mirrored that won't interfere with the boot-pool.

I also feel like a hardware raid controller would help with boot pool not going wrong despite not protecting against data rot could be an option for some that just want to quickly swap a faulty drive without any added hassle and keep ZFS for the storage pools.

But hopefully they'll sort all this, I did get one error today with my machine, but it's just one ZFS error that's been corrected by the filesystem which is fine by me and it's not Online Unhealthy pending a scrub it really just wants me to scrub for good measure but i haven't even finished moving all the data back from tapes and other backups..
 
Top