TrueNAS stopped working after upgrade because it does stuff without asking.

bigbraindecisions · Oct 5, 2022

Hi,
I have a TrueNAS Core server with a Realtek 2.5G NIC, that was happily running on 13.0-U1.

This night I decided to upgrade TrueNAS to 13.0-U2, but thanks to the fantastic decision of disabling NICs without telling anyone about it, if not by mentioning it in the changelog, I'm not able to reach my server anymore.

So, basically, my server has been "killed" by the TrueNAS team, that deliberately decided to push an update that disabled stuff that may be required to operate as a "bug fix".
Was that so hard to add a notification saying "consider disabling your 2.5G Realtek if you're using iSCSI shares as it may cause issues", add a check to see if that NIC was the only NIC available, or eventually, add a check to see if there's at least one iSCSI share running on the server? Doesn't seem hard to me.

I'm quite new with TrueNAS, so I would like to ask you two things:
- There's any solution to this big brain idea? My server is headless, only has that NIC and it's a few hundred kilometers from my current location.
- Is the TrueNAS team used to work like this? I was interested in Scale and TrueNAS products, but hell, if this is their usual workflow I'm definitely going to pack up soon.

jgreco · Oct 5, 2022

bigbraindecisions said:
So, basically, my server has been "killed" by the TrueNAS team, that deliberately decided to push an update that disabled stuff that may be required to operate as a "bug fix".

This is an unfair characterization. The 2.5G Realtek stuff has been dodgy since the start, and they actually did some "special" work to make it function in the first place. It's not that difficult to imagine that someone inadvertently upgraded the tree with current FreeBSD drivers that didn't support it, or some other mild source code calamity occurred.

bigbraindecisions said:
Was that so hard to add a notification saying "consider disabling your 2.5G Realtek if you're using iSCSI shares as it may cause issues", add a check to see if that NIC was the only NIC available, or eventually, add a check to see if there's at least one iSCSI share running on the server? Doesn't seem hard to me.

By similar reasoning, why is it so hard for end users to check the recommended hardware guides, the excellent 10 Gig Networking Primer, and select from hardware that has been known to work rock solid well for years and years? Why would a user go out and buy a crappy dodgy 2.5G *REALTEK* adapter, of all things, when there's a strong history of problems with Realtek easily demonstrated by a search of the forum, and then get all bent out of shape when it turned into a misadventure? Doesn't seem hard to me.

bigbraindecisions said:
- There's any solution to this big brain idea? My server is headless, only has that NIC and it's a few hundred kilometers from my current location.

Some of us manage systems that are about 2500 miles away (~4000 kilometers), so I feel your pain. It's inconvenient and expensive to go and visit a data center to remediate what is essentially a stupid configuration issue. I certainly wouldn't want to, but, on the bright side, that's maybe a day's drive (two or three hours each way, an hour onsite) to remediate.

I'm going on nearly six years since my last visit to our Virginia data center, so let me offer some "pro tips" for remote colo.

1) Be sure to have a private management network. Using a small device -- we prefer the Ubiquiti EdgeRouter-X due to its $50 cost and tiny "tape it to the side of the rack" footprint -- you can set up an OpenVPN server to connect to the private management network. You should hook up your IPMI to the management network, and ideally also one of the onboard 1Gbps ethernet ports common on most server mainboards. If not, *add* an Intel Desktop CT card to your server and make sure it is configured on the management network. This gives you a great backdoor in via the management network.

2) Consider having a Raspberry Pi or other similar low power compute device to act as a serial console server. This gives you another path into the system console when something goes awry, and you can get right at the boot loader to blacklist the failed driver and/or choose a different boot environment.

3) If your stuff is important, you should be testing prior to deployment on your local lab system(s). This helps educate you about what the outcome of updates/upgrades is likely to be prior to pushing it out to the location where remediation becomes a PITA.

bigbraindecisions said:
- Is the TrueNAS team used to work like this? I was interested in Scale and TrueNAS products, but hell, if this is their usual workflow I'm definitely going to pack up soon.

I'm going to remind you that iXsystems is graciously allowing you to act as a beta-tester for their NAS product. iXsystems sells TrueNAS Enterprise to many customers. They have also chosen a strategy of allowing users access to a slightly feature-reduced free version of their software. "free" does not mean "equal". It means that you are acting as their beta-tester, looking to shake bugs out of the system. They are going to be mainly focused on the hardware platform that they sell, and not consumer grade 2.5G Realtek based systems. To insinuate that this is some sort of plot to sabotage you is unfair and probably insulting. They cannot possibly test every combination of hardware that people are using for TrueNAS Core/Scale. They're a small team and they're supporting two very different underlying operating systems supporting a highly complicated product. Their paychecks are dependent on happy customers discovering the free products and upselling to the Enterprise product. There is no motivation for them to deliberately sabotage these things.

bigbraindecisions · Oct 5, 2022

jgreco said:
This is an unfair characterization. The 2.5G Realtek stuff has been dodgy since the start, and they actually did some "special" work to make it function in the first place. It's not that difficult to imagine that someone inadvertently upgraded the tree with current FreeBSD drivers that didn't support it, or some other mild source code calamity occurred.

Well, when such things happen, an alert is not an option, it's a must.
Also considering that it's stated in the changelog, so the team was aware of this thing.

jgreco said:
By similar reasoning, why is it so hard for end users to check the recommended hardware guides, the excellent 10 Gig Networking Primer, and select from hardware that has been known to work rock solid well for years and years? Why would a user go out and buy a crappy dodgy 2.5G *REALTEK* adapter, of all things, when there's a strong history of problems with Realtek easily demonstrated by a search of the forum, and then get all bent out of shape when it turned into a misadventure? Doesn't seem hard to me.

Because maybe the NIC has been reused and wasn't a new choice.
Anyway, doesn't seem to be a good reason to not adding an alert for those kind of issues.

jgreco said:
stupid configuration issue

Exactly what pisses me off the most.

jgreco said:
3) If your stuff is important, you should be testing prior to deployment on your local lab system(s). This helps educate you about what the outcome of updates/upgrades is likely to be prior to pushing it out to the location where remediation becomes a PITA.

I just trusted too much the TrueNAS Core stability and the fact it was U2 (so for Larger Systems, according to documentation, but correct me if I'm wrong).
I didn't expect such choices on a "production ready" thing, and I didn't put too much interest in the testing prior-upgrading as I have my hot-DR Plan to kick in in cases like this one.
Surely for the next time I'll read the changelog instead of trusting who's behind the software, as opinable choices may be made and not checking those was surely my fault, I violated the "expect nothing from others" rule, because I expected something.

jgreco said:
To insinuate that this is some sort of plot to sabotage you is unfair and probably insulting

I said a different thing btw.

jgreco said:
They cannot possibly test every combination of hardware that people are using for TrueNAS Core/Scale

They perfectly knew that they were blocking some NICs from their systems, and they didn't set any check for necessity or alerting of this choice, that's not really acceptable to me. I'm okay with blocking stuff that causes issues, but you tell me that first, you don't kill my server randomly.

"With the lack of time for a fix on a planned 13.0-U2 freeze day we decided to re-disable the vendor driver to avoid the data corruptions."
Sorry but I see this as a "as we can't fix this, we just disabled the whole thing without considering any side effects or issues it may cause and how those issues may affect our customers", and again, that's not acceptable to me, because tomorrow it may be with a CPU/RAM/Disk vendor.

So, if that is the standard TrueNAS approach, I would just prefer to step out before it's too late.

jgreco · Oct 5, 2022

bigbraindecisions said:
Well, when such things happen, an alert is not an option, it's a must.
Also considering that it's stated in the changelog, so the team was aware of this thing.

It was listed in the changelog? That IS your alert. You are actually expected to read the changelog before applying an update. This has been standard UNIX (both BSD and Linux) practice for many years.

bigbraindecisions said:
Because maybe the NIC has been reused and wasn't a new choice.

That's not really the fault of the developers, now, is it. You chose to use a poorly supported device. There is no guarantee that TrueNAS will work on whatever random hardware you happen to have available. There's a "guarantee" of sorts that it will work on the recommended hardware because the recommended hardware list closely echoes what iXsystems sells for TrueNAS hardware, which is typically well supported, because it has to be.

bigbraindecisions said:
Anyway, doesn't seem to be a good reason to not adding an alert for those kind of issues.

You just blew past the "alert" (in the form of reading the changelog) and installed the update anyways. The developers generally do not have the resources to be adding alerts and safety belts for every possible failure mode. At a certain point, you're expected to take responsibility for reading the changelog. In fairness, this is pretty clearly stated:

Due to a bug with an upstream networking driver causing data corruption issues with iSCSI sharing configurations, 2.5GigE Realtek NICs are unsupported in 13.0-U2 by default. Warning: at a risk of data corruption, especially if the system is used for iSCSI sharing, the offending driver can be manually loaded. See the Known Issues entry for NAS-117663 for more details and the workaround.

So it's not like there was no warning, and it seems like there was significant reason to disable the driver -- everyone in the ZFS community generally considers corruption to be a significant crisis issue. Removing a broken driver until upstream resolves the problem is not a particularly unusual solution in the industry.

bigbraindecisions said:
Sorry but I see this as a "as we can't fix this, we just disabled the whole thing without considering any side effects or issues it may cause and how those issues may affect our customers", and again, that's not acceptable to me, because tomorrow it may be with a CPU/RAM/Disk vendor.

And just what would you have suggested? Any pre-update analysis to identify this problem would need to be running in the context of the version of TrueNAS you are currently running. That's not entirely impossible, but in the eleven years that this project has been around, I don't recall another case like this. There isn't any infrastructure built into the system to do such checks. You could probably suggest it as a Jira feature request, but I suspect it would be a bit tricky to generalize so that it was actually meaningful.

bigbraindecisions said:
I said a different thing btw.

Well, yes. That's why I said "insinuate".

bigbraindecisions said:
They perfectly knew that they were blocking some NICs from their systems, and they didn't set any check for necessity or alerting of this choice, that's not really acceptable to me.

It's unclear how they could have done this, since they would have needed to have a code mechanism already present on your system in order to perform such a check. Instead, they relied on the mechanism that sits between chair and keyboard to read the changelog and the conspicuously posted warning. I'm fine if you don't want to accept responsibility for failing to do that, but let's not blame the lack of hypothetical solutions in code.

bigbraindecisions · Oct 5, 2022

TL;DR: If I have a fault on my car, I would like the car to inform me, rather than disabling the doors and locking me out.

jgreco said:
It was listed in the changelog? That IS your alert. You are actually expected to read the changelog before applying an update. This has been standard UNIX (both BSD and Linux) practice for many years.

I'm used to put alerts in the same place where they are useful.
Think about safety exits but without any signs in the hallways, they're not really useful if you don't know the whole place by heart, right?
At this point, if this is the approach, I would recommend to add a reference/iframe to the changelog page in the update page

jgreco said:
There is no guarantee that TrueNAS will work on whatever random hardware you happen to have available.

It was working, and it had no reason to stop working in my case.

jgreco said:
and it seems like there was significant reason to disable the driver

Not in my case, as I wasn't affected.
And considering it's a "pro" product, not an "average user" product, the system should ask before.

jgreco said:
Any pre-update analysis to identify this problem would need to be running in the context of the version of TrueNAS you are currently running.

Asking before doing stuff is always the best thing.
After the upgrade, if there were iSCSI shares, the system could have sent a critical alert saying "Consider disabling your Realtek NIC because it causes issues with iSCSI shares".
I would like to note that Debian even asks you if you want to restart services after an update, because maybe you don't want to.

You don't want to ask because the answer may arrive too late? Well, at least check that what is going to be disabled is needed.

jgreco said:
It's unclear how they could have done this, since they would have needed to have a code mechanism already present on your system in order to perform such a check.

A simple post-update script would have accomplished the task perfectly.
- Verify that system has been upgraded
- Check if there's a Realtek 2.5G NIC on the system
- Check if there are iSCSI shares running
- Eventually disable the NIC, if asking was a too bad option.

jgreco said:
Instead, they relied on the mechanism that sits between chair and keyboard to read the changelog and the conspicuously posted warning.

Relying on the weaker node of the chain doesn't seem like a great idea.

jgreco said:
I'm fine if you don't want to accept responsibility for failing to do that

I've already accepted my responsibility for that

not checking those was surely my fault

jgreco · Oct 5, 2022

bigbraindecisions said:
At this point, if this is the approach, I would recommend to add a reference/iframe to the changelog page in the update page

You're free to make a feature request in Jira. However, the developers tend to be a bit minimalist, and there is already a warning to

bigbraindecisions said:
Not in my case, as I wasn't affected.
And considering it's a "pro" product, not an "average user" product, the system should ask before.

It can be complicated to predict who is going to be affected by changes.

As for "pro" product, I routinely work on high end networking gear and other pricey gear that can easily be ruined if you don't actually follow the instructions. "Pro" gear often skews towards the arcane (and occasionally incomplete) instructions for doing updates and firmware installs; it isn't unusual to have to read half a dozen documents to understand the eventual end result. I would expect "average user" products to have the sort of warnings you envision, because they are the ones unlikely to read the documentation before hitting "upgrade". Yet it still came as a bit of a shock to people some years back when iOS, a consumer/retail phone OS with a huge install base, dropped support for 32 bit apps and effectively broke a lot of older apps. No prompting before that happened, just lots of news coverage.

My experience is basically that "pro" products are unlikely to provide this sort of handholding. ESXi, for example, is well known for quietly allowing you to continue using outdated firmware on controller cards after an update, relying on you to manually revalidate your compatibility with the VMware HCL. This can lead to all sorts of fascinating problems and corruption. It would be nicer if it'd at least refuse to work with old firmware. This has been so tragic for so long that they finally introduced vCenter Lifecycle Manager to help out.

bigbraindecisions said:
A simple post-update script would have accomplished the task perfectly.
- Verify that system has been upgraded
- Check if there's a Realtek 2.5G NIC on the system
- Check if there are iSCSI shares running
- Eventually disable the NIC, if asking was a too bad option.

That's a lot of complexity, and the "eventually disable the NIC" bit is a total POLA violation. Again, I understand why you're annoyed in your particular case, but expecting the developers to spend lots of time on edge case issues is simply not going to happen. They're fighting a battle against competing products in a hypercompetitive market with a small team and a lot of demands on their limited resources. The "free" in FreeNAS doesn't really guarantee that there's no cost in running the product. There's still an expectation that you're going to be sufficiently invested to buy the correct hardware and invest sysadmin time in managing it.

You are of course welcome to write such a post-update script and submit it for consideration.

bigbraindecisions said:
Asking before doing stuff is always the best thing.
After the upgrade, if there were iSCSI shares, the system could have sent a critical alert saying "Consider disabling your Realtek NIC because it causes issues with iSCSI shares".
I would like to note that Debian even asks you if you want to restart services after an update, because maybe you don't want to.

I think it is clear that it was a last minute decision to address an unexpected and previously undetected problem, and they had no good path forward since there's really no mechanism to do the sort of scripted pre-update analysis that what you're suggesting would require. It's not that it's not a good idea -- it is, I'm sure we can agree -- but it is a thing in software engineering where you can just get painted into a corner and then address a problem through an errata notice (that's a bit historical) or a hotfix. I would personally prefer that software not go out the door with this sort of regression, but it happens. You seem to be able to discuss this from a user's perspective, so, once again, I don't think it would be a bad idea for you to submit a Jira ticket. It could lead to some sort of pre-update check system being designed, a feature that appears to be absent at the moment. Past issue resolutions suggest that they really do want to be able to "get it right" for these edge case issues, and a lot of time has been poured into stuff like making network configuration relatively bulletproof. But it's also a matter of developer time.

danb35 · Oct 5, 2022

bigbraindecisions said:
And considering it's a "pro" product, not an "average user" product, the system should ask before.

Considering it's a "pro" product, not an "average user" product, you should be using suitable hardware. And if you're managing a device that's hundreds of km away, surely "suitable hardware" would include remote monitoring and administration. It's not like solid hardware recommendations haven't been thoroughly documented here for several years.

Or does this only go one way?

bigbraindecisions · Oct 5, 2022

jgreco said:
That's a lot of complexity, and the "eventually disable the NIC" bit is a total POLA violation

That's not a POLA violation, as it's a decision which is not up to the script, it's up to the team.
If they decide that disabling the NIC is the right choice IF there are conditions for problems to happen, the script will disable the NIC, otherwise if they decide that an alert is enough, the script won't disable anything and just send an alert.

jgreco said:
You are of course welcome to write such a post-update script and submit it for consideration.

Since you're a bit more experienced with TrueNAS than me, there's like a map of how the software is structured so that I know where to put my hands?

jgreco said:
I think it is clear that it was a last minute decision to address an unexpected and previously undetected problem

In my opinion would be better to delay releases rather than doing this stuff, mainly on an U2 "larger systems" ready version.

jgreco said:
You seem to be able to discuss this from a user's perspective, so, once again, I don't think it would be a bad idea for you to submit a Jira ticket. It could lead to some sort of pre-update check system being designed, a feature that appears to be absent at the moment. Past issue resolutions suggest that they really do want to be able to "get it right" for these edge case issues, and a lot of time has been poured into stuff like making network configuration relatively bulletproof.

I'll surely do that.

Redcoat · Oct 5, 2022

bigbraindecisions said:
there's like a map of how the software is structured so that I know where to put my hands?

Take a look here https://github.com/truenas/

There's also the API - see TrueNAS CORE API documentation is available from the web interface by clicking settings > API Keys > DOCS.

jgreco · Oct 5, 2022

bigbraindecisions said:
That's not a POLA violation, as it's a decision which is not up to the script, it's up to the team.
If they decide that disabling the NIC is the right choice IF there are conditions for problems to happen, the script will disable the NIC, otherwise if they decide that an alert is enough, the script won't disable anything and just send an alert.

This is just setting up for bigbraindecisions2 ("bbd2"), the next person in your shoes, to wander into that minefield. So let's say that bbd2 has a pair of filers, for redundancy, with one built with 10G Chelsio, and the other one built more cheaply with 2.5G Realtek. iSCSI in both cases. So bbd2 upgrades the Chelsio box and is loving it. Updates the Realtek box and it too comes up, but then shortly thereafter the "eventually disables the NIC" causes loss of connectivity. I'd call that a POLA violation. The underlying issue here is actually introducing an artificial dependency between the iSCSI and Realtek; some people will have a test setup with the 2.5G Realtek but without the iSCSI, and there we have another discontinuity when it passes muster on the local lab Realtek box but gets locked out in the field. I'd also call this a POLA violation. There are likely other situations as well.

The safer thing might be to do a controlled rollback to the previous boot environment, but I can picture that as having some advantages over "eventually disable the NIC". However, it still introduces an unexpected restart. I just don't see a particularly good way to handle the situation. It's probably easy to make it work for certain scenarios, like yours, but at the cost of complicating other scenarios.

bigbraindecisions said:
Since you're a bit more experienced with TrueNAS than me, there's like a map of how the software is structured so that I know where to put my hands?

I'm not sure there's a map or block diagram, but the code is on GitHub.

The complex bit here seems to be that there could be a bit of creature feep (feature creep if you're unfamiliar). So on one hand, the system you want to design could have a hook prior to the tarball extraction phase of install. You're running on the old OS, and you just need a generic hook that can call a script that is included in the distribution tarball with the new OS. It can do a sanity check of the system and then return problem results, green, yellow, or red, such as "yellow - system only has 8GB RAM, 16GB RAM is now strongly recommended", or "red - system has 2.5G Realtek and iSCSI".

The problem here is that no such hook exists (that I'm aware of), but it is absolutely possible for one to be added. Some UI work to add an interstitial warning prior to proceeding would be needed.

But there's a more subtle problem. This assumes that a fault within a given distribution is known in advance. It could be a very short window, as seems to be what happened in this case with the Realtek. I'll say "any idiot" can code up some shell or Python to do this kind of system analysis task pretty quickly, especially if it is limited to two variables (Realtek, iSCSI). I hope you agree with that.

The more subtle problem is that this would need to be included in the distribution tarball, which is signed, which means that this mechanism is very weak. You would ideally want to be able to detect problems like this as errata were discovered; for example, if the latest LSI HBA driver suddenly had a TRIM issue that wasn't noticed until a week after release. You then have the situation where the system check you've included in the distribution tarball greenlights the install, but corrupts the data. In this case, you would really want to do an interactive download of the system check script over the Internet, if available, run it, and get all the latest errata checks included. Suddenly this gets more complicated, because there are ramifications to allowing/requiring Internet access for such things... and we've hated on iXsystems in the past when they've made things like their over-the-Internet crash reporter mandatory, many years ago.

It's probably worth doing if iXsystems were willing to put time and effort into detecting these kinds of errors, but I had quite a fight for years trying to get any sort of "memory too low" error inserted into the installer, to try to scare off people installing on sub-8GB RAM systems.

bigbraindecisions said:
In my opinion would be better to delay releases rather than doing this stuff, mainly on an U2 "larger systems" ready version.

That's been done in the past, and has sometimes turned into a delay-after-delay(-after-delay....etc) thing. There were probably still some factors that favored pushing out the update. I don't know as I don't work for iXsystems, even if I'm very interested in this sort of thing.

ChrisRJ · Oct 5, 2022

bigbraindecisions said:
TL;DR: If I have a fault on my car, I would like the car to inform me, rather than disabling the doors and locking me out.

In my view the better analogy would be "If the engine has a problem that could lead to the destruction of the car's payload, I want the engine to not start at all".

Overall this whole discussion looks to me like "I have failed on at least three levels, but still put the blame on someone else". Said levels are

Using a hardware that has been known to be problematic for years and which is well documented.
Applying an update without proper checking of the change log.
Not having a working fall-back mechanism with out-of-band access (IPMI or equivalent).

I consider myself the opposite of an iXSystems fanboy. In fact, I disagree on a number of points how they handle various things. But this, for a change, is certainly not one of them.

danb35 · Oct 5, 2022

ChrisRJ said:
I consider myself the opposite of an iXSystems fanboy.

I think my recent thread about plugins shows the same of me.

Eypsilon · Oct 19, 2022

How can I enable the realtek8125 2.5G card again in 13.0-U2?

Eypsilon · Oct 19, 2022

Eypsilon said:
How can I enable the realtek8125 2.5G card again in 13.0-U2?

When the system is not used for iSCSI sharing and the NIC support is required, enabling the Realtek NIC driver is possible by going to System > Tunables and creating two new tunables.
Click ADD, enter these values:

Variable : if_re_load
Value : YES
Type : loader

and click SAVE.
Click ADD again, enter these values:

Variable : if_re_name
Value : /boot/modules/if_re.ko
Type : loader

and click SAVE.
To verify the realtek driver is loaded, reboot the system, go to the Shell, and type kldstat -n if_re.ko. The command returns the file name and details when it has been loaded.

Important Announcement for the TrueNAS Community.

TrueNAS stopped working after upgrade because it does stuff without asking.

bigbraindecisions

Cadet

jgreco

Resident Grinch

bigbraindecisions

Cadet

jgreco

Resident Grinch

bigbraindecisions

Cadet

jgreco

Resident Grinch

danb35

Hall of Famer

bigbraindecisions

Cadet

Redcoat

MVP

jgreco

Resident Grinch

ChrisRJ

Wizard

danb35

Hall of Famer

Eypsilon

Cadet

Eypsilon

Cadet

Similar threads

Important Announcement for the TrueNAS Community.

TrueNAS stopped working after upgrade because it does stuff without asking.

Cadet

Resident Grinch

Cadet

Resident Grinch

Cadet

Resident Grinch

Hall of Famer

Cadet

MVP

Resident Grinch

Wizard

Hall of Famer

Cadet

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS stopped working after upgrade because it does stuff without asking."

Similar threads