SCALE becomes unreachable every few weeks

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
Hi everyone,

lately I've been plagued with my SCALE host becoming unreachable after a few days or weeks of uptime.

I'm on TrueNAS-SCALE-22.12.2.

I can't reach the Web UI, ssh is unreachable, but I can ping the machine.

I can reach the machine via IPMI, but all I get in the console is IPVS: rr: TCP 10.42.1.3:8080 - no destination available

Aside from a hard reset, I don't even know where else to look to get back in.

What do I do?

1690638817613.png
 
Last edited:

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
The console does in fact react, more than a minute after trying to enter something in the IPMI console I just randomly saw this:

1690639220719.png


...and after trying to interact with the console (pressed 1 and enter, and waited for some time) the console came alive:

1690639657540.png



After this, I got in via SSH, but the web UI still refuses connection. How do I proceed?
 
Last edited:

NickF

Guru
Joined
Jun 12, 2014
Messages
763
'It's hard to help without knowing what your system is. I've gathered you have an X11 board. How much RAM do you have in the system? The error message is literally saying you have run out of RAM. It also appears that you are using apps in Kubernetes. What are you doing with the system?
 

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
Sorry, you’re right. It’s an X11SSH-F with 32GB RAM, an Intel Xeon Quad-Core 1220 (I’m not sure which generation right now). Power Supply is a SeaSonic SS650KM, with a total of 5 WD Red HDDs, 2 WD Red SSDs and one old Intel SSD boot drive.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
What apps are you using? What are they doing? 32GB isn't a whole lot if you are using all of RAM in qBittorrent as an example :P
 

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
Quite a few, qBittorrent, Nextcloud, Home-Assistant, paperless-ngx, emby, sonarr, radarr, prowlarr, and a host of helper apps like tailscale, wg-easy, ddns-updater, traefik,... I don’t think I’ve really noticed the host being memory starved before, but I realise that‘s a vague statement.

I’ll post a heavyscript screenshot of the apps in a few minutes.
Edit: No dice, heavyscript fails too, complaining that it "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" So without rebooting I don’t know how to get a complete list of apps right now.

Is there any way I can bring the UI back from the console? Find out which app uses excessive memory and kill it? Or is a reboot the only option at this point (and if so, how do I reboot properly from the console?)

I'd add some more memory, but I'd like to make sure this is actually the cause first. If this happens so infrequently though, how do I best verify that, considering the system crashes? I have Prometheus running as a TrueCharts app, can I monitor memory consumption in there somehow?

Another edit: funny, I just looked at your signature and saw Spencer... and in there a reference to multi-report... I think I'll put those on my shortlist of things to look into...
 
Last edited:

NickF

Guru
Joined
Jun 12, 2014
Messages
763
You need to reduce the number of applications or add additional RAM. There's really not anything more to be done. The system crashes when it runs out of memory. It runs out of memory the longer your apps run.

Take a look at htop or the graphs in your TN or maybe netdata. You can watch the slow march to your crash
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
About docker, are you using it in a VM? Afaik SCALE doesn't support it anymore.

32GB feels tight for the amout of apps/VMs you are using.

htop might give you an idea of your system resources, and reboot if you are logged as root to reboot; alternative is shutdown -r now.
 

NetSoerfer

Explorer
Joined
May 8, 2016
Messages
57
Okay, I didn’t think 32GB was so tight but fair enough. I’m not actually using docker at all, just TrueCharts apps on SCALE‘s k3s.

Heavyscript just says docker, but it’s a tool to manage Truecharts apps.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Heavyscript just says docker, but it’s a tool to manage Truecharts apps.
Might be worth doing a dip into their discord then, for the heavyscript issue.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
For now, until such time that you buy additional RAM, you should be able to increase the stability of your system by increasing the swap value from 2GB to 8GB. That is not a long term solution and ultimately your server is suffering a huge performance penalty right now.
1691262865221.png


1691262843491.png


The issue is that this change does not take into affect until such time that you add a new disk. Depending on your pool layout, you may want to format and re-add each one one at a time allowing resilver, but again, it's simpler to just add RAM and it also addresses the root cause of the problem.
1691262956874.png
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
For now, until such time that you buy additional RAM, you should be able to increase the stability of your system by increasing the swap value from 2GB to 8GB. That is not a long term solution and ultimately your server is suffering a huge performance penalty right now.
View attachment 69021

View attachment 69020

The issue is that this change does not take into affect until such time that you add a new disk. Depending on your pool layout, you may want to format and re-add each one one at a time allowing resilver, but again, it's simpler to just add RAM and it also addresses the root cause of the problem.
View attachment 69022

Not really? From my understanding TN doesn't use swap space, doing so would further reduce performance.

Consider doing as @NickF wrote only if you see heavy use of swap space in the reporting tab, but till now I haven't seen a single machine use it (even with lower RAM).
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Not really? From my understanding TN doesn't use swap space, doing so would further reduce performance.
The Linux kernel will always use swap space to prevent crashes like this...in fact this is exactly why swap exists. While you are generally right, TrueNAS is designed to NOT use swap space if it can help it, in this particular case it may be a work around until OP has the funds to upgrade his RAM.

This may be relevant and interesting: https://opensource.com/article/18/9/swap-space-linux-systems

Obviously the easier solution is to just turn some services/apps off :P
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Afaik TN decreases ARC size if it needs to increase "system" or "service" RAM.
I don't think this machine is using swap space.
There is no workaround to increasing RAM under 64GB (CORE, SCALE might need double).
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Afaik TN decreases ARC size if it needs to increase "system" or "service" RAM.
I don't think this machine is using swap space.
There is no workaround to increasing RAM under 64GB (CORE, SCALE might need double).
It does. This example proves that ARC is literally so baren it's not being used, while we are still seeing crashes. I would guarantee you that it is using swap space, this is an inherent Linux design philosophy and the very reason for swap's existence.

Swap's goal in life is to increase stability at the cost of performance. In this case, if the goal is to make sure the system doesn't crash, increase SWAP size is a step in the right direction until a proper solution is implemented. It's a work around by any definition.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
It does. This example proves that ARC is literally so baren it's not being used, while we are still seeing crashes. I would guarantee you that it is using swap space, this is an inherent Linux design philosophy and the very reason for swap's existence.
Nope, simply ARC is of lower priority than system/service RAM: whether it's heavily used or not, it will always "give way".
From my understanding, swap space in a TN machine will be used only after there is no more RAM to allocate for system/service processes, it will not be used for ARC.

And I have never seen a TN machine so messed up that doesn't have RAM for ARC... yet.
 

NickF

Guru
Joined
Jun 12, 2014
Messages
763
And I have never seen a TN machine so messed up that doesn't have RAM for ARC... yet.
You are basing that on your experience with CORE. IIRC you're not exactly thrilled with SCALE and you are often cited as a proponent of CORE.
This example highlights the differences in memory management between Linux and FreeBSD. :wink:
Also the design philisophy of SCALE. The "do-it-all system" vs the storage box.
From my understanding, swap space in a TN machine will be used only after there is no more RAM to allocate for system/service processes, it will not be used for ARC.
This user has no RAM, thus we are seeing the message he posted which indicated things crashed when they ran out of memory. Thus we can deduce that we also ran out of swap................................................

EDIT/FWIW:
I was taught by multiple greybeard Linux sysadmins that Linux memory management sucks. When we were deploying Linux VMs for webapps or docker we always doubled our RAM to figure out how much SWAP to use. In the case of TrueNAS SCALE, the default 2GB of swap per disk might not be enough on smaller systems....as evidenced by this thread...
 
Last edited:

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
You are basing that on your experience with CORE. IIRC you're not exactly thrilled with SCALE and you are often cited as a proponent of CORE.
While it's true that I prefer CORE to (actual state of) SCALE, I have no reason to depict the latter in a different way than what it is (or in a more direct way, I do not twist things in order to put SCALE in bad light); regarding SCALE I am basing my posts on my experience here on the forum (I often read the SCALE section threads), and have always specified that what I wrote was not absolute and either from my understanding or as far as I know: I do not build barriers or see this as a pride contest, I learn anything I can.
Also, I don't think I am particularly cited in any way? The auctoritas here are others.

This user has no RAM, thus we are seeing the message he posted which indicated things crashed when they ran out of memory. Thus we can deduce that we also ran out of swap................................................
I had missed this and was awaiting infos from the reporting tab. As I wrote, your solutions is worth considering given the state of things.

Edited for spelling correction.
Edit2: I didn't like you implying I am not honest or objective in what I write, not a bit.
 
Last edited:

NickF

Guru
Joined
Jun 12, 2014
Messages
763
Edit2: I didn't like you implying I am not honest or objective in what I write, not a bit.
That wasn't my intention at all. I was merely trying to point out that my understanding of you as a TN user is that the vast majority of your experience is in CORE. If I misspoke there, or have mischaracterized you that's on me and I can apologize for that.
I had missed this and was awaiting infos from the reporting tab. As I wrote, your solutions is worth considering given the state of things.
To be clear - it is NOT a solution. It's a shitty walmart brand band-aid to mask the underlying problem.
 
Top