TrueNAS Scale Replication Behaviour Changed?

Emile.Belcourt · Apr 5, 2022

Hello all,

I've been testing TrueNAS Scale on a non-critical auxiliary data replication node and have noted something a bit odd that has me a bit concerned.

I use Zabbix to monitor disk activity on our primary systems and have noticed that since moving to Scale, disk activity on one of our nodes which is a replication source to the target wherein the target is the one performing the "pull" method of a Replication via SSH. Before Scale, Replication would take a couple of hours overnight and would be done in a very short order but now it appears to be taking so long, it's tripping over into the next days daily replication.

Here are the metrics:

(it says TB/s but it's actually GB/s, something is up with scale)

I can't see anything obvious and I have the same replication configuration as before, here are details and notes:

The two nodes are attached directly with a 1GbE link with MTU9000
Source node has 12x 4TB SAS3 drives in a RAID 10 configuration, E5-2620v3, 128GB DDR4 ECC, NVMe Read Cache + Optane SLOG. Zpool is encrypted
Target node has 4x SATA 8TB in RAIDz1
Replication configuration example of 1 of 5 replication tasks (one for each Dataset, they have various retention lifetimes) on auxiliary server:

As far as I can tell from my notes, all settings are the same. However, initially I had to enable "synchronise destination snapshots with source" to start replication to a nested dataset but on or off it still appeared to take an extended length of time.

The total size of the data on the source server is about 8TB and it looks like each replication cycle it's literally copying every byte over as 8TB over a 1GbE connection will take ~17-19 hours and there are dips for a couple of hours which appears to be when it's "done".

My question is whether something has changed with the settings that means it's now doing a full byte replication rather than a changed block replication?

I hope I'm articulating the issue correctly but I'm worried I'm slamming the disks unnecessarily now!

Many thanks,
Emile

morganL · Apr 5, 2022

Can you specify the two end systems and what versions they are running?

Emile.Belcourt · Apr 5, 2022

morganL said:
Can you specify the two end systems and what versions they are running?

Oops, apologies!

The source is "TrueNAS-12.0-U8" and the target is "TrueNAS-SCALE-22.02.0.1". The target is the system that has the replication task in a PULL configuration so it's using TrueNAS SCALE replication methodology.

Source Spec:

CPU: 2x E5-2620v3
RAM: 128GB DDR4-2133 ECC
X10DRU-i+
SAS3 Backplane
12x Seagate ES.3 SAS 4TB
Quad X540 10GbE

Target Spec:

CPU: 2x E5620
RAM: 24GB DDR3-1600 ECC
Direct Attached SATA
4x Seagate 8TB SATA
Dual Intel 1GbE

Edit:
That's weird, I just was checking the storage pool and the consumed size is very very wrong:

That should be over 8TB and the snapshots folder is empty and I've just had a new aleart I must have missed:

Replication "SVN_Repo_Daily_Replication" failed: can't compare offset-naive and offset-aware datetimes..

Here are the logs:

[2022/04/06 05:07:37] INFO [Thread-109] [zettarepl.paramiko.replication_task__task_5] Connected (version 2.0, client OpenSSH_8.4-hpn14v15)
[2022/04/06 05:07:37] INFO [Thread-109] [zettarepl.paramiko.replication_task__task_5] Authentication (publickey) successful!
[2022/04/06 05:07:37] INFO [replication_task__task_5] [zettarepl.replication.pre_retention] Pre-retention destroying snapshots: []
[2022/04/06 05:07:37] WARNING [replication_task__task_5] [zettarepl.replication.run] No incremental base for replication task 'task_5' on dataset 'Warm_HDD_Pool/SVN_Repository', destroying destination dataset
[2022/04/06 05:07:39] INFO [replication_task__task_5] [zettarepl.replication.run] For replication task 'task_5': doing paull from 'Warm_HDD_Pool/SVN_Repository' to 'Backup_Pool/Warm_HDD_Pool' of snapshot='SVN_Repo_Daily-2022-01-27_00-00' incremental_base=None include_intermediate=False receive_resume_token=None encryption=False
[2022/04/06 08:09:34] INFO [replication_task__task_5.process] [zettarepl.transport.base_ssh] [ssh:root@IP] [replication_process:task_5] Warning: 'cannot receive sharesmb property on Backup_Pool/Warm_HDD_Pool: pool and or dataset must be upgraded to set this property or value'
[2022/04/06 08:09:35] INFO [replication_task__task_5] [zettarepl.replication.run] For replication task 'task_5': doing pull from 'Warm_HDD_Pool/SVN_Repository' to 'Backup_Pool/Warm_HDD_Pool' of snapshot='SVN_Repo_Daily-2022-01-28_00-00' incremental_base='SVN_Repo_Daily-2022-01-27_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2022/04/06 08:09:42] INFO [replication_task__task_5] [zettarepl.replication.run] For replication task 'task_5': doing pull from 'Warm_HDD_Pool/SVN_Repository' to 'Backup_Pool/Warm_HDD_Pool' of snapshot='SVN_Repo_Daily-2022-01-29_00-00' incremental_base='SVN_Repo_Daily-2022-01-28_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2022/04/06 08:09:45] INFO [replication_task__task_5] [zettarepl.replication.run] For replication task 'task_5': doing pull from 'Warm_HDD_Pool/SVN_Repository' to 'Backup_Pool/Warm_HDD_Pool' of snapshot='SVN_Repo_Daily-2022-01-30_00-00' incremental_base='SVN_Repo_Daily-2022-01-29_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2022/04/06 08:09:48] INFO [replication_task__task_5] [zettarepl.replication.run] For replication task 'task_5': doing pull from 'Warm_HDD_Pool/SVN_Repository' to 'Backup_Pool/Warm_HDD_Pool' of snapshot='SVN_Repo_Daily-2022-01-31_00-00' incremental_base='SVN_Repo_Daily-2022-01-30_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2022/04/06 08:09:52] INFO [replication_task__task_5] [zettarepl.replication.run] For replication task 'task_5': doing pull from 'Warm_HDD_Pool/SVN_Repository' to 'Backup_Pool/Warm_HDD_Pool' of snapshot='SVN_Repo_Daily-2022-02-01_00-00' incremental_base='SVN_Repo_Daily-2022-01-31_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2022/04/06 08:09:59] INFO [replication_task__task_5] [zettarepl.replication.run] For replication task 'task_5': doing pull from 'Warm_HDD_Pool/SVN_Repository' to 'Backup_Pool/Warm_HDD_Pool' of snapshot='SVN_Repo_Daily-2022-02-02_00-00' incremental_base='SVN_Repo_Daily-2022-02-01_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2022/04/06 08:10:08] INFO [replication_task__task_5] [zettarepl.replication.run] For replication task 'task_5': doing pull from 'Warm_HDD_Pool/SVN_Repository' to 'Backup_Pool/Warm_HDD_Pool' of snapshot='SVN_Repo_Daily-2022-03-29_21-00' incremental_base='SVN_Repo_Daily-2022-02-02_00-00' include_intermediate=False receive_resume_token=None encryption=False
[2022/04/06 08:10:09] WARNING [replication_task__task_5] [zettarepl.replication.run] For task 'task_5' at attempt 1 recoverable replication error RecoverableReplicationError("cannot open 'Warm_HDD_Pool/SVN_Repository@SVN_Repo_Daily-2022-03-29_21-00': dataset does not exist\ncannot receive: failed to read from stream")
[2022/04/06 08:10:09] INFO [replication_task__task_5] [zettarepl.replication.run] After recoverable error sleeping for 1 seconds
[2022/04/06 08:10:10] ERROR [replication_task__task_5] [zettarepl.replication.run] For task 'task_5' unhandled replication error TypeError("can't compare offset-naive and offset-aware datetimes")
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 166, in run_replication_tasks
retry_contains_partially_complete_state(
File "/usr/lib/python3/dist-packages/zettarepl/replication/partially_complete_state.py", line 16, in retry_contains_partially_complete_state
return func()
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 167, in <lambda>
lambda: run_replication_task_part(replication_task, source_dataset, src_context, dst_context,
File "/usr/lib/python3/dist-packages/zettarepl/replication/run.py", line 248, in run_replication_task_part
pre_retention(src_context.context.now, replication_task, src_context.datasets, dst_context.datasets, target_dataset,
File "/usr/lib/python3/dist-packages/zettarepl/replication/pre_retention.py", line 48, in pre_retention
RetentionBeforePushReplicationSnapshotOwner(target_dataset, now, replication_task, source_snapshots,
File "/usr/lib/python3/dist-packages/zettarepl/replication/pre_retention.py", line 24, in __init__
super().__init__(*args, **kwargs)
File "/usr/lib/python3/dist-packages/zettarepl/replication/task/snapshot_owner.py", line 112, in __init__
self.delete_snapshots = {
File "/usr/lib/python3/dist-packages/zettarepl/replication/task/snapshot_owner.py", line 113, in <dictcomp>
dst_dataset: self.replication_task.retention_policy.calculate_delete_snapshots(
File "/usr/lib/python3/dist-packages/zettarepl/replication/task/retention_policy.py", line 84, in calculate_delete_snapshots
if parsed_dst_snapshot.datetime < now - lifetime:
TypeError: can't compare offset-naive and offset-aware datetimes

morganL · Apr 6, 2022

Are you seeing all the snapshots being used for the inrcremental base?

There is a line here that makes me think there is an issue:

"TypeError: can't compare offset-naive and offset-aware datetimes"

Are the snapshots at each end named differently?

I'm not sure whether its a configuration issue or a bug..... if you had a third system (e.g VM), we could create a dummy dataset and test.

Emile.Belcourt · Apr 7, 2022

morganL said:
Are you seeing all the snapshots being used for the inrcremental base?

There is a line here that makes me think there is an issue:

"TypeError: can't compare offset-naive and offset-aware datetimes"

Are the snapshots at each end named differently?

I'm not sure whether its a configuration issue or a bug..... if you had a third system (e.g VM), we could create a dummy dataset and test.

Actually, i do have a 3rd system which I was using as an offsite replication for a while between cloud moves. I have picked a smaller dataset to test replication with first that used to work fine between the nodes and replicated the config. This 3rd node is a TrueNAS Core build with similar hardware and TrueNAS Core version as the source.

Below are the configurations to see if one works and one doesn't.

Scale Config:

Core Config:

I should hopefully have an answer for you soon as to whether it works or not or suffers the same issue. One thing I did notice before I purged the Scale node dataset is that it had snapshots but when i unlocked the dataset it had no children and the dataset was set to read only where it used to be set to "inherit (off)" which I found very odd.

For sanity sake, I have set the replication target dataset on the Scale node for Read Only to "off" as well as ACL to "off" as there shouldn't be any ACL information for this dataset for now.

Will update once the replication has "finished".

Many thanks,
Emile

Emile.Belcourt · Apr 7, 2022

So the first replication "finished" on Scale but it provided this error:

cannot receive sharesmb property on Backup/Warm_HDD_Pool: pool and or dataset must be upgraded to set this property or value

The base dataset & pool are not set to SMB type but surely the replicated dataset would have the property (rather than the base) but maybe i'm misreading the error. So I decided to run the run the replication a second time and now am getting the error again:

Error: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/middlewared/job.py", line 423, in run await self.future File "/usr/lib/python3/dist-packages/middlewared/job.py", line 459, in __run_body rv = await self.method(*([self] + args)) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1261, in nf return await func(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/replication.py", line 454, in run await self.middleware.call("zettarepl.run_replication_task", id, really_run, job) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1318, in call return await self._call( File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1286, in _call return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1186, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run result = self.fn(*self.args, **self.kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/zettarepl.py", line 380, in run_replication_task self._run_replication_task_job(f"task_{id}", job) File "/usr/lib/python3/dist-packages/middlewared/plugins/zettarepl.py", line 440, in _run_replication_task_job raise CallError(make_sentence(message.error)) middlewared.service_exception.CallError: [EFAULT] can't compare offset-naive and offset-aware datetimes.

While writing this out, I realise now I had 5 separate replication configurations all pointed to the same dataset target assuming like having one replication config with 5 datasets selected, it would make a child dataset named the same as the source. Well, you know what they say about "assuming"!

I've now been able to successfully replicate the same dataset from a Core to Scale node by pre-making a specified child dataset as the target. Something about having multiple replication task configs pointed at the same dataset made it have some unusual freakout that wasn't really sensical from the error messages. This likely would have happened with Core to Core as well because I had set up my replication target incorrectly.

To summarise:

When making individualised replication tasks for multiple source datasets, you need a separate child dataset on the target node for each.
Unlike when you have a single replication task for multiple source datasets, targeting a single child dataset will then auto-magically make the child datasets for you named the same as the source

In the end it was a mistake on my part. It would be nice as part of replication both for a single source dataset and multiple source datasets that the behaviour of replication would make a new child dataset on the target replicating the naming convention of the source.

Testing on another dataset but likely problem solved!

Emile.Belcourt · Apr 11, 2022

Unfortunately the problem returned for two of the replication tasks for two of the datasets.

Error: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/middlewared/job.py", line 423, in run await self.future File "/usr/lib/python3/dist-packages/middlewared/job.py", line 459, in __run_body rv = await self.method(*([self] + args)) File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1261, in nf return await func(*args, **kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/replication.py", line 454, in run await self.middleware.call("zettarepl.run_replication_task", id, really_run, job) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1318, in call return await self._call( File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1286, in _call return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args) File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1186, in run_in_executor return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs)) File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run result = self.fn(*self.args, **self.kwargs) File "/usr/lib/python3/dist-packages/middlewared/plugins/zettarepl.py", line 380, in run_replication_task self._run_replication_task_job(f"task_{id}", job) File "/usr/lib/python3/dist-packages/middlewared/plugins/zettarepl.py", line 440, in _run_replication_task_job raise CallError(make_sentence(message.error)) middlewared.service_exception.CallError: [EFAULT] can't compare offset-naive and offset-aware datetimes.

I have currently 5 replication tasks of which:

Two have succeeded. They were a general dataset which was used for NFS access for an SVN data store and the other was an SMB targeted datastore
One amber alert with "cannot receive sharesmb property on Backup_Pool/Build: pool and or dataset must be upgraded to set this property or value" which is unusual because this dataset is almost exactly the same as one of the success datasets with the exception of having a snapshot once a week instead of once a day
Two failed replications with the "can't compare offset-naive and offset-aware datetimes". These are just large datasets (a few TBs each) and are similarly used with SMB share access as one of the succeeded and the amber alert replication task. The only thing that was observably different about these two datasets was that I had configured a different snapshot retention lifetime rather than same as source. But I had also configured a different retention lifetime for one of the successful replication tasks

I've just reconfigured the snapshot lifetime to Same as Source for both failed replication tasks. The only other thing that I may have accidentally changed was the Destination Dataset Read-only Policy was configured to "SET" and the successful replication tasks are configured to "REQUIRE". I have configured one of the failed tasks to REQUIRE while the other is left as SET and will see if one fails.

Emile.Belcourt · Apr 11, 2022

I have found the cause of the issue, if I modify a snapshot lifetime on the target system (the TrueNAS Scale system running the snapshot task and where the replication ends up), it will fail with the "can't compare offset-naive and offset-aware datetimes".

As long as the retention policy is set to "None" or "Same as Source", it seems to be ok.

I also thought it may be because the replicated data was encrypted and was therefore locked on the Scale system but I unlocked the target dataset and same errors occurred.

On the third system that I have Core still running on, I can modify the Snapshot Retention Lifetime as part of the Replication Task and it seems to be very happy to do this. The only thing I haven't tried is modifying the retention lifetime from a Scale to a Scale system (this is Core to Scale).

Edit: Apologies for this obscene multi-post, will make sure not to do again in the future.

CHorse · Jul 1, 2022

Emile.Belcourt said:
I have found the cause of the issue, if I modify a snapshot lifetime on the target system (the TrueNAS Scale system running the snapshot task and where the replication ends up), it will fail with the "can't compare offset-naive and offset-aware datetimes".

As long as the retention policy is set to "None" or "Same as Source", it seems to be ok.

I also thought it may be because the replicated data was encrypted and was therefore locked on the Scale system but I unlocked the target dataset and same errors occurred.

On the third system that I have Core still running on, I can modify the Snapshot Retention Lifetime as part of the Replication Task and it seems to be very happy to do this. The only thing I haven't tried is modifying the retention lifetime from a Scale to a Scale system (this is Core to Scale).

Edit: Apologies for this obscene multi-post, will make sure not to do again in the future.

I came here looking to see if someone has successfully done ZFS level replication from Scale to Core because I’m thinking about doing this soon.

Is that working for you?

Also, this message is a bug:
“can't compare offset-naive and offset-aware datetimes”

I’m pretty sure I have a hunch on why it could be, but I don’t work on the TrueNAS code so I can’t fix it myself. Maybe let’s file a bug report. I will find some time to do this over the weekend.

Emile.Belcourt · Jul 21, 2022

CHorse said:
I came here looking to see if someone has successfully done ZFS level replication from Scale to Core because I’m thinking about doing this soon.

Is that working for you?

Also, this message is a bug:
“can't compare offset-naive and offset-aware datetimes”

I’m pretty sure I have a hunch on why it could be, but I don’t work on the TrueNAS code so I can’t fix it myself. Maybe let’s file a bug report. I will find some time to do this over the weekend.

I have managed to do Scale to Core and Core to Scale but when going in either direction you cannot have the retention policy different, it has to be "same as source"

You've reminded me I should be sending in a bug ticket on this but haven't thought about it for a while.

Important Announcement for the TrueNAS Community.

TrueNAS Scale Replication Behaviour Changed?

Emile.Belcourt

Dabbler

morganL

Captain Morgan

Emile.Belcourt

Dabbler

morganL

Captain Morgan

Emile.Belcourt

Dabbler

Emile.Belcourt

Dabbler

Emile.Belcourt

Dabbler

Emile.Belcourt

Dabbler

CHorse

Cadet

Emile.Belcourt

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

TrueNAS Scale Replication Behaviour Changed?

Dabbler

Captain Morgan

Dabbler

Captain Morgan

Dabbler

Dabbler

Dabbler

Dabbler

Cadet

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "TrueNAS Scale Replication Behaviour Changed?"

Similar threads