Hey fellow Selfhosters! I need some help, I think, and searching isn't yielding what I'm hoping for.
I recently built a new NAS for my network with 4x 18TB drives in a ZFS raidz1 pool. I previously have been using an external USB 12TB harddrive attached to a different machine.
I've been attempting to use rsync to get the 12TB drive copied over to the new pool and things go great for the first 30-45 minutes. At that point, the current copy speed diminishes and 4 current files in progress sit at 100% done. Eventually, I've had to reboot the machine, because the zpool doesn't appear accessible any longer. After reboot, the pool appears fine, no faults, and I can resume rsync for a while.
EDIT: Of note, the rsync process seems to stall and I can't get it to respect SIGINT or Ctrl+C. I can SSH in separately and running zpool status hangs with no output.
While the workaround seems to be partially successful, the point of using rsync is to make it fairly hands-free and it's been a week long process to copy the 3TB that I have now. I don't think my zpool should be disappearing like that! Makes me nervous about the long-term viability. I don't think I'm ready to drop down on Unraid.
rsync is being initiated from the NAS to copy from the old server, am I better off "pushing" than "pulling"? I can't imagine it'd make much difference.
Could my drives be bad? How could I tell? They're attached to a 10 port SATA card, could that be defective? How would I tell?
Thanks for any help! I've dabbled in linux for a long time, but I'm far from proficient, so I don't really know the intricacies of dmesg et al.
Just to make sure. Are you copying to your ZFS pool directory or a dataset? Check to male sure your paths are correct.
Push vs pull shouldn't matter but I've always done push.
If your zpool is not accessible anymore after a transfer then there is a low-level problem here as it shouldn't just disappear.
I would installe tmux on your ZFS system and have a window with htop running, dmesg, and zpool status running to check your system while you copy files. Something that severe should become self evedent pretty quickly.
So next I'd be checking logs for sata errors, pcie errors and zfs kernel module errors. Anything that could shed light on what's happening. If the system is locking up could it be some other part of the server with a hardware error, bad ram, out of memory, bad or full boot disk, etc.
If you're running TrueNAS, the replication feature was the smoothest and easiest way to move large amounts of data when I did it 18 months back. Once the destination location was accessible from the sending host, it was as simple as kicking off a snapshot, resulting in a fully usable replica on the receiving host. IIRC, IXsystems staff told me rsync can be problematic compared to the replication/snapshot system, as permissions and other metadata can be lost.
Thank you! I ended up connecting them directly to the main board and had the same result with rsync, eventually the zpool becomes inaccessible until reboot (ofc there may be other ways to recover it without reboot).
When things lock up, will a kill -9 kill rsync or not? If it doesn't, and the zpool status lockup is suspicious, it means things are stuck inside a system call. I've seen all sorts of horrible things with usb timeouts. Check your syslog.
I don't have practical experience with ZFS, but my understanding is that it uses RAM a lot... if that's new, it might be worth checking the RAM by booting up memtest (for example) and just ruling that out.
Maybe also worth watching the system with nmon or htop (running in another tmux / screen pane) at the beginning of the next session, then when you think it's jammed up, see what looks different...
Awesome, thanks for giving some clues. It's a new build, but I didn't focus hugely on RAM, I think it's only 32GB. I'll try this out.
Edit: I did some reading about L2ARC, so pending some of these tests, I'm planning to get up to 64gb ram and then extend with an l2arc SSD, assuming no other hardware errors.