(this work motivation came from the will to show how easy is to get something running at home on a crapy hardware using a few tricks and some hints on how to tune filesystems and volume managers more efficiently for better performance in a specific workload situation)
Big disclaimer
The hardware I used here in quite old, and the performance on tests like these might fluctuate (because they are very long) due to other factors I am deliberately omitting to make it simple to compare things. This will be a series of 3 posts using one HDD and USB devices with the objective to show how to start playing with filesystems and volume managers. Some of these approaches might not be as noticeable as the results I am showing on high end hardware, such has fast SDDs, NVMe or when using RAID adaptors of these. In those situations you might need to use different approaches for tuning, that in this post series tests could be ignored due the time difference.
In this series, I will be doing a comparison of performances over several block sizes tests I made to research how ZFS works on HDDs and comparing it (later post) with other Filesystems and Volume Managers. For the restore I am using always the same Hive-Engine (Light mode) snapshot (MongoDB dump) from @primersion (that you can download from here).
Software used
- MongoDB 5.0.9
- Kernel 5.15.0-41-generic #44~20.04.1-Ubuntu SMP
- zfs version
zfs-0.8.3-1ubuntu12.14
zfs-kmod-2.1.2-1ubuntu3
Hardware
Something that was going to trash and I just grabbed it for this test. But its still going to trash by the end of the year! The OS is running from an independent disk to avoid interference with the tests.
- 2 cores (4 hyperthreading cores) Intel 3.6GHz (6MB Cache-L3)
- 8GB RAM (DDR3) 1600MHz
- 2TB 7200RPM Disk via SATAII (very old stuff from 2017)
Initial steps for ZFS
If you need to edit (create or delete) partitions on the disk you can use fdisk
tool or other equivalent before using the disk for ZFS.
You will need to either start Ubuntu with ZFS or install it afterwards... apt install zfs*
I am using ZFS features defaults here, so no compression and no deduplication.
# Pool for the 2TB HDD
zpool create dpool /dev/sda1
ZFS Test #1 (default blocksize, 128KB)
This is a restore using ZFS filesystem default block size for the MongoDB replica.
# Stop the MongoDB
systemctl stop mongod
# Ensure you the mount point is not mounted
umount /var/lib/mongodb
# Create the new filesystem
zfs create -o mountpoint=/var/lib/mongodb dpool/mongodb
# Change permissions of the MongoDB directory
chown mongodb:mongodb /var/lib/mongodb
# Start MongoDB
systemctl start mongod
# Initialize the replica set
mongo --eval "rs.initiate()"
time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-23T14:43:47.182+1200 37324269 document(s) restored successfully. 0 document(s) failed to restore.
real 120m5.672s
user 6m35.602s
sys 0m55.827s
ZFS Test #2 (1 MB block size)
This is a restore using ZFS filesystem 1MB block size for the MongoDB replica.
systemctl stop mongod
umount /var/lib/mongodb
zfs create -o recordsize=1M -o mountpoint=/var/lib/mongodb dpool/mongodb_1m
chown mongodb:mongodb /var/lib/mongodb
systemctl start mongod
mongo --eval "rs.initiate()"
time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-23T04:23:50.055+1200 37324269 document(s) restored successfully. 0 document(s) failed to restore.
real 251m49.991s
user 6m36.547s
sys 0m57.353s
ZFS Test #3 (4KB block size)
This is a restore using ZFS filesystem 4KB block size for the MongoDB replica.
systemctl stop mongod
umount /var/lib/mongodb
zfs create -o recordsize=4K -o mountpoint=/var/lib/mongodb dpool/mongodb_4k
chown mongodb:mongodb /var/lib/mongodb
systemctl start mongod
mongo --eval "rs.initiate()"
time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-23T18:00:11.445+1200 37324269 document(s) restored successfully. 0 document(s) failed to restore.
real 190m29.359s
user 6m33.841s
sys 0m59.114s
Tuning considerations
Its clear that for 1 HDD situation, larger block sizes are not beneficial for random IO. This is expected because the number of IOs a rotary disk can do are very limited (usually 50-200 /s max), which causes random IO performance to lower even more when the IO is based on large blocks and you are not using them efficiently (explained ahead).
Another detail to consider here, is that if the database uses many files of small sizes, this could result on inefficient usage of space on your disk and you will also be loosing some performance due to the unaligned nature of the IO.
Checking file size distribution (Number of files / file size in KB):
/var/lib/mongodb# du -k * | awk '{print $1}' | sort | uniq -c | sort -n | tail
9 60
10 43
10 52
12 47
26 35
29 31
30 39
53 27
374 5
727 23
Total files:
var/lib/mongodb# du -k * | wc -l
1498
So, here we can see that more than 85% of the files are way under 64KB size, around 32KB average ish maybe. Hence the default ZFS block size is not the best block size to save space and squeeze performance from this MongoDB specific scenario. And here, the best compromise might be 32KB, because most of those IOs are bellow that block size and they "waste" in block efficiency is not a lot.
Remember that every time you need 1 byte over the block size you will need at least 2x IOs, therefore its important to use a block size that gathers 99% of the file sizes but not big enough to waste space and IO latency. So lets give it a go with a 32KB to see if it really makes any difference!
Final run with 32KB block size
time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-23T23:11:14.319+1200 37324269 document(s) restored successfully. 0 document(s) failed to restore.
real 99m4.594s
user 6m35.625s
sys 0m55.475s
This will likely be our best result... dealing with higher block sizes will (increase throughput performance but also increase latency, hence it begins to be a game between averages and how much efficiency you want to loose in space) be of very little performance gain if any.
And if you give it a go, you will see until you eventually reach the 128KB default that is good enough for probably many situations, the ZFS developers have found... hence why the default. But in this specific situation, you would loose some performance.
To confirm... here is the results with 64KB:
time nice -20 ionice -c3 mongorestore --gzip --archive=./hsc_07-21-2022_b66333824.archive
...
2022-07-24T01:15:14.174+1200 37324269 document(s) restored successfully. 0 document(s) failed to restore.
real 110m53.942s
user 6m38.025s
sys 0m53.947s
Conclusions
As you can see, ZFS is super flexible on how you can manage the back end storage with the appropriate application for your IO.
Note that in this case its only one application running. If multiple applications are considered, the results will change obviously. But you will always be able to optimize the block size in a much more granular way since you can use the same volume with multiple filesystems.
There are other volume managers that can do this (for example LVM) but in my view, not as easily as with ZFS. Performance wise, it depends, and I will try to touch base on that at the 3rd post of this series.
Congratulations @atexoras! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s):
Your next target is to reach 11000 upvotes.
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOP
Support the HiveBuzz project. Vote for our proposal!