transferring (lots of) files to a remote server + saving space
I have recently been asked how to transfer lots of file from a backup server to a local disk. The context is that
- the backup is an archive to be put on an external hard drive and then in a closet for future use,
- the transfer line is not quite reliable and transfers may stop (for instance if it is mounted as a samba / windows / applefile share)
rsync¶
The solution is to use the beloved rsync
tool:
In its most simple form, you use this command in the termina:
rsync -av $INPUT $OUTPUT
Where -av
stands for "keep file attributes and be (not too) verbose", $INPUT
is the folder containing the files to be backed up, $OUTPUT
is the destination folder. Be careful to avoid a trailing slash in $INPUT
, unless you want the files that reside within the folder to be transferred instead of the folder itself.
One huge advantage is that $INPUT
or $OUTPUT
can contain a remote server in the form of myname@remoteserver:/path/to/my/folder
.
An additional, useful option is --dry-run
to try the command without moving the files. Also --progress
gives a bit more info.
Let's try it:
%%bash
rsync -av --progress --dry-run perrinet.l@frioul.int.univ-amu.fr:/riou/work/invibe/ARCHIVES/backups_perrinet/12_backups/alex /Volumes/3tera/backups/12_backups/
All good! We could connect to the server and the folder names were apparently correct. Let's use another option to exclude some files:
%%bash
INPUT=perrinet.l@frioul.int.univ-amu.fr:/riou/work/invibe/ARCHIVES/backups_perrinet/12_backups/alex
OUTPUT=/Volumes/3tera/backups/12_backups/
rsync -av --dry-run --progress --exclude .AppleDouble $INPUT $OUTPUT
Let's now do it for real:
%%bash
INPUT=perrinet.l@frioul.int.univ-amu.fr:/riou/work/invibe/ARCHIVES/backups_perrinet/12_backups/alex
OUTPUT=/Volumes/3tera/backups/12_backups/
rsync -av --progress --exclude .AppleDouble $INPUT $OUTPUT
Everything went fine, with a file transfer speed of ~58Mb/s. The good sync with rsync
is that if you do that once again (or if the transfer was interrupted) you only transfer the files you need and not the whole thing. So for instance, if we do the same command again, we get:
%%bash
INPUT=perrinet.l@frioul.int.univ-amu.fr:/riou/work/invibe/ARCHIVES/backups_perrinet/12_backups/alex
OUTPUT=/Volumes/3tera/backups/12_backups/
rsync -av --progress --exclude .AppleDouble $INPUT $OUTPUT
That is a speedup of more than $10^7$, but that does not make sense, it's just very quick :-)
Hardlinking¶
And now, what about saving space? On a hard drive, files exist physically on the plates as streeams of bits --- zeros and ones--- but a curcial point is that they may be chunked in different pieces (imagine placing a big file on a disk which is fragmented may cause problems similar to when you have to re-order your desk...). A solution on most filesystems is that they contain a file allocation tables containing (1) all the filenames and the name of the folder in which they sit (2) a pointer to the chunks on the physical disk where the actual data is.
So we can take advantage of that on a backup: a somewhat obscure habit is to take a folder my_work
, copy it and then rename my_work_new
. Strange habit, but let's stick with that and take advantage of the previous remark. Such technique is used in many backup mechanisms (such as the Time Machine used on Mac Os X) and allows to copy a whole folder with a different name but without taking more space on the disk (google incremental backups
for more info on this).
As a matter of fact, it is possible to have two files with different names but pointing to the same data chunks, it is a harlink. If a program would detect files that are exactly the same, for instance if we used the copy-and-paste procedure described above, this would greatly reduce the actual physical space. Many such programs exist and one easy piece of software is available on https://hardlinkpy.googlecode.com/. It just consists of a python script, so the install is obvious:
%%bash
wget https://hardlinkpy.googlecode.com/hg-history/1be1ba7ea38917e6b52c189ef625b05b4e7e4d52/hardlink.py
Now imagine we had copied the folder twice:
%%bash
INPUT=perrinet.l@frioul.int.univ-amu.fr:/riou/work/invibe/ARCHIVES/backups_perrinet/12_backups/alex
OUTPUT=/Volumes/3tera/backups/12_backups/alex_new
rsync -av --progress --exclude .AppleDouble $INPUT $OUTPUT
The program is easy to use: just type python hardlink.py
in the terminal to have basic usage:
!python hardlink.py
Such that in our case, we want to do this: python hardlink.py /Volumes/3tera/backups/12_backups/
:
!python hardlink.py /Volumes/3tera/backups/12_backups/
Ok, we just saved over 2Go :-)
book keeping¶
%%writefile 14-07-07-transferring-lots-of-files-to-a-remote-server-saving-space.meta
.. title: 14-07-07 transferring (lots of) files to a remote server + saving space
.. slug: 14-07-07-transferring-lots-of-files-to-a-remote-server-s
.. date: 2014-07-07 10:31:28 UTC+02:00
.. tags: int, hardlink, rsync
.. link:
.. description:
.. type: text
!nikola build
!nikola deploy