Data Transfer with GridFTP

From HLRS Platforms
Jump to: navigation, search

Introduction

For transferring large amounts of data, the simple FTP protocol can not fully exploit high bandwidth connections (especially when they have high latencies, like intra- or international Wide Area Networks (WANs)). For this task, an extension has been definied: GridFTP. It supports parallel TCP streams and multi-node transfers (also known as Striping) to achieve a high data rate on high bandwidth connections (even with high latencies). Furthermore, transfers can be restarted and third-party transfers can be established, which means users can initiate transfers between two GridFTP servers that are controlled by a third party (i.e. the user).

GridFTP has a typical client/server architecture, where the server stores the data or has access to the data and where the client downloads/uploads data or controls a server to server transfer in a third-party transfer as described above. The Globus Toolkit includes a simple GridFTP client - globus-url-copy - which is described in more detail below. On top of that there exists gtransfer a more user-friendly tool with additional features which is also described in more detail below.

At HLRS, dedicated GridFTP servers are available for use which have access to the high-performance file systems of the Hazelhen and Laki supercomputers at HLRS. These servers can be used with a GridFTP client. Usually these GridFTP servers are used in third-party transfers, where users download/upload data from/to another GridFTP server e.g. at their home institution. There are two ways to conduct third-party transfers with our GridFTP servers: Either you use the pre-installed GridFTP clients on our Hazelhen frontend nodes or you install GridFTP clients somewhere else outside the HLRS network, for example at your home institution.


Prerequirements for using our GridFTP servers

  • A personal X509 certificate. For accessing our GridFTP servers and performing your data transfers with GridFTP you need a GSI proxy credential (GPC) signed by your personal X.509 certificate. Please see "Key concepts of GSI security" for more information about GSI proxy certificates. This means that you first need a personal X.509 certificate signed by your organization or institute. In addition the source and destination GridFTP services must be able to verify your GPC to enable the data transfer. By default a GPC derived from a personal X.509 certificate issued by one of the grid certificate authorities (CAs) that are member of the IGTF or their affiliated registration authorities (RAs) is required for data transfers. Please contact your IT department on how to acquire such a personal X.509 certificate.
  • The distinguished name (DN) of your X.509 certificate. After receiving your personal X.509 certificate you need to forward the certificate's DN to the HLRS personnel in order to activate access to our GridFTP servers. To determine the DN you can use the following openssl command on your personal X.509 certificate:
$ openssl x509 -noout -subject -in <YOUR_PERSONAL_X509_CERTIFICATE_FILE>
  • A Linux System with a GridFTP client installed (e.g. one of the Hazelhen frontend nodes)


Further information on X.509 certificates


Pre-installed GridFTP client on the Hazelhen frontend nodes

  • Create a GSI proxy credential (GPC) locally at your workstation with either grid-proxy-init (requires installation of Globus packages or manual compilation and installation of the Globus Toolkit, see below) or genproxy (just requires the Bash shell and OpenSSL). Afterwards copy the resulting GPC (usually named "x509up_u<UID>") to your home directory at HLRS with scp and configure the environment variable X509_USER_PROXY with the path to your GPC ($ denotes a user prompt, user and host names are symbolic!):
user@local:~$ genproxy
Your identity: /C=DE/O=GridGermany/OU=Universitaet Stuttgart/OU=[..]/CN=[...]
Enter pass phrase for /home/user/.globus/userkey.pem:
Your proxy `/tmp/x509up_p13706.fileQNqstU.1' is valid until: Fri May 19 11:16:36 CEST 2017

user@local:~$ scp /tmp/x509up_p13706.fileQNqstU.1 user@hazelhen.hww.de:X509_USER_PROXY

user@local:~$ ssh user@hazelhen.hww.de

user@hazelhen:~$ export X509_USER_PROXY="$HOME/X509_USER_PROXY"
  • To use gtransfer, load the tools/gtransfer module (which automatically loads all pre-required modules) on the Hazelhen frontend node you are currently logged in ($ denotes a user prompt, user and host names are symbolic!):
user@hazelhen:~$ module load tools/gtransfer
load globus-gridftp-client gt-6.0.1478289945 (PATH, MANPATH, GLOBUS_LOCATION, GLOBUS_TCP_PORT_RANGE, GLOBUS_TCP_SOURCE_RANGE, X509_CERT_DIR, LD_LIBRARY_PATH)

To make use of the Globus GridFTP client (GGC) you need a GSI proxy credential (GPC)
that authenticates you against the involved GridFTP servers.

Create your GPC at your local workstation and copy it to this system (e.g. via scp).
Then make it known to the Globus tools with ($ is the prompt and not part of the
command!):

```
$ export X509_USER_PROXY="/path/to/gpc"
```

Although you can use the GGC alone to transfer files via GridFTP, we strongly
recommend to use gtransfer - a more advanced GridFTP client on top of GGC, tgftp and
uberftp - instead. To use it, simply load its modulefile with:

```
$ module load tools/gtransfer
```

load tgftp 0.7.0 (PATH, MANPATH)
In addition to the manual pages (man {tgftp|tgftp_log}), there is also a longer README file available (less /sw/hazelhen/hlrs/tools/tgftp/0.7.0/share/doc/README).
load gtransfer 0.8.1 (PATH, MANPATH)
Bash completion loaded: press the TAB key for completion.
In addition to the manual pages (man {gtransfer|gt|dparam|dpath|halias|gcat|gls|gmkdir|gmv|grm}), there is also a longer README file available (less /sw/hazelhen/hlrs/tools/gtransfer/0.8.1/README.md).
  • To use globus-url-copy alone, load the module tools/globus-gridftp-client on the Hazelhen frontend node you are currently logged in ($ denotes a user prompt, user and host names are symbolic!):
user@hazelhen:~$ module load tools/globus-gridftp-client
load globus-gridftp-client gt-6.0.1478289945 (PATH, MANPATH, GLOBUS_LOCATION, GLOBUS_TCP_PORT_RANGE, GLOBUS_TCP_SOURCE_RANGE, X509_CERT_DIR, LD_LIBRARY_PATH)

To make use of the Globus GridFTP client (GGC) you need a GSI proxy credential (GPC)
that authenticates you against the involved GridFTP servers.

Create your GPC at your local workstation and copy it to this system (e.g. via scp).
Then make it known to the Globus tools with ($ is the prompt and not part of the
command!):

```
$ export X509_USER_PROXY="/path/to/gpc"
```

Although you can use the GGC alone to transfer files via GridFTP, we strongly
recommend to use gtransfer - a more advanced GridFTP client on top of GGC, tgftp and
uberftp - instead. To use it, simply load its modulefile with:

```
$ module load tools/gtransfer
```

Installing the GridFTP client at your home institution

  • Since version 5.2 of the Globus Toolkit, the GridFTP client is also available as pre-compiled RPM (for Red Hat Enterprise Linux 6 and 7, CentOS 6 and 7, Scientific Linux 6 and 7 and possibly others) or DEB (for Debian GNU/Linux 7, 8 and 9 and Ubuntu Linux 14.04 LTS, 16.04 LTS, 16.10 and 17.04) package. Install the GridFTP client - if a pre-compiled package is available it's usually named globus-gass-copy-progs, make grdiftp will include it for source installs - by following the instructions in the Globus Tookit 6.0 documentation. Be sure to also install the grid-proxy-init tool - included in the globus-proxy-utils package or in an installation from source with make gridftp - or just use the genproxy tool mentioned above. Only one of these tools is required for the creation of GSI proxy credentials.
  • Create a directory .globus in your home directory and place both your personal X.509 certificate (as usercert.pem) and your private key file (as userkey.pem) there. To create these files from a PKCS#12 keystore follow these instructions but use the names from above for the destination files. When using grid-proxy-init to create a GSI proxy credential, you can also place a PKCS#12 keystore (as usercred.p12) there - the Firefox web browser for example exports user certificates and keys as PKCS#12 keystore.
  • Additionally create another directory named certificates in .globus and place all the trusted CA certificates there. A collection suitable for use with the Globus Toolkit is provided by SURFsara as a tarball - download and untar it into the above directory. The included files are needed to authenticate remote entities (i.e. GridFTP servers).
  • Run grid-proxy-init or genproxy to verify the validity of your personal X.509 certificate and to create a GSI proxy credential signed by your personal X.509 certificate with a default lifetime of 12 hours (for grid-proxy-init) and 24 hours (for genproxy). This step has to be repeated after the created GSI proxy credential has expired.

Usage

Workspaces

The paths to your workspaces are identical on supercomputers and GridFTP servers. To get the path of a specific workspace, first login to the respective supercomputer frontend(s), then determine the workspace name of the workspace you want to use and then enter ws_find <WORKSPACE_NAME> to get the actual path to this specific workspace. More information about workspaces at HLRS can be found in the platforms wiki.


gtransfer (gt)

  • Type gt and hit the ENTER/RETURN key to get a brief usage message. Use gt --help and man gt to get a description of all gt options.
  • To start a transfer, enter gt, hit the SPACE key and then hit the TAB key three times to make use of the gt bash completion. You'll get a listing of all available options. Start with -s to enter the source address. The - character was already provided by the gt bash completion. After entering s hit the SPACE key and enter your source address, e.g. gsiftp://gridftp.domain.tld:2811. You can also hit the TAB key two times to get the preconfigured GridFTP source server addresses or host aliases. Add the path to your desired workspace just like on the supercomputer frontends (e.g. /lustre/cray/ws8/ws/user-workspace/) and then hit the TAB key two to three times to get a listing of the files and directories in your workspace directory on the remote server. Depending on the latency and the number of files present there, it can take a few seconds until you see results and this will only work if your GSI proxy certificate is considered valid by the remote GridFTP server and you are trying to list a directory where you have rx (read and execute) permissions. Type in the beginning of your desired file or directory and hit the TAB key to complete the name. If you want to copy all files in a directory, add /* or just / to the end of the path. Now continue with the destination address. Add -d to the command line, hit the SPACE key and continue with the destination address just like you entered the source address. Enter a / at the end of the destination path.
  • To recursively copy all files and directories below a given directory, add the -r option to the gt command line.

Example:

$ gt <TAB>

$ gt -

$ gt -<TAB><TAB>

$ gt -
--                       --configfile             --gt-max-retries         -m                       -s                       --verbose
-a                       -d                       --gt-progress-indicator  --metric                 --source                 --version
--auto-clean             --destination            --guc-max-retries        --no-sync                --sync-level             
--auto-optimize          -e                       --help                   -o                       --transfer-list          
-c                       --encrypt-data-channel   -l                       -r                       -v                       
--checksum-data-channel  -f                       --logfile                --recursive              -V

$ gt -s <TAB><TAB>

$ gt -s
hazelhen:  laki:

$ gt -s h<TAB>

$ gt -s hazelhen:

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/<TAB>

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file<TAB><TAB>

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file
hazelhen:/lustre/cray/ws8/ws/user-workspace/file1  hazelhen:/lustre/cray/ws8/ws/user-workspace/file2  hazelhen:/lustre/cray/ws8/ws/user-workspace/file3

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file* -d gsiftp://gridftp.domain.tld:2811/~/

Hints

I have multiple user accounts at a remote GridFTP server. How can I choose a specific account?

This can be done by inserting a <USER>@ portion into your GridFTP URLs or prefixing host aliases with <USER>@. Replace <USER> with your desired username on the remote site.

Examples:

  • GridFTP URL:

gsiftp://gridftp.domain.tld:2811/[...]/files/* => gsiftp://user1@gridftp.domain.tld:2811/[...]/files/*

  • Host alias:

my-gridftp:/[...]/files/ => user1@my-gridftp:/[...]/files/


Can gtransfer automatically create non-existing directories on the destination side?

Yes, this is possible and activated by default. Just enter the desired name or path in your destination URL and gtransfer will automatically create non-existing directories on the destination side (with the help of globus-url-copy).


Use host aliases for your GridFTP servers

There are already two host aliases defined which point to the two GridFTP servers at HLRS:

  • hazelhen:
  • laki:

You can use them instead of the longer host part of a GridFTP URL in the source and destination URLs, e.g. you can use:

  • hazelhen:/lustre/cray/ws8/ws/user-workspace instead of
  • gsiftp://gridftp-fr1.hww.de:2812/lustre/cray/ws8/ws/user-workspace

To create your own host aliases, please refer to the host aliases documentation linked below.

What if the gtransfer command fails during a data transfer?

Globus-url-copy - the tool gtransfer actually uses through tgftp to transfer data - is configured by gtransfer to retry the transfer of files that failed to transfer successfully to the destination GridFTP server. And if that fails, gtransfer will retry the whole process three times until giving up on the transfer. And even if that happens, you can later continue a failed or interrupted transfer by simply issuing the very same gtransfer command. Gtransfer stores state information about a transfer in your home directory below .gtransfer. So this mechanism will work in the same home directory and with the same user account and as long as the state files are not touched in between.

What if I need to interrupt a data transfer?

You can always interrupt a gtransfer data transfer by hitting CTRL+C during a data transfer, which effectively sends a SIGINT to the gtransfer process group and interrupts the data transfer. You can continue the transfer from where it was interrupted by issuing the very same gtransfer command - as with failed transfers described above. The same restrictions - same host, same user account, no fiddling with the state files in between - apply here.


Documentation

General


Man pages

Man(ual) pages are also available locally on the Hazelhen frontends. Simply enter man and the name of the manpage (e.g. gtransfer or dpath) to read a specific page. If man pages with the same name exist in different sections you also have to specify the section number after the man command but before the name of the man page to read a man page from a specific section. E.g. to read the dparam(5) man page - which contains the file format description for dparams - you would enterman 5 dparam.


Section 1


Section 5


Special functionality


globus-url-copy (aka Globus GridFTP client (GGC))

  • Type globus-url-copy and hit the ENTER/RETURN key to get a brief usage message. Use globus-url-copy -help and man globus-url-copy to get a description of all globus-url-copy options.
  • The basic syntax is:
globus-url-copy [optional command line switches] source destination
  • Source and destination can be further resolved to:

<pre< globus-url-copy [optional command line switches] {gsiftp://<server address>:<port> | file://}<absolute path> {gsiftp://<server address>:<port> | file://}<absolute path> </pre>

  • Files on remote systems can be referenced by gsiftp:// URLs whereas local files have to be referenced by file:// URLs. The usage of gtransfer host aliases is not supported by globus-url-copy, hence you need to enter the server addresses and ports manually. Use the following table for reference:
Host Server address Port
Hazelhen gridftp-fr1.hww.de 2812
Laki gridftp-fr2.hww.de 2812

Example:

$ globus-url-copy -cc 2 -tcp-bs 4M -p 2 -cd gsiftp://gridftp-fr1.hww.de:2812/lustre/cray/ws8/ws/user-workspace/file* gsiftp://gridftp.domain.tld:2811/~/


Documentation

See the Globus Toolkit documentation on globus-url-copy for more details about this tool.

Further Information


Support