FAQ bwGRiD

From HLRS Dgrid
Jump to: navigation, search

General

How can I get important news for cluster users?

  • Subscribe to the mailing list dgridsysnews. There will be send announcements of maintenance schedules, informations for new software versions etc. You may subscribe at a web page or write an email to dgridsysnews-request@listserv.uni-stuttgart.de with the subject subscribe.

Where can I get help?

  • There are several possibilities. On this wiki page, we regularly add hints when we find things where users have difficulties. On the cluster, there is a man page you can call and which displays information about the bwGRiD Stuttgart:
man bw-grid

Software modules often bring their own instructions which you can read by calling

module help category/softwarepackage/version

e.g.

module help compiler/intel/12.0

and last, but not least, feel free to contact our support staff

What ist the minimal command to submit a job to the batch system?

  • Specify the number of nodes, the number of processor cores per node, the type of the nodes (probably 'bwgrid') and the desired walltime.
    Example: 4 nodes, 8 processor cores per node for two hours
    qsub -l nodes=4:ppn=8:bwgrid,walltime=2:00:00 ./myscript

How can I filter the queue to only show my jobs?

  • The commands showq and qsub are used to display job information. If there are many jobs in the queue it is more convenient to filter the data.
    showq -u <user-name>
    qstat -u <user-name>

How can I upload/download files to/from the bwgrid?

  • Inside gsissh-Term, you can open an SFTP Session in the Tools-menu which allows you to transfer files in both directions, while you are logged in at a grid frontend.
  • If you usually log in via gsissh, you probably have also the commands
    gsiscp
    and
    gsisftp
    installed. Note that you might have to specify the port to which gsiscp shall connect with -P 2222 (Note: it is a capital -P here, in contrast to the way you can specify the port for gsissh). If you are logged in via gsissh, you can also use gsiscp to transfer data between different bwgrid sites.
  • If you are using Globus Toolkit, you can also use the command
    globus-url-copy
  • In our bwgrid portal there is a menu File Browser below Gatlet, which also allows you to transfer files from your local computer to one of the grid frontends.
  • If you want to use rsync to transfer data, you have to tell it to use gsissh to log in:
    rsync --rsh="gsissh -p 2222"

Errors

mpirun: spawn failed

    using
    mpirun -np 2 -hostfile $PBS_NODEFILE ./test.out
    causes in combination with the openmpi module an error like
    [n110402:02618] pls:tm: failed to poll for a spawned proc, return status = 17002
    [n110402:02618] [0,0,0] ORTE_ERROR_LOG: In errno in file ../../../../../orte/mca/rmgr/urm/rmgr_urm.c at line 462
    [n110402:02618] mpirun: spawn failed with errno=-11
    
    You simply have to omit the -hostfile option.

InfiniBand retry count

I get an error message about timeouts, what can I do?
    If your parallel programs sometimes crash with an error message like this:
    --------------------------------------------------------------------------
    The InfiniBand retry count between two MPI processes has been
    exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
    (section 12.7.38):
    
        The total number of times that the sender wishes the receiver to
        retry timeout, packet sequence, etc. errors before posting a
        completion error.
    
    This error typically means that there is something awry within the
    InfiniBand fabric itself.  You should note the hosts on which this
    error has occurred; it has been observed that rebooting or removing a
    particular host from the job can sometimes resolve this issue.  
    
    Two MCA parameters can be used to control Open MPI's behavior with
    respect to the retry count:
    
    * btl_openib_ib_retry_count - The number of times the sender will
      attempt to retry (defaulted to 7, the maximum value).
    
    * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
      to 10).  The actual timeout value used is calculated as:
    
         4.096 microseconds * (2^btl_openib_ib_timeout)
    
      See the InfiniBand spec 1.2 (section 12.7.34) for more details.
    --------------------------------------------------------------------------
    

    This means that the mpi messages can't pass through our infiniband switches before the btl_openib_ib_timeout is over. How often this occurs depends also on the traffic on the network. We have adjusted the parameters such that it should normally work, but if you have compiled your own OpenMPI, maybe also as part of another program package, you might not have adjusted this value correctly. However, you can specify it when calling mpirun:

    mpirun -mca btl_openib_ib_timeout 20 -np ... your-program ...
    

    you can check the preconfigured parameters of the module currently loaded by:

     ompi_info --param btl openib 
    

    where you can grep for the above mentioned parameters.

text file busy

    text file busy
    

    on our lustre file system (i.e. in the workspace the problem is that the file might be currently executed by a job. In this situation it is a bad idea to overwrite or remove the file. However, sometimes it happens that a job which is executed on another node crashes and the file still stays locked. In this situation:

    • Verify that the file is not used by any other job
    • rename the file to something else (something like tmp.file)
    • copy it back to the original location (this version should not have the file lock anymore)
    • remove the temporary file (tmp.file if you used that name above)

    If the above steps do not help try to allocate a new workspace and copy your input data there and start your compute jobs again in the new workspace. As far as we know this problem has been fixed. If you still get this error, please let us know.

Software Development

Where can I find documentation for compilers ?

Documentation for MPI libraries

Documentation for numerical libraries

Documentation for the batch system

  • The most important commands are qsub, qdel, qstat.
  • The command showstart is quite interesting for the impatient user.

Certificates/VO

Unable to join VO

When in phase II of the VO registration process you try to click the "I have read the AUPs. Click to register." button and receive an error emssage in a popup window "You need to first read the AUPs, please click on the provided link." even if you read the document after DOWNLOADING it with a download plugin you might have to deactivate the plugin or just open the document in the browser since the page chcks if you really have opened the document and this is not possible when using a download plugin. For example using pdfdownload extension you have to completely disable pdfdownload and enabled the Adobe reader plugin, so that the pdf would open in the browser instead of being downloaded and opened externally.

I can not log in anymore

If you have installed a valid certificate and you are still getting an error

Permission denied (publickey,gssapi-keyex,external-keyx,gssapi-with-mic,gssapi,password,keyboard-interactive).

when you try to log in to gridway, most probably your bwGRiD membership has expired. Go to the bwGRiD registration page at https://vomrs.zam.kfa-juelich.de:8443/vo/bwgrid/vomrs and resign the grid AUP's by going to the menu point

bwgrid Registration Home > Member Info > Re-sign Grid and VO AUPs   

and click on the link with the AUP's read them and finally accept them with the button.

What do I have to do when my certificate expires?

You will receive a reminder email four and two weeeks before your certificate expires. So you have enough time to renew your certificate. This reminder contains detailed information what you have to do to renew your certificate:

  • visit the certification website for your registration authority as shown i.e. for HLRS [1]
  • As for your first time, request a certificate and print the request form. Be sure to use the same information as before (the DN=Distinguished Name must be identical to reuse your VO membership)

Your browser will mostly generate a new key pair for this new certificate. Since you have to substitute your certificate in your Middleware in any case this is no issue. You can avoid this when you are familliar with certificates creation on the command line and reuse your old key.

  • Visit your registration authority, sign your reqeust form as you have done with your first certificate.
  • Your certificate will be emailed to you.

VO membership reset after certificate renewal?

In 2008 the certification authority DFN changed the DN of their certification server from

old: /C=DE/O=DFN-Verein/OU=DFN-PKI/CN=DFN-Verein User CA Grid - G01
new: /C=DE/O=DFN-Verein/OU=DFN-PKI/CN=DFN-Verein PCA Grid - G01

Du to this change your new certificate has a modified DN for the CA and the VO Management Server and any middleware gateway in front of the compute resources will identify you as new/unknown user. Therefore you will have no access to any resources. Since this situation is not foreseen in the design in the VOMRS server (for requesting VO membership etc.) you have to write an email to the VO admins (vomrs-admin@fz-juelich.de) They need the following to modify your entries.

  • required hint that your new certificate has a changed DN for the DFN - CA. If possible together with the DN for the CA
  • requered your DN: i.e. C=DE,O=GridGermany,OU=Universitaet Stuttgart,OU=HLRS,CN = Jochen Buchholz
  • Helpful your VO(s)

Normally they will answer within one work day.

Untrusted self signed certificate?

I you installed your new certificate and receive the following error

...
GSS Minor Status Error Chain:
globus_gsi_gssapi: SSLv3 handshake problems OpenSSL Error: 
s3_clnt.c:951: in library: SSL routines, function
SSL3_GET_SERVER_CERTIFICATE: certificate verify failed
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Can't get the local trusted CA certificate: 
Untrusted self-signed certificate in chain with hash 1149214e

gss_init_context failed

you need to update your CA certificates files. Formerly the issuer "/C=DE/O=DFN-Verein/OU=DFN-PKI/CN=DFN-Verein User CA Grid - G01" signed the certificates with subject "/C=DE/O=GridGermany/.*". to avoid that a CA can sign for any subject, on any system there is a list of files describing which CA is allowed to sign for which subject. Since the DFN changed their CA server names (switched from several issuing servers to one) the new server "/C=DE/O=DFN-Verein/OU=DFN-PKI/CN=DFN-Verein PCA Grid - G01" needs to issue for the subjects "/C=DE/O=GridGermany/.*" as well but this is restricted by the files on your local systems and therefore the new certificate is invalid.

You have to install the new CA certificates files either in the global "/etc/grid-security/certicates" directory (globus full installation on Linux) or in the ".globus/certificates" directory below your local home directory (overriding the system defaults or if using Windows)

If the hash value from the error message above is "1149214e" (filenames in the directory are "1149214e.*" or in newer versions "DFN-GridGermany-Root.*") you can either download the new Version (ca_GermanGrid-X.XX.tar.gz the name changes upon new revisions of the certificate bundles) for DFN Root CA (http://dist.eugridpma.info/distribution/igtf/current/accredited/tgz/ ) and extract it to the corresponding directory or you can use the package for all CAs which are in the International Grid Trust Federation (IGTF) (http://dist.eugridpma.info/distribution/igtf/current/accredited/igtf-preinstalled-bundle-classic.tar.gz).

If the hash value differs you can use the package or look in the corresponding .info file and try to fetch the corresponing file for the specific CA After extracting the files the new certificate should be accepted and you should be able to log in again.

IMPORTANT: You have to rename the corresponding .pem file to .0 if the error still exists (i.e. DFN-GridGermany-Root.pem -> DFN-GridGermany-Root.0)

If the same error messages like above occur, but they are truncated before printing out the certificate hash:

GSS Minor Status Error Chain:
globus_gsi_gssapi: SSLv3 handshake problems
OpenSSL Error: s3_srvr.c:2010: in library: SSL routines, function 
SSL3_GET_CLIENT_CERTIFICATE: no certificate returned
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Can't get the local trusted CA certificate: Cannot 
find trusted CA certificate wit

then, it could probably be caused by an update of libssl. Then, you have to recompile/reinstall globus.

There was a problem with the connection or with authenticating: Error from GSS layer

..
com.sshtools.j2ssh.transport.kex.KeyExchangeException: Error from GSS layer 
..
Caused by: GSSException: Failure unspecified at GSS-API level [Caused by:    Unknown CA]
..

If you receive an Java Exception like the one above it os probably caused because your client doesn't accept the server's certificate. This is the case if you global /etc/grid-security/certificates/ directory or your personal /.globus/certificates/ directory (overwriting the global directory) doesn't contain information about the CA for the server's certificate which describes which namespaces are allowed for the CA and the corresponding public key and so on. To solve this problem please folow the solution for the prior topic Untrusted self-signed certificate

Could not load your certificate: [JGLOBUS-22] Algorithm not supported

This error apear at least when you used Firefox in combination with OpenSSL on Windows 7. It seems that opensssl creates the wrong format by default. To solve this issue you have to use openssl again to rewrite the userkey.pem.

openssl rsa -des3 -in userkey.pem -out userkey2.pem

Please don't use the same in and out filename since it seems to fail then. Rename the output file afterwards and copy it to your .globus directory in your home. Don't omit the "-des3" option since then your key will be saved without any password protection.

Support

Development Support Question

I cannot build my application, it is not running, or I did nothing and it does not work anymore. Please help me.

If you need support to get your application working it is necessary to provide some information to get useful help.

If you have problems to build your application please provide the following information.

  • Used software modules (module list),
  • The calls to the tools like compiler or linker,
  • The output from these tools.

If you have problems during the excution of your application please provide the following information.

  • The command which you have used to submit your job to the execution queue (qsub ...),
  • Used software modules (module list),
  • The path where you have executed your program, command or script to execute the program, the information for input files.

In cases similar to that that you did not use your application for some time, made a recompilation (but did not change anything else) and run into problems, please recompile your application with a command like

 make ... | tee make.log

Please check the output carefully for warnings which could potentially be hints for problems. If this does not help, bundle this log together with the information mentioned above in your support question.