News

From HLRS Dgrid
Jump to: navigation, search

This is a shortened listing of the news sent via the mailing list dgridsysnews@listserv.uni-stuttgart.de

solved - certificate status change (2013/09/10)
the administrators of the VOMRS server have corrected the validation of the user certificates and changed back the status from Expired to Approved. You should have received another email about this status change. If not then it is an indication that the email address used for the VO-registration is not valid or filtering rules of your account might have dropped that mail.
Login is possible again to all bwGRiD sites which currently have clusters in operation.
If you still can not log in, try:
gsissh -X -p 2222 -vv login-frontend.bw-grid-site.de  # put in a correct host name here
and check the last lines of the output. If it sais
GSS Minor Status Error Chain:
globus_gsi_gssapi: SSLv3 handshake problems
OpenSSL Error: s3_clnt.c:842: in library: SSL routines, function
SSL3_GET_SERVER_CERTIFICATE: certificate verify failed
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: The certificate has expired: 
Credential with subject: /C=DE/O=DFN-Verein/OU=DFN-PKI/CN=DFN-Verein PCA Grid - G01 has expired.

gss_init_context failed
then, you have to update the ca-certificates also on your client. The ca certificates are in the folder $HOME/.globus/certificates.
the current bundle of certificates of the eugridpma can be obtained from here:
http://www.eugridpma.info/distribution/igtf/current/accredited/igtf-preinstalled-bundle-classic.tar.gz
Detailed instructions on how to install them can be found on our Wiki page:
https://wickie.hlrs.de/dgrid/index.php/Manual_Accessing_with_Globus#Integrating_the_CA_certificates
final shutdown of bwGRiD Stuttgart end of April 2013 (announced 2013/02/15)
(English version see below)
nach fünf Jahren Betrieb wird am 30. April 2013 das bwGRiD-Cluster am Standort Stuttgart, HLRS außer Betrieb genommen. Der Zugriff auf Home-Verzeichnisse und Workspaces wird bis zum 15. Mai 2013 weiterhin möglich sein, danach werden auch diese Ressourcen abgeschaltet.
Bitte beachten Sie diese Fristen: ein späterer Zugriff auf Ihre Daten wird NICHT möglich sein.
In diesem Zusammenhang möchten wir Sie auf die vom HLRS angebotenen Höschtleistungsrechnerressourcen hinweisen, die sich im weiteren Ausbau befinden. Im Rahmen von Bundesprojekten können sowohl die Cray XE6 als auch ein Cluster mit Nehalem- und Sandybridge-Knoten genutzt werden. Ein Antrag kann über das Online-Proposal-System gestellt werden. Näheres finden Sie unter: http://www.hlrs.de/systems/access-and-usage-models/
Die Standorte Ulm/Konstanz, Mannheim/Heidelberg, Esslingen, Karlsruhe, Tübingen und Freiburg bieten weiterhin bwGRiD-Ressourcen an.
Bei dem im Rundschreiben 8/2013 der Universität Stuttgart genannte Termin handelt es sich um ein Missverständnis. Die Ressourcen stehen tatsächlich bis 30. April 2013 zur Verfügung.

English version:

after five years of production the bwGRiD cluster Stuttgart (HLRS) will go out of service on April 30, 2013. Access to the Home and Workspace file systems will be possible until May 15, 2013. After this deadline, these resources will be turned off as well.
Please be aware of these deadlines. Later, access to your data will not be possible anymore.
In this context we want to draw your attention to the high performance compute resources offered by HLRS. They will be further extended in the near future. In the context of a Bundesprojekt it is possible to use the Cray XE6 as well as a PC cluster with Nehalem and Sandybridge nodes. You can submit a proposal online on our website. For more details see: http://www.hlrs.de/systems/access-and-usage-models/
The sites Ulm/Konstanz, Mannheim/Heidelberg, Esslingen, Karlsruhe, Tübingen and Freiburg continue operation of their bwGRiD resources.
Please note that there was a misunderstanding for the out of service announcement in the Rundschreiben 8/2013 der Universität Stuttgart. Actually, the resources are kept available until April 30, 2013.


Maintenance on bwGRiD Stuttgart
as we have announced after our last maintenance, we did not receive all hardware parts in time and we need another maintenance for including this equipment into the cluster. We are planning to do this on
Wednesday, December 19, 2012
Therefore, no jobs will be run on this date, and the cluster frontends may be unreachable for some time, while we are working on the network.
UPDATE (DEc,20,2012): unfortunately, due to unforeseeable technical problems, not all works in the maintenance on our grid cluster could be completed yesterday. We had to extend it for Dec 20, 2012. We apologise for any inconvenience this may cause
software installation (2012-11-07)
  • bio/blast_plus/2.2.26
  • math/R/2.15.2
  • vis/blender/2.63
  • vis/visit/2.5.2
software installation and updates (2012-10-11)
New software was installed
  • bio/blast/2.2.26 (rc1)
  • bio/bowtie/0.12.8
  • bio/cdhit/4.6
  • bio/rsem/1.1.21
  • bio/seqclean/Feb22.2011
  • cae/ansys/14.0
  • chem/dirac/11.0.1
  • chem/jmol/12.2.34
  • chem/moldyn/884-mkl-10.3.5-gnu-4.5
  • chem/pymol/1.5.0.3.rev4001-python-2.7.2-gnu-4.5 (rc1)
  • chem/tpp/4.6 (rc1)
  • devel/cmake/2.8.8 (rc2)
  • devel/ruby/1.9.3
  • math/R/2.15.1-mkl-10.3.5
our bwgrid cluster is back in production with updated software stack (2012-07-11)
We have updated the operating system (Scientific Linux 5.5, including latest patches), the batch system and scheduler, and we have installed / updated the following software packages:
  • bio/autodockvina/1.1.2
  • bio/autodockvina/1.1.2
  • bio/mrbayes/3.2.1
  • chem/dalton/2011
  • chem/desmond/2012
  • chem/gromacs/4.5.5_single
  • chem/orca/2.9.1
  • chem/pymol/1.4.1 (rc)
  • chem/pymol/1.5.0.1 (rc)
  • chem/wxmacmolplt/7.4.3
  • devel/wxwidgets/2.8.12
  • math/octave/3.6.2
  • numlib/fftw/3.3.2-openmpi-1.4.3-gnu-4.1
  • numlib/fftw/3.3.2-openmpi-1.4.3-gnu-4.5
  • numlib/fftw/3.3.2-openmpi-1.4.3-intel-12.0
note: the packages marked as release candidate (rc) in this list may change in the near future.
wxmacmolplt/7.4.3 has been moved to the section "chemistry". In Stuttgart it was in the category "vis" until now. We keep the module in the vis-category until end of July, then vis/wxmacmolplt/7.4.3 will be deprecated.


scheduled maintenance for Wednesday, July 11, 2012. (2012-07-03)
we have scheduled a maintenance on bwGRiD Stuttgart for Wednesday, July 11, 2012.
We are planning to update of the operating system and install new software packages.
bwGRiD Stuttgart back in production with updated software stack (2011-11-02)
we have finished the maintenance on our grid cluster at HLRS. Apart from the changes in the power supply and the network cabling we have also updated the software stack. The following packages were installed / updated / recompiled with the latest compilers:
  • chem/amber/10
  • chem/amber/8
  • chem/amber/9
  • chem/desmond/2011
  • cae/ansys/13.0.sp2
  • chem/gromacs/4.5.5
  • chem/modeller/9v9
  • chem/namd/2.8
  • chem/nwchem/5.1.1
  • chem/nwchem/6.0
  • chem/orca/2.8.0.2
  • chem/vmd/1.9
  • chem/xplor_nih/2.29
  • bio/cadds/0.9
  • math/octave/3.4.3
  • phys/geant4/9.4
  • vis/gnuplot/4.4.3
  • phys/meep/1.1.1
if you use any of these modules, please check if your jobs run properly with these latest versions and modify your job scripts accordingly to take advantage of the latest improvements of the software.
scheduled maintenance on bwGRiD Stuttgart (2011-10-24)
on Wednesday, November 2nd 2011 we have a scheduled maintenance on our grid cluster at HLRS.
Due to changes in the power supply unit the cluster has to be shut down for this maintenance.
Frontends may still be available, but we also plan to change the network cabling so that longer interruptions are possible also on the frontends.
maintenance finished (2011-09-26)
our bwGRiD cluster at HLRS is back in production.
We have updated the operating system and installed/updated the following software packages:
  • intel compiler 12.0 (is now default)
  • intel math kernel library 10.3.5
  • impi 4.0.2
  • java jdk 1.5.0 / 1.7.0
  • math R 2.8.1
  • mpjexpress
  • acml 4.4.0
  • gsl 1.15
  • paraview 3.10.1
short maintenance on bwGRiD Stuttgart (2011-09-23)
On Monday, September 26th we plan a short maintenance on our grid cluster. We are going to update software packages and install latest security patches. We will reboot the frontends once and all nodes will be rebooted between two jobs. The default version of some modules will be updated to the latest version we are going to install. However, the currently installed modules will still be available.
scheduled maintenance on bwGRiD Stuttgart (2011-06-16)
On July 28, 2011 a downtime of all hlrs services will take place. The bwgrid cluster will be shut down on July 27 and becomes available again on July 29. This website will also be down on the day of the maintenance.
scheduled maintenance on bwGRiD Stuttgart (2011-05-20)
  • We have scheduled a maintenance on our bwGRiD Cluster at HLRS for
June 14, 2011, 19:00 until June 16, 2011, 19:00
The power supply for several central components of the cluster has to be reconnected differently. Therefore, we have to shut down the complete cluster.
  • We will shut down the filesystem /lustre2 as we have announced already. Please make sure to save all data from /lustre2 which you still need.
    /lustre will then be the only file system providing workspaces.
  • We have installed a new version of gromacs (4.5.4) compiled with intel mpi and mkl today, which is now available to be used.
new software installations on bwGRiD Stuttgart (2011-04-28)
we have updated / newly installed the following software packages on the bwGRiD cluster at HLRS, Stuttgart:
  • Category Computer Aided Engineering (CAE):
    • OpenFoam 1.7.1 and 1.5
  • Category Chemistry:
    • Moldyn 835
  • Category Compiler:
    • GNU 4.3.5
    • GNU 4.5.2
    • Intel Cluster Suite 2011.0.013, providing
      icc, ifort, icpc 12.0.0.084
      MKL 10.3.0.084
      IMPI 4.0.1.007
    • Intel Cluster Tools 4.0, providing
      icc, ifort, icpc 11.1.072
      MKL 10.2.5.035
      IMPI 4.0.0.028
  • Category Mathematics
    • Math package "R" 2.13.0
  • Category MPI
    • mpiexec 0.84
    • mvapich2 1.5.1p1 with different compilers
    • openmpi-1.4.3 with different compilers
  • Category Numerical libraries
    • fgsl 0.9.3-gsl-1.14
    • gsl 1.14
  • Category Physics
    • Geant4 9.4
    • Root 5.28
  • Category Visualization
    • Paraview 3.10.0 and 3.8.1
If you encounter any problems with your programs due to this change, please check the modules you have loaded. We have also adapted the default versions of the modules as far as there is a consensus among the bwGRiD sites.
Therefore, if you load default modules and do not specify the exact version, it might be that you should recompile your code so that it is capable to use the newly installed software packages. If you prefer to continue using the older packages, specify the exact version when loading the modules.
Example:
the command
module load compiler/intel
loads the module compiler/intel/11.1
The default intel compiler was set to compiler/intel/10.1 before, and if you want to use that one, you now have to call
module load compiler/intel/10.1
You might have to adapt your batch scripts to these changes, if these changes we have made cause any trouble for you. On the other hand, an unique standard of default versions at all bwGRiD sites makes it easier to move around among the different sites to make use of spare capacities at some sites.
default workspaces switched on bwGRiD Stuttgart (2011-04-19)
we have now switched the default for the workspaces to the larger and faster file system /lustre.
From now on, if you call ws_allocate, ws_list, ws_release etc. without the option -F it is automatically assumed that you use the 60TB file system /lustre.
The smaller and slower /lustre2 is still available, but we are going to shut down that file system presumably middle of May 2011. It was meant as a temporary solution to provide some scratch space until the file system /lustre is fully operational again.
Therefore, please copy the data you still need during the next weeks to a another place.
Please note also that tomorrow and the day after tomorrow (April 20th/21st) there is a maintenance on the central storage in Karlsruhe. Therefore we recommend to wait until these works are finished, before you transfer files to Karlsruhe.
On the /lustre file system we have set up quotas of 1.8 / 2 TB (soft/hard) per user.
System upgrade on bwGRiD Stuttgart (2011-04-16)
in the evening of April 15th we have finished our system upgrade on the bwGRiD cluster at HLRS. We have also updated the operating system (SL5.5 including the latest patches now) and the batch system (Torque).
Our next plan is to set the file system /lustre as default workspace filesystem as announced in the previous mailing. We are also preparing to install new software (including new builds of mpi) in /opt/bwgrid and provide the appropriate modules so that we hope to reach a new common standard of software environment over all sites of bwGRiD soon.
bwGRiD Cluster Maintenance finished (2011-03-29)
we have finished the maintenance on bwGRiD Stuttgart. As we have announced, we can now provide again our main storage system which is mounted on /lustre as disk space for workspaces. The workspaces there are allocated with
ws_allocate -F lustre workspacename duration
with duration being the number of days the workspace is valid. On our smaller (and slower) system the workspaces are accessed via
ws_list -F lustre2
You should use the file system lustre now, which is considerably faster than our intermediate solution which we were working with during the last months. So, please change your scripts such that you specify -F lustre when you allocate new workspaces and -F lustre2 to access the data of the last months.
In the medium term we are planning to switch the default to lustre instead of lustre2, so that you can ommit again -F lustre later on. Furthermore, we are planning to make /lustre2 readonly first and switch it off in the long term.
Therefore, we stronly encourage all users, to move over to the faster and larger file system /lustre in the next weeks.
I also would like to remind you, that the deprecated modules aren't loaded automatically anymore. If you need one of them, first load the module "system/deprecated" and then the one you need from there.
bwGRiD Cluster Maintenance on Tuesday, March 29, 2011 (2011-03-25)
we are planning a maintenance on bwGRiD Stuttgart for
Tuesday, March 29, 2011
As you might have noticed, some nodes are unavailable due to hardware failure. We are going to fix this and we are going to bring our main storage back to production. The workspaces there will be allocated by
ws_allocate -F lustre workspacename duration
with duration being the number of days the workspace persists, the ones on our smaller (and slower) system are allocated by
ws_allocate -F lustre2 workspacename duration
The default setting is lustre2 at the moment, and we are going to switch the default to lustre in the near future. To make this transition smooth, you should specify the filesystem name to ensure your jobs search the workspace on the correct filesystem.
Those who use the deprecated modules, which are available by default at the moment will have to load the module "system/deprecated" in the future. This change was announced more than half a year ago, and probably nobody is affected anyway, since the deprecated modules are old or broken versions anyway.
bwGRiD Cluster available again (2011-02-16)
Our Grid Cluster is back to production again. We had some minor damage to the file system of the workspaces due to the power interruption. Probably just those files were affected, which were opened by the jobs at the moment when the electrical power went away. The jobs in the queue survived. If your jobs have interdependencies please check if the transition from before to after the interruption went well.
power interruption on the campus of the University of Stuttgart (2011-02-15)
Due to a power interruption at parts of the University of Stuttgart our computers at HLRS are currently not available.
old workspaces on /lustre will be deleted on bwGRiD Stuttgart (2011-01-17)
As I have announced in my email on December 21, 2010, we are planning to set up a new file system on /lustre.
The workspaces located there (these are the ones created until September 2010 ) will be deleted. We have repaired the file system and gave you the opportunity to save the data stored there.
However, in case someone has missed the announcement or couldn't yet save the data, please let us know at latest until January 21, 2011.
filesystem /lustre (2010-12-21)
we have finished now the repair of the filesystem /lustre on bwGRiD Stuttgart. There are some workspace directories for which the data base entry was lost, so if you miss a workspace which was there before, have a look at the directory /lustre/ws2/ws - maybe the content of the workspace is still there and only the database entry is missing.
Please save your data - as far as you still need it - until 17 January 2011
We are planning a system upgrade on the lustre servers and a complete new setup of the filesystem. We hope that by this upgrade we can make the servers more stable in the future. Some optimizations we are planning need a rebuild of the filesystem so that the data will be erased. Therefore, we give you now some weeks to save the data. I know, some of you might already be on vacation. We hope that the time frame of about 4 weeks is enough.
From our log files created during the file system repair we can see that we could restore 94.5% of the objects (files and directories) stored in the file system. However, since it was a lot of work and some data was lost, we consider it worth to do the upgrade before using the file system again for production runs. Therefore, we still have mounted /lustre in readonly mode and only on the frontends.
We wish you a merry christmas and a happy new year.
old workspaces available again (2010-12-13)
After a long and complicated recovery procedure we could make the old workspaces on bwGRiD Stuttgart available again. The filesystem "/lustre" is now mounted on the frontend nodes in readonly mode to let you copy your data.
The command
ws_list -F lustre
shows them. If you ommit the file system name -F lustre (or if you explicitly specify -F lustre2) the command will list the ones on /lustre2 which is currently in use.
Important note: The procedure of repairing the file system is not yet completely finished. You might see files for which the content is not available anymore. We are going to continue repairing this, but since we know that some users urgently need their data, we make the file system available already, now that we finally can access the files.
Please, unless you still need to postprocess the data, do not simply copy huge amount of data over from /lustre to /lustre2. If you want to make a backup and you don't have the possibility to copy the data to your local computer, please login via gsissh to gridway and copy the data to the central storage of the bwgrid (see manpage: man bw-grid).
bwGRid Cluster in production (2010-11-16)
Since we don't see any problems at the moment, I would consider our testing stage of the past days as successful and announce production state for bwGRiD now.
The file system currently used for the workspaces is much smaller than the one we normally had in production. Therefore, we have put quotas on it (250GB soft and 500GB hard quota) so that all users have the chance to store some results. On the nodes (e.g. within an interactive job with "qsub -I") you can check your quota usage on /lustre2 with the following command line:
lfs quota -u $(whoami) /lustre2
Note that this does not work on gridway. The lustre quotas are different from the nfs quotas applied to the home directories.
bwGRiD cluster working again (2010-11-15)
Since Friday, 2010-11-15, the bwGRiD cluster is working again. Since not all errors we have seen are completely investigated yet, we classify the current state as "Testing Mode", but as we see now, the cluster has benn running stable over the weekend.
There is a lustre file system which we have set up newly for the workspaces with quotas (250GB soft and 500GB hard) as an intermediate storage solution until we have finished the recovery of the main workspace file system.
emergency shutdown of all systems in the computer room Nobelstr (2010-11-08)
Aufgrund eines Störfalls im Rechenraum, Nobelstr. 19, stehen unsere Rechner (SX-8, SX-9, Nehalem-Cluster, bw-Grid) derzeit nicht zur Verfügung.
Wir gehen davon aus, dass ab Mittwoch, den 10.11.2010, die Rechner wieder zur Verfügung stehen.
Additional Information: bwGRiD is not yet operating. The frontends are up, but since central services are still missing, login is not possible at the moment. We are waiting for the support to help us.
maintenance on bwGRiD Stuttgart extended (2010-10-20)
we have some trouble with our intermediate storage system. Scheduling the jobs is stopped at the moment. We hope that we can bring up either of the lustre file systems soon. We have to ask for your patience.
bwGRiD Stuttgart back in testing operation (2010-10-15)
Our bwGRID Cluster is up again for urgent compute jobs.
We are still not able to access the old data of the workspaces. We are working on this issue and will allow you access to the data as soon as it is technically possible. We will inform you as soon as we see any progress. At the moment we wait for an exact action plan from the lustre development team how to proceed.
Meanwhile we have set up a small storage system (also based on the parallel lustre file system) which now provides workspaces. This storage system has in total 7TB which is about a tenth of the size of our main lustre system and it is less redundant. We have less experience with this storage system and it has less built-in redundancy.
Therefore: The data stored now in the workspaces you newly create is even more insecure - so please do backups on your own!
With this workaround we have just started production, so that you can uses the cluster for urgent computations. We hope that the intermediate storage system can cope with the load of your compute jobs.
Login via gsissh to gridway is working again, as well as copying data via gsiscp. However, submitting globus jobs is not yet possible. We still have to work on this. However, fixing the file system issues was and is at higher priority so that the cluster can be used at all.
emergency maintenance - bwGRiD unavailable until the problems are fixed (2010-09-22)
After installing the new firmware, as recommended by the manufacturer of the storage system, we have seen problems with the file system. We are now working in close collaboration with the manufacturer who assists us to get this system back to production. We hope that we can preserve the data on the workspaces. During the maintenance the workspaces do not expire and the users will be informed when the cluster is back to production via the mailing list dgridsysnews. We apologize for the inconvenience caused.
system upgrade - cluster back to production (2010-09-21)
The bwGRiD cluster at HLRS is back to production.
Batch jobs are executed, login to the local frontends frbw.dgrid.hlrs.de and frwab.dgrid.hlrs.de is granted to our local users with their HLRS-account. However, we could not yet finish the upgrade on the grid-frontend gridway.dgrid.hlrs.de. It will be available as soon as possible. Currently workspaces do not expire, until you have the possibility to log in again.
The operating system is upgraded to Scientific Linux 5.5 and security patches have been applied. We upgraded the firmware on the Infiniband network cards, ethernet switches and the RAID controllers of the lustre disk system. Since the cluster anyway had to be taken off for security reasons, we did all these updates at this occasion to avoid further downtimes.
The upgrade to Scientific Linux 5.5 is another step within the bwGRiD to provide a mostly standardized production environment at all sites in the whole bwGRiD.
new software and modified configuration (2010-06-28)
We had some problems with individual nodes which were left in an ambiguous state after Jobs have died (probably because they were using too much memory). We changed the configuration such that this problem won't occur anymore. Unfortunately, we did some mistakes in the changes of last week, so that on from Thursday until Friday noon no X-window jobs could be executed. From Friday afternoon until today, Monday noon no jobs could be submitted from gridway. Those problems have been fixed. Note that logging in on the cluster nodes is now only possible if one of your jobs is running on that particular node. For testing use interactive jobs, which you can start with the command 'qsub -I'. This will log you in on the head node of the interactive job. We still have to adapt the scripts vis_via_vgl.sh and vis_via_vnc.sh to the modified situation (which we will do in the near future, so please be patient in this point)
New software has been installed during the last weeks. For a list of available programs use the commad 'module avail'. Note that some module files are renamed to obtain a clearer naming convention. The programs are sorted according to a scheme 'category/program/version' e.g. 'compiler/gnu/4.3'. Some modules, which didn't follow this convention in the past were moved to /opt/system/deprecated which is in the default module path at the moment. We are going to remove this path in the future. If you don't want to adapt your scripts to the new convention, you can add it again to your module search path by calling 'module load system/deprecated'.
The strict naming convention is another attempt to provide a uniform environment in the whole bwGRiD. Not all software is installed everywhere, but we aim to ensure that if the program is installed at a site, the commands to load the modules are the same everywhere. A list of which software is installed at which site can be found on the bwgrid-website at http://www.bw-grid.de/benutzerinformation/software/
man bw-grid (2010-05-06)
A man-page which provides important information about bwGRiD has been created. It is available on all nodes (including the frontend nodes frbw and gridway) by calling
man bw-grid
Extending Workspaces (2010-04-30)
A new feature has been added to the workspace tools: It is now possible to extend a workspace five times. After five extensions have been used up, it will not be possible anymore to extend the lifetime of a workspace. For a descritpion of the workspace tools see this documentation.
Home file system (2010-01-29)
Due to problems with timeouts on our file server which is responsible for the home file system, we noticed that the home sometimes becomes readonly. We are in contact with the manufacturer of the RAID system to investigate this problem. To write out your simulation data ***PLEASE*** use the workspace mechanism, which uses a high performance parallel file system designed to handle strong loads of IO from parallel jobs.
Lustre file system and infiniband (2009-12-17)
As some of you might have noticed: We have moved the scratch directory to the lustre file system by default. We had several hardware replacements in October and since then, the storage array works fine. In the maintenance this week we upgraded the firmware on our infiniband switches which might be also notable in an increase of speed for parallel jobs or when writing to the scratch file system. We hope that by these activities during the last weeks we can provide a more perfonant system to our users.
New Globus frontend available (2009-10-30)
we finished setting up our new Grid Frontend for the bwGRiD Cluster. The former frontend gt4bw.dgrid.hlrs.de will remain operational only for a short time until we know the new works fine. Additionally i.e. new workspaces are only possible on the new frontend. Please switch to gridway.dgrid.hlrs.de if you use globus or gsissh. We tried to provide an identical environment, so everything should work as before. If you recognize any differences, please contact us, then we can either fix the problem or change the documentation.
(Remark: gt4bw is switched off now)
Optimierung des Batchsystems (2009-06-01)
um den Durchsatz auf dem bwGRiD-Cluster zu steigern und Ihre Zugangsmöglichkeiten zu verbessern wird eine Optimierung der Paramter des Batchsystems erforderlich.
Zur Erleichterung der Programmentwicklung wird bisher eine Testqueue (max. 4 Knoten bis 20 Minuten) angeboten. Ab sofort werden auch Jobs mit maximal 4 Knoten/32 Cores und einer Laufzeit bis zu 4 Stunden bevorzugt gestartet.
Zur Verbesserung des Durchsatzes bietet sich eine Verkürzung der bisherigen Laufzeit von 24 Stunden an. Außerdem ist zu beobachten, dass die meisten Jobs mit 24 Stunden Laufzeit angefordert werden, tatsächlicn jeoch nur ca. 8 Stunden aktiv sind. Diese Diskrepanz, die sich aus der ständigen Anforderung von Maximalressourcen unabhängig vom tatsächlichen Bedarf ergibt, verschlechtert die Planungsmöglichkeiten des Batchsystems und damit die aller Nutzer.
Aus den genannten Gründen soll die angeforderte und ggf. auch die reale Laufzeit der Jobs verringert werden. Andererseits soll die maximale Laufzeit von 24 Stunden weiterhin möglich sein. Um dieses Ziel zu erreichen, erhalten alle Jobs mit einer Laufzeit bis zu 12 Stunden ab sofort bei der Übergabe an das Batchsystem einen Bonus für die Startpriorität und werden dadurch gegenüber Jobs mit einer Laufzeit von 12 bis 24 Stunden bevorzugt bedient.
Eine Gesamtübersicht der Parameter des Batchsystems kann online gelesen werden:
https://wickie.hlrs.de/dgrid/index.php/QueuePolicies
Abschließend noch ein Tipp. Es ist für jeden Nutzer selbst vorteilhaft, seine Programmläufe zu beobachten und Laufzeiten entsprechend dem tatsächlichen Bedarf anzufordern und nicht bei allen Jobs mit der maximal möglichen Laufzeit zu arbeiten. Oft stehen für einige Stunden Knoten zur Verfügung, die für diese kürzeren Jobs verwendet werden könnten. Dies bietet Ihnen die Möglichkeit deutlich mehr und schneller Rechenleistung nutzen zu können.
Neues Serverzertifikat auf gt4bw (2009-04-14)
Wir haben letzte Woche das Server Zertifikat für gt4bw erneuert da das alte am 11.04.09 abgelaufen wäre.
Das neue Zertifikat wurde nicht mehr von "C=DE, O=DFN-Verein, OU=DFN-PKI, CN=DFN-Verein Server CA Grid - G01" ausgestellt,sondern von "C=DE, O=DFN-Verein, OU=DFN-PKI, CN=DFN-Verein PCA Grid - G01", da der DFN für die Grid Zertifikate eine getrennte CA installiert hat. Das führt vermutlich dazu, dass bei Ihnen die Chain-of-Trust nicht mehr gefunden werden kann.
Sie können die entsprechende Zertifikatskette unter https://pki.pca.dfn.de/grid-root-ca/pub/cacert/chain.txt finden.
Bitte speichern Sie die ASCII-Datei unter dem Namen "1149214e.0" in Ihrem ".globus/certificates"-Ordner. Er befindet sich standardmässig in Ihrem Homeverzeichnis.
Update: Die zuvor beschriebene Prozedur scheint für gsi-sshTERM ausreichend zu sein. Globus benötigt aber darüber hinaus auch das signing_policy-File.
Es ist in folgendem Archiv zusammen mit dem neuen Zertifikat enthalten: http://www.eugridpma.info/distribution/igtf/current/accredited/igtf-preinstalled-bundle-classic.tar.gz
Globus-Nutzer entpacken daher bitte das gesamte Archiv in ihren ".globus/certificates"-Ordner.