General instructions on use of the GPU accelerator cluster (QP)

Temporary accounts have been generated for all on-site attendees and for remote participants who requested access by July 28. On-site attendees will receive their account information on Aug. 18; account information will be emailed to eligible remote participants the week before the summer school. All accounts will be activated Monday, Aug. 18 and will expire at the conclusion of the summer school.

Summer school students will have access to NCSA's 16-node GPU cluster. This cluster is composed of 16 hosts on a private network, with a single host (qp.ncsa.uiuc.edu) acting as a public access point and compiling location (CUDA SDK 1.1 available).

To access qp.ncsa.uiuc.edu, an SSH (secure shell) client is required. On-site students will need to have wireless network connection capability in their laptops as well. Each of the 16 hosts, qp01-16, is a dual-socket dual-core 2.4 Ghz Opteron with 8 GB of memory and 4 NVIDIA Quadro 5600 GPUs, each with 1.5 GB of memory. The login node (qp.ncsa.uiuc.edu) does not have any Quadro 5600's attached. CUDA documentation is available at http://www.nvidia.com/object/cuda_documentation.html.

First Login

The first time you sign in to qp.ncsa.uiuc.edu via SSH, you will see an expired password prompt:

WARNING: Your password has expired.
You must change your password now and login again!
Changing password for user gpuXYZ.
Changing password for gpuXYZ
(current) UNIX password: *****
(Initial password provided to you)
New UNIX password: ***** (A new password that you select)
Retype new UNIX password: ***** (Repeat the new password you select)
passwd: all authentication tokens updated successfully.

You will see some SSH key pair generation output on your first login only. This is normal.

Note to remote users: If you try to log in before Aug. 18, you will be allowed to change/set your new password but still will be denied permission to login. Please remember the password you set!

Node Access

Access to compute nodes is managed by the Torque batch system and Moab scheduler. To submit a job to a node, use submitjob ./my_binary my_args

.

Note: The above submission has a maximum walltime of 3 hours. Walltime limits are needed to prevent stuck jobs from blocking other users from the limited GPU resources. For the same reason, users are limited to a single running job at a time.

qstat and qpeek -f can be used to monitor job status.

After job completes, job stdout and stderr are delivered in files in the same directory the job was submitted, in the form of my_binary.oXYZ and my_binary.eXYZ respectively, where XYZ is the job number.

This is what a sample submission would look like:

[xiaolong@qp release]$ submitjob lab2-matrixmul
4678.qp
Job submitted.
Use 'qstat' to view the job queue and state.
Use 'qstat -n [ Job Number ]' to view node job is running on.
Use 'qpeek -f [ Job Number ]' to monitor job stdout.
Use 'qdel [ Job Number ]' to cancel/delete job.
[xiaolong@qp release]$

Troubleshooting

If you are accessing the cluster from an overseas connection, it is possible that you will experience connection timeouts resulting in disconnects. If this happens, it is recommended that you edit your source files on your local workstation and use scp to move files between your workstation and QP.

If you do not have an SSH client, there are free clients available. A popular client for Windows is Putty. There is also a Java version.

More information on using QP will be available to participants during the summer school. Questions can be posted on the summer school discussion board.