High Performance Computing (HPC)

High performance research computing at NJIT is implemented on compute clusters integrated with other computing infrastructure.

Clusters are comprised of racks of computers, called "nodes". linked to each other by a high-speed network internal to the cluster. The operation of the nodes is controlled by a master node, usually called the "head node". User data storage may be provided in several ways:

  • Disk on the head node

  • NFS (Network File System) - mounted disk on the SAN (Storage Area Network)

  • AFS disk on the SAN

  • Local disk on each node, used for temporary files ("scratch space") and AFS cache

  • Network disk, accessible from each node, used for temporary files

Generally, users' storage needs can be met be one of, or a combination of, these methods.

Clusters are designed to support parallel processing, using message-passing interface (MPI) software, for rapid communication between nodes. However, clusters can also be used to run serial jobs on a number of nodes at once, with no communication between the nodes.

The scheduling and running of jobs on the compute nodes is done by a scheduler (Son of Grid Engine, or SGE) running on the head node. All jobs, whether parallel or serial, must be submitted by the user via GE, from whence they are run on the compute nodes. This means that users are not permitted to run compute-intensive jobs on the head node - i.e., the node that the user logs into to submit a job via SGE.

There are currently two clusters in operation:

IST also manages several non-cluster, shared-memory, computation servers:

  • CRNP.arcs.njit.edu -- Restricted access

  • Gorgon.njit.edu -- Restricted access

  • Phi.njit.edu -- General access

Detailed specifications of all HPC systems are available here.

The operating system running on all clusters is Linux (Red Hat). Users who are unfamiliar with Linux/Unix can contact Academic and Research Computing Systems at ARCS@njit.edu for guidance.

Faculty and researchers who wish to either have access to the cluster Kong.njit.edu for themselves or their students, or to purchase nodes on Kong for their own, dedicated use should send requests to ARCS@njit.edu. Faculty/staff accounts are automatically removed without notification at termination of employment. All other types of accounts are automatically removed without notification upon separation or graduation from NJIT, or after one year of inactivity. Removed accounts can be reinstated by request of faculty/staff to ARCS@njit.edu. Files belonging to expired accounts will be kept on backup for twelve months and can also be restored by request.

All NJIT clusters are managed by staff of IST Academic and Research Computing Systems. Access to any of these resources requires an AFS account.

High Performance Computing & Big Data Wiki

The HPC & BD Wiki contains detailed technical information for HPC and big data (BD) users.

Computational Grids vs. Computer Clusters

Computational grids enable the sharing, selection, and aggregation of a wide variety of geographically distributed computational resources (such as supercomputers, computer clusters, storage systems, data sources, instruments, people) and presents them as a single, unified resource for solving large-scale computer and data intensive computing applications.

NJIT is in the process of standardizing on groups ("clusters") of commodity computers as the vehicle for providing high performance computing services for researchers.  A cluster is a computing approach with a large number of processors with very high-speed interconnects under the control of specialized scheduling and resource management software.  Clusters are designed for parallel processing.  These computer nodes can act independently, or in parallel, to handle large-scale computationally demanding tasks. The user interacts with the cluster via scheduling and resource management software. This software can perform the same functions for groups of clusters (or other computational elements).  Scheduling and allocation of resources are determined by policies that are dependent on various factors, including individual ownership of nodes in a cluster.

The key difference between clusters and grids is in the way resources are managed.  In a cluster, all resources are managed by a centralized resource manager and nodes work cooperatively as a single unified resource.  In a grid, each node has its own resource manager and the grid engine software works to find resources available on the grid to share and aggregate distributed computational resources and deliver them as a service.  Grids are useful in allowing small computation resources to contribute to the solution of a task that is far too large for any one of those resources to handle. This idea is analogous to electric power network grids, where power generators are distributed, but the users are able to access electric power without bothering about the source of energy and its location.

For further information on grids, visit the Grid Computing Information Centre .

See also: Tartan High Performance Computing Initiative.

 

Last Updated: March 18, 2016