Research PortalCompute Canada Technical Glossary
Cluster: A group of interconnected compute nodes managed by a resource manager acting like a single system.
Compute node: A computational unit of the Cluster, one or more of which can be allocated to a job. A node has its own operating system image, one or more CPU cores and some memory (RAM). Nodes can be used by the jobs in either exclusive or shared manner depending on a system.
Core year: The equivalent of using 1 CPU core continuously for a full year. Using 12 cores for a month, or 365 cores for a single day are both equivalent to 1 core-year. Compute Canada compute allocations are based on core year allocations.
Head or Login node: Typically when you access a cluster system you are accessing a head node, or gateway/login node. A head node is setup to be the launching point for jobs running on the cluster. When you are told or asked to login or access a cluster system, invariably you are being directed to log into the head node, often nothing more than a node configured to act as a middle point between the actual cluster and the outside network.
Fair share allocation: Generally speaking, Compute Canada allocates its batch processing priority based on a fair-share algorithm. Each user is allocated a share of the total system resources, which effectively translates into priority access to the system. If you have used a large fraction of the system recently (ie. larger than your fair-share), your priority drops. However, the scheduling system has a limited time window over which it calculates priority. After some time (e.g., weeks) of reduced usage, it gradually “forgets” that you overused in the past. This is designed to ensure full system usage and not to penalize users who take advantage of idle compute resources. A consequence is that your total allocation is not a limit on how many compute resources you can consume. Rather, your total allocation represents what you should be able to get over the course of the year if you submit a constant workload to the system and it is fully busy. In other words, once your “total allocation” is used, just keep working.
Job: A job is the basic execution object managed by the batch system. It is a collection of one or more related computing processes that is managed as a whole. Users define resource requirements for the job when they submit it to the batch system. A job description includes a resource request, such as the amount of required memory, the duration of the job, and how many compute cores this job will require. Jobs can be either serial (running on one compute core) or parallel (running on multiple compute cores).
Parallel job: A job that runs on multiple CPU cores. Parallel jobs can be roughly classified as threaded/SMP jobs running on a single compute node and sharing the same memory space, and distributed memory jobs that can run across several compute nodes.
Serial job: A job that requires one compute CPU core to run.
Uneven usage: Most batch systems are tuned to deliver a certain number of core years over a fixed period of time, assuming relatively consistent usage of the system. Users may have very inconsistent workloads, with significant peaks and valleys in their usage. They therefore may need a “burst” of compute resources in order to use their RAC allocation effectively. Normally we expect allocations to be used in a relatively even way throughout the award period. If you anticipate having bursty workloads or variable usage, please indicate that in your RAC application form so that we can contact you and find ways to accommodate your requirements.
Memory per core: The amount of memory (RAM) per CPU core. If a compute node has 2 CPUs, each having 6 cores and 24GB (gigabytes) of installed RAM, then this compute node will have 2GB of memory per core.
Memory per node: The total amount of installed RAM in a compute node.
Deep storage: It is a catch-all for persistent storage: tape-based backup and nearline; /project, and specialty storage such as Ceph, dCache and CVMFS. Basically, all non-temporary storage.
Disk: A disk, hard drive or solid-state drive is permanent storage (compared to a computer’s main memory or RAM) that holds programs, input files, output results, etc.
Filesystems: A directory structure made available for use by systems in a cluster. Each filesystem may have different performance characteristics, space available, and intended use. Some filesystems may be available to only head nodes in a cluster, while others may be shared with compute nodes for working storage during job execution. Filesystems typically available on clustered systems include:
Home: The home filesystem is commonly used for storage of user’s personal files, executable programs, job execution scripts, and input data. Each user has a folder in the home filesystem called a “home directory”. The home directory is persistent, smaller than scratch and, in most systems, backed up regularly. The home directory is visible to all nodes in a given cluster.
Local storage: This refers to the hard drive or solid-state drive in a compute node that can be used to temporarily store programs, input files, or their results. Files in local storage on one node can not be accessed on any other node. The local storage may not be persistent, so the files created on the local storage should be moved to non-local storage to avoid data loss.
Nearline: The nearline filesystem is made up of medium to low performance storage in very high capacity. This filesystem should be used for storage of data that is infrequently accessed that needs to be kept for long periods of time. This is not true archival storage in that the datasets are still considered “active.” It is allocated through the RAC process.
Project: The project filesystem is of medium performance disk and generally generally available to compute nodes on a clustered system. This filesystem is larger in available storage than a home directory, and in most systems, backed up regularly. This filesystem is generally used to store frequently-used project data and is allocated through the RAC process.
Scratch: This filesystem, available on compute nodes, is composed of high-performance storage used during computational jobs. Data should be copied to scratch, then removed from scratch once job execution is complete. Scratch storage is usually subject to periodic “cleaning” (or purging) according to local system policies and is not allocated.
Site: A member of one of Compute Canada’s regional consortia providing advanced research computing (ARC) resources (such as high-performance computing clusters, Clouds, storage, and/or technical support).
Tape: Tape is a storage technology used to store long-term data that are infrequently accessed. It is considerably lower in cost than disk and is a viable option for many use cases.
Terabytes (TB): Terabytes are most often used to measure the storage capacity of large storage devices. One terabyte (abbreviated “TB”) is equal to 1,000 gigabytes and precedes the petabyte unit of measurement.
Compute Canada Cloud: is a pool of hardware supporting virtualization. This can be thought of as Infrastructure as a Service (IaaS). There are currently 2 geographically separate clouds: West and East, with more coming on-line with GP2 and GP3.
Compute Cloud: These are instances that have a limited life-time and typically have constant high-CPU requirements for the instances life-time. They have also been referred to as ‘batch’ instances. These will be granted higher vCPU/Memory quotas since they are time-limited instances.
Cloud storage: Persistent cloud storage provides virtual disk functionality to virtual machines running in the cloud. Persistent cloud storage is very reliable and scalable, made possible by specialized software (Ceph) running on a highly-redundant physical disk array.
Floating IP: A public IP address that a project can associate with a VM so that the instance has the same public IP address each time that it boots. You create a pool of floating IP addresses and assign them to instances as they are launched to maintain a consistent IP address for maintaining DNS assignment.
Instance: A running Virtual Machine (VM), or a VM in a known state such as suspended, that can be used like a hardware server.
Memory per core: See definition in the Memory section above.
Persistent Cloud: These are instances that are meant to run indefinitely (e.g., based on the clouds availability) and would include web servers, database servers, etc. In general, these are thought to be lower CPU or bursty CPU instances. These will have lower vCPU/Memory quotas since they are meant to consume the resources for long periods of time.
Testing Cloud: A small, time-limited resource quota that is automatically approved, does not require scientific or technical review and is meant to be used for testing and development purposes.
Service Portal: Compute Canada hosts many research web portals which serve datasets or tools to a broad research community. These portals generally do not require large computing or storage resources, but may require support effort by the Compute Canada technical team. Groups applying for a service portal often use the Compute Canada cloud, generally require a public IP address, and may (or may not) have more stringent up-time requirements than most research projects. This option is shown as “Portal” in the online form.
Total size of Volumes and Snapshots: The maximum amount of storage (GB) that can be used by your persistent, reliable block devices.
Virtual Machine (VM): See Instance above.
Volume: A persistent virtual disk that can be attached to a running VM. Backed by resilient hardware.
Volume Snapshot: A point-in-time copy of an OpenStack storage volume. Used for backups or as a base to instantiate (launch) other VMs.
Computational Resource Categories
CPU (pronounced as separate letters): Is the abbreviation for central processing unit. Sometimes referred to simply as the central processor, but more commonly called processor, the CPU is the brains of the computer where most calculations take place.
GPU: GPU computing is the use of a graphics processing unit (GPU) to accelerate deep learning, analytics, and engineering applications for example.. GPU accelerators now power energy-efficient data centers in government labs, universities, enterprises, and small-and-medium businesses around the world. They play a huge role in accelerating applications in platforms ranging from artificial intelligence to cars, drones, and robots.
VCPU: Stands for virtual central processing unit. One or more VCPUs are assigned to every Virtual Machine (VM) within a cloud environment. Each VCPU is seen as a single physical CPU core by the VM’s operating system.