In any distributed system, there are peaks and valleys in use. The ability to scale up when demand exceeds current capacity is critical. This is particularly true for cloud environments where changes to current resource acquisition immediately impact billing.
Auto-scaling in CycleCloud is designed to make it trivial to enable automatic scale up when load increases and, just as importantly, automatic scale down to conserve cost when load returns to lower levels (or even to zero).
For compute clusters (e.g. HPC clusters like GridEngine or HTCondor), auto-scaling is generally based on a combination of current queue depth and expected job attributes (runtime, memory usage, etc.). However, auto-scaling can be applied equally well to other cluster types. For example, a file system cluster such as Lustre might be autoscaled to increase storage capacity, and a web-service cluster might be auto-scaled as number of concurrent users increases.
CycleCloud provides generic and pluggable interfaces to handle all of these cases using a common platform. For many of the cluster types provided with CycleCloud, a default auto-scaling plugin is provided as well. For user-defined clusters (or to override the default auto-scaling plugin), the auto-scale APIs may be used to define custom auto-scaling rules.
CycleCloud can auto-scale any or all of the NodeArrays defined for your cluster.
To enable auto-scale for your cluster, add “Autoscale=true” to your cluster definition.
[cluster htcondor] ... # Enable/disable autoscaling Autoscale = $Autoscale
By default, this will enable both auto-start and auto-stop for all auto-scale capable NodeArrays in the cluster. All standard CycleCloud execute nodearray roles (e.g. role[sge_execute_role], role[condor_execute_role], etc.) are capable of auto-scaling.
For autoscaling to work, the nodes in the cloud must have a route back to the CycleCloud machine. The easiest way to accomplish this is to install your CycleCloud instance in the cloud along with the nodes it will be spinning up. Alternatively if you are in a VPC environment you can set up a route back to your machine, port forward the CycleCloud port on your router to your machine, or use the IsReturnProxy feature.
Selectively Disabling Auto-Stop
For all CycleCloud autoscale plugins, auto-stop may be enable/disabled for each NodeArray using the following configuration attribute:
[[nodearray execute]] ... [[[configuration]]] # Disable auto-stop for this nodearray cyclecloud.cluster.autoscale.stop_enabled = false
It is best-practice for User-created autoscale plugins to honor this attribute as well.
Auto-Stop Idle Nodes
CycleCloud automatically terminates nodes which are idle without jobs running. Idle nodes are classified into one of two categories: nodes that have never run jobs, and nodes that have. Configuration of the termination timer for each of these two cases is available by setting configuration attributes:
[[nodearray execute]] ... [[[configuration]]] # Set idle node termination timers for this nodearray cyclecloud.cluster.autoscale.idle_time_after_jobs = 60 cyclecloud.cluster.autoscale.idle_time_before_jobs = 3600
This example sets the termination timer to one minute for nodes that have run jobs, and to one hour for nodes which have been idle since their start. Default values for these attributes are 300 and 1800, respectively.
In AWS instances are billed on an hourly cycle and these timers are only honored if they are exceeded AND a billing cycle is approaching.
CycleCloud has built-in support for autoscaling Grid Engine and HTCondor clusters. The CycleCloud software automatically monitors the queues for the clusters it launches and starts and stops nodes as needed to complete the work in an optimal amount of time/cost. API and plugin information for auto-scaling can be found in the CycleCloud Administrator’s Guide.
The GridEngine and HTCondor scheduler integrations provide a simplified API for customizing the auto-scaling decisions using data collected by the built-in monitoring. Users should not need to use the full Auto-Scale Plugin API for any cluster type that provides built-in auto-scaling.