Spark Clusters

Spark clusters are used for running distributed Spark jobs over multiple machines

The following is an example of a configured Spark Cluster.

Because Spark is a distributed compute framework, it requires configuration for both a driver instance type as well as an executor instance type. The overall cluster parameters are the following:

  • Name: User-friendly name of the cluster
  • Description: A longer text field for describing the purpose of the cluster

The driver template defines the resources for a single instance:

  • Cores: The number of CPU cores to allocate to the instance. Kaspian utilizes the same representation as Kubernetes. This can be a positive integer, fractional value, or string representation with units included.
  • Memory: The amount of system RAM to allocate to the instance. Kaspian utilizes the same representation as Kubernetes. This can be a positive integer, fractional value, or string representation with units included. Note, raw integers indicate bytes.
  • Disk: The amount of persistent data storage attached to the instance (e.g., EBS volume). This utilizes the same units as those for memory configuration in Kubernetes. This can be a positive integer, fractional value, or string representation with units included. Note, raw integers indicate bytes.
  • Use Spot Instances: Enabling this toggle will ensure that the job is only scheduled on Spot Instances, which come at a lower cost but may be interrupted. Note, if a driver instance configured to run on Spot instances is interrupted, the entire job will terminate. If the instance is interrupted, the job will terminate. There may also be a delay in starting up the job if no Spot instance capacity is available from the cloud provider.

The executor template defines the resources for multiple copies of the same instance type. Spark clusters are configured to allow for autoscaling executors.

  • Minimum Number of Executors: This is the minimum number of executors that Spark will provision. Setting this equal to the maximum number of executors will disable autoscaling.
  • Maximum Number of Executors: This is the maximum number of executors that Spark will provision. Setting this equal to the maximum number of executors will disable autoscaling.
  • Cores: The number of CPU cores to allocate to each instance. Kaspian utilizes the same representation as Kubernetes. This can be a positive integer, fractional value, or string representation with units included.
  • Memory: The amount of system RAM to allocate to each instance. Kaspian utilizes the same representation as Kubernetes. This can be a positive integer, fractional value, or string representation with units included. Note, raw integers indicate bytes.
  • Disk: The amount of persistent data storage attached to each instance (e.g., EBS volume). This utilizes the same units as those for memory configuration in Kubernetes. This can be a positive integer, fractional value, or string representation with units included. Note, raw integers indicate bytes.
  • Use Spot Instances: Enabling this toggle will ensure that the job is only scheduled on Spot Instances, which come at a lower cost but may be interrupted. When executor instances running on Spot instances are interrupted, new ones will be initialized to replace them; the job will continue to run, albeit potentially slower.

Spark clusters also come with the ability to configure SparkConf options as many of the exposed options are relevant to cluster-level configurations. The conf parameters must be provided as a JSON object.