Overview of SolrCloud Autoscaling

Autoscaling in Solr aims to provide good defaults so a SolrCloud cluster remains balanced and stable in the face of various cluster change events. This balance is achieved by satisfying a set of rules and sorting preferences to select the target of cluster management operations automatically on cluster events.

A simple example is automatically adding a replica for a SolrCloud collection when a node containing an existing replica goes down.

The goal of autoscaling in SolrCloud is to make cluster management easier, more automatic, and more intelligent. It aims to provide good defaults such that the cluster remains balanced and stable in the face of various events such as a node joining the cluster or leaving the cluster. This is achieved by satisfying a set of rules and sorting preferences that help Solr select the target of cluster management operations.

There are three distinct problems that this feature solves:

  • When to run cluster management tasks? For example, we might want to add a replica when an existing replica is no longer alive.

  • Which cluster management task to run? For example, do we add a new replica or should we move an existing one to a new node?

  • How do we run the cluster management tasks so the cluster remains balanced and stable?

Before we get into the details of how each of these problems are solved, let’s take a quick look at the easiest way to setup autoscaling for your cluster.

Quick Start: Automatically Adding Replicas

Say that we want to create a collection which always requires us to have three replicas available for each shard all the time. We can set the replicationFactor=3 while creating the collection, but what happens if a node containing one or more of the replicas either crashed or was shutdown for maintenance? In such a case, we’d like to create additional replicas to replace the ones that are no longer available to preserve the original number of replicas.

We have an easy way to enable this behavior without needing to understand the autoscaling features in depth. We can create a collection with such behavior by adding an additional parameter autoAddReplicas=true with the CREATE command of the Collection API. For example:

/admin/collections?action=CREATE&name=_name_of_collection_&numShards=1&replicationFactor=3&autoAddReplicas=true

A collection created with autoAddReplicas=true will be monitored by Solr such that if a node containing a replica of this collection goes down, Solr will add new replicas on other nodes after waiting for up to thirty seconds for the node to come back.

You can see the section Autoscaling Automatically Adding Replicas to learn more about how to enable or disable this feature as well as other details.

The selection of the node that will host the new replica is made according to the default cluster preferences that we will learn more about in the next sections.

Cluster Preferences

Cluster preferences, as the name suggests, apply to all cluster management operations regardless of which collection they affect.

A preference is a set of conditions that help Solr select nodes that either maximize or minimize given metrics. For example, a preference such as {minimize:cores} will help Solr select nodes such that the number of cores on each node is minimized. We write cluster preferences in a way that reduces the overall load on the system. You can add more than one preferences to break ties.

The default cluster preferences consist of the above example ({minimize:cores}) which is to minimize the number of cores on all nodes.

You can learn more about preferences in the Autoscaling Cluster Preferences section.

Cluster Policy

A cluster policy is a set of conditions that a node, shard, or collection must satisfy before it can be chosen as the target of a cluster management operation. These conditions are applied across the cluster regardless of the collection being managed. For example, the condition {"cores":"<10", "node":"#ANY"} means that any node must have less than 10 Solr cores in total, regardless of which collection they belong to.

There are many metrics on which the condition can be based, e.g., system load average, heap usage, free disk space, etc. The full list of supported metrics can be found in the section describing Autoscaling Policy Attributes.

When a node, shard, or collection does not satisfy the policy, we call it a violation. Solr ensures that cluster management operations minimize the number of violations. Cluster management operations are currently invoked manually. In the future, these cluster management operations may be invoked automatically in response to cluster events such as a node being added or lost.

Collection-Specific Policies

A collection may need conditions in addition to those specified in the cluster policy. In such cases, we can create named policies that can be used for specific collections. Firstly, we can use the set-policy API to create a new policy and then specify the policy=<policy_name> parameter to the CREATE command of the Collection API:

/admin/collections?action=CREATE&name=coll1&numShards=1&replicationFactor=2&policy=policy1

The above CREATE collection command will associate a policy named policy1 with the collection named coll1. Only a single policy may be associated with a collection.

Note that the collection-specific policy is applied in addition to the cluster policy, i.e., it is not an override but an augmentation. Therefore the collection will follow all conditions laid out in the cluster preferences, cluster policy, and the policy named policy1.

You can learn more about collection-specific policies in the section Defining Collection-Specific Policies.

Triggers

Now that we have an idea about how cluster management operations use policies and preferences help Solr keep the cluster balanced and stable, we can talk about when to invoke such operations.

Triggers are used to watch for events such as a node joining or leaving the cluster. When the event happens, the trigger executes a set of actions that compute and execute a plan, i.e., a set of operations to change the cluster so that the policy and preferences are respected.

The autoAddReplicas parameter passed with the CREATE Collection API command in the Quick Start section above automatically creates a trigger that watches for a node going away. When the trigger fires, it executes a set of actions that compute and execute a plan to move all replicas hosted by the lost node to new nodes in the cluster. The target nodes are chosen based on the policy and preferences.

You can learn more about Triggers in the section Autoscaling Triggers.

Trigger Actions

A trigger executes actions that tell Solr what to do in response to the trigger. Solr ships with two actions that are added to every trigger by default. The first is called the ComputePlanAction and the other is ExecutePlanAction. The former computes the cluster management operations necessary to stabilize the cluster and the latter executes them on the cluster.

You can learn more about Trigger Actions in the section Autoscaling Trigger Actions.

Listeners

An Autoscaling listener can be attached to a trigger. Solr calls the listener each time the trigger fires as well as before and after the actions performed by the trigger. Listeners are useful as a call back mechanism to perform tasks such as logging or informing external systems about events. For example, a listener is automatically added by Solr to each trigger to log details of the trigger fire and actions to the .system collection.

You can learn more about Listeners in the section Autoscaling Listeners.

Autoscaling APIs

The autoscaling APIs available at /admin/autoscaling can be used to read and modify each of the components discussed above.

You can learn more about these APIs in the section Autoscaling API.