318x Filetype PDF File size 0.98 MB Source: www.usenix.org
Building Scalable and Flexible Cluster Managers
Using Declarative Programming
Lalith Suresh, VMware; João Loff, IST (ULisboa) / INESC-ID; Faria Kalim, UIUC;
Sangeetha Abdu Jyothi, UC Irvine and VMware; Nina Narodytska, Leonid Ryzhyk,
Sahan Gamage, Brian Oki, Pranshu Jain, and Michael Gasch, VMware
https://www.usenix.org/conference/osdi20/presentation/suresh
This paper is included in the Proceedings of the
14th USENIX Symposium on Operating Systems
Design and Implementation
November 4–6, 2020
978-1-939133-19-9
Open access to the Proceedings of the
14th USENIX Symposium on Operating
Systems Design and Implementation
is sponsored by USENIX
Building Scalable and Flexible Cluster Managers Using Declarative Programming
1 2 3
Lalith Suresh, João Loff , Faria Kalim , Sangeetha Abdu Jyothi , Nina Narodytska, Leonid Ryzhyk,
Sahan Gamage, Brian Oki, Pranshu Jain, Michael Gasch
VMware,1IST(ULisboa)/INESC-ID,2UIUC,3UCIrvineandVMware
Abstract Despite the complexity of the largely similar algorith-
Cluster managers like Kubernetes and OpenStack are noto- mic problems involved, cluster managers in various con-
riously hard to develop, given that they routinely grapple with texts tackle the configuration problem using custom, system-
hard combinatorial optimization problems like load balanc- specific best-effort heuristics—an approach that often leads
ing, placement, scheduling, and configuration. Today, clus- to a software engineering dead-end (§2). As newtypesofpoli-
ter manager developers tackle these problems by developing cies are introduced, developers are overwhelmed by having
system-specific best effort heuristics, which achieve scalabil- to write code to solve arbitrary combinations of increasingly
ity by significantly sacrificing the cluster manager’s decision complex constraints. This is unsurprising given that most
quality, feature set, and extensibility over time. This is prov- cluster management problems involve NP-hard combinato-
ing untenable, as solutions for cluster management problems rial optimization that cannot be efficiently solved via naive
are routinely developed from scratch in the industry to solve heuristics. Besides the algorithmic complexity, the lack of
largely similar problems across different settings. separation between the cluster state, the constraints, and the
WeproposeDCM,aradicallydifferent architecture where constraint-solving algorithm leads to high code complexity
developers specify the cluster manager’s behavior declara- and maintainability challenges, and hinders re-use of clus-
tively, using SQL queries over cluster state stored in a rela- ter manager code across different settings (§2). In practice,
tional database. From the SQL specification, the DCM com- even at a large software vendor we find policy-level feature
piler synthesizes a program that, at runtime, can be invoked additions to cluster managers take months to develop.
to compute policy-compliant cluster management decisions
given the latest cluster state. Under the covers, the generated Ourcontribution This paper presents Declarative Cluster
program efficiently encodes the cluster state as an optimiza- Managers(DCM),aradicallydifferent approach to building
tion problem that can be solved using off-the-shelf solvers, cluster managers, wherein the implementation to compute
freeing developers from having to design ad-hoc heuristics. policy-compliant configurations is synthesized by a compiler
WeshowthatDCMsignificantlylowersthebarriertobuild- from a high-level specification.
ing scalable and extensible cluster managers. We validate our Specifically, developers using DCM maintain cluster state
claim by powering three production-grade systems with it: a in a relational database, and declaratively specify the con-
Kubernetes scheduler, a virtual machine management solu- straints that the cluster manager should enforce using SQL.
tion, and a distributed transactional datastore. Given this specification, DCM’s compiler synthesizes a pro-
gramthat, at runtime, can be invoked to pull the latest cluster
1 Introduction state from the database and compute a set of policy-compliant
changes to make to that state (e.g., compute optimal place-
Today’s data centers are powered by a variety of cluster man- ment decisions for newly launched virtual machines). The
agers like Kubernetes [10], DRS [47], Openstack [15], and generated program – an encoder – encodes the cluster state
OpenShift [14]. These systems configure large-scale clusters and constraints into an optimization model that is then solved
and allocate resources to jobs. Whether juggling containers, using a constraint solver.
virtual machines, micro-services, virtual network appliances, In doing so, DCM significantly lowers the barrier to build-
or serverless functions, these systems must enforce numerous ing cluster managers that achieve all three of scalability, high
cluster management policies. Some policies represent hard decision quality, and extensibility to add new features and
constraints, which must hold in any valid system configura- policies. In contrast, today’s cluster managers use custom
tion; e.g., “each container must obtain its minimal requested heuristics that heavily sacrifice both decision quality and ex-
amountofdiskspace”. Others are soft constraints, which re- tensibility to meet scalability goals (§2).
flect preferences and quality metrics; e.g., “prefer to scatter For scalability, our compiler generates implementations
replicas across as many racks as possible”. A cluster manager that construct highly efficient constraint solver encodings that
therefore solves a challenging combinatorial optimization scale to problem sizes in large clusters (e.g., 53% improved
problemoffindingconfigurationsthatsatisfy hard constraints p99 placement latency in a 500 node cluster over the heavily
while minimizing violations of soft constraints. optimized Kubernetes scheduler, §6.1).
USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 827
For high decision quality, the use of constraint solvers DCM Runtime
under-the-covers guarantees optimal solutions for the speci- Solver
fiedproblems,withthefreedomtorelaxthesolutionqualityif Constraints.sql 3. Encoder generates
needed(e.g.,4×betterloadbalancinginacommercialvirtual Schema.sql optimization model
machine resource manager, §6.2). and invokes solver 4. Solution 2. Encoder
Optimization fetches required
Forextensibility, DCM enforces a strict separation between Code model input data from
the a) cluster state, b) the modular and concise representation DCM generate Encoder database
of constraints in SQL, and c) the solving logic. This makes Compiler
it easy to add new constraints and non-trivial features (e.g., 1. User code invokes 5. Return new
makingaKubernetesscheduler place both pods and virtual generated code via runtime configuration
User code
machines in a custom Kubernetes distribution, §6.3). User code
Several research systems [46,53,57,78,92] propose to Figure 1: DCM architecture. Dotted lines show the compila-
use constraint solvers for cluster management tasks. These tion flow. Solid lines show runtime interactions between the
systems all involve a significant amount of effort from opti- DCMruntime,usercodeandtheclusterstate DB.
mizationexpertstohandcraftanencoderforspecificproblems
with simple, well-defined constraints – let alone encode the
full complexity and feature sets of production-grade cluster to power a commercial virtual machine management solution
managers (e.g., Kubernetes has 30 policies for driving ini- where we improved load balancing quality by 4×. Lastly, we
tial placement alone). Even simple encoders are challenging briefly discuss a distributed transactional datastore where we
to scale to large problem sizes and are not extensible even implemented several features with a few lines of SQL.
whentheydoscale(§8). In fact, for these reasons, constraint
solvers remain rarely used within production-grade cluster 2 Motivation
managers in the industry-at-large: none of the open-source
cluster managers use solvers and, anecdotally, nor do widely Our motivating concern is that ad-hoc solutions for cluster
used enterprise offerings in this space. managementproblemsare regularly built from scratch in the
Instead, with DCM, developers need not handcraft heuris- industry, due to the wide range of specialized data-center en-
tics nor solver encodings to tackle challenging cluster man- vironments and workloads that organizations have, for which
agement problems. off-the-shelf solutions do not suffice. Even beyond dedi-
Providing a capability like DCM is fraught with challenges. cated cluster managers like Kubernetes [10], OpenStack [15],
First, cluster managers operate in a variety of modes and and Nomad [50], similar capabilities are routinely embed-
timescales: from incrementally placing new workloads at mil- dedwithinenterprise-grade distributed systems like databases
lisecond timescales, to periodically performing global recon- andstorage systems: e.g., for policy-based configuration, data
figuration (like load balancing or descheduling); we design a replication, or load-balancing across machines, all of which
programming model that is flexible enough to accommodate are combinatorial optimization problems.
these various use cases within a single system (§3). Second, Today, developers handcraft heuristics to solve these clus-
constraint solvers are not a panacea and are notoriously hard ter management problems that incur significant engineering
to scale to large problem sizes. DCM’s compiler uses care- overhead. First, the heuristics are hard to scale to clusters with
fully designed optimization passes that bridge the wide chasm hundreds to thousands of nodes; they often require purpose-
between a high-level SQL specification of a cluster manage- built and inflexible pre-computing and caching optimizations
ment problem and an efficient, low-level representation of an to remain tractable [40,95]. Even then, the heuristics are chal-
optimization model – doing so leverages the strengths of the lenging to get right as developers have to account for arbitrary
constraint solver while avoiding its weaknesses (§4). combinations of constraints. Second, the heuristics sacrifice
Summaryofresults Wereportin-depthaboutourexperi- decision quality to scale (e.g., load balancing quality), which
ence building and extending a Kubernetes Scheduler using is not surprising given that combinatorial optimization prob-
DCM.WeimplementexistingpoliciesinKubernetesinunder lems cannot be solved efficiently via naive heuristics. Third,
20lines of SQL each. On a 500 node Kubernetes cluster on they lead to complex code that makes it hard to extend and
th evolve the cluster manager over time; it is not uncommon for
AWS,DCMimproves95 percentilepodplacementlatencies policy-level feature additions to take multiple months’ worth
byupto2×,is10×morelikelytofindfeasibleplacementsin of effort to deliver.
constrained scenarios, and correctly preempts pods 2× faster Weillustrate the above challenges using Kubernetes as a
than the baseline scheduler. We also report simulation results representative example.
with up to 10K node clusters and experiment with non-trivial
extensions to the scheduler, like placing both pods and VMs Kubernetes example TheKubernetes Scheduler is respon-
within a custom Kubernetes distribution. We also use DCM sible for assigning groups of containers, called pods, to cluster
828 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Constraints Zone 1 Zone 1
Policy Description 1. Pod 1 and Pod Scheduler Node 1 Node 2 Node 1 Node 2
H1-4 Avoid nodes with resource overload, unavailability or errors 2 cannot be in Queue
H5 Resourcecapacityconstraints:podsscheduledonanodemustnotexceed the same zone
node’s CPU, memory, and disk capacity (anti-affinity) Pod
H6 Ensure network ports on host machine requested by pod are available 2. Pod 1 is affine 1X Pod Pod Pod
H7 Respect requests by a pod for specific nodes to node 1. 2 1 2
H8 If pod is already assigned to a node, do not reassign X
Low Priority Without cross-node preemption With cross-node preemption
H9 Ensure pod is in the same zone as its requested volumes (Pod 1 cannot be placed) (Lower prio. pod preempted)
H10-11 If a node has a ‘taint’ label, ensure pods on that node are configured to High Priority
tolerate those taints Figure 3: An anti-affinity constraint prevents Pod 1 and Pod
H12-13 Nodeaffinity/anti-affinity: pods are affine/anti-affine to nodes according
to configured labels 2frombeinginthesamezone,pod1isaffinetonode1,and
H14 Inter-pod affinity/anti-affinity: pods are affine/anti-affine to each other pod2hasalowerpriority than pod 1. Placing pod 1 on node
according to configured labels
H15 Pods of the same service must be in the same failure-domain 1 requires evicting pod 2 on node 2.
H16-20 Volumeconstraints specific to GCE, AWS, Azure.
S1 Spread pods from the same group across nodes
S2-5 Loadbalance pods according to CPU/Memory load on nodes
S6 Prefer nodes that have matching labels and nodes. For instance, Figure 3 shows a scenario where a
S7 Inter-pod affinity/anti-affinity by labels
S8 Prefer to not exceed node resource limits high priority pod (pod 1) can only be placed on node 1, but to
S9 Prefer nodes where container images are already available do so, the scheduler has to preempt a lower priority pod on
Figure 2: Policies from the baseline Kubernetes scheduler, node 2. Computing this rearrangement requires simultaneous
showing both hard (H) constraints and soft (S) constraints. reasoning about resource and affinity constraints spanning
multiple pods and nodes, which cannot be achieved in the
current architecture. Thus, although such global reconfigu-
nodes (physical or virtual machines). Each pod has a number ration is in high demand among users, it is unsupported in
of user-supplied attributes describing its resource demand Kubernetes [60,64].
(CPU, memory, storage, and custom resources) and place-
ment preferences (the pod’s affinity or anti-affinity to other Extensibility: Best-effort scheduling leads to complex code
pods or nodes). These attributes represent hard constraints Similar to Borg [40,95], Kubernetes needs careful engineer-
that must be satisfied for the pod to be placed on a node (H1– ing to keep scheduling tractable at scale. Several policies like
H20 in Table 2). Kubernetes also supports soft versions of inter-pod affinity (Table 2-H14) and service affinities (Table 2-
placement constraints, with a violation cost assigned to each H15)arecomputeintensive because they require reasoning
constraint (S1–S9 in Table 2). Like other task-by-task sched- over groups of pods. These policies are kept tractable using
ulers [15, 94, 95], the Kubernetes default scheduler uses a carefully designed caching and pre-computing optimizations
greedy, best-effort heuristic to place one task (pod) at a time, that are fragile in the face of evolving requirements. For exam-
drawn from a queue. For each pod, the scheduler tries to find ple,it is hard to extend inter-pod affinity policies to specify the
feasible nodes according to the hard constraints, score them numberofpodspernode[58,59,61–63],andtherearediscus-
according to the soft constraints, and pick the best-scored sions among developers to restrict these policies to make the
node. Feasibility and scoring are parallelized for speed. code efficient [60]. For similar reasons, there are discussions
amongdevelopers to remove the service affinity policy due to
Decision quality: not guaranteed to find feasible, let alone accumulating technical debt around its pre-computing opti-
optimal, placements Podscheduling is a variant of the mul- mizations [69]. Such complexity accumulates to make entire
tidimensional bin packing problem [18,21], which is NP- classes of policies requested by users difficult to implement
hard and cannot, in the general case, be solved efficiently in the scheduler [60,64,73].
with greedy algorithms. This is especially the case when the Beyond policy-level extensions, the tight coupling be-
scheduling problem is tight due to workload consolidation tween the cluster state representation in the scheduler’s data-
and users increasingly relying on affinity constraints for per- structures and the scheduling logic makes it near impossible
formance and availability. to introduce changes to the underlying abstractions (e.g., ex-
Toremainperformant, the Kubernetes scheduler only con- tending the scheduler to also place tasks other than pods, like
siders a randomsubsetofnodeswhenschedulingapod,which virtual machines [71]) without a complete rewrite [66].
might miss feasible nodes [93]. Furthermore, the scheduler
maycommittoplacingapodanddenyfeasiblechoices from
pods that are already pending in the scheduler’s queue (a 3 Declarative Programming with DCM
commonweaknessamongtask-by-taskschedulers [40]).
Our position is that developers should specify cluster man-
Featurelimitations:best-effortschedulingdoesnotsupport agement policies using a high-level declarative language, and
global reconfiguration Manyscenarios require the sched- let an automated tool like DCM generate the logic that effi-
uler to simultaneously reconfigure arbitrary groups of pods ciently computes policy-compliant decisions. Architecturally,
USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 829
no reviews yet
Please Login to review.