Mohamed
Malek
Naouach,
Master’s
candidate
David
R.
Cheriton
School
of
Computer
Science
Modern datacenters run different applications with various communication requirements in terms of bandwidth and deadlines, in particular deadlines that are driving web-search workloads, e.g., when submitting requests to Bing search engine or loading Facebook home page. Serving the submitted requests in a timely fashion relies on meeting the deadlines of the generated scatter/gather flows for each request, and current flow-schedulers are unaware of deadlines, and they just start flows as soon as they arrive when the resource is available.
In this thesis, we present Artemis: a workload-driven flow-scheduler at the end-hosts that learns via reinforcement how to accommodate deadline flows and schedule them to meet their deadlines. The flow-scheduling policy in Artemis is not hard-coded and it is instead computed in real-time based on a reinforcement-learning control loop. In Artemis, we model the flow-scheduling problem as a deep reinforcement learning task and we use the actor-critic architecture to solve it.
Flows in Artemis do not start as soon as they arrive, and a source starts sending a particular application-flow upon requesting and acquiring a token from the flow-destination node. The token-request is issued by the source node and it does expose the flow's requirements to the destination, and at the destination side, Artemis flow-scheduler is a decision-making agent that learns how to serve the awaiting token-requests based on their embedded requirements using the deep reinforcement learning actor-critic model.
We use two variants of a 10-to-1 gather workload to demonstrate (1) Artemis ability to learn how to schedule deadline flows on its own and (2) its effectiveness to meet their deadlines. We compare Artemis performance against Earliest Deadline First (EDF), and two other rule-based flow-scheduling policies that, unlike EDF, they are aware of both the sizes and the deadlines of the flows: Largest Size Deadline ratio First (LSDF) and Smallest Size Deadline ratio First (SSDF). LSDF schedules arrived flows with largest size-deadline ratio first, while LSDF does the inverse logic.
Our experimental results show that Artemis flow-scheduler is able to capture the structure of the gather task workloads, maps the requirements of the arrived flows to the order at which they need be served and computes a flow-scheduling strategy based on that. For the first workload that is composed of a 50-50 split of (350KB, 40ms) and (250KB, 50ms) flows, Artemis met +35.58\% more deadlines than EDF, +24.93\% more than SSDF, and performed marginally better than LSDF with +4.42\%. For the second workload that is composed of a 60-40 split of (350KB, 40ms) and (250KB, 50ms) flows, Artemis outperformed all three flows-schedulers, meeting +16.34\% more deadlines than the second best SSDF.