Selected Publications

Traffic optimizations (TO, e.g. flow scheduling, load balancing) in datacenters are difficult online decision-making problems. Previously, they are done with heuristics relying on operators’ understanding of the workload and environment. Designing and implementing proper TO algorithms thus take at least weeks. Encouraged by recent successes in applying deep reinforcement learning (DRL) techniques to solve complex online control problems, we study if DRL can be used for automatic TO without human-intervention. To this end, we develop AuTO: an end-to-end automatic TO system that can collect network information, learn from past decisions, and perform actions to achieve operator-defined goals.
ACM SIGCOMM’18, 2018

Management tasks in datacenters are usually executed in-band with the data plane applications, making them susceptible to faults and failures in the data plane. In this paper, we introduce power line communication (PLC) to datacenters as an out-of-band management channel. We design PowerMan, a novel datacenter management network that can be readily built into existing datacenter power systems. With commercially available PLC devices, we implement a small 2-layer prototype with 12 servers. Using this real testbed, as well as large-scale simulations, we demonstrate the potential of PowerMan as a management network in terms of performance, reliability, and cost.
USENIX NSDI’18, 2018

Existing wired optical interconnects face a challenge of supporting wide-spread communications in production clusters. Initial proposals are constrained to only support hotspots between a small number of racks (e.g., 2 or 4) at a time, reconfigurable at milliseconds. Recent efforts on reducing optical circuit reconfiguration time from milliseconds to microseconds partially mitigate this problem by rapidly time-sharing optical circuits across more nodes, but are still limited by the total number of parallel circuits available simultaneously. In this paper, we seek an optical interconnect that can enable unconstrained communications within a computing cluster of thousands of servers. In particular, we present MegaSwitch, a multi-fiber ring optical fabric that exploits space division multiplexing across multiple fibers to deliver rearrangeably non-blocking communications to 30+ racks and 6000+ servers. We have implemented a 5-rack 40-server MegaSwitch prototype with real optical devices, and used testbed experiments as well as large-scale simulations to explore MegaSwitch’s architectural benefits and tradeoffs.
USENIX NSDI’17, 2017

Cloud applications generate a mix of flows with and without deadlines. Scheduling such mix-flows is a key challenge; our experiments show that trivially combining existing schemes for deadline/non-deadline flows is problematic. For example, prioritizing deadline flows hurts flow completion time (FCT) for non-deadline flows, with minor improvement for deadline miss rate. We present Karuna, a first systematic solution for scheduling mix-flows.
ACM SIGCOMM’16, 2016

Recent Publications

More Publications

(2018). AuTO: Scaling Deep Reinforcement Learning to Enable Datacenter-Scale Automatic Traffic Optimization. ACM SIGCOMM’18.

(2018). PowerMan: An Out-of-Band Management Network for Datacenters Using Power Line Communication. USENIX NSDI’18.

(2017). PIAS: Practical Information-Agnostic Flow Scheduling for Commodity Data Centers. IEEE/ACM ToN.

PDF

(2017). Enabling Wide-spread Communications on Optical Fabric with MegaSwitch. USENIX NSDI’17.

PDF

(2016). Enabling ECN over Generic Packet Scheduling. ACM CoNEXT’16.

(2016). Online flow size prediction for improved network routing. IEEE ICNP’16.

PDF

(2016). Scheduling Mix-flows in Commodity Datacenters with Karuna. ACM SIGCOMM’16.

PDF

(2016). CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark. ACM SIGCOMM’16.

PDF

(2016). Enabling ECN in Multi-Service Multi-Queue Data Centers. USENIX NSDI’16.

PDF

(2015). Fully programmable and scalable optical switching fabric for petabyte data center. OpEx.

PDF

Projects

RDMA over Converged Ethernet (RoCE) For Large-scale Deep Learning with Amber

With the rapid growth of model complexity and data volume, deep learning systems require more and more servers to perform parallel training. Currently, deep learning systems with multiple servers and multiple GPUs are usually implemented in a single cluster, which typically employs Infiniband fabric to support Remote Direct Memory Access (RDMA), so as to achieve high throughput and low latency for inter-server transmission. It is expected that, with ever-larger models and data, deep learning systems must scale to multiple network clusters, which necessitates highly efficient inter-cluster networking stack with RDMA support. Since Infiniband is only suited for small-scale clusters of less than thousands of servers, we believe RDMA-over-Converged-Ethernet (RoCE) is a more appropriate networking technology choice for multi-cluster datacenter-scale deep learning. Therefore, we endeavor to incorporate RoCE as the networking technology for deep learning systems, such as Tensorflow and Tencent’s Amber.

Angel: Network-Accelerated Large-Scale Machine Learning

Angel is an in-house large scale machine learning framework in Tencent. We cooperated with Technology Engineering Group (TEG), and developed a network accelerator. Via algorithm-specific flow scheduling, We achieved 70x reduction in job completion time compared to vanilla Apache Spark.

Chukonu: Application-Aware Networking

Datacenters exists because of a standalone server/rack can no longer meet the requirements of modern day applications: web search, ad recommendation, online commerce, machine learning, etc. Different from traditional networks, data center networks enjoy high bandwidth, low latency, and minimal packet loss. These features, however, are not fully utilized today, because application developers are usually unfamiliar with datacenter environment and/or networking stack and its tuning. We aim to design a system for application developers to access networking functions in datacenters and unlock its full potential.

Professional Experience

Senior Software Engineer at Tencent (Jun 2018 - Now)

Software Engineering Intern at Tencent (Jun 2015 - Jul 2017)

Certified Instructor at NVIDIA Deep Learning Institute (Oct 2017 - Now):

  • Deep Learning Demystified
  • Best Practices for Starting a Deep Learning Project
  • Applications of Deep Learning with Caffe, Theano and Torch
  • Image Classification with DIGITS
  • Object Detection with DIGITS
  • Image Segmentation with TensorFlow
  • Neural Network Deployment

Teaching Assistant at HKUST (2011 - 2018):

  • COMP 3511 Operating Systems
  • COMP 4621 Computer Communication Networks I
  • ELEC 2100 Signals and Systems
  • ELEC 2600 Probability and Random Processes in Engineering
  • ELEC 4120 Computer Communication Networks
  • ELEC 5350 Multimedia Networking

Awards & Honors

  • MSRA Ph.D Fellowship (2016)

  • HKUST Postgraduates Studentship (2011 - 2018)

  • HKUST Research Travel Grant (2013)

  • Meritorious Winner of Mathematical Competition of Modeling (2010)

  • The Commercial Radio 50th Anniversary Scholarships (2010)

  • HKUST Scholarship for Continuing UG Students (2007 - 2011)