Chen (陈) Li (力) is working on topics in the intersection of AI and networking. He joined Data Center Networking group at Tencent in June 2018. He is a Microsoft Research Asia PhD Fellow, and has published 10+ peer-reviewed papers in top journals and conferences. His network acceleration subsystems and scheduling algorithms have seen deployment in Big Data systems in Huawei and Tencent.
Li received his Ph.D. in CS, M.Phil. and B.Eng. (with First Class Honors and Minor in Mathematics) in ECE from the Hong Kong University of Science and Technology in 2018, 2013, and 2011, respectively. Full CV and references are available upon request.
Ph.D in Computer Science & Engineering
HKUST (2013 - 2018)
M.Phil in Electronic & Computer Engineering
HKUST (2011 - 2013)
B.Eng in Electronic & Computer Engineering
HKUST (2007 - 2011)
With the rapid growth of model complexity and data volume, deep learning systems require more and more servers to perform parallel training. Currently, deep learning systems with multiple servers and multiple GPUs are usually implemented in a single cluster, which typically employs Infiniband fabric to support Remote Direct Memory Access (RDMA), so as to achieve high throughput and low latency for inter-server transmission. It is expected that, with ever-larger models and data, deep learning systems must scale to multiple network clusters, which necessitates highly efficient inter-cluster networking stack with RDMA support. Since Infiniband is only suited for small-scale clusters of less than thousands of servers, we believe RDMA-over-Converged-Ethernet (RoCE) is a more appropriate networking technology choice for multi-cluster datacenter-scale deep learning. Therefore, we endeavor to incorporate RoCE as the networking technology for deep learning systems, such as Tensorflow and Tencent’s Amber.
Angel is an in-house large scale machine learning framework in Tencent. We cooperated with Technology Engineering Group (TEG), and developed a network accelerator. Via algorithm-specific flow scheduling, We achieved 70x reduction in job completion time compared to vanilla Apache Spark.
Datacenters exists because of a standalone server/rack can no longer meet the requirements of modern day applications: web search, ad recommendation, online commerce, machine learning, etc. Different from traditional networks, data center networks enjoy high bandwidth, low latency, and minimal packet loss. These features, however, are not fully utilized today, because application developers are usually unfamiliar with datacenter environment and/or networking stack and its tuning. We aim to design a system for application developers to access networking functions in datacenters and unlock its full potential.
Senior Software Engineer at Tencent (Jun 2018 - Now)
Software Engineering Intern at Tencent (Jun 2015 - Jul 2017)
Certified Instructor at NVIDIA Deep Learning Institute (Oct 2017 - Now):
Teaching Assistant at HKUST (2011 - 2018):
MSRA Ph.D Fellowship (2016)
HKUST Postgraduates Studentship (2011 - 2018)
HKUST Research Travel Grant (2013)
Meritorious Winner of Mathematical Competition of Modeling (2010)
The Commercial Radio 50th Anniversary Scholarships (2010)
HKUST Scholarship for Continuing UG Students (2007 - 2011)