Interview with Tushar Krishna - Communications in on-chip network, accelerators, and beyond

5 minute read

Published:

This blog post was written as part of an assignment for the GT CS 7001 course.

Introduction

Due to the large sizes of the latest machine learning (ML) models such as OpenAI’s GPT-3 and Google’s Gemini, the use of large-scale clustered ML accelerators such as GPUs and TPUs is greater than ever. To support data communication between these accelerator devices, collective communication libraries like NCCL and MSCCL enable fast and effective data transfers between the devices in such distributed computing systems. These collective communication libraries include collective algorithms such as All-Reduce, Reduce-Scatter, All-Gather, and more. As these ML applications grow in importance, researchers like today’s interviewee are focusing on optimizing and deploying efficient inter-device networks.

Today, we interviewed Tushar Krishna, an Associate Professor in the School of ECE and CS at Georgia Tech. Professor Krishna earned his PhD degree in EECS at MIT. Through a short interview with the professor, we gained insight into how he approaches research and also had a brief discussion about his recent paper, TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning.

Professor Krishna’s Synergy Lab focuses on various topics in computer architecture, but the overarching theme is one: network and communication. During his PhD, he worked on on-chip networks and later shifted to accelerator networks. To challenge himself, he moved out of his “comfort zone” and now works on accelerator and GPU interconnection networks, specifically analyzing the communication patterns in ML workloads. Currently, he and his PhD students are working on distributed ML simulators like ASTRA-Sim, as well as FPGA-level optimization tools such as FEATHER.

Advices for early-year PhD students

At the beginning of the interview, we asked several questions about life in academia and sought advice for early-year PhD students. The first advice he gave was to be open to learning new things, step outside your comfort zone—as he did from on-chip networks to inter-device networks—and critically analyze research papers. He added that good PhD students shouldn’t just think, “Oh, this paper is interesting,” but instead, “This is good work, but it has these limitations and assumptions,” when reading a paper, even the ones published in the top-tier conferences. He also emphasized that, since computer architecture is positioned between the hardware-software stack, PhD students should familiarize themselves with topics ranging from low-level digital circuits to higher-level compilers, OS, and system design.

The second key point he emphasized was the ability to quickly grasp the big picture, rather than getting too caught up in details. He shared an anecdote from his early PhD days when he was presenting his on-chip network solution and showcasing performance gains compared to prior work. One audience member asked him what the ideal performance gain would be in solving the problem. Although the question was high-level and didn’t delve into the technical details, it was crucial because understanding the big picture is essential before getting lost in intricate solutions and implementations, no matter how fancy they are. For example, if the ideal solution provides a 60% performance gain and the proposed solution shows a 50% improvement over the baseline, the contribution is significant. However, if there is still a 100x room for improvement after the proposed solution, the research may not be as impactful. With limited time as a PhD student, he stressed the importance of learning to think critically about papers quickly and always considering the big picture of the research problem.

Summary of TACOS paper

We then shifted the conversation to discuss the contributions and limitations of the paper TACOS. TACOS is an automated synthesizer for deploying collective communication on arbitrary network topology of accelerators such as GPUs or TPUs. Since collective communication is a key bottleneck in running most modern AI models, improving its efficiency will enable faster and more scalable model training and deployment, having a broad impact across industries and leading to more advanced AI applications. By expanding the topology to a Time-Expanded Network (TEN) format and applying the greedy algorithm, TACOS efficiently synthesizes collective algorithms on the given topology. TACOS achieves 4.27x performance improvement against the best of the prior works and scales quadratically with the number of devices. Since greedy algorithms do not ensure optimal solutions, many would question whether or how much the solution from TACOS is close to the optimal solution. In the interview, Professor Krishna added that the authors are trying to add experiments in the camera-ready version of the paper about comparing the performance of TACOS with the ideal solution. He proposed that TACOS achieves 95% of the performance compared to the optimal solution. He additionally noted that William Won, the first author of the TACOS paper, is working to extend the research to address the limitations and modeling discrepancies in TACOS.

Conclusion

With the rise of machine learning and the ever-increasing size of ML models, the popularity of distributed computing is also growing. Tushar Krishna continues to publish fascinating work, which focuses on improving communication efficiency in distributed systems. Better communication enables ML models to scale more effectively, allowing more advanced AI technologies to be developed and widely deployed. These scalable AI models can power smarter virtual assistants, improve decision-making in autonomous systems, and enhance the capabilities of AI in self-driving cars, unlocking the next generation of AI technologies.