Fall 2023 · Stanford Software Research Lunch

Program for the fall quarter of 2023.

9/21: Legate Sparse: Distributed Sparse Computing in Python

Time: Thursday, September 21, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Rohan Yadav

Abstract: The sparse module of the popular SciPy Python library is widely used across applications in scientific computing, data analysis and machine learning. The standard implementation of SciPy is restricted to a single CPU and cannot take advantage of modern distributed and accelerated computing resources. We introduce Legate Sparse, a system that transparently distributes and accelerates unmodified sparse matrix-based SciPy programs across clusters of CPUs and GPUs, and composes with cuNumeric, a distributed NumPy library. Legate Sparse uses a combination of static and dynamic techniques to efficiently compose independently written sparse and dense array programming libraries, providing a unified Python interface for distributed sparse and dense array computations. We show that Legate Sparse is competitive with single-GPU libraries like CuPy and achieves 65% of the performance of PETSc on up to 1280 CPU cores and 192 GPUs of the Summit supercomputer, while offering the productivity benefits of idiomatic SciPy and NumPy.

Food:

9/28: Generalized Hierarchical Collective Communication

Time: Thursday, September 28, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Mert Hidayetoglu

Abstract: Collective communications is essential in distributed high-performance computing, but optimizing collective communication operations across diverse network architectures is a challenging task. For this reason, the mainstream GPU-aware library implementations, such as MPI and NCCL, may perform collectives sub-optimally or be limited to a few optimized functions for the vendor's hardware. To improve performance portability, we introduce HiCCL, a novel library that decouples the collective pattern design from its optimization on a specific machine. HiCCL 1) enables users to compose collective patterns as a superposition of two kinds of generalized primitives, 2) factors the data movement for a specific network hierarchy, 3) and executes the resulting asynchronous data movement via the native point-to-point communication API of MPI, NCCL, or IPC. Our evaluation on five high-performance systems comprising Nvidia, AMD, and Intel GPUs with different network architectures shows that HiCCL achieves higher throughput than MPI and NCCL functions on Broadcast, Reduce, Gather, Scatter, All-Gather, Reduce-Scatter, All-Reduce, and All-to-All collectives with a geometric mean speedup of 5.66x over MPI on all systems and 1.16x over NCCL on Nvidia GPUs.

Food:

10/5: LeanDojo: Theorem Proving with Retrieval-Augmented Language Models

Time: Thursday, October 5, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Kaiyu Yang

Abstract: Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selection— a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): the first LLM-based prover that is augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 96,962 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.

Food:

10/12: Hardware-Aware Static Optimization of Hyperdimensional Computations

Time: Thursday, October 12, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Pu (Luke) Yi

Abstract: Binary spatter code (BSC)-based hyperdimensional computing (HDC) is a highly error-resilient approximate computational paradigm suited for error-prone, emerging hardware platforms. In BSC HDC, the basic datatype is a hypervector, a typically large binary vector, where the size of the hypervector has a significant impact on the fidelity and resource usage of the computation. Typically, the hypervector size is dynamically tuned to deliver the desired accuracy; this process is time-consuming and often produces hypervector sizes that lack accuracy guarantees and produce poor results when reused for very similar workloads. We present Heim, a hardware-aware static analysis and optimization framework for BSC HD computations. Heim analytically derives the minimum hypervector size that minimizes resource usage and meets the target accuracy requirement. Heim guarantees the optimized computation converges to the user-provided accuracy target on expectation, even in the presence of hardware error. Heim deploys a novel static analysis procedure that unifies theoretical results from the neuroscience community to systematically optimize HD computations. We evaluate Heim against dynamic tuning-based optimization on 25 benchmark data structures. Given a 99% accuracy requirement, Heim-optimized computations achieve a 99.2%-100.0% median accuracy, up to 49.5% higher than dynamic tuning-based optimization, while achieving 1.15x-7.14x reductions in hypervector size compared to HD computations that achieve comparable query accuracy and finding parametrizations 30.0x-100167.4x faster than dynamic tuning-based approaches. We also use Heim to systematically evaluate the performance benefits of using analog CAMs and multiple-bit-per-cell ReRAM over conventional hardware, while maintaining iso-accuracy – for both emerging technologies, we find usages where the emerging hardware imparts significant benefits.

Food:

10/19: vLLM: High-throughput and memory-efficient LLM serving with PagedAttention

Time: Thursday, October 19, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Woosuk Kwon

Abstract: LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. To address this problem, we are developing vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes PagedAttention, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention achieves up to 24x higher throughput than Hugging Face Transformers, without requiring any model architecture changes. vLLM has been developed at UC Berkeley and deployed for Chatbot Arena and Vicuna Demo for the past 6 months. It is also being used by many companies. In this talk, we will discuss the motivation, features, and implementation of vLLM in depth.

Food:

10/26: Organizational Lunch

Time: Thursday, October 26, 2023, 12 noon - 1pm
Location: Gates 415

Organizational lunch. Come sign up to give a talk during the quarter.

Food:

11/2: Finch: Compiler techniques for versatile sparse and structured data processing

Time: Thursday, November 2, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Willow Ahrens

Abstract: From Fortran to Numpy, TensorFlow, and Halide, programming languages have evolved to effectively handle dense arrays. However, real-world data often exhibits underlying structures like sparsity, run-length-encoding, or symmetry. This talk discusses how programming languages can provide a versatile vocabulary to describe and analyze tensor data structures and operations, including their asymptotic costs and intermediate representations. We introduce Finch, a compiler which adapts dense array programs to the structure of data automatically. Finch uses a language of basic loop building blocks called Looplets to hierarchically decompose structured sparsity and generate efficient code. Our approach enables new loop optimizations across multiple domains, unifying techniques such as sparse tensors, databases, and lossless compression.
https://dl.acm.org/doi/10.1145/3579990.3580020 https://dl.acm.org/doi/10.1145/3519939.3523442 https://github.com/willow-ahrens/Finch.jl

Food:

11/9: Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous Machines

Time: Thursday, November 9, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Thiago Teixeira

Abstract: In a parallel and distributed application, a mapping is a selection of a processor for each computation or task and memories for the data collections that each task accesses. Finding high-performance mappings is challenging, particularly on heterogeneous hardware with multiple choices for processors and memories. We show that fast mappings are sensitive to the machine, application, and input. Porting to a new machine, modifying the application, or using a different input size may necessitate re-tuning the mapping to maintain the best possible performance. We present AutoMap, a system that automatically tunes the mapping to the hardware used and finds fast mappings without user intervention or code modification. In contrast, hand-written mappings often require days of experimentation. AutoMap utilizes a novel constrained coordinate-wise descent search algorithm that balances the trade-off between running computations quickly and minimizing data movement. AutoMap discovers mappings up to 2.41× faster than custom, hand-written mappers.

Food:

11/16: Helping Developers Use Type Systems for Memory Safety

Time: Thursday, November 16, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Will Crichton

Abstract: Programming languages are trending towards ever more complex type systems. This talk poses two questions in response: how can we help programmers keep up? And how can we use advances in type systems to help programmers beyond correctness? I will present my research on these questions in the context of Rust's ownership types, its key mechanism for memory safety without garbage collection. First, to help programmers understand ownership types, I designed a novel conceptual model of ownership based on the metaphor of permissions that can visually explain key ideas such as undefined behavior and incompleteness. A large-scale deployment of the model with thousands of Rust learners showed that it improves learning outcomes. Second, to help programmers use ownership beyond correctness, I developed a static program slicer for Rust. The cognitive motivation comes from a series of experiments that show how working memory limits a person's ability to read code. The technical insight is to use ownership types to modularize the slicing analysis in a sound and precise way.

Food:

11/30: Making OCaml Safe for Performance Engineering

Time: Thursday, November 30, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Ron Minsky

Abstract: Over the last couple of years, Jane Street has greatly increased its investment in OCaml, and has started developing major extensions to the OCaml's type system, with the primary goal of making OCaml a better language for writing high-performance systems.
In this talk, I'll attempt to provide a developer's-eye view of these changes. We'll cover two major directions of innovation: first, the addition of modal types to OCaml, which opens up a variety of ambitious features, like memory-safe stack-allocation; type-level tracking of effects, and data-race freedom guarantees for multicore code. The second is the addition of a kind system to OCaml, which will provides more control over the representation of memory, in particular allowing for structured data to be represented in a cache-and-prefetch-friendly tabular form. Together, these features pull together some of the most important features for writing high performance code in Rust, while maintaining the relative simplicity of programming in OCaml.
In all of this, I'll focus less on the type theory, and more on how these features are surfaced to users, the practical problems that they help us solve, and the place in the design space of programming languages that this leaves us in.

Food:

12/7: Scalable Static Analysis of Parallel Programs with Tens of Millions Lines of Code

Time: Thursday, December 7, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Jeff Huang

Abstract: Scalable static analysis of parallel programs is crucial in ensuring the correctness, safety, and performance of large-scale software applications. As software projects grow in size and complexity, analyzing them becomes more challenging. Analyzing parallel programs, which can run concurrently on multiple processors or cores, poses additional challenges due to the intricacies of synchronization, data races, and deadlocks. In this talk, I will present a static analysis tool developed by my research team at Texas A&M and a startup company, Coderrect Inc, over the last five years based on the LLVM toolchain. The tool has been used to analyze over tens of millions lines of source code in languages such as C/C++, Fortran, Rust, and Smalltalk, and has identified hundreds of previously undetected concurrency issues in popular open source and proprietary projects. I will focus on a few precision-preserving techniques to scale the compiler analyses, including an abstraction of origins, pointer analysis solvers, OpenMP logical threading, and an IR for dynamic languages. Some of these ideas have been published in recent PLDI, SC, and ICSE conferences, including an ACM SigSoft distinguished paper award.

Food:

12/14: Compiling Shape Operators for Sparse Array Programming

Time: Thursday, December 14, 2023, 12 noon - 1pm
Location: Gates 415

Speaker: Alexander J. Root

Abstract: We show how to build a compiler for a sparse array language that supports shape operators (e.g. reshaping or concatenating arrays) in addition to compute operators. Existing sparse array programming systems implement generic shape operators for only some sparse data structures, reduce shape operators on other data structures to those, and do not support fusion. Our system compiles sparse array expressions to code that efficiently iterates over reshaped views of irregular sparse data structures, without needing to materialize temporary storage for intermediates. Our evaluation shows that our approach generates sparse array code competitive with popular sparse array libraries: shape operators perform on average 1.49x faster than hand-written kernels in scipy.sparse and 28x faster than handwritten kernels in pydata/sparse; for operators whose library implementations perform data structure conversions, our generated code performs on average 4.19x faster than scipy.sparse and 54.6x faster than pydata/sparse. Lastly, we show that fusion of shape and compute operators can improve the performance of some sparse array programs, with 1.16x-2.79x speed-ups on selected programs over unfused code.

Food: