Demand for data center capacity reached an all-time high with the COVID-19 pandemic. Spending on global data center infrastructure was $216 billion in 2021, a 27% increase from 5 years ago. This sounds warning bells in terms of the environmental footprint — the emission from training a common Natural Language Processing (NLP) model called BERT on a GPU cluster is roughly equal to the emission of a trans-American flight. Further, IoT devices have increased in numbers and IDC predicts that IoT devices will generate 79.4 Zettabytes of data by 2025. This means that using the top-of-the-line flash drives of 1-Terabyte capacity, one would need 79.4 billion such drives to store all this data. Distributed sensing and computation reduce this load on the data centers while also enabling more rapid decision making. Rightsizing computation is a major driver of Dr. Chaterji's work — whether by algorithmically optimizing the number and size of virtual machine (VM) instances for a specific analytics task, or squeezing the size of neural networks, achieving “good enough” computational accuracy.

On the computational genomics front, Dr. Chaterji's vision is to use ML to unearth patterns from genomics data to extend our healthspan. She has focused on both multi-faceted DNA regulatory regions, such as enhancers and promoters and on regulatory RNA networks, e.g., gene-microRNA networks. Pinpointing the targets of these microRNAs can improve the targeting of genome engineering molecules, a major pain point of RNA-based therapeutics, genome editors (e.g., CRISPR/Cas9), and mRNA vaccines. Moreover, while advances have been made in long-read sequencing, these instruments are still error-prone. Dr. Chaterji has created data-driven error correction techniques that will speed up the adoption of long-read sequencing in our clinical enterprise. In line with this, Dr. Chaterji wishes to turbocharge the development of software packages that encapsulate such innovations. An important first step toward this has been her development of the first Domain Specific Language (DSL) in the field of computational genomics, Sarvavid. The common algorithmic building blocks in genomics applications have been encapsulated into scalable software functions in Sarvavid.

On the IoST front, Dr. Chaterji’s vision is to enable algorithms to distill rich patterns from the deluge of data generated by ubiquitous sensing. Her algorithms are uniquely qualified to run on resource-constrained devices, the kinds that will make up the bulk of our next-generation computational infrastructure. Her work enables the analytics to run closer to the sources of data, freeing it from the imperative of transporting all data to our data centers. For example, in digital agriculture, her work enables the IoST ecosystem, creating end-to-end data pipelines with energy-aware edge computation. This is achieved both for on-premise computation (for latency or privacy) or on cloud-hosted VMs or on the latest serverless cloud infrastructure.

Working with two distinct domains uniquely provides the synergy across these. For example, in scRNA-seq data, autoencoders extract the salient features making the analytics more interpretable. In computer vision, encoders compress the information in streaming video files to transfer them across networks, thus saving bandwidth.