Dr. Chaterji's lab has two research thrusts: IoT and Cloud Computing and Computational Genomics, with the unifying theme that both thrusts deal with volumes of data, whether streaming video or genomics, for data-driven decisions. The unique aspect of her analytics algorithms is that they can adapt to execute on different kinds of devices, small and large, and can meet user-provided service level objectives (SLOs) for performance metrics. Such SLOs are related to throughput (often more important in genomics), latency (timeliness, more relevant to IoT, e.g., self-driving cars), or energy (e.g., for battery-powered devices).
Thrust 1: IoT And Cloud Computing
An increasing number of networked devices are out there in the wild, ranging from security cameras to aerial drones to small Raspberry Pi single-board computers. The total installed base of connected devices worldwide is projected to reach 30.9 billion units by 2025. These devices are becoming smaller in terms of their physical form factor and there is a compelling need to use these devices as analytics tools in the palms of our hands. The big question that Dr. Chaterji is addressing is:
As the resources on these devices are limited, can hungry machine learning (ML) algorithms be approximated to fit within them?
These ML algorithms are ravenous in multiple dimensions — compute-hungry, network-hungry, power-hungry, and data-hungry. Therefore, they need to be approximated and rightsized to fit within the devices (i.e., within the memory and compute cores). The fact that the algorithms will use lesser compute cycles also means decreased energy consumption. Thus, the goal is to make on-device analytics smart, safe, and agile. These devices will not merely be the sensors they are today, but will transform into decision-making intelligent devices. To enable this, Dr. Chaterji's work is showing how to adapt the computation based on dynamic conditions, such as the video content characteristics or the amount of computational resources available. Prior adaptive solutions, which reconfigure at runtime, often underperform relative to static baselines. On the other hand, Dr. Chaterji's work provides a cost-benefit analysis to select the best approximation for achieving the right balance of accuracy and latency.
Thrust 1's key high-level contributions:
- How to use networks of sensors and drones in sensing and actuation for digital agriculture and energy-aware surveillance [e.g., https://schaterji.io/publications/2021/dronet/].
- How to enable demanding streaming video processing algorithms to run on sensor nodes, often in network-constrained sensorized farms [e.g., https://schaterji.io/publications/2022/litereconfig/].
- How to run complex ML workloads on a serverless cloud computing platform [e.g., https://schaterji.io/publications/2022/wisefuse/].
Thrust 2: Computational Genomics
Single-cell RNA sequencing (scRNA-seq) quantifies gene expression at the genome-wide level in thousands of cells. This has expanded our knowledge of cell heterogeneity, as well as understanding changes in human tissues, at a single-cell granularity. scRNA-seq outputs high-dimensional expression matrices of cells and genes. Intrinsically, these cells reside in a lower-dimensional manifold, which can then be clustered meaningfully into compact interpretable clusters. The typical pipeline involves dimensionality reduction, followed by clustering of the lower-dimensional manifold into meaningful clusters, such that cells in one cluster have similar characteristics (e.g., gene expression profiles). Dr. Chaterji has developed a combination of techniques including data-to-graph models, denoising, and graph autoencoders, to infer contextual information and to create interpretable clusters of cells, meaningful to domain scientists. Prior to this, Dr. Chaterji worked on regulatory RNA modeling, developing classification and regression algorithms to decipher the precise targets of regulatory RNA molecules (e.g., microRNA) and regress their interaction strengths, in a context-aware manner (e.g., in different physiological states). This enables these regulatory RNA molecules to be used to extend our healthspan with fewer off-targeting effects.
Thrust 2's key high-level contributions:
- How to cluster single-cell genomics datasets with high fidelity into interpretable clusters [e.g., https://schaterji.io/publications/2022/kratos/].
- How to use natural language processing (NLP) techniques for error correction in sequenced genome datasets [e.g., https://schaterji.io/publications/2022/lerna/].
- How to infer context-specific microRNA-gene regulatory networks and the strength of their regulatory effects [e.g., https://schaterji.io/publications/2021/simultaneous/].
The theme of big data in IoT and genomics pervades the unique philosophy of Dr. Chaterji's teaching. Dr. Chaterji has developed and taught three innovative new courses, and is offering a 4th one this semester — two 1-credit modules and two 3-credit modules, on various aspects of applied ML for IoT, cloud computing, and genomics. Funded through NSF-CAREER and NIFA awards, her learning platforms instantiate a slew of active learning techniques, podcasting, and invited speakers and entrepreneurs in data science and engineering to her classes. She has been successful at attracting technically diverse students to her courses through an exciting agile curriculum that she is developing for data science and data engineering within ABE. This is crucially important because computing is key for digital agriculture and for next-generation biological engineering.
Dr. Chaterji is a Purdue Scholarship of Engagement Fellow and through this, is developing the broader impacts direction of her NSF-CAREER project. She has developed building blocks to speed up the pipeline of developing algorithms and visualization software, and a complementary set of testbeds to validate the analytics software. This includes a “living testbed” in actual farms in the WHIN (Wabash Heartland Innovation Network) area to educate farmers about the bounds of technologies in the Cyber-Physical Systems (CPS) domain.
Dr. Chaterji is involved in an REEU program through NIFA to recruit underrepresented students and won the NSF-CAREER REU supplement this year to engage undergraduates in her lab. She believes it is her professional and moral responsibility to improve the historical low enrollment of underrepresented minorities and women students in our graduate programs. Consequently, she engages with diverse students through on- and off-campus programs, such as through the multi-university USDA/NIFA higher education challenge grant and the IEEE Geographical Actions Committee (GAC). She is also invited to conference program committees of top venues in ML, computer systems, and genomics, fueling her outreach activities.
Dr. Chaterji is recognized as a pioneer in bringing data analytics to IoT systems, concurrent with resilience and security concerns. This is increasingly important as the world moves toward IoT systems that demand high criticality, whether in battlefields (Internet of Battlefield Things, IoBT), civilian relief and rescue (e.g., drone-based surveillance), or digital agriculture (Internet of Small Things, IoST) [Read more about her NSF-CAREER project here and here]. Her work allows these systems to be deployed in spaces where there is unstable or bandwidth-constrained network or energy-constrained sensor nodes [Why on-device computing?].
Dr. Chaterji's work in this space has developed the first distributed cloud database, uniquely tailored to handle IoT data (dynamic, time-varying streaming data with anomalies). This database is automatically reconfigured to meet low latency/high throughput user requirements, even as the application patterns change, often unpredictably. Dr. Chaterji has developed data-driven techniques to rightsize computer vision algorithms to mobile GPUs and IoT nodes, enabling these intensive ML algorithms to execute on IoT systems. This has resulted in KeyByte, her academic startup founded in 2021, for complex and time-varying workloads from ML applications.
In computational genomics, Dr. Chaterji's work has created a software suite that can lead to precision RNA therapeutics (with lower off-targeting) and interpretable clusters of cells from scRNA-seq data. Dr. Chaterji has also created the first domain specific language (DSL) for genomics to speed up the design and evolution of new computational genomics algorithms.
Demand for data center capacity reached an all-time high with the COVID-19 pandemic. Spending on global data center infrastructure was $216 billion in 2021, a 27% increase from 5 years ago. This sounds warning bells in terms of the environmental footprint — the emission from training a common Natural Language Processing (NLP) model called BERT on a GPU cluster is roughly equal to the emission of a trans-American flight. Further, IoT devices have increased in numbers and IDC predicts that IoT devices will generate 79.4 Zettabytes of data by 2025. This means that using the top-of-the-line flash drives of 1-Terabyte capacity, one would need 79.4 billion such drives to store all this data. Distributed sensing and computation reduce this load on the data centers while also enabling more rapid decision making. Rightsizing computation is a major driver of Dr. Chaterji's work — whether by algorithmically optimizing the number and size of virtual machine (VM) instances for a specific analytics task, or squeezing the size of neural networks, achieving “good enough” computational accuracy.
On the computational genomics front, Dr. Chaterji's vision is to use ML to unearth patterns from genomics data to extend our healthspan. She has focused on both multi-faceted DNA regulatory regions, such as enhancers and promoters and on regulatory RNA networks, e.g., gene-microRNA networks. Pinpointing the targets of these microRNAs can improve the targeting of genome engineering molecules, a major pain point of RNA-based therapeutics, genome editors (e.g., CRISPR/Cas9), and mRNA vaccines. Moreover, while advances have been made in long-read sequencing, these instruments are still error-prone. Dr. Chaterji has created data-driven error correction techniques that will speed up the adoption of long-read sequencing in our clinical enterprise. In line with this, Dr. Chaterji wishes to turbocharge the development of software packages that encapsulate such innovations. An important first step toward this has been her development of the first Domain Specific Language (DSL) in the field of computational genomics, Sarvavid. The common algorithmic building blocks in genomics applications have been encapsulated into scalable software functions in Sarvavid.
On the IoST front, Dr. Chaterji’s vision is to enable algorithms to distill rich patterns from the deluge of data generated by ubiquitous sensing. Her algorithms are uniquely qualified to run on resource-constrained devices, the kinds that will make up the bulk of our next-generation computational infrastructure. Her work enables the analytics to run closer to the sources of data, freeing it from the imperative of transporting all data to our data centers. For example, in digital agriculture, her work enables the IoST ecosystem, creating end-to-end data pipelines with energy-aware edge computation. This is achieved both for on-premise computation (for latency or privacy) or on cloud-hosted VMs or on the latest serverless cloud infrastructure.
Working with two distinct domains uniquely provides the synergy across these. For example, in scRNA-seq data, autoencoders extract the salient features making the analytics more interpretable. In computer vision, encoders compress the information in streaming video files to transfer them across networks, thus saving bandwidth.
On the IoT and digital agriculture front, our focus is on the data engineering side especially designing algorithms for lightweight in-sensor analytics, partitioning algorithms across different platforms of computation (sensor → edge → cloud), and designing robust backend databases for high-performance data lakes for IoT in digital agriculture.
For Thrust 1, Prof. Chaterji is a part of the Data Science/Digital Agriculture area of ABE and here is a blurb portraying the area's highlights:
Data science and engineering is revolutionizing agriculture and improving data flow from sensors to end users by decreasing data latency, improving the precision and predictive power of tools, and identifying bottlenecks in food and agricultural systems. We work on leading edge topics such as harnessing heterogeneous data from biosensors, hyperspectral imaging systems, GPU/FPGA accelerators, aerial UAVs, remote-sensing satellites, and streaming data analytics to revolutionize agriculture and other domains involving automation. Through software engineering, applied machine learning, data engineering, and robotics, we look to create the next-generation techniques for the technical community and corresponding tools to benefit society.
On the precision health and genome engineering front, our focus is on the computational genomics and synthetic biology sub-domains.
For Thrust 2, Prof. Chaterji is a part of the Biological Engineering area of ABE and here is a portrayal of the area's highlights:
Biological Engineering includes modeling, instrumentation and hardware, and lab-scale and high-throughput experiments to modify cells and biological materials for applications relating to humans and other living systems. In addition, modeling can involve interrogating the genome of the organism and advancements in high-throughput tools ranging from sequencing to genome assembly. Our approach involves a combination of computational genomics, systems and synthetic biology, and experimental methodologies to create new entities and generate data and validate them through data analytics and further experimentation. All of these will rapidly advance and open opportunities in cell engineering, bioenergy and biofuel production, bioremediation, and biodefense.
Here I will provide new dimensions and storylines from the work that we do at ICAN. In general, I will write this so that it is accessible to a general technical audience.
Summer 2020: Work inspired by funding from the Lilly Endowment and Army Research Laboratories [ARL].
My lab [https://schaterji.io] called Innovatory for Cells and Machines, abbreviated ICAN, got started in 2018 and since then we have been branching out in two distinct, albeit inter-related technical areas, first: Internet-of-Things (IoT) and edge computing for digital agriculture and other related IoT application domains, and second: genome engineering. They are related in the sense that my lab develops machine learning-based algorithms for efficient data analytics on the IoT side and for decoding the genome on the genome engineering side. Let me describe our novel approach in these two areas.
IoT and Edge Computing for Digital Agriculture: We are developing algorithms for handling large volumes of IoT data and to derive actionable insights from them. This is relevant in digital agriculture with data being generated by ground sensors and aerial drones. Our novel approach allows the data to be stored and processed close to the source of the data rather than being ferried all the way to a data center. This is important to ensure the privacy of the IoT data (e.g., it can be made to stay completely within the premises) and to ensure that the scarce wireless bandwidth and constrained energy resources are not spent carrying raw and irrelevant data to the backend. As part of our solution approach, we have developed:
- On-device computation: This is the approach where machine learning algorithms, such as, object recognition or activity recognition based on video data, can be approximated to run on small form factor devices.
- Federated machine learning: This is the approach where all data does not need to be centralized for building machine learning models — different data owners may have privacy misgivings about doing that. Rather, the model can be built through coordination among many local models with data staying private to each data owner. Our approach allows this to be done in a reliable and secure manner, tolerating some of the local clients failing or acting maliciously.
- Distributed databases: The data needs to be stored in distributed databases that can be hosted close to the source of the data. Such a database should be able to scale up or down depending on the load and changing characteristics of the application. Our approach enables this for the dominant class of databases called NoSQL databases. We have two patent-pending technologies in this space, ready to be licensed from Purdue’s OTC.
Genome engineering: Genome reads are still rather error prone, especially as we look toward newer technologies, like nanopore sequencing. ICAN considers these reads as words of a language and just like Natural Language Processing (NLP) can correct errors in language, we are optimizing technologies from there to correct genomic reads. Our approach enables the huge flourishing of the work in deep neural networks (DNNs) to be applied to understanding the “language” of the genome, so that we can derive actionable insights from them, such as, developing precise genome editing technologies.