Arpita Saha

Computer Science Researcher, Music Enthusiast.

Hi! I am a passionate Computer Science Researcher. My research work revolves around applying Machine Learning to build robust automated systems and perform knowledge discovery. Currently I am a Post-Masters Research Associate at Brandeis University, working with Dr. Subhadeep Sarkar at SSD Lab. Prior to that I was a Graduate Research Assistant working with Dr. Xia Ning at Ohio State. I received my Masters from The Ohio State University and my Bachelors from Bangladesh University of Engineering and Technology.

Research Associate

Brandeis University

I am working with Prof. Subhadeep Sarkar on the use of Machine Learning techniques to make NoSQL systems adaptive to workload changes for optimized performance.

October 2023 - Present

Graduate Research Assistant

Ohio State University

I worked with Dr. Xia Ning on time-series modeling of large scale EHR data using deep learning.

January 2023 - August 2023

Graduate Teaching Assistant

Ohio State University

Mentored a class of 100 Computer Science students and assisted Professor Michelle Mallon for successful class conduction.

January 2022 - December 2022

Lecturer

United International University, Bangladesh.

Instructed courses like Discrete Mathematics, Operating Systems and Programming Languages. Prepared materials, mentored and evaluated student performance.

September 2008 - June 2010

Education

The Ohio State University

Master of Science

Computer Science and Engineering

GPA: 3.81/4.00

Thesis research on Machine Learning and AI in HealthInformatics: link

August 2021 - August 2023

Bangladesh University of Engineering and Technology

Bachelor of Science

Computer Science and Engineering

GPA: 3.82/4.00

February 2016 - February 2021

Research

Covid-19 Mortality Prediction and Patient Phenotyping from large-scale EHR data

Arpita Saha, Maggie Samaan, Xia Ning

ACM BCB 2023 (PAPER LINK)

This work studies the relationship of patient data, such as demographics, lab results, comorbidities, etc. with disease outcomes and patient phenotyping. For this work we use large-scale Electronic Health Record data from the COVID-19 Research Data Commons (CoRDaCo). We built a GRU-based time-series deep learning model from scratch that obtained an AUC ROC score of 97% in predicting Covid-19 patient mortality from large-scale EHR data. Our model beats the SOTA accuracy on the same dataset using only 11K parameters compared to transformer models with subpar performance even with 700K parameters. We also investigated the strong expressive power of patient representation embeddings generated by our model by clustering them into distinct phenotypes and studying the trends of risk factors related to mortality across these phenotypes for efficient resource allocation during a pandemic. The project was published in ACM BCB 2023 as my first first-authored publication.

August 2022 - August 2023

KVBench: A Key-Value Benchmarking Suite

Zichen Zhu, Arpita Saha, Manos Athanassoulis, Subhadeep Sarkar

DBTest 2024 (ACM SIGMOD Workshop) (PAPER LINK)

This work aims to build a workload generator tool that can produce synthetic workload for NoSQL datasystems. It is integral to the stress testing NoSQL systems for correctness and performance benchmarking. This tool fills the gap between classical workload generators and complex real-life workloads by offering a richer array of knobs, such as the proportion of empty point queries, point and range deletes with selectivity specified, customized distributions for queries and updates, etc. Therefore this tool provides better support for emulating real-life workloads compared to the state-of-the key-value workload generators.

February 2024 - April 2024

Performance Benchmarking of the various implementations of LSM Memory Buffer

Arpita Saha, Alex Ott, Shubham Kaushik, Subhadeep Sarkar

In this project we are studying how the different data structure implementations of the LSM Memory Buffer affect the performance (latency and throughput) of NoSQL databases under varied workloads. Insert-only workloads benefit from an unsorted vector as the memtable, while skiplist performs better in the presence of point queries. So, we are implementing different data structures such as Unsorted Vector, Trie, Binary Search Tree, etc. studying the existing and new data structures to benchmark the optimal configuration of memtable for a given workload composition. We are using RocksDB open source NoSQL database for benchmarking and building our dataset as a prelude to the optimization problem.

February 2024 - April 2024

Toward Workload-Aware Self-Designing LSM Engines for NoSQL databases

Arpita Saha, Subhadeep Sarkar

NEDBDay 2024 (POSTER LINK)

In this research project, we aim to design ways to automate the tuning of LSM Tree data structure, the backbone of NoSQL Data Systems. NoSQL data systems have a multitude of exposed knobs that can be tuned to obtain the desired read write latencies and throughput performance. However, hand-tuning of these systems does not guarantee optimal configuration due to the vastness of the design space and complex interaction among knobs, hardware and workloads. Therefore, we leverage Machine Learning techniques to navigate through the vast design space of LSM Trees and learn the optimal combination of performance parameters or tuning knobs in response to dynamic workloads. By optimizing key performance metrics, we aim to make NoSQL data-systems more efficient and responsive to changing workload queries. The overarching goal of this research project is to use data-driven techniques to develop adaptive and efficient self-designing data-systems.

October 2023 - May 2024

QT-GILD: Quartet based gene tree imputation using deep learning improves phylogenomic analyses despite missing data

Sazan Mahbub, Shashata Sawmya, Arpita Saha, M Sohel Rahman, Md. Shamsuzzoha Bayzid

RECOMB 2022 (PAPER LINK)

In this project, we aimed to solve the Quartet Distribution Imputation problem in incomplete gene trees. We built a variational autoencoder-based semi-supervised appraoch empowered by NLP techniques such as masked language modeling and positional encoding, to impute missing taxa in incomplete gene trees. This helped improve SOTA of species tree estimation for better phylogenomic analyses. This work was published in RECOMB 2022 and later published in The Journal of Computational Biology.

April 2019 - May 2021

Project

Website development for Tour Planning System

Developed a website as my level-4/term-1 project

Tools: Django, SQLite

Github link

Micro-controller Project for 3-D Audio Equalizer

Can detect a car in real time and send a warning message with gps location to the owner's mobile for exceeding speed limit or crossing a safety zone. Even the owner can get the gps location of the car with a single message.

Tools: Arduino, Msgeq7 seven band graphic equalizer, LED Cube (built from scratch)

Youtube video link

Piano Player

A first-year project, UI for playing piano on desktop using keyboard.

Tools: C, iGraphics

Miscellaneous

Developing a C compiler using lexical analyzer and parser designing tools
Designing a 4-bit Computer Model using Atmel Studio, MIPS architecture
Simulating Mancala game in AI lab course
Implementing and modifying some functionalities of XV6 Operating System.
Modifying some functionalities of Computer Network in NS2.

Publications

Zichen Zhu, Arpita Saha, Manos Athanassoulis, Subhadeep Sarkar. "KVBench: A Key-Value Benchmarking Suite", DBTest '24: Proceedings of the Tenth International Workshop on Testing Database Systems(ACM SIGMOD Pods )

Arpita Saha, Maggie Samaan, Bo Peng, Xia Ning. "A Multi-Layered GRU Model for COVID-19 Patient Representation and Phenotyping from Large-Scale EHR Data", Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB 2023 )

Arpita Saha, "DePCoM: Deep Phenotyping of COVID-19 Patients Using a Multi-Layered GRU Model on Large-Scale EHR Data", The Ohio State University: MS Thesis.

Sazan Mahbub, Shashata Sawmya, Arpita Saha, Rezwana Reaz, M Sohel Rahman, Md Shamsuzzoha Bayzid "Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data", Journal of Computational Biology

Sazan Mahbub, Shashata Sawmya, Arpita Saha, Rezwana Reaz, M Sohel Rahman, Md Shamsuzzoha Bayzid "QT-GILD: Quartet based gene tree imputation using deep learning improves phylogenomic analyses despite missing data", International Conference on Research in Computational Molecular Biology (RECOMB 2022).

Shashata Sawmya, Arpita Saha, Sadia Tasnim, Naser Anjum, Md Toufikuzzaman, Ali Haisam Muhamad Rafid, Mohammad Saifur Rahman, M Sohel Rahman, Tanvir Alam. "Phylogenetic Analyses of SARS-CoV-2 Strains Reveal Its Link to the Spread of COVID-19 Across the Globe", MEDINFO 2021: One World, One Health–Global Partnership for Digital Innovation

Shashata Sawmya, Arpita Saha, Sadia Tasnim, Naser Anjum, Md Toufikuzzaman, Ali Haisam Muhammad Rafid, Mohammad Saifur Rahman, M Sohel Rahman "Analyzing hCov genome sequences: applying machine intelligence and beyond", BioRxiv

Skills

Languages: Python, C++, C, Java, Matlab, SQL, Matlab, JavaScript, Shell

Database: Oracle, MySQL, SQLite, Neo4j

Frameworks: Django, JavaSwing, PyQT5, JavaFX

Libraries: PyTorch, TensorFlow, NumPy, Pandas, Scikit Learn, Matplotlib

Tools/Infrastructure: Git, SLURM, Linux, UNIX, Java Unit Testing, Agile, Scrum

Cloud/HPC: Chameleon Cloud, SSH, SLURM

Technical Writing: LaTeX, Overleaf

Others: VSCode, Eclipse JDT, IntelliJ Platform SDK