Arpita Saha

Computer Science Researcher, Music Enthusiast.

Hi! I am a passionate Computer Science Researcher. My research work revolves around applying Machine Learning to build robust automated systems and perform knowledge discovery. Currently I am a Post-Masters Research Associate at Brandeis University, working with Dr. Subhadeep Sarkar at SSD Lab. Prior to that I was a Graduate Research Assistant working with Dr. Xia Ning at Ohio State. I received my Masters from The Ohio State University and my Bachelors from Bangladesh University of Engineering and Technology.

Recent News
  • Presented a poster at NEDBDay 2024 on Self-Designing Data Systems (link) [May '24]
  • Published (KVBench: A Key-Value Benchmarking Suite) at DBTest 2024 (ACM SIGMOD Pods) [April '24]
  • Joined Smart and Scalable DataSystems Lab as a Research Associate (SSD) [Oct '23]
  • Graduated with a Masters in Computer Science and Engineering from The Ohio State University (CGPA 3.81 out of 4.0). [Aug '23]

Experience

Research Associate

Brandeis University
I am working with Prof. Subhadeep Sarkar on the use of Machine Learning techniques to make NoSQL systems adaptive to workload changes for optimized performance.
October 2023 - Present

Graduate Research Assistant

Ohio State University
I worked with Dr. Xia Ning on time-series modeling of large scale EHR data using deep learning.
January 2023 - August 2023

Graduate Teaching Assistant

Ohio State University
Mentored a class of 100 Computer Science students and assisted Professor Michelle Mallon for successful class conduction.
January 2022 - December 2022

Lecturer

United International University, Bangladesh.
Instructed courses like Discrete Mathematics, Operating Systems and Programming Languages. Prepared materials, mentored and evaluated student performance.
September 2008 - June 2010

Education

The Ohio State University

Master of Science
Computer Science and Engineering

GPA: 3.81/4.00

Thesis research on Machine Learning and AI in HealthInformatics: link

August 2021 - August 2023

Bangladesh University of Engineering and Technology

Bachelor of Science
Computer Science and Engineering

GPA: 3.82/4.00

February 2016 - February 2021

Research

Covid-19 Mortality Prediction and Patient Phenotyping from large-scale EHR data

Arpita Saha, Maggie Samaan, Xia Ning

ACM BCB 2023 (PAPER LINK)

This work studies the relationship of patient data, such as demographics, lab results, comorbidities, etc. with disease outcomes and patient phenotyping. For this work we use large-scale Electronic Health Record data from the COVID-19 Research Data Commons (CoRDaCo). We built a GRU-based time-series deep learning model from scratch that obtained an AUC ROC score of 97% in predicting Covid-19 patient mortality from large-scale EHR data. Our model beats the SOTA accuracy on the same dataset using only 11K parameters compared to transformer models with subpar performance even with 700K parameters. We also investigated the strong expressive power of patient representation embeddings generated by our model by clustering them into distinct phenotypes and studying the trends of risk factors related to mortality across these phenotypes for efficient resource allocation during a pandemic. The project was published in ACM BCB 2023 as my first first-authored publication.

August 2022 - August 2023



Description of image
Description of image

KVBench: A Key-Value Benchmarking Suite

Zichen Zhu, Arpita Saha, Manos Athanassoulis, Subhadeep Sarkar

DBTest 2024 (ACM SIGMOD Workshop) (PAPER LINK)

This work aims to build a workload generator tool that can produce synthetic workload for NoSQL datasystems. It is integral to the stress testing NoSQL systems for correctness and performance benchmarking. This tool fills the gap between classical workload generators and complex real-life workloads by offering a richer array of knobs, such as the proportion of empty point queries, point and range deletes with selectivity specified, customized distributions for queries and updates, etc. Therefore this tool provides better support for emulating real-life workloads compared to the state-of-the key-value workload generators.

February 2024 - April 2024

Performance Benchmarking of the various implementations of LSM Memory Buffer

Arpita Saha, Alex Ott, Shubham Kaushik, Subhadeep Sarkar

In this project we are studying how the different data structure implementations of the LSM Memory Buffer affect the performance (latency and throughput) of NoSQL databases under varied workloads. Insert-only workloads benefit from an unsorted vector as the memtable, while skiplist performs better in the presence of point queries. So, we are implementing different data structures such as Unsorted Vector, Trie, Binary Search Tree, etc. studying the existing and new data structures to benchmark the optimal configuration of memtable for a given workload composition. We are using RocksDB open source NoSQL database for benchmarking and building our dataset as a prelude to the optimization problem.

February 2024 - April 2024

Toward Workload-Aware Self-Designing LSM Engines for NoSQL databases

Arpita Saha, Subhadeep Sarkar

NEDBDay 2024 (POSTER LINK)

In this research project, we aim to design ways to automate the tuning of LSM Tree data structure, the backbone of NoSQL Data Systems. NoSQL data systems have a multitude of exposed knobs that can be tuned to obtain the desired read write latencies and throughput performance. However, hand-tuning of these systems does not guarantee optimal configuration due to the vastness of the design space and complex interaction among knobs, hardware and workloads. Therefore, we leverage Machine Learning techniques to navigate through the vast design space of LSM Trees and learn the optimal combination of performance parameters or tuning knobs in response to dynamic workloads. By optimizing key performance metrics, we aim to make NoSQL data-systems more efficient and responsive to changing workload queries. The overarching goal of this research project is to use data-driven techniques to develop adaptive and efficient self-designing data-systems.


October 2023 - May 2024




Description of image

QT-GILD: Quartet based gene tree imputation using deep learning improves phylogenomic analyses despite missing data

Sazan Mahbub, Shashata Sawmya, Arpita Saha, M Sohel Rahman, Md. Shamsuzzoha Bayzid

RECOMB 2022 (PAPER LINK)

In this project, we aimed to solve the Quartet Distribution Imputation problem in incomplete gene trees. We built a variational autoencoder-based semi-supervised appraoch empowered by NLP techniques such as masked language modeling and positional encoding, to impute missing taxa in incomplete gene trees. This helped improve SOTA of species tree estimation for better phylogenomic analyses. This work was published in RECOMB 2022 and later published in The Journal of Computational Biology.

April 2019 - May 2021

-->

Project

  • Website development for Tour Planning System
  • Developed a website as my level-4/term-1 project

    Tools: Django, SQLite

    Github link

  • Micro-controller Project for 3-D Audio Equalizer
  • Can detect a car in real time and send a warning message with gps location to the owner's mobile for exceeding speed limit or crossing a safety zone. Even the owner can get the gps location of the car with a single message.

    Tools: Arduino, Msgeq7 seven band graphic equalizer, LED Cube (built from scratch)

    Youtube video link

  • Piano Player
  • A first-year project, UI for playing piano on desktop using keyboard.

    Tools: C, iGraphics

  • Miscellaneous
    • Developing a C compiler using lexical analyzer and parser designing tools
    • Designing a 4-bit Computer Model using Atmel Studio, MIPS architecture
    • Simulating Mancala game in AI lab course
    • Implementing and modifying some functionalities of XV6 Operating System.
    • Modifying some functionalities of Computer Network in NS2.


Publications


Skills


CV

Click here to download my complete and updated CV


Awards & Certifications