Bio
I am a Junior at the Texas Academy of Mathematics and Science (TAMS), studying Computer Science.
I'm interested in Computer Vision and Multimodal Learning (Video-Audio tasks).
In my free time, I like to play the Cello!
I use arch btw
Number of people converted to Arch Linux: 2
Selected Projects
Click a project to get the code!
-
Early Fusion for Sound Separation and Localization via Sound and Video
Inspired by the Sound of Pixels paper, I slightly modify their architecture, because the audio generation does not take into account visual cues within a frame that could help with understanding where and how sound is produced at a region (Zhao et. al. 2018).
Using the early fusion method from Co-Separating Sounds of Visual Objects, I switched out the image backbone for a video backbone (3D ResNet) to encode video frames (Gao and Grauman 2019).
Unable to continue training due to insufficient compute. I was able to use a RTX 3060 (12 GB) to train for one epoch; however, I had to use a batch size of 2 (VRAM issue). Also, cost to train model :( -
UCF-101 Transformer Model
Transformer-Based Encoder for Action Recognition Classification on 101 different actions. Achieves an accuracy of 75-77% with 2 Attention-Layers (dimension of 512, 8 heads).
-
KMNIST DC-GAN
Custom Deep Convolutional Generative Adversarial Network for Kuzushiji Letter/Writing Form Generation.
-
Fine-tuning Long-Short Context Network (LoCoNet) on AVDIAR Dataset
Goal: Learn how to fine-tune a model and work with a large codebase.
Using AVDIAR2ASD github repo, I tested the LoCoNet paper on this dataset. Before fine-tuning, it achieves an accuracy of 85-86%. After fine-tuning for 15 epochs and unfreezing the last 3 layers from trial-and-error (in order to prevent catastrophic forgetting), I achieved an accuracy of 88%. -
3 Different CIFAR-10 Models
Implemented a Vanilla-CNN, custom Res-Net (3-Layers), and transfer learning Res-Net18 on the CIFAR-10 Dataset.