Bio

I am a Junior at the Texas Academy of Mathematics and Science (TAMS), studying Computer Science. I'm interested in Computer Vision and Multimodal Learning (Video-Audio tasks).
In my free time, I like to play the Cello!
I use arch btw
Number of people converted to Arch Linux: 2


Selected Projects

Click a project to get the code!

  • Early Fusion for Sound Separation and Localization via Sound and Video

    Inspired by the Sound of Pixels paper, I slightly modify their architecture, because the audio generation does not take into account visual cues within a frame that could help with understanding where and how sound is produced at a region (Zhao et. al. 2018).

    Using the early fusion method from Co-Separating Sounds of Visual Objects, I switched out the image backbone for a video backbone (3D ResNet) to encode video frames (Gao and Grauman 2019).

    Unable to continue training due to insufficient compute. I was able to use a RTX 3060 (12 GB) to train for one epoch; however, I had to use a batch size of 2 (VRAM issue). Also, cost to train model :(

  • UCF-101 Transformer Model

    Transformer-Based Encoder for Action Recognition Classification on 101 different actions. Achieves an accuracy of 75-77% with 2 Attention-Layers (dimension of 512, 8 heads).

  • KMNIST DC-GAN

    Custom Deep Convolutional Generative Adversarial Network for Kuzushiji Letter/Writing Form Generation.

  • Fine-tuning Long-Short Context Network (LoCoNet) on AVDIAR Dataset

    Goal: Learn how to fine-tune a model and work with a large codebase.

    Using AVDIAR2ASD github repo, I tested the LoCoNet paper on this dataset. Before fine-tuning, it achieves an accuracy of 85-86%. After fine-tuning for 15 epochs and unfreezing the last 3 layers from trial-and-error (in order to prevent catastrophic forgetting), I achieved an accuracy of 88%.

  • 3 Different CIFAR-10 Models

    Implemented a Vanilla-CNN, custom Res-Net (3-Layers), and transfer learning Res-Net18 on the CIFAR-10 Dataset.