Overview
152 bird species from passive audio — hierarchical classification at Order and Family levels
A deliberate choice to use structured acoustic feature engineering rather than end-to-end deep learning — demonstrating signal processing transferable to clinical audio.
The BirdCLEF 2022 Kaggle competition presents passive audio recordings from Hawaiian bird monitoring stations under real field conditions: background noise, wind, and overlapping calls across 152 species. Rather than raw spectrogram deep learning, this project centers on systematic acoustic feature engineering with librosa and hierarchical XGBoost classification — predicting taxonomic Order first (17 classes), then narrowing to Family (47 classes), rather than direct species prediction from 107 tabular features.
Why hierarchical classification matters clinically: The same decomposition principle applies to clinical diagnostic AI — predicting broad clinical category first (e.g., cardiac vs. respiratory), then narrowing. It's more interpretable than a single 152-class head and makes the AI's reasoning auditable at each level. The acoustic feature engineering pipeline (MFCC, Chroma, Spectral Contrast, Spectral Rolloff) is directly transferable to clinical audio signals: auscultation, phonocardiography, lung sound analysis, and wearable biosignal processing.
Feature Engineering
107-feature acoustic pipeline — built with librosa from scratch
All features extracted from raw .ogg audio with a 1–8 kHz Butterworth bandpass filter to isolate typical birdsong frequencies and suppress environmental noise. Base set: 13 MFCC means, 12 Chroma pitch-class means, 7 Spectral Contrast means, Spectral Centroid, Bandwidth, Zero Crossing Rate, and geographic coordinates (encoded as Cartesian x/y/z on a unit sphere to preserve continuity at ±180°). Augmented set adds temporal derivatives (delta-MFCC), standard deviations, and additional energy/rolloff features — 107 total.
Hierarchical classification pipeline
XGBoost with Bayesian hyperparameter search (BayesSearchCV) at both Order and Family level. SMOTE class-balancing before fitting. Stratified 70/15/15 split. Feedforward neural network comparison (grid search over depth/width/dropout). AWS Streamlit deployment with per-class confidence scores.
What this project built that transfers to clinical AI
Acoustic signal processing from raw audio: Understanding what MFCC, Chroma, and Spectral Rolloff physically represent — not just calling them as functions — is directly transferable to clinical audio signal analysis (auscultation, ECG, respiratory monitoring).
Hierarchical taxonomic classification design: Decomposing a 152-class problem into a two-level structure with separate models at each level introduced structured multi-stage inference — the same architecture used in clinical decision support systems that triage before diagnosing.
Spherical geographic encoding: Raw latitude/longitude values create a discontinuity at ±180° that breaks distance-based computation. Cartesian (x,y,z) encoding on a unit sphere solves this. Feature engineering decisions rooted in domain geometry, not just statistics.
AWS Streamlit deployment: First project in this portfolio with live cloud deployment. End-to-end pipeline from feature extraction to deployed inference endpoint — the deployment pattern carried into all subsequent clinical AI systems.
My contributions
- ▸Full feature engineering pipeline — bandpass filter, 107 acoustic features, geographic encoding, all 6 feature dataset files
- ▸XGBoost hierarchical classification — Bayesian hyperparameter search, OvA strategy, feature combination experiments
- ▸FNN grid search (depth/width/dropout), SMOTE, evaluation pipeline
- ▸AWS Streamlit deployment — live demo with hierarchical predictions and per-class confidence scores