Classic Data Science Algorithms from Scratch

A collection of foundational machine learning and statistical algorithms implemented from first principles in Python. Each implementation focuses on understanding the mathematical foundations rather than using black-box libraries.

Algorithms Implemented

K-Means Clustering

Complete implementation of the K-means clustering algorithm with support for different distance metrics and normalization. Applied to real-world datasets including Sacramento housing data and earthquake coordinates with spherical coordinate transformations.

Random Forest

Full decision tree and random forest implementation including entropy calculations, information gain splitting, and ensemble voting. Includes tree visualization capabilities and performance testing on classification datasets.

Naive Bayes Classifier

Text classification implementation using Naive Bayes with Laplace smoothing. Built for spam detection with feature extraction from raw text messages and probabilistic classification.

Kalman Filter

State estimation implementation for tracking dynamic systems with process and measurement noise. Includes predict and update phases with covariance propagation for optimal state estimation.

Metropolis-Hastings MCMC

Bayesian inference implementation using Markov Chain Monte Carlo sampling. Includes posterior distribution sampling for statistical parameters and Ising model simulation for statistical physics applications.

Technical Implementation

Core Features

Pure Python: All algorithms built using only NumPy, SciPy, and standard libraries
Mathematical Foundations: Each implementation follows the theoretical formulations from literature
Real Data Applications: Tested on actual datasets including earthquake coordinates, housing data, and text corpora
Visualization: Comprehensive plotting and analysis of algorithm performance and results

Key Challenges Addressed

Numerical Stability: Implemented proper handling of edge cases like empty clusters in K-means and numerical precision issues in probability calculations for Naive Bayes.

Efficiency Optimization: Vectorized operations using NumPy for performance while maintaining readability and mathematical clarity.

Statistical Accuracy: Proper implementation of convergence criteria, acceptance rates, and burn-in periods for MCMC methods to ensure valid statistical inference.

Results

The implementations successfully replicate the behavior of their scikit-learn counterparts while providing complete transparency into the algorithmic mechanics. Performance testing shows comparable accuracy on standard datasets with clear visualization of algorithm internals.

What I Learned

Building these algorithms from scratch deepened my understanding of the mathematical foundations underlying modern machine learning. The process revealed the importance of numerical stability, convergence criteria, and proper statistical methodology that are often hidden in high-level libraries.