Classical ML Projects

Avazu Click-Through Rate (CTR) Prediction

In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding. This project is an effort to predict CTR given data from Avazu. First EDA is performed on the data, followed by preprocessing pipeline of - standardizing features, automatic outlier detection, Variance-Inflation factor (VIF) analysis, automatic feature selection and balancing imbalanced data usign SMOTE analysis. This is followed by developing models using XGBoost, Multi -Layer Perceptrons and Logistic Regression. Finally a 95% confidence interval is developed for predicting how likely is it for an ad to be clicked.

King County Housing Price Prediction

Real estate is a booming business and using AI we can better understand the housing price trend and get an estimate of a house given various parameters like # of bedrooms, geographic location, sq ft area, etc. This project models the housing price data of King County, USA using Generalized Linear Models (GLMs), Gradient Boosting Machines (GBMs) and Random Forests (RF) with L2 Regularization (Ridge Regression). The data was subjected to normality tests using Q-Q plots and the homocedastic nature was observed and maintained. K-Best features were selected using ANOVA and Variance Thresholding methods and the final models were evaluated using R-Squared and Adjusted R-Squared metrics.

Amazon Product Reviews Sentiment Prediction

Understanding user sentiment for a product is one of the key metrics in the E-commerce space to judge the success of a product and drive future decisions related to it. In this project, we used Selenium to scrape ~5000 reviews of Amazon Kindle Paperwhite product. For every review given by user there was a rating given too. The textual reviews were analyzed and encoded using 4 techniques - Word Level TF-IDF (Term Frequency Inverse Document Frequency), Character Level TF-IDF, N-Gram Level TF-IDF and Count Vectorizers and this transformed data was modeled using Naive Bayes for predicting user rating on scale the of 0 (Bad) - 5 (Great) given a textual review of the product

Dysphonia Detection Using Gaussian Mixture Models

Dysphonia is a voice disorder which usually occurs in people who speak or use their vocals frequently - singers, teachers, etc. This causes the Lrynx (organ forming an air passage to the lungs, the voice box) to introduce hoarseness in the voice. Subjective methods exist and often rely on human sense of hearing. This work is an effort to create an objective solution to the problem of Dysphonia Detection. The speech signals are preprocessed and we achieve approx 25 speech features which include but are not limited to Mel Frequency Cepstral Co-efficients (MFCC), Jitter and Shimmer, Short-Time Energy, etc. This low dimensional feature space is projected to a higher dimension using I-Vectors (a processing technique) which uses Universal Background Model (UBM) based on Gaussian Mixture Model (GMM). This high-dimensional feature space is then used as a speaker recognition prior, following which the transformed speech signals are classified using Support-Vector Machines (SVMs), K-Nearest Neighbors (KNN) and Naive Bayes.