Scikit-Learn ML Foundations - Classification & Regression Coursework

A five-part scikit-learn coursework series building hands-on fluency in classification and regression - from loading toy datasets through train/test splits to evaluating models with confusion matrices and prediction plots.

My Role

Student

Company

Baylor University

Industry

AI/ML, Data Science, Coursework

Business Problem

Coursework framing rather than a market problem. The Advanced Python module needed to move from language fundamentals into applied machine learning, building the muscle memory to load a dataset, split it, fit a model, predict, and evaluate the results - the same loop that underlies almost every real ML project before any deep learning or production tooling enters the picture.

Tools Used

Pythonscikit-learnpandasNumPyMatplotlibSeabornHugging Face Datasets

Key Features

Iris classification: KNN on the classic 150-sample three-species dataset, with a Seaborn confusion matrix to visualize misclassifications
Handwritten Digits classification: KNN on 8x8 pixel digit images, plotting sample inputs and reporting accuracy plus a 10x10 confusion matrix
Diabetes regression: Linear Regression on the scikit-learn diabetes dataset, printing coefficients and intercept and plotting predicted vs. expected with a y=x reference line
Auto MPG regression: Linear Regression on the Hugging Face auto-mpg dataset, with cleaning of non-numeric columns, scatter of weight vs. mpg, and an actual-vs-predicted plot
California Housing regression: Linear Regression across all eight features, per-feature scatterplots vs. median house value, and an expected-vs-predicted plot bounded to a shared axis range
Consistent workflow across all five: load, inspect shape and DESCR, train_test_split with a fixed random_state, fit, predict, and visualize - so the pattern stays visible across problem types

My Role & Contribution

Sole student author. Wrote each script end-to-end, picked the visualization that best exposed model behavior for each dataset (confusion matrix for classification, prediction-vs-actual scatter for regression), and kept the train/test/evaluate loop structurally identical across the five so the underlying scikit-learn pattern is easy to read at a glance.

Biggest Challenge

The Auto MPG dataset shipped with non-numeric values (notably '?' in the horsepower column) that broke LinearRegression.fit silently with confusing dtype errors. Solved by dropping the non-numeric car-name column, coercing everything else with pd.to_numeric(errors='coerce'), and dropping rows with NaNs before splitting - a small but real lesson in why data cleaning is the unglamorous majority of any ML project.

What I Learned

The scikit-learn API is deliberately uniform: load, split, fit, predict, score - the same five steps regardless of whether the model is KNN, Linear Regression, or anything else in the library. That uniformity is the whole point. Once the loop is muscle memory, swapping algorithms or datasets becomes cheap, and attention shifts to the parts that actually matter: feature quality, data cleaning, and choosing the evaluation that exposes how the model is wrong. These foundations underpin the more applied AI work elsewhere on this portfolio.