Supervised Learning: Instance-based Learning and K-Nearest Neighbors
This is the 5th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.
What is “Instance Based”?
Recall that Supervised Learning approximates a function. Then projections are made by plugging in values to the function, without any reference to the actual data.
An alternative approach just puts all the raw data (“all instances”) in a database, and, when queried, looks up the corresponding output. No time is spent doing any learning, so it is fast and simple.
However, of course, we lose out on generalization (for example if we are asked something close to but not exactly what we have data for) and we are prone to overfitting (for example if our data is noisy).
So Instance Learning looks at the nearest neighbors to decide what any queried point should be. Specifically, the k
nearest neighbors.
K
Nearest Neighbors
Given:
- Training Data
D = {Xi, Yi}
- Distance Metric
d(q, x)
(representing your domain knowledge - a way to quantify the similarity of one thing to another) - Number of neighbors,
k
(also relying on your domain knowledge as to what matters) - A query point,
q
The algorithm is simply - given D
, find the k
nearest neighbors (K-NN) based on your distance metric d(q, x)
.
You can do both classification and regression this way:
- Classification: based on vote of
Yi
’s - your point is classified by whatever the most neighbors are - Regression: the average of the
Yi
‘s.
Instead of a simple vote or simple average that weighs all neighbors the same, you can also weight them by closeness so that closer neighbors count more.
Lazy vs Eager
You can loosely break up these algorithms into learning and querying stages:
- The learning stage computational complexity of K-NN is
O(1)
, while Linear Regression isO(N)
. - The querying stage computational complexity of K-NN is
O(log N)
, while Linear Regression isO(1)
Thus more work is frontloaded by Linear Regression, making it an “eager” learning algorithm, whereas K-NN does more work while querying, making it “lazy.
In Python
For some light programming practice, try solving the problems pictured here:
Here’s a sample Repl solution calculating nearest neighbors by Euclidian and Manhattan distance.
Note that the real answer (based on the hidden function) was 18
, which K-NN doesn’t get close to by any of these methods.
KNN’s biases
KNN’s preference bias (its beliefs about what makes a good hypotheses) are:
- Locality - assumes that Near Points are “similar”
- Smoothness - using averaging
- All features matter equally <— this belies the Curse…
Curse of Dimensionality
As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially!
A blended approach
Instead of a Euclidean/Manhattan weighted/unweighted average approach to estimating the final result, we can also combine a KNN with regression to create locally-weighted regression to have the best of both worlds. This isn’t just limited to regression - within the defined nearest neighbors, you can use any of the methods we have covered so far, like neural networks or decision trees.
Optimal Data Structure
We can also consider more efficient data structures for kNN than a simple lookup table. Ball Trees are promising in this regard.
Next in our series
Hopefully that was a good introduction to Instance-based Learning and K-Nearest Neighbors. I am planning more primers and would love your feedback and questions on:
- Overview
- Supervised Learning
- Unsupervised Learning
- Randomized Optimization
- Information Theory
- Clustering - week of Feb 25
- Feature Selection - week of Mar 4
- Feature Transformation - week of Mar 11
- Reinforcement Learning
- Markov Decision Processes - week of Mar 25
- “True” RL - week of Apr 1
- Game Theory - week of Apr 15