Unsupervised Learning: Feature Transformation
The problem of preprocessing a set of features to create a new (smaller/more compact) feature set, while retaining as much (relevant? useful?) information as possible. For example in linear algebra terms, you can create new features that are some linear combination of existing features.
Any problem with a large amount of features that have plenty of false positives and negatives will want some form of Feature Transformation.
One example: The information retrieval, or ad hoc retrieval problem, like with Google needing to look up documents based on given search terms. For google, all the words in a document are features (give or take some cleaning). That’s a lot of words, bringing us the curse of dimensionality. But most relevantly here, the words also aren’t very good indicators (because of polysemy and synonymy) which cause a lot of false positives and negatives. Doing feature selection will not solve this problem.
It is an example of an Eigenproblem. PCA finds the directions of maximum variance that are mutually orthogonal. You are able to perfectly reconstruct (or describe) your data with PCA, but most commonly if you take a subset of the principal components you will minimize the L2 error for the dimensions you pick.
Eigenvalues and Dimensionality Reduction
Eigenvalues indicate importance of each PC, and a 0 eigenvalue means that dimension is irrelevant. PCA does dimensionality reduction by sorting eigenvalues and picking the top
k values. You can visualize this with a Scree Plot
You use PCA to transform features to a new space where you know how to do the filtering.
PCA is about finding correlation by maximizing variance.
ICA tries to maximize independence. Transforming a set of features X into new features Y with no mutual information between any Y feature, but maximum mutual info between X and Y (being able to reconstruct the data).
This is applied in the Blind Source Separation problem (aka the Cocktail Party Problem) - where microphones try to isolate sound of individual people despite picking up extra noise. Each microphone receives some linear combination of the multiple sources, and you can model it with ICA and work backwards to create sound streams of individual voices.
In contrast to PCA, ICA is local and unsorted. It just focuses on unique traits, which is a very nice feature to have for sound and natural images - they detect edges in images! When you apply this to our original problem of documents, ICA gives you topics - collections of words that select for specific topics.
RCA: Random Components Analysis instead of finding dimensions with highest variance, it just picks random dimensions. It works remarkably well if the next thing you do is a form of classification because you still maintain information, but have lower dimensions so you still pick up some correlations. It’s less efficient but faster than PCA.
LDA: Linear Discriminant Analysis finds a projection that discriminates based on the label (like supervised learning) - separates based on label. After the 90’s, Latent Dirichlet Allocation has become the more popular LDA.
Further notes on this topic:
Hopefully that was a good introduction to Feature Transformation. I am planning more primers and would love your feedback and questions on:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
- Markov Decision Processes - week of Mar 25
- “True” RL - week of Apr 1
- Game Theory - week of Apr 15