0. (think for yourself; we will discuss this next class)--what is "rank" of a matrix? what happens to the rank of a matrix if you pad it with rows that are linear combinations of its existing rows, and pad it with columns that are linear combinations of existing columns? What could this possibly have anything to do with LSI and dimensionality reduction?
1. Give a scenario where two words may be close in the context of a query, but not close over the whole document corpus (and vice versa).
What does this tell us about global vs. local thesauri construction?
2. We mentioned that LSI is a special case of LDA. In particular, the most important dimension, as unearthed by LSI, may not be the most important
dimension to separate relevant vs. irrelevant docs w.r.t. a query. Given this, could we make any justification for LSI analysis (as against LDA?)
3. I mentioned that LSI does not capture "nonlinear" dependencies, and mentioned "circular" fish data (i.e., a bunch of artificial fish, which, when plotted in the length/width space, will look like a circle (or, if you prefer, a very nice cubic curve or a sine wave etc.).
3.a. Since we don't quite know up front whether or not the data under consideration has non-linear dependecies among its dimensions, what is the "risk" we are taking in applying LSI?
3.b.[Tricky] Consider the following off beat idea. A nonlinear dependency may still be viewed as a "linear" dependency between non-linear dimensions. This suggests the following highly quixotic idea for dimensionality reduction:
First *explode* the dimensionality by adding all possible higher order combinations of base dimensions (if you start with x and y dimensions, consider
making up new terms such as x^2, y^2, x*y etc)
Now, do LSI in this (vastly higher dimensional) space; pick the top k most important dimensions.
Suppose I do the above for only up to nth order terms (i.e, terms of type x^k y^(j-k) with j less than or equal to n+1). Can you see this idea actually doing
dimensionality reduction at least conceptually? What do you see as the main computational hurdle?