Tuesday, September 9, 2008

Thinking Cap qns on Correlation analysis..

0. (think for yourself; we will discuss this next class)--what is "rank" of a matrix? what happens to the rank of a matrix if you pad it with rows that are linear combinations of its existing rows, and pad it with columns that are linear combinations of existing columns? What could this possibly have anything to do with LSI and dimensionality reduction?

1. Give a scenario where two words may be close in the context of a query, but not close over the whole document corpus (and vice versa).
    What does this tell us about global vs. local thesauri construction?

2. We mentioned that LSI is a special case of LDA. In particular, the most important dimension, as unearthed by LSI, may not be the most important
   dimension to separate relevant vs. irrelevant docs w.r.t. a query. Given this, could we make any justification for LSI analysis (as against LDA?)

3. I mentioned that LSI does not capture "nonlinear" dependencies, and mentioned "circular" fish data (i.e., a bunch of artificial fish, which, when plotted in the length/width space, will look like a circle (or, if you prefer, a very nice cubic curve or a sine wave etc.).

3.a. Since we don't quite know up front whether or not the data under consideration has non-linear dependecies among its dimensions, what is the "risk" we are taking in applying LSI?

3.b.[Tricky]  Consider the following off beat idea. A nonlinear dependency may still be viewed as a "linear" dependency between non-linear dimensions. This suggests the following highly quixotic idea for dimensionality reduction:

        First *explode* the dimensionality by adding   all possible higher order combinations of base dimensions (if you start with x and y dimensions, consider  
              making up new terms such as  x^2, y^2, x*y etc)

       Now, do LSI in this (vastly higher dimensional) space; pick the top k most important dimensions.

  Suppose I do the above for only up to nth order terms (i.e, terms of type x^k y^(j-k) with j  less than or equal to n+1).  Can you see this idea actually doing
dimensionality reduction at least conceptually? What do you see as the main computational hurdle?


Nishant said...

1. Considering the case of terms being close to each other in the context of the query, local thesauri

construction will give a better estimation of the co-relation between the terms provided a good query log

is available. But if a query log is not avialable then probably global thsauri construction will be the

best shot but then intially since the corpus could be huge, the thershold value of the corelation should be

kept low.

2. The most important thing that is achieved by LSI is in terms of reducing the dimensionality curse. By

reducing the no of dimensions, LSI helps in reducing the search space which means less computations.

3.a We will be risking in getting a bad estimation of the data as LSI might reject the dimensions which

could give better linerality if given more data samples

b. Well , the idea of adding dimensions like x^k seems cool where in the non linear patterns can be viewed

as linear in terms of these non linear dimensions but I guess while adding more and more dimensions, the

points ( read terms to terms vector space) will lie more towards the boundary of the sphere if we consider

the sphere as n dimension space, so there wont be much discrimination between dimensions.

These are my intial comments , looking forward to the thrashing :P

Pierre said...

2. I am not sure of it, but I don't think we can compare LDA and LSI : they are not on the same level of abstraction. LDA is just the global description of a principle. LSI is one — simple — method applying this principle.

3a. it might link not-related things and separate related things, leading to give poor results.

3b. yes, this would be a good idea to reduce dimensions. I would compare that to the SVG format for storing images. Instead of saving each pixel's color, shapes and their colors are saved. This gives much much lighter images.

The main hurdle is of course the cost of trying all x^k y^(j-k). But one solution might be not to try all x^k y^(j-k) but instead, only the most common relations, such as gaussian, x^2, 1/x, ... In biology and physics, we often note that most relations can be categorized with just a few set of model.

lakshmi said...

1. (baby, crawling) are two words which are highly correlated in a general context, but in the context of queries related to web data mining, they are not.

2. (images, matrices) are two words which are highly correlated in the context of image processing, but are not close in the general context.

lakshmi said...

This tells us that global and local thesauri can vary by a high degree.

Subbarao Kambhampati said...

Pierre said...

2. I am not sure of it, but I don't think we can compare LDA and LSI : they are not on the same level of abstraction. LDA is just the global description of a principle. LSI is one — simple — method applying this principle.

I assume you followed my discussion in the last slide of the class today. If so, can you elaborate as to why you still think LDA and LSI are incomparable?


Vijayakrishnan Nagarajan said...

Example: Federer wiki

The global thesauri would have very small value for a connection between those two words. Since the local is a subset of the global, and local has only those pages which have the two words, the connection would be strong. If the two words have connection in the global(have a small value) i think the local should be proportionally higher than the global as we are just magnifying.

lakshmi said...

I am guessing that LSI finds the best dimension to represent the corpus of data and is thus independent of the query. So, it needn't be done for every query.

If LDA requires a query to determine the meaning of relevance (am not sure of this), it needs to be done for every query which is expensive.

Vijayakrishnan Nagarajan said...
This comment has been removed by the author.
Vijayakrishnan Nagarajan said...

3b. The computation would increase with the size of the information. So, it would be difficult to do on web which has millions of pages.

Vidya said...

LSI is a dimension reduction technique that is done prior to classifying the document set as relevant or irrelevant, possibly using LDA, given a query. It is used to eliminate redundant dimensions that are transforms of other key dimensions.

These two methods are meant to do different tasks, and hence cannot be compared.

Pierre said...

Rao said...

I assume you followed my discussion in the last slide of the class today. If so, can you elaborate as to why you still think LDA and LSI are incomparable?

I am not sure if I understood well, please correct me if this is wrong.

What I understood is that LDA is a principle to reduce dimension. This principle is applied into concrete methods : the LSI is one of them, but one might also use other methods, with other discriminant functions. Then we should compare the LSI with those other concrete methods. LDA "as such" does not give results but a general idea to build methods.

Eric Helser said...

1. On the topic of psychology, the phrase "word salad" may appear very frequently in a handful of documents. However, for the entire corpus, this word pair will not appear very often.

This shows that individual documents can have a vastly different local thesaurus than the global thesaurus.

2. LSI could be improved by lumping documents into groups if they can be easily identified. For example, if a large number of outliers exists for a particular query, separate them, and perform LSI separately on them.

3a. The risk we are taking is assuming that the data has some linear dependency. If it doesn't, then we're heading in the wrong direction and are bound to get garbage results.

3b. Yes, this will improve your ability to reduce the dimensionality. This is a common practice in the field of statistics and linear regression that takes into account variable interaction.

The main computational hurdle here is that it will make the matrix much larger, and therefore longer to process.

Farooq said...

Hmm this is in theory my first blog post and more of a test run I guess.

But I decided to try to spread the idea of the first question to include a few more words.

George Bush failure or perhaps I could just use Bush failure.
George Bush is failure. But the use of stop-words woudl probably take out the "is". These two words appear in a lot of documents. And its sort has occured due to the evolution of how people of thought about these two words together. It certainly is not in the entire corpus tho.

Girish said...

1. I assume that people generally do not associate "Cancer and Metastasis" together in a query since metastasis is a spread of disease from one organ part to another, which does not necessarily needs to be associated with Cancer, but we find several documents that have Cancer, also has Metastasis going along together. So in this case construction of a local thesauri of medical terminologies would prove to be more beneficial.
Whereas, "Java, Island " would not return any relevant document in the context of queries related to programming language, therefore one can think of a global thesauri here.