Thursday, September 4, 2008

Thinking cap questions II: Distance measures; high-dimensional spaces etc.

First a couple of fun links followed by more thinking-cap questions

First a couple of fun links:

0. A really good paper that discusses vexing search issues in an important web-search domain

1. The lincoln play by woody allen that I mentioned (about "how long should a man's legs be") is here

2. here is a bag-of-words analysis of Bush's state of the union addresses through 2007

3. Link to the click-bias study (that showed that people tend to assume Google knows better what is relevant more
than they themselves can do). This is a good paper to skim to get an idea of what is involved in careful user studies
(contrast their user studies with the  user studies in "0" above--which were much more profitable--as in footnote 1 of that paper).

==========Thinking cap questions=============

0. A comment on "Reduced Shakespere company" is that the more you know the original the funnier it gets. Keeping that in mind, what is a sedate but academic summary of the point of the paper in "0" above?
  (for example, why do the say that instead of idf they found df measure to be more suitable for their purposes)?

1. What is the highest value the normalized euclidean distance measure can take (recall that it is just the distance between two unit vectors in the n-dimensions).

2. What is the effect of alpha/beta/gamma values in the Rocchio relevance feedback method on precision and recall?

3. Does it make sense to do stop-word elimination in the context of vector similarity ranking with tf/idf weights?



Nishant said...

1. the maximum value can be sqrt(2).
3. For the stop words the tf will be pretty high but idf will be marginal , so no major effect by the stop word elimination

Ravi Gummadi said...
This comment has been removed by the author.
Ravi Gummadi said...

0. The striking point of the paper is when you consider different vertical domains on the web, it's always possible to get interesting metrics which would allow the search engine to work extremely well in such domains.

We can consider this as overfitting and implementing such tweaks in a generic serach engine like google, yahoo isn't recommended.

But this is also primarily why search engines implement vertical counterparts like Google Scholar and Microsoft Libra which do extremely well in specific domains.

Anupam said...

0. Since the testing data was so biased (998/1000) towards the +ve examples, IDF measure would I've given almost zero weight-age to the discriminating keywords, and the classifier would not have got those features to learn from. Instead DF boosted the importance of keywords, although they were statistically uninteresting. The DF approach is not expected to work well in scenarios where only a fraction of docs are relevant, and where the generic words do not contribute in telling the relevant docs from irrelevant ones.

1. Max euclidean distance would be root of 2 (that's what seems to be the case in 2D and 3D spaces at least)

2. Parameter alpha seems to control how much allegiance should the system give to the previous query. It should somehow be based on how well the results of original query agree with Dr and Dn.

Precision would tend to go down along with decreasing values of gamma, and recall would tend to increase with higher values of beta. The fact that beta + gamma = 1 seems to refer to the precision-recall trade-off

3. It would still make sense. Although (useless) stop words would be taken care of by IDF measure, they would still dominate within each document and interfere with TF values. So using stop-word elimination would help TF measure being controlled by more domain/content specific terms.

Pierre said...

0. As Ravi said, it depends of the domain. We could say that in pornography the data is already well labelled. Thus, the user enter labela and not really keywords. With label, the tf*idf is indeed irrelevant, and their tf*df is better.

Pornography may be the most advanced part of the web in semantics ;-)

1. I agree with Nishant, sqrt(2).

2. Increasing beta increases the recall : as new keywords from relevant documents are added, some other relevant documents might be found thanks to those additional keywords. However, it might decrease precision : as we add some document which keywords are not those explicitly given by the user, we might get some completely out-of-topic documents.

Increasing gamma increases the precision, might decrease recall, same reasoning.

I'm not sure about alpha ... for me, we could just always set alpha to 1, and only modify beta and gamma. So I don't think alpha has much influence and precision and recall.

3. It might not make much sense, as the td*idf will anyway marginalize the stop-words as Nishan wrote. But eliminating some stop words might lighten the process of td*idf, make the retrieval faster.

Eric Helser said...

1. The answer to this one depends on if you are restricting the unit vectors to the first quadrant or not. If so, then sqrt(2) would be correct. If the unit vectors have no restrictions, then this distance can be 2 (i.e., unit vector u and -u).

sainath said...

0. The paper summarizes that in case of domain specific search, there is a lot of scope to improve precision.
It proposes two such features.

1.Imperative Detector :
It has been discovered that there was an abundance of imperatives in the relevant documents. And they were infrequent in the irrelevant documents. So they built an imperative detector, so that precision can be improved.

2. TF * DF:
TF * IDF produces excellent results when less number of documents are relevant in the whole retrieved set of documents. But in domain specific search, as their statistics say, 998 documents were found relevant out of 1000. So if the queried term is found to be frequent over the span of those 1000 documents, better it is for the document in which it is found to be the most frequent.

sainath said...

In my previous post, please consider queried term to be queried terms

sainath said...

3. I agree with Nishant and Pierre.
With respect to ranking there will not be any major effect since IDF factor for Stop words would be marginal, but Response Time and index space will be affected significantly since Stop words are very common hence they will require space(in the Lexicon and postings list) and significant amount of time for processing, and this why Stop word elimination should be encouraged

Sanil said...

0. I believe that their experiments yielded better results for df as compared to idf is because of the choice of domain. In the chosen domain, the selection of words used are seldom used in other irrelevant documents(which don't fall in this domain). As a result, higher the df, higher is the relevance of that term for pornography classification. Hence the result.

2. According to various publications. the values of alpha, beta and gamma for Rocchio algo are 0, 16 and 4. However, as per the experiments carried out by Cornelis H.A. Koster, Jean G. Beney, On the Importance of Parameter Tuning in Text Categorization, optimal query results are obtained for beta and gamma values of 0.7 and 1. If negative term frequencies are omitted, beta can be set to 0.1. If we consider a very small value for beta, both Precision and Recall fall down catastrophically.

3. Yes, it does make sense to eliminate stop words even while using tf*idf in vector similarity ranking algos. Though idf for stop words are very very small, tf can be large. Anyways, stop words hardly relate to the relevance of documents when terms of vectors are evaluated and not the phrases.

Pierre said...

Some comments about the "fun links" :

- the bag-of-words analysis of Bush's state of the union addresses is quite basic : it just looks for the exact word, without stemming, without taking care of word seperation. This creates some issues - searching "war" you get "warning" and searching "economy" you don't get "economic". Here a good stemming would be useful.

- it might be logical that "people tend to assume Google knows better what is relevant more
than they themselves can do". Looking for information we don't know where to get, we used to ask some librarian, or any knowledgeable people. Now one might see Google as such an entity, and interact the same way : trusting it.

Vidya said...

2. α does not have anything to do with precision or recall; it only retains a weighted component of the initial query in the optimized query.

I don't believe β+γ=1 has to be always true. If we are dealing with a trustworthy classification of the documents into relevant and irrelevant, both β and γ can be given a high value. In such a case, since β pushes the query vector closer to the relevant documents, it increases recall. Similarly, since γ pushes the vector away from the irrelevant documents, it increases the precision by making sure that documents from the irrelevant set are not retrieved.

However, in practice, we want to trust the user-identified relevant documents more that the ones that he has overlooked and are assumed irrelevant. Hence, the higher β value and lower γ value.

Vidya said...

Another thing that I do not agree with is the precision-recall trade-off, as mentioned in some of the previous comments. I do not think that an increase in precision implies a decrease in recall or vice versa.

In fact, as β takes the query vector closer to the relevant centroid, for a fixed number of results, the precision of the result set increases as a result of increased recall.

Dejun Yang said...

1. The answer is sqrt(2) for two document vectors, because the angle could be at most 90. It is 2 for two general vectors.

bharadhwaj said...

The fact that the test conducted is a domain specific one forces use of df instead of idf, as there is lesser uniqueness in keywords, and eliminating commonly occurring terms essentially eliminates highly relevant documents from the results

The highest value as most have discussed is sqrt(2), which occurs between 2 totally irrelevant documents represented as normalized vectors

The need for a query Q1 comes when it is understood by the IR system that the current results are not satisfactory. Increase in both beta and gamma will ensure more user centric/ user satisfiable results, but only a proportionate value of alpha ensures relevance to search query. Again it comes down to whether the user has interpreted his/her query right or wrong !! ;)

Mithila said...

Answer for Q0:

The paper talks of using TF*DF since IDF would be most suited for a set of documents where a particular term is not found in all or most of the documents. If IDF were used here then TF*IDF would turn out to be low, since most documents in this domain would contain the term that is searched for (998/1000. Hence for a domain like pornogragy using DF would be most appropriate.

Answer for Q1:

Like most say: max euclidian distance is root 2.

Harish said...

Q3. Stop word elimination I would say is pretty essential. Though the IDF factor takes into account the number of occurences, or the repetition of a particular word, the TF factor would still be affected by the stop words. That is, say the occurence of 'the' repeats itself. Then the TF factor might take the max number of occorunces of 'the' into account instead of some other important or valid term. So, its best to do the stop word elimination before calculations....

Vidya said...

2. Since the vector similarity ranking uses tf/idf weights, stop word elimination does not improve the ranking. Computational benefit might be the only advantage.

Girish said...

2. I would agree with vidya here that α would only retain the weight of the previous query in the new optimized query. α,β and γ are attached to the terms to control balance between the judged/proposed document sets versus the query. Now, if we have a lot of judged documents, we
would like a higher β and γ since the new query moves you some distance toward the centroid of the relevant documents and some distance away from the centroid of the non-relevant documents. And also positive feedback is assumed to be more valuable than negative feedback, we can expect most IR systems to set γ < ß. Therefore, relevance feedback can improve both precision and recall, depending on what the user is looking for.

Girish said...

3. Though, eliminating stop words would definitely increase the chances of fetching documents which contain terms that are more relevant with the query term, however the effect of stopwords can be negated with the help of idf
Therefore, stopword elimination should not make a major/significant difference in the context of vector similarity ranking