CSE 494/598 Fall 2008 Blog: October 2008

Wednesday, October 29, 2008

Office Hours Location

Given that the project is due tomorrow there's a chance more students might show up to today's office hours so if you plan to come, I'll be having my office hours in the 2nd floor lab instead of on the fifth floor.

Thanks,
Garrett

Next topic: Text classification (required reading: Chapter 14 of Manning et al's book)

Our next topic is classification techniques. Classification is an enormous area and entire courses are devoted to it.
We will only spend about 1.5 classes on it and get a birds eye view of the main issues, and look at Naive Bayes Classifier--a techniques that works about as well
as a default strategy in text classification as K-means does as a default strategy for clustering.

The classification techniques are also useful in content-based filtering (which will come up in the discussion of recommender systems to start next week).

The reading for tomorrow is chapter 14 of Manning et al's book

rao

Tuesday, October 28, 2008

Lost Backpack

I noticed a backpack sitting on the ground in the second floor lab near where I was sitting for office hours. Did one of you leave your bag by chance?

Garrett

Project 2 Fix

It was pointed out that some of the pages in the LinkExtract file have a space at the beginning of their name and therefore might not be matched when provided with the name w/o the space. Some of you have worked around this but if you are having problems, there are two quick solutions to this problem.

1) Simply edit the Hashedlinks file replacing ", " with "," (replace comma followed by space with just a comma)

2) Edit the LinkExtract.java file and add the following after this line in the constructor sin = sin.substring(sin.indexOf("[")+1,sin.indexOf("]"));

sin = sin.replace(", ", ",").trim();

You can add this line yourself and recompile or you can download the updated file from here (didn't want the java file to get caught by virus checker)

http://asu.mine.nu/CSE494/LinkExtract.zip

Garrett

Additional Office Hours

Just a reminder that today I'll be holding additional office hours in
the 2nd floor lab. I will be there right after today's class for
those that may have last minute questions regarding the project 2.

Garrett

Monday, October 27, 2008

Additional Office Hours

Several students have expressed interest in meeting tomorrow so I will be holding additional office hours right after the class. Due to the number of students I’m planning to meet, I’ll be holding the office hours in the 2^nd floor lab. I’ll be answering questions regarding the project and the basic approach/algorithm or other similar issues you may be having. Please don’t come expecting a lesson in Java or help debugging your code as I want to give priority to those students who are having trouble with the actual approach or who need some clarification regarding the project description.

Thanks,

Garrett

GAAP (Grade Anxiety Amelioration Program)

As the drop date is approaching, I am getting mails from people worried about their grades.
I haven't completed mid-term grading and am not likely to get done this week because of some other
pressing deadlines.

As I said, I don't go with 95-->A+; 90-->A sort of scale. So, you shouldn't worry too much based just on your
cumulatives.

I also recognize that people who can't keep up with the demands of the course leave--thus I normally don't have the
same kind of bottom distribution as might be present in other classes.

My general advice to anyone who has come up to this point in the course and has done all assignments, and is enjoying the
course, is to de-stress (note the "de" not "di") and continue. Your personal mileage of course may vary.

Rao

ps: If it helps, here are the final cumulatives and the actual reported grades for a previous offering of this course. The first
three are 471 and the rest are 598. This is for illustrative purposes.
There are *no* implicit guarantees that if you get those points, you will get those grades. If you have other questions, feel free to
send me email either directly are anonymously.




88.40	A+
79.57	A
62.83	B

91.97	A+
90.61	A+
87.55	A+
85.76	A
85.11	A
84.34	A
84.00	A
83.88	A
83.85	A
81.27	A-
79.92	B+
78.76	B+
78.63	B+
74.76	B
69.64	B
69.35	B

Project 2 Due Date Approaching

I’ve heard from a few of you regarding project 2, so I’m assuming the rest of you aren’t having any issues so far. If it would be helpful, I can hold additional office hours before or after the class tomorrow because as you know, the project due date is posted as this Thursday the 30^th.

If anyone is interested, let me know and I will fix a time.

Garrett

Thursday, October 23, 2008

Fwd: Faculty Seminar Series Presentation on October 24th 2008 (Friday) by Professor Arun Sen: ORG-ENG-CSE_Online_Com

Folks:

We have an embarrassment of riches tomorrow--as far as talks relevant to CSE494 go. You already know about Zaiqing Nie's talk on Object level search in the morning.
This talk by Prof. Sen will likely be connected to and/or extend the class discussion on social networks. Hope you can make it and enjoy both talks.

Rao

---------- Forwarded message ----------
From: Chitta Baral <chitta.baral@asu.edu>
Date: Thu, Oct 23, 2008 at 9:57 PM
Subject: Faculty Seminar Series Presentation on October 24th 2008 (Friday) by Professor Arun Sen: ORG-ENG-CSE_Online_Com
To: "ORG-ENG-CSE_Online_Com"

(Students are most welcome to these Faculty seminar series presentations.)

Date: Oct 24th 2008
Time: 3:00 PM (cookies and coffee from 2:45 PM)
Place: BYAC 110 (Same place as in previous weeks)

Title : A hitchhiker's guide to Network Science

By: Prof. Arun Sen

Abstract:

Network Science is a newly emerging discipline that draws upon a variety of well established disciplines, such as, Graph theory, Game Theory, Probability Theory, Algorithms and Complexity Theory, Machine Learning and Data Mining. The techniques from Network Science can be applied to a variety of application domains, Computer Networks, Electric Power Grid Networks, Social Networks, Transportation Networks and Biological Networks, to just name a few. In this talk we will undertake a journey through the galaxy of networks and find out how basic tenets of Network Science rule all these different planets.

Next topic: Clustering--readings are chapters 16&17 from manning et al IR book (as linked)

My slides are up.

rao

Reminder: Talk tomorrow by Zaiqing Nie (ASU Alumnus; currently at Microsoft Research) on 10/24 on "Object-level Web Search"

[If you are interested in talking to Nie one-on-one, let me know.]

Title: Object-level Web Search

Speaker: Zaiqing Nie, Lead Researcher, Microsoft Research Asia.

Venue: Room BY 210; 10/24; 10:30 AM

Abstract:

Current Web search engine can be considered a page-level general search engine whose main functionality is to rank web pages according to their relevance to a given query. However, there are various kinds of objects embedded in static Web pages or Web databases. Typical objects are people, products, papers, organizations, etc. We can imagine that if these objects can be extracted and integrated from the Web, powerful object-level search engines can be built to meet users' information needs more precisely. In this talk, I will discuss this new trend on building what we call "object-level search engines". The research problems that we need to address include large scale web classification, object-level information extraction, object identification and integration, and object relationship mining and ranking. I introduce the overview and core technologies of object-level search engines that have been implemented in three working systems: Libra Academic Search (http://libra.msra.cn), Windows Live Product Search (http://products.live.com), and Guanxi – a search engine for entity relationship discovery (http://renlifang.msra.cn).

Bio:

Zaiqing Nie is a lead researcher in the Web Search & Mining Group at Microsoft Research Asia. He graduated in May 2004 with a Ph.D. in Computer Science from Arizona State University. He received both his Master and Bachelor of Engineering degree in Computer Science from Tsinghua University. His research interests include data mining, machine learning, Web information integration and retrieval. Nie has many publications in prestigious conferences and journals including SIGKDD, ICML, WWW, CIDR, ICDE, TKDE, and JMLR. His recent academic activities include program committee co-chair of IIWeb 2007 and program committee member of KDD 2008, SIGIR 2008, WWW 2008, ICML 2008, ACL 2008, AAAI 2008 etc. Some technologies he developed have been transferred to Microsoft products/services including Windows Live Product Search and Windows Live Search in China, Libra Academic Search, and Renlifang Guanxi Search.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Tuesday, October 21, 2008

Re: About Calculator

Rainforests schmainforests. What'd they ever done for us other than providing a breeding ground for billiard-ball sized mosquitoes?

But if you insist, use of laptops to look at slides and calc.exe is fine (I thought I already mentioned this..but may be not)

Rao

On Tue, Oct 21, 2008 at 12:14 PM, Nicholas Vaidyanathan <Nicholas.Vaidyanathan@asu.edu> wrote:

Similarly, I store all my notes on my laptop and would like to avoid contributing to wasted paper and destruction of rainforests...can I keep the slides and calc.exe open and use it (no trawling the net, etc etc)?

On Tue, Oct 21, 2008 at 7:42 AM, Subbarao Kambhampati <rao@asu.edu> wrote:

This would be fine. Honor code will be enforced.

rao

ps: The innumeracy of America seems to be on the raise. The "calculator" related questions seem to be the single largest class of questions I have received :-( May be on Thursday we should do Kumon...

On Mon, Oct 20, 2008 at 11:26 PM, Farooq Khera <fkhera@asu.edu> wrote:

can we use a regular calculator and just promise not to use the matrix functionality cuz i'm sure most engineers only own one calculator, usually the texas instruments kind.

On Mon, Oct 20, 2008 at 9:43 PM, Subbarao Kambhampati <rao@asu.edu> wrote:

Yes--basic calculators will be allowed (but not those with matrix or vector operations).

rao

On Mon, Oct 20, 2008 at 3:30 PM, Shuiwang Ji <shuiwang.ji@gmail.com> wrote:

Prof. Rao,

Is calculator (with basic arithmetic functionality) allowed for the exam?

Thanks,

Shuiwang

--
Namaste,
Nicholas Vaidyanathan
Doctoral Student @ Arizona State University
President of IEEE CS Society (2008)
School of Computing and Informatics

Re: About Calculator

This would be fine. Honor code will be enforced.

rao

ps: The innumeracy of America seems to be on the raise. The "calculator" related questions seem to be the single largest class of questions I have received :-( May be on Thursday we should do Kumon...

On Mon, Oct 20, 2008 at 11:26 PM, Farooq Khera <fkhera@asu.edu> wrote:

can we use a regular calculator and just promise not to use the matrix functionality cuz i'm sure most engineers only own one calculator, usually the texas instruments kind.

On Mon, Oct 20, 2008 at 9:43 PM, Subbarao Kambhampati <rao@asu.edu> wrote:

Yes--basic calculators will be allowed (but not those with matrix or vector operations).

rao

On Mon, Oct 20, 2008 at 3:30 PM, Shuiwang Ji <shuiwang.ji@gmail.com> wrote:

Prof. Rao,

Is calculator (with basic arithmetic functionality) allowed for the exam?

Thanks,

Shuiwang

Monday, October 20, 2008

Re: About Calculator

Yes--basic calculators will be allowed (but not those with matrix or vector operations).

rao

On Mon, Oct 20, 2008 at 3:30 PM, Shuiwang Ji <shuiwang.ji@gmail.com> wrote:

Prof. Rao,

Is calculator (with basic arithmetic functionality) allowed for the exam?

Thanks,

Shuiwang

Re: Doubt on random graphs

Sorry. The 1/2*N^2 is a mistake That is the *total number* of edges in a complete graph. k is the average degree--i.e. the number of edges each vertex is involved in. For complete graphs, it is N-1 or approx N (there is an edge from each vertex to each of the other vertices).

I corrected the slide.

Rao

On Mon, Oct 20, 2008 at 8:21 PM, Mithila Nagendra <mnagendr@asu.edu> wrote:

Hello Dr. Rao

In the slides for random graphs, there is point where the expected degree k for p=1.0 is given as equal to 1/2N^2. The following slide says that k=N for p=1.0, this makes sense since k=pN. I dont understand how p can also equal 1/2N^2.

Thank you

Mithila Nagendra

(should work now) Re: answers to qn 5 & 6 of hw2 posted online

it should work now...

rao

On Mon, Oct 20, 2008 at 2:51 PM, Garrett Wolf <Garrett.Wolf@asu.edu> wrote:

FYI - the link to the additional solutions is not working...

On Mon, Oct 20, 2008 at 2:45 PM, Subbarao Kambhampati <rao@asu.edu> wrote:

answers to qn 5 & 6 of hw2 posted online

qn re: the use of Rocchio method

while doc-doc similarity computation can be costlier than doc-short-query similarity, it is still considered small (afterall you will have to do that to cluster documents, for example).

Having said that, the most obvious place the current search engines use this method is when you click "similar results" link under any result. (see also the discussion in http://nlp.stanford.edu/IR-book/html/htmledition/relevance-feedback-on-the-web-1.html )

rao

On Mon, Oct 20, 2008 at 10:58 AM, Liang Sun <sun.liang@asu.edu> wrote:

Dear Prof.
I have one question about the Rocchino method. In Rocchino method, we use document to update the query. Initially, the query is very short. However, after feedback, the query is very long, and its length is comparable to that of the document. Thus, it may take a long time to compute the similarity between the new query and the documents. My question is, is Rocchino method used in real IR system, e.g., Google? Thanks.

On Mon, Oct 20, 2008 at 9:23 AM, Subbarao Kambhampati <rao@asu.edu> wrote:

Recall that in relevance feedback, we show K docs to the user, who marks r of them to be relevant (and thus K-r of them to be irrelevant). We then compute the new query as alpha* old query + beta* (sum of relevant doc vectors)/r + gamma * (sum of irrelevant doc vectors)/(K-r).

Since we are showing only *one* document to the user (K=1) , and the user says it is relevant (r=1), we have to assume that the number of irrelevant documents that we have shown until now is 0; which means that gamma factor becomes zero.
[You can't assume that 1023 are irrelevant documents--we haven't shown them to the user--so how do we know that she would have found them to be irrelevant?]

3.[3pt] Suppose the user is shown D in response to the query Q, and the user says
that D is relevant to his query. If we now use relevance feedback to modify Q, what
will the query vector become? Assume that alpha, beta and gamma are all 1.

rao

On Wed, Oct 15, 2008 at 8:59 AM, Farooq Khera <fkhera@asu.edu> wrote:

You were going to provide some insight about the specimen midterm about the question of Relevance feedback in reference to Irrelevant documents?
I was confused about how you calculate the sum of the irrelevant documents. It would seem there are 1023 irrelvant documents.
and so we know the size of irrelvant documents, such that part of the formula is Gama/#of irrelvant documents * (Sum of (irrelvant document vectors))

However its not well defined what the irrelevant document vectors are?
Unless somehow you can just take the whole corpus as a vector and then just subtract whatever the relevant document parts... which makes sense now i think lol...

-Farooq Khera

--
Liang Sun
Research Associate
Department of Computer Science & Engineering
Arizona State University

Re: midterm is open notes/book?

Yes (but I already said that this was a very high probability event..)

rao

On Mon, Oct 20, 2008 at 9:29 AM, Farooq Khera <fkhera@asu.edu> wrote:

You said you would clarify this before day of the midterm.
Is the midterm open notes/book?

Re: question about midterm

Recall that in relevance feedback, we show K docs to the user, who marks r of them to be relevant (and thus K-r of them to be irrelevant). We then compute the new query as alpha* old query + beta* (sum of relevant doc vectors)/r + gamma * (sum of irrelevant doc vectors)/(K-r).

Since we are showing only *one* document to the user (K=1) , and the user says it is relevant (r=1), we have to assume that the number of irrelevant documents that we have shown until now is 0; which means that gamma factor becomes zero.
[You can't assume that 1023 are irrelevant documents--we haven't shown them to the user--so how do we know that she would have found them to be irrelevant?]

3.[3pt] Suppose the user is shown D in response to the query Q, and the user says
that D is relevant to his query. If we now use relevance feedback to modify Q, what
will the query vector become? Assume that alpha, beta and gamma are all 1.

rao

On Wed, Oct 15, 2008 at 8:59 AM, Farooq Khera <fkhera@asu.edu> wrote:

You were going to provide some insight about the specimen midterm about the question of Relevance feedback in reference to Irrelevant documents?
I was confused about how you calculate the sum of the irrelevant documents. It would seem there are 1023 irrelvant documents.
and so we know the size of irrelvant documents, such that part of the formula is Gama/#of irrelvant documents * (Sum of (irrelvant document vectors))

However its not well defined what the irrelevant document vectors are?
Unless somehow you can just take the whole corpus as a vector and then just subtract whatever the relevant document parts... which makes sense now i think lol...

-Farooq Khera

Sunday, October 19, 2008

Homework 2

You can pickup your graded homework 2 from my desk (557AD) between 12pm and 5pm. I may also be at my desk prior to 10:30am if you want to stop by and check.

I apologize for the delay in grading, something important came up and I wasn't able to have them graded in time. I also needed to go back and re-grade several of them. One suggestion I would like to make is to please keep your homework in the order of the problems assigned. A large number of students handed in homework which wasn't in order. Also, please be sure to circle your answers and leave some space between problems. I found it very difficult to determine the start/end of problems for many students this time around.

Here are the stats:

Total Points: 43

Max: 43
Min: 7
Avg: 36.4

Extra office hours on Monday 2-3pm

I will be available for exam-related consultations between 2-3pm tomorrow (Monday).

Rao

Friday, October 17, 2008

Results of poll on "discussion" classes--FYI

In your opinion, are dicussion classes like the one today, have any utility?

	40% (12)	Yes. Let us have more of them.
	60% (18)	Some what. But let us not get carried away
	0% (0)	Mostly a vehicle to slow the class down. Nothing really learned.

30 voters have answered this question.

If you have more nuanced comments, feel free to write down (these will be anonymous)

Many of the questions were very interesting. I was really benefited from the discussion.

well i liked the way today's session went..but something was lacking..im not able to pinpoint..maybe the next go would turn out better..

Helps get a general overview of the subject at hand. Definitely interesting!

Re-stating the question is usefull, as we often can't hear other people.

I would prefer shorter in-class discussions. We do already have opportunity to discuss everything in the blog. If there were hot blog discussions probably it would have sense to bring them in class.

The idea of discussions in itself seems good, but today's discussion was more like summarizing what we've been doing in the class. Most of the questions/comments were at very high level of abstraction, and not targeted on the core technical points.

Many people felt forced to talk and spoke without having any opinion.

I feel, in future, such discussion should/could be conducted after every topic gets completed

Discussion classes could be really fruitful if well directed and planned and used only for very important issues. I consider that they are specially useful for issues that have not been completely understood or are still kind of obscure. Probably, next time, the class could be polled to know whether a discussion is worthy an entire class session.

I did think that it was very helpful. One problem I had is that certain students were bringing up irrelevant issues that were (a) not pertinent to the discussion topic and (b) probably would be covered later in class. I think a perhaps better solution would be to interject discussion topics WITHIN lectures. So dedicate 5-10 minutes of open discussion about a lecture topic. That way we do not fall behind on the subject matter, we don't miss out on your amazing (and funny) lectures, BUT DO have the opportunity to talk about things that we do not understand.

I think the discussions are good once and a while to mix up the regular routine.

Discussions like these help put everything that we learn in class together and see the whole picture.

Too many questions, not a really discussion. Maybe it should be moderated more and there should be predefined specific topics that the discussion is about.

Thursday, October 16, 2008

Online poll on discussion class effectiveness

Please vote on the effectiveness of having discussion classes like today's in the online--single question--poll available at

http://www.misterpoll.com/polls/362296

Vote early, but *not* often.

regards
Rao

Wednesday, October 15, 2008

To the people who suddenly found that they cannot post to the blog..

To those of you who have realized recently that you cannot post to the blog:

This is because you never bothered to accept the invitation that was sent to you. After the 14-day drop-add date, I flushed all the pending an unaccepted invitations (since
the invites were sent also to students who eventually dropped). So, if you didn't accept the invite by the 14 days, then you
don't have permission, and you will have to be issued a new invitation. Ask either Garrett or me for it.

(I also find it disappointing hat the realization that you didn't have permission to post came only after I requested for a mandatory post
to the blog.This seems to suggest that you never even *considered* responding to the many thinking-cap questions before because they were not mandatory.)

Rao

Links brought about by PageRank paper

I prescribe to the alien time machine theory myself...

Talking about systems design, here's a great article about search engines from former Stanford Scientist, then Google Employee, and now Cuil owner Anna Patterson

But will all this really be necessary? What about when mashups start doing the heavy lifting for us? Enter the Ubiquitous experience...

Tuesday, October 14, 2008

Mid-term postponed to Tuesday (In-class); Thursday will be mandatory discussion (+rationale for the decision)

I have decided to postpone the mid-term exam to next Tuesday 10/21. It will be held in class.

On Thursday, 10/16, we will continue your discussion of the google paper. If the discussion runs dry, I will start the
next topic (on clustering).

In addition to the google paper you already read, I highly encourage you to read the following two documents too (both are optional readings
in the "overall search engine" topic)

http://rakaposhi.eas.asu.edu/cse494/henzinger02challenges.pdf

http://googleblog.blogspot.com/2008/07/technologies-behind-google-ranking.html

--------------------------
Here is the rationale for the decision--in the order of importance--if you care:

1. Garrett said he will have the homeworks graded by Thursday. Since some of you wanted graded homeworks in hand, it makes sense to
have the exam after thursday.

2. It would make sense to complete the discussion of the google paper now, before you forget what you read, than next Tuesday. Also, the discussion might also
serve as a review of sorts for most of the topics we have done until now (and thus may help in your exam preparation).

3. Of the three people who expressed a preference against shift of the exam to Tuesday, two have since changed their mind. One of them did make travel plans
that would have taken them out of town on Tuesday. I have offered them a make-up exam (it is a wash for me since I had to give a make-up for another person who
would have missed exam if it was on Thursday..)

...

...

...

100. Having thrice as much time to prepare the exam will help me make it a third as easy.

...
.
...

1000. It seems the honorable thing to do is to "suspend" the exam so all of you will have a chance to hear the final Obama-McCain debate tomorrow and see the two
gentlemen viciously tear each other down (so Ralph Nader will get elected and everyone gets a free membership in CostCo.).

regards
Rao

IMPORTANT re: the midterm exam date (please read)

Folks

I seem to have made a terrible faux pas by asking the questions about exam postponement preferences in the order I did.

If I was thinking clearly, I would have *first* asked whether anyone has irreconcilable conflicts if the exam date is shifted to Tuesday, and only if no one has irreconcilable conflicts should I have asked if the majority prefer shifting.

By asking the questions in the opposite order, I apparently created a bit of social pressure to the
people who said they have irreconcilable conflicts. The upshot is that one of them came by my office and changed their vote!

Creating social pariahs has not been my intention, and I apologize.

Nevertheless, since the damage has been done, if the others who voted to express irreconcilable preferences also want to change their
vote, they should let me know by tonight.

*Unless* you get a mail stating otherwise, we will be having the mid-term exam on Thursday.

cheers
rao

ps: This episode makes the point that once you make your plans public, they are no longer your plans and other agents can make commitments
based on your plans. See http://rakaposhi.eas.asu.edu/replan-08.pdf about how an intelligent agent has to keep commitments to other agents in mind while
trying to exploit opportunities.. (this is from the other part of my research life).

Sunday, October 12, 2008

a specimen midterm exam (from the past)

Can be found at

http://rakaposhi.eas.asu.edu/s07-specimen-exam.pdf

Note that the coverage of topics was a bit different that time around--so the specimen exam has questions on link-analysis and clustering
(as I said, link-analysis will not be covered in this exam, and clustering hasn't been done yet). We instead have done more on indexing and
tolerant dictionaries and also more on social networks.

cheers
Rao

Re: Midterm syllabus

Mid-term syllabus is everything covered by the two homeworks (i.e., all topics until the end of social networks).

Link-analysis topics (page-rank; authorities/hubs etc) are not included.

Rao

On Sun, Oct 12, 2008 at 3:43 AM, Vijayakrishnan Nagarajan <vnagara5@asu.edu> wrote:

Dear Professor,

Can you tell us what is the syllabus for midterm?

Thanks,
Vijay

Friday, October 10, 2008

Solutions for homework 2 posted online (hw1 solutions were posted a while back)

Thursday, October 9, 2008

New Search Engine

This is a new search engine which searches our web memory.

www.infoaxe.com

Vijay

Links from today's class..

1. Here is a link about the Search King/Google lawsuit:

http://www.pandia.com/sw-2003/21-searchking.html

2. Here is a link to an NYT article where Amit Singhal--google search fellow--admits that page rank is but just one of 240 odd pieces of information they combine in making their ranking

http://rakaposhi.eas.asu.edu/s07-cse494-mailarchive/msg00082.html

3. Here is interesting, but dated, information about the mechanics of Google dance (when google would update its index, link matrix and page ranks)

http://dance.efactory.de/

4. Here is a link--back from late 2005--when Google first started tweaking its index for China--because China asked it so nicely that Google just couldn't refuse ;-)

http://rakaposhi.eas.asu.edu/f05-cse494-mailarchive/msg00106.html

(here is a more complete article from BBC http://news.bbc.co.uk/2/hi/technology/4645596.stm )

Here is a general wikipedia article that also discusses current search engine status in China:

http://en.wikipedia.org/wiki/Internet_censorship_in_mainland_China

cheers
Rao

Results of the class survey

Dear all:

Thanks for taking the time to complete the class feedback survey. Thanks to Vijaykrishnan Nagarajan's exceedingly prompt and free service, we already have the tabulated results from the survey. Here are two links. The first one contains the stats of the entire class. The second one contains stats just for the 494 students.

http://rakaposhi.eas.asu.edu/cse494/survey-08-results-all.htm <--the whole class

http://rakaposhi.eas.asu.edu/cse494/survey-08-494-alone.htm <--just the 494 students

My cursory reading didn't raise any major red-flags. I will look at them more carefully and respond and/or fine-tune the remaining semester. You might also want to look at them to see how your perceptions stack up with the rest of the class's.

As before, you can use the anonymous email link sent to you earlier to give any other feedback you want to provide.

cheers
Rao

VERY Important: Mandatory reading/review/discussion for next class 10/14

Folks:

A large part of next Tuesday's class will be devoted to a discussion of the paper

http://rakaposhi.eas.asu.edu/cse494/google-paper.pdf

(this is the top one in the "overall search engine" category).

You should all

(1) read it and

(2) bring to the class a an at-most one page review to the class with questions and/or observations about
the paper given all that you know about how search engines could work).
Focus on what are aspects of the architecture that surprised you
and/or different from what we discussed in the class.

(3) Post your comments also onto the blog (as comments on this post)---Preferably before the class.
This will also serve as a good review of a lot of what we have done in the last several weeks.

The class itself will consist of

1. A short first part where I will complete the discussion of Trust Rank and Efficient computation of Pagerank

and

2. Us discussing the questions you bring up.

This is a mandatory class and your participation is mandatory..

cheers
Rao

Wednesday, October 8, 2008

More on how pagerank paper got rejected from SIGIR (and some philosophy on conferences)--comments?

[You can read the following either as a discussion of conference reviews etc., or as a discussion of importance estimation in
social networks using link-analysis.]

I mentioned in passing yesterday that the original pagerank paper was rejected from SIGIR (the 1998 conference). It never did
get published as a separate paper. All the cites to the idea are redirected instead to their WWW 1998 paper.

This brings up the issue of how we are supposed to judge the importance of a conference (and researcher). Here are two methods:

1. Consider the importance of a conference in terms of it's *acceptance ratio*--i.e., what exceedingly small fraction of the papers submitted to the conference
are actually accepted for publication. The researchers can then be judged in terms of how many hard-to-get-into conferences did they get their papers into.
The attractiveness of this approach of course is its very immediacy--as soon as the conference acceptance decisions are made, we have the acceptance ratios--which can be
used to judge the conferences as well as researchers. The disadvantage is that there is really no connection between the long term impact of a paper/conference and the acceptance ratio. The page rank paper example above says that just because SIGIR is very selective doesn't mean it is right (actually there is anecdotal evidence that the total sum of citations to all SIGIR papers is smaller than the citations to the pagerank paper it rejected ;-).
Similarly, high acceptance ratio doesn't necessarily mean that the selection process is easy. consider that if Presidential race were to be seen as a conference, it has a whopping 50% acceptance ratio (I hope the Ralph Nader fans will forgive me for saying this). In the end, acceptance ratios tell us how much of a fad the area currently is.

2. Consider the importance of a conference in terms of the citations to the papers appearing in that conference. This is link-analysis--except generalized to a "cluster" of nodes (corresponding to the papers of that conference) rather than to individual nodes. The importance of a researcher then is in terms of the aggregate citations to their papers
(an example measure is H-index http://en.wikipedia.org/wiki/Hirsch_number ). This statistic is *not* immediately obvious--since citations are a "lagging" measure. [The day after
Lincoln's gettysburg address, the news papers around the country were split almost 50-50 in their judgement of the importance of that address--with many papers calling it a
lack luster speech by a third rate politician). The advantage is that citations do tell us about the long-term importance/half-life of the paper (and a conference's ability to select such papers). I tell my students that if ever they find themselves writing a paper for a confernce A and find that they are mostly citing papers from conferences B and C, they should re-consider their decision to send the paper to A--or admit that they are second-class ;-).

Of course, even this measure can be further improved. There may be conferences on Scientology and/or Creationism whose papers are very well cited--by other papers appearing in those conferences--but the world at large completely ignores it. To evaluate this, you need to separate citations from within the conference/domain and across domains [Notice the connection to the idea of giving separate weights to intra-domain and transverse links that we mentioned in yesterday's class].

Not surprisingly, this is what serious ranking sites go by citations rather than acceptance ratios (although authors tend to put the acceptance ratios in their CVs!). See

http://citeseer.ist.psu.edu/impact.html for a ranking of impact of about 1200 CS publication venues (and check where your conference occurs)

see http://libra.msra.cn for a more refined ranking--where each conference/researcher are ranked with respect to each area (considering in-domain vs. inter-domain citations). Zaiqing Nie--who is the architect of the Libra system--will be here on October 24th to give a talk partly about how this system is built.

===========
1 and 2 above also have an impact on the way conferences are run (and researchers do research). Coming to conferences--it is easy enough to optimize for lower acceptance ratios. But if what we want to optimize for is longer term expected impact, then how does the conference--and its reviewers--do this? Notice that by rejecting just the page-rank paper SIGIR lost about half its cumulative citations. The essay by Ken Church -- at http://rakaposhi.eas.asu.edu/church-precision-reviews.pdf provides a few thoughtful suggestions.
One important point the paper makes is that while precision is easier to judge, recall is harder (you can tell if there are relevant results in the top-10 pages returned by a search engine, but you can't tell if it left out any really important pages from the top-10). [Another related article is http://rakaposhi.eas.asu.edu/patterson-conferences.pdf )

cheers
Rao

Tuesday, October 7, 2008

Thinking Cap questions on Page rank/ Authorities/hubs..

------
Digression:

Here is the link to the Montana history professor who took "sponsorship" from a burrito joint for his class:

http://www.missoulian.com/articles/2008/10/01/news/top/news01.txta

(Check out "Wait Wait Don't Tell me" on NPR on Sundays if you want to be kept abreast of this kind of important news ;-)
----------

Here is a thinking cap question:

The following are several "comparisons" between page rank and Authorities/hubs methods for computing page importance. Comment on whether or not these comparisons make sense.

1. A/H analysis is too costly because it has to be done for each query, while page-rank analysis, which can be done off-line and once for all queries, is much better.

2. page rank analysis is not as good in taking importance in the context of the queries, while A/H analysis is much better.

3. A/H analysis has major issues with stability (i.e., its ranking can change a lot with just small random changes to link structure), while page rank is much more stable.

4. Its all a bunch of ballyhoo, and underneath it all, A/H and pagerank both give same ranking of importance to all the pages.

------------
Rao

Talk by Zaiqing Nie (ASU Alumnus; currently at Microsoft Research) on 10/24 on "Object-level Web Search"

Abstract:

Bio:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Monday, October 6, 2008

Re: [CSE 494/598 Fall 2008 Blog] New comment on Re: Question about Homework 2 on document coordina....

I added the following hint to that question for those of you who are confused by the U,S,V notation

[[Oct 6, 2008] Note that M here is t-d so the SVD of M will be of the
form t-d = t-f * f-f * d-f' So, U=t-f, S=f-f and V is d-f]

cheers
Rao

ps: There is nothing wrong with U,S,V notation--it is in fact the standard notation for SVD. However, the d-f, f-f, t-f notation is more mnemonic

On Mon, Oct 6, 2008 at 3:49 PM, Mike Balzer <mjbalzer@asu.edu> wrote:

Mike Balzer has left a new comment on your post "Re: Question about Homework 2 on document coordina...":

Then, should the wording for question 3.1 Homework 2 be changed, as it references U, S, and V?

Posted by Mike Balzer to CSE 494/598 Fall 2008 Blog at October 6, 2008 3:49 PM

Re: Question about Homework 2 on document coordinates in LSI space

[[Folks--please make sure that you are not looking at hidden slides--as these may have inconsistencies and incomplete statements]

SVD gives

d-t = d-f * f-f * t-f'

the new coordinates of the docs are d-f * f-f

I pointed this out in the class when doing the main animated slide.. (I think you and the student are looking at some hidden slides which may have been left over there from long back--I did all my discussion in terms of d-f * f-f * t-f' and not in terms of
u sigma and v!)

rao

On Mon, Oct 6, 2008 at 1:51 PM, Garrett Wolf <garrett.wolf@asu.edu> wrote:

Rao I've had two students ask me this question today, and after looking through the slides I wasn't clear which they should be using for the homework as both were mentioned in the slides (one seemed to be marked as the one to use, but I just wanted to check to be sure).

Garrett

---------- Forwarded message ----------
From: Jianhui Chen <jchen74@asu.edu>
Date: Mon, Oct 6, 2008 at 1:48 PM
Subject: Question about Homework 2
To: Garrett.Wolf@asu.edu

Hi, Garrett,

I am a student in Dr. Rao's information retrieval class. I have a question regarding the homework.

In LSI, assume X is the doc-term matrix and denote its SVD by X = U \Sigma V'.

Should we use "U \Sigma" as the new documents representation or only "U" as the new documents

representation?

Seems both approaches are mentioned in Dr. Rao's slides. And in the homework, we need to use

the new documents representation to compute the similarity and plot the documents.

Please help to advice. Thanks.

Jianhui

Homework 1 Stats and Comments

Here are a few stats homework 1:

Max: 40

Min: 27

Avg: 36.7

Stddev: 3.7

Overall, people did well on this first homework but I had a few suggestions for future homework:

· Make sure to show your work because without it, I cannot tell how severe of a mistake has been made, and thus a wrong answer is not likely to get partial credit.

· For most of the questions (obviously not all), some sort of explanation is expected (went easy on this for the first homework but later homework will be more strict). If you have to compute an answer for a question, it’s always good to include some text describing the answer rather than just providing a number.

· For the relevance feedback question, in the beta/|Dr| and gamma/|Dn| part of the question, |Dr| and |Dn| are the number of relevant and irrelevant documents respectively (not the norms of the relevant and irrelevant documents). For question 3.4, where d4 and d6 are shown to the user and the user clicks on d6 but not d4, keep in mind that |Dr| = 1 and |Dn| = 1. Some people put |Dr| = 2 and I’m assuming this came from the 2 documents being shown to the user. However, just because the system thinks the documents are relevant and thus shows them to the user, they are not considered relevant in terms of the relevance feedback calculation, unless the user says so.

Some people may have answers which differ from those in the slides for problem 3. I had sent out the updated doc-term matrix yet some people used the matrix from the slides for answering the questions. If you did this, I used the original matrix when grading your homework. Please use any updated versions if they are sent out for future homework.

Finally, the grading scale in the solutions are not the ones I used. Here are the grading scale used for this homework. If you have any questions or think something was graded incorrect, then please let me know (but as I mentioned, I went a little easier on this first homework).

Total Points 40

Q1.1 1

Q1.2 1

Q1.3 2

Q1.4 3

Q1.5 3

Q1.6 3

Q1.7 3

Q1.8 3

Q2.1 4

Q3.1 3

Q3.2 3

Q3.3 3

Q3.4 4

Q3.5 4

Garrett

Sunday, October 5, 2008

Homework 1 solutions posted; graded homeworks will be returned on Tuesday

Homework 1 solutions posted; graded homeworks will be returned on Tuesday.

(I should mention that the hard-working TA gave me the graded homeworks on last tuesday. I wanted to go over them once
and return them to you on Thursday. I forgot the returning part).

rao

Homework 2--three questions added. Due Thursday in class

Folks:
I added three questions to the homework--one on social networks, one on string distances and one on inverted indexes. These questions are not exactly very hard but are put there to make you look at the notes.

the homework as a whole is due on Thursday 10/9 in class

Rao

Friday, October 3, 2008

Where the Web is Going: Web 2.0, 3.0 and Beyond

Interesting round table style panel on Web 3.0 and beyond...

http://link.brightcove.com/services/link/bcpid980795693/bctid1790936412

(If you watch far enough through, you'll notice Peter Norvig happens to mention his spelling corrector that was shown in the "Indexing and Tolerant Dictionaries" lecture)

Thursday, October 2, 2008

Beyond Text Ranking

I am personally more interested in applying ranking and retrieval techniques developed in IR for image ranking. In particular, I am working on a project which, given a query biological image, the task is to find similar images in the database. You can find more information about this project at: http://www.flyexpress.net/

Recently, it seems that people is beginning to apply text retrieval techniques to image ranking and I noticed two papers which I found interesting and want to share with you.

@article{ 10.1109/TPAMI.2008.121,
author = {Yushi Jing and Shumeet Baluja},
title = {VisualRank: Applying PageRank to Large-Scale Image Search},
journal ={IEEE Transactions on Pattern Analysis and Machine Intelligence},
volume = {30},
number = {11},
year = {2008},
pages = {1877-1890},
}

@ARTICLE{Sivic:PAMI08,
AUTHOR = {J. Sivic and A. Zisserman},
TITLE = {Efficient Visual Search of Videos Cast as Text Retrieval},
JOURNAL = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
VOLUME = {},
NUMBER = {},
PAGES = {},
YEAR = {2008},
NOTE = {In press},
}

Shuiwang

added new slides at the beginning of page-importance measures lecture

Folks:

I revised the first slide and added a new second slide in the page-importance measures. I encouarge you to look at them.

rao

Regarding Page Importance

Sir,

When you were discussing about page importance you mentioned about the amount of access a page gets as one of the criteria for evaluating importance. I have 2 questions on that.

1. There are lots of websites which display the number of visitors for that page. How do they compute that?
2. Is there any relevance to that for computing the page rank for that page? If yes can this make the website owners to increase the page rank of their website by logging in again and again themselves and make it to the top of the rank?

On the question of when eigen values are guaranteed to be positive (and the property of positive definiteness)

There was a question in the class (by Dmitry I think) as to when can we be sure that eigen values of a matrix are all positive.

Here is the answer

All symmetric matrices are guaranteed to have real valued eignen values, but are not all guaranteed to have positive eigen values.

Having positive eigen values turns out to be closely related to a *very* important property of matrices called "positive definiteness"

However, if the symmetric matrix M under consideration can be written as B*B' where B is a non-singular matrix (i.e., has non-zero determinant),
then it is guaranteed to have +ve eigen values. (Notice that the matrices we power-iterated on in A/H computation are of this form).

-------------

While we are at it, we should note that the property of having positive eigen values is intimately linked to another very deep and useful property called
"positive definiteness"

A matrix M is called positive definite, if for *any* vector u,

u * M * u' > 0 (it is called positive semi-definite if "greater than" is replaced by "greater than or equal to")

It turns out that M is +ve definite if and only if it has all positive eigen values (you can sort of see the "if" case in terms of viewing vector/matrix multiplication in terms of
stretching vector's projections in the eigen vector directions by the amount equal to eigen values).

cheers
Rao

Wednesday, October 1, 2008

Search on google's 2001 index

hi all

If you all dont know yet..google has put up its 2001 index for searching as a means to celebrate its 10th birthday..for searching back in time.. you can visit

http://www.google.com/search2001.html

and for more on their 10th birthday, visit

http://www.google.com/tenthbirthday/

I searched for apple ipod and got 36 documents as results with none of them pointing to apple.com

Wednesday, October 29, 2008

Tuesday, October 28, 2008

Monday, October 27, 2008

Thursday, October 23, 2008

Tuesday, October 21, 2008

Monday, October 20, 2008

Sunday, October 19, 2008

Friday, October 17, 2008

Thursday, October 16, 2008

Wednesday, October 15, 2008

Tuesday, October 14, 2008

Sunday, October 12, 2008

Friday, October 10, 2008

Thursday, October 9, 2008

Wednesday, October 8, 2008

Tuesday, October 7, 2008

Monday, October 6, 2008

Sunday, October 5, 2008

Friday, October 3, 2008

Thursday, October 2, 2008

Wednesday, October 1, 2008

Blog Archive

Contributors