Tuesday, September 30, 2008

An somewhat erroneous statement I made in today's class..

In today's class, I said that it is possible to make poisson distribution as "peaked" as you want by
reducing the variance parameter (as part of making the point that the "drop off" in the power law distribution
is not as important as the "slow decay/long tail).

Well--not exactly. Poisson distribution, as you saw, has only *one* parameter--lambda, and the
means as well as the variance of the distribution will be lambda (this is unlike normal distribution,
for example, where you can set the mean and variance parameters indepdendently). The only way to reduce
the variance is thus to reduce the mean...  (you may also recall that for random networks,  lambda is p*N (i.e., the expected degree))

Here is a link on poisson distribution as well as how it is the limiting case of binomial distribution:

http://en.wikipedia.org/wiki/Poisson_distribution

cheers
rao

Speaking of the Long Tail...

http://www.thelongtail.com/the_long_tail/2005/09/long_tail_101.html

This article a while back sparked quite the frenzy in the business world...it can be seen as the direct cause of the renewed interest in non-Google engines and other businesses trying to capitalize on the Long Tail.

Project Deliverables

Just a reminder to bring a printout of your commented source code, example query results, and report to today’s class. 

 

We are not using digital drop box or any other electronic submission for the course…remaining homeworks/projects will also be submitted in class.

 

 

Garrett

 

 

Sunday, September 28, 2008

Project Help - Sunday

It seems as though several people are interested in meeting today regarding the project so I will plan on being in the brickyard from around 3pm – 5pm.  I’ll plan on sitting in the second floor lab during this time (assuming I have access…otherwise I’ll be on the 5th floor in which case you can just send me an email and I’ll come down and meet you on the second floor).

 

Garrett

 

 

CSE494/598 Project

After responding to several questions regarding the project, it seems there may be people with additional questions.  With the project deadline soon approaching, I wanted to see if anyone would find it useful if I were to make some time available tomorrow (Sunday) for you to come and ask questions face to face.    

 

For those who are seriously interested, please let me know.  If there are people expressing interest, then I can be available later in the afternoon (after 3pm or 4pm) Sunday for a couple hours. 

 

 

Thanks,

Garrett

 

 

Saturday, September 27, 2008

Information Retrieve this!

SEO is a burgeoning business for those interested in seeing IR concepts taken to a business level...it also may be destroying the heart and soul of the Internet by turning a medium that's supposed to be about the free flow of information into a purely commercial endeavor, but that's probably just my own humble opinion. Anyway, check out some of the cool new stuff Google's doing that has that industry all atwitter! \

Friday, September 26, 2008

Thinking cap questions: Social networks..

Here are a couple of questions you can discuss (feel free to add other questions you want discussed too). You might want to look at the
blog to see other comments already made before you add your own. Also feel free to only answer the hard ones..


A. We considered three types of measures of centrality and prominence: based on degree, based on "closeness" to other entities in the network and
based on "between-ness"--i.e., how often is the entity going to be on the geo-desic path between other actor pairs. W.r.t. these, think of the following:

A0. Consider three canonical networks of size n: a star network, where one entity has (undirected) edges to all other entities; a circle network; and a line network.
 Think of what are the most "important" nodes in each of the networks, if we measure importance by degree based, closeness-based and between-ness based
notions respectively.

A1. are all these measures equally applicable to both directed and undirected social networks?

A2. In the case of directed social networks, can you attach different notions of importance to in-degree and out-degree?

A3. Can you think of generalizing the  in- and out- degree based notions to transitively take into account the importance of the actor from/to
which the edge is directed? (basically, you somehow want to say that the importance of a link from a node is proportional in some way to the
importance of the node itself).

A4. In the questions above, we kept silent about the label of the edge. Suppose in one network, the edges signify the relation "respects" and in
other network they signify the relation "hates". Does this difference throw a wrench into the kind of analysis the measures in 3 are trying to compute?

A5. Can you think of cases where closeness and between-ness measures of importance could be useful in the web-based networks (social networks, blogs or
page networks)?


B. We talked about small-world networks where the largest geo-desic path between any pair of
connected nodes in the network, also called the diameter of the network, is log of the size of the network.

B0: which of the networks in A0 are small-world networks by this definition?

B1: Suppose you are an actor in a small-world network of size N, and you only have local information (i.e., have access to only the links
incident on you). Suppose you are trying to look for a path to another actor/entity in the network with a certain "goal" property g().
We are interested in characterizing how much work you will need to do, in the worst case and the best case, to find this actor. Answer the
following:

B.1.1  What is the difference between this scenario and the usual graph-search problems you learned about in 310 or 471?

B.1.2. What is the best and worst case complexities (you decide how to measure complexity) of the task?

B.1.3. How does your answer to B.1.2. change if you have access to the whole network (i.e., you can see all the connections between all the nodes)


========that is all for now================

Fwd: Faculty Seminar Series: Today's presentation by Prof. Huan Liu ( 3 PM at BYAC 110): ORG-ENG-CSE_Online_Com

So done... (Huan's talk should gel well with the social networks discussion we started in the class)

rao


---------- Forwarded message ----------
From: Farooq Khera <fkhera@asu.edu>
Date: Fri, Sep 26, 2008 at 9:21 AM
Subject: Fwd: Faculty Seminar Series: Today's presentation by Prof. Huan Liu ( 3 PM at BYAC 110): ORG-ENG-CSE_Online_Com
To: Subbarao Kambhampati <rao@asu.edu>


This lecture sounds really interesting - perhaps you can forward to class :-D


---------- Forwarded message ----------
From: Chitta Baral <chitta.baral@asu.edu>
Date: Fri, Sep 26, 2008 at 7:00 AM
Subject: Faculty Seminar Series: Today's presentation by Prof. Huan Liu ( 3 PM at BYAC 110): ORG-ENG-CSE_Online_Com
To: "ORG-ENG-CSE_Online_Com"


(NOTE: Location changed to BYAC 110 as it became available. Rest of the semester we will be in BYAC 110.)

Time & Date: September 26th, 3 PM - 4:15 PM
(Coffee and cookies from 2:45 PM)
Location: BYAC 110

Title: Data Mining and Social Computing: Moving Forward in Interdisciplinary Research

Speaker: Prof. Huan Liu

Abstract:
Data mining finds actionable nuggets from massive data. Social computing is concerned with the study of social behaviors and social context based on computational systems including social media on the Web. Challenges from data are numerous. Two of the common ones are of high dimensionality and of exceedingly large numbers of data points.
Feature selection is one effective means to tackle data of high dimensionality. It differs from feature extraction in that it selects a relevant subset of original features, which is often required by domain experts in many real-world applications. With the pervasive use of social media such as wiki, social networking sites, and blogging, more and more users can conveniently express themselves in endless
ways. It is imperative to develop novel approaches to help understand what's happening and how to tap into the increasing volume of data generated from social media. In this talk, we first present an overview of our lab research activities in data mining and social computing, then elaborate two research lines on feature selection and
on identifying influential bloggers, and conclude with some
representative challenging research issues we are currently working on such as multi-source feature selection, finding "familiar strangers" in the blogosphere, and trust in Health 2.0.

=== Upcoming Seminars in October ===

3rd October: Prof. Dijiang Huang (BYAC 110)
10th October: Prof. Winslow Burleson (BYAC 110)
17th October: Prof. Hari Sundaram (BYAC 110)
24th October: Prof. Arunabha Sen (BYAC 110)
31st October: Prof. Yi Chen (BYAC 110)


Sunday, September 21, 2008

Question Regarding Project

A few of you have noticed that certain terms appear in the index more than once.  The reason for this is due to the type of term.  For terms appearing in the content of the document, you will notice that term.field() == "contents".  There are other term field types as well, in fact, there are the following term types:  contents, title, modified, and uid.

 

If you simply consider those with field type equal to contents, you will be getting all the terms.  The additional types such as title are simply used to specify that maybe that term is more important since it appears in the title of a page, however, you are not required to give it additional weight.  If you feel like doing some extra work, you could always experiment and see how it would affect the results if the title terms are given additional weight, but like I said, it’s not required.

 

 

Garrett

 

 

Tuesday, September 16, 2008

Project Recitation Slides

You can download the project recitation slides from here

http://asu.mine.nu/CSE494/project.html


Garrett



Monday, September 15, 2008

CSE494/598 - Reminder on Project Recitation and Linear Algebra Review

This is just a reminder that tomorrow during the normal class time (and in the same room) I will be holding a recitation to discuss any questions/concerns regarding the project.  In addition, Raju will be giving a brief linear algebra review for those who are interested.  If you plan on attending the recitation, please take some time to look over the code so you’ll be prepared to ask questions and clear up any doubts you might be having.

 

Garrett

 

 

Thursday, September 11, 2008

DT matrix

It seems as if the image I sent didn’t make it through correctly…here is the DT matrices

 

 

HW1 Question 3 Part 4

As some of you have noticed, the TF-IDF document term matrix on slide 79 is somewhat mixed up and missing a row.  You can use these TF-IDF values below when answering question 3.4 on the homework.

 

For the other questions (e.g. question 3.2), you cannot simply hand in the rows from this table, you must also show your work so I can see how you arrived at the answer.

 

 

 

Garrett

 

 

(no) make-up class tomorrow

Dear all:

 Since I have pretty much covered everything I wanted to w.r.t. LSI in today's class, and since I don't want to start a topic and get back to it 10 days later,
I decided not to have a make-up session tomorrow.

You should have enough things to keep you busy between the project 1, homework 1 and homework 2 (which will be opened with questions on correlation analysis
and LSI in a minute.

I do hope to make-up for this class at a more opportune time (e.g. the wednesday before the Thanksgiving ;-). If I fail, I will just speak faster in the last few classes
and get you some donuts one day to make up for the loss of your tuition dollar value..

cheers
Rao

Readings for latent semantic indexing.. (and other links)

I revised the readings list and made Manning et al chapter as the main reading for the latent semantic indexing. Although  my own lecture is not based on that specific chapter,  I think the chapter provides the
best stand-alone short intro.  For the mathematically brave, I also put in a link to a proof of the connection between eigen decomposition and optimal low-rank approximation.


Kepler's laws of planetary motion are here http://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion
I was trying to remember the third law (which states"The squares of the orbital periods of planets are directly proportional to the cubes of the axes of the orbits.").

The farside explanation of how Einstein discovered special theory of relativity is here: http://rakaposhi.eas.asu.edu/cse494/notes/einstein-farside.jpg

Before you laugh at it, consider that  historians of science think Kepler probably did discover his laws empirically by poring over (and looking for regularities in) Tycobrahe's planetary motion data.
In fact, one of the first empirical discovery systems in Machine Learning,  called Bacon, written by our very own Pat Langley, for his PhD thesis, shows how this approach can be
"automated" (and one of the examples it uses is Kepler's laws--I recall this because I reimplemented bacon system as a class project for my intro to AI course in 1984). Here is a link to a short paper
on Bacon (from 1979) :  http://rakaposhi.eas.asu.edu/cse494/notes/bacon-langley.pdf

cheers
Rao

Wednesday, September 10, 2008

Make-up lecture slides/audio posted

Folks:

 For those of you who missed the make-up class today, I posted the slides and audio. Unfortunately, although I brought my camcorder to video-record the class, I seem to have picked up the wrong power cord for the camcorder, and so the camcorder had to run on battery and died too soon.  I did re-arrange and edit the slides after the class so that they are in the correct order w.r.t. audio, and also have additional slides corresponding to the additional discussion that was done in the class. I believe you should be able to make-up for the lecture by viewing the slides while listening to the audio.

regards
Rao

Tuesday, September 9, 2008

Thinking Cap qns on Correlation analysis..

0. (think for yourself; we will discuss this next class)--what is "rank" of a matrix? what happens to the rank of a matrix if you pad it with rows that are linear combinations of its existing rows, and pad it with columns that are linear combinations of existing columns? What could this possibly have anything to do with LSI and dimensionality reduction?

1. Give a scenario where two words may be close in the context of a query, but not close over the whole document corpus (and vice versa).
    What does this tell us about global vs. local thesauri construction?


2. We mentioned that LSI is a special case of LDA. In particular, the most important dimension, as unearthed by LSI, may not be the most important
   dimension to separate relevant vs. irrelevant docs w.r.t. a query. Given this, could we make any justification for LSI analysis (as against LDA?)


3. I mentioned that LSI does not capture "nonlinear" dependencies, and mentioned "circular" fish data (i.e., a bunch of artificial fish, which, when plotted in the length/width space, will look like a circle (or, if you prefer, a very nice cubic curve or a sine wave etc.).

3.a. Since we don't quite know up front whether or not the data under consideration has non-linear dependecies among its dimensions, what is the "risk" we are taking in applying LSI?

3.b.[Tricky]  Consider the following off beat idea. A nonlinear dependency may still be viewed as a "linear" dependency between non-linear dimensions. This suggests the following highly quixotic idea for dimensionality reduction:

        First *explode* the dimensionality by adding   all possible higher order combinations of base dimensions (if you start with x and y dimensions, consider  
              making up new terms such as  x^2, y^2, x*y etc)

       Now, do LSI in this (vastly higher dimensional) space; pick the top k most important dimensions.

  Suppose I do the above for only up to nth order terms (i.e, terms of type x^k y^(j-k) with j  less than or equal to n+1).  Can you see this idea actually doing
dimensionality reduction at least conceptually? What do you see as the main computational hurdle?




Reminder: Make-up class tomorrow (9/10: Wed: 10:30--11:45 in BY 210)

All of you who have the time are highly encouraged to attend in-class. I will record for those who can't show up--but would hate to
record looking staring at white walls.

Rao

Cool presentation on LSA

Last year in Methods of Biomedical Informatics I Dr. Trevor Cohen presented a talk on LSA, which I is available here .



It offers a fascinating look of the way other fields frame the same fundamental research question, and the slant that they take is fascinating. Plato's Problem is quite thought provoking.

Monday, September 8, 2008

Questions Regarding Homework Format and Method of Submission

There have been a number of questions regarding the format for the homework and the method of submission so I just wanted to pass along the reminder to everyone.  You may submit the homework in type written or hand written (please make sure it’s easy enough to read) format and submission will be in class so please remember to bring a hardcopy to hand in.

 

If there are any more questions, please let me know.

 

Garrett

Thursday, September 4, 2008

Follow up to precision/recall curve question

This is just a follow up to the mail I sent regarding the plotting of precision/recall curves…I realize that the example in the slides shows 10 relevant documents and it may seem as if this is the reason for plotting the 11 point curve however, this is not the case…

 

11 point curves are frequently used irrespective of the number of relevant documents…for some additional information please refer to these links, send me a mail, or stop by and see me

 

http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization#11-point_average_precision

 

http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

 

Garrett

 

 

Question on homework - Plotting precision/recall curve

There was  a question regarding the homework:

 

“Clarification on #2: Are we supposed to plot the precision and recall numbers after each document is fetched in the query? My interpretation of the initial formulation seems to yield only 1 point, which is a bit nonsensical for plotting a curve...”

 

The answer is yes you should plot the precision/recall after each relevant document is fetched.  In the slides the frequently used 11-point precision/recall curve is shown…This is done by interpolating and finding each of the eleven points…If you plot the curve for each document (relevant or not) and don’t interpolate, you often times end up with a saw tooth shaped curve…

 

 

Garrett

Thinking cap questions II: Distance measures; high-dimensional spaces etc.

First a couple of fun links followed by more thinking-cap questions


First a couple of fun links:

0. A really good paper that discusses vexing search issues in an important web-search domain
       http://www.jmlg.org/papers/wankendon05.pdf

1. The lincoln play by woody allen that I mentioned (about "how long should a man's legs be") is here
http://rakaposhi.eas.asu.edu/lincoln-query-woody-allen.pdf

2. here is a bag-of-words analysis of Bush's state of the union addresses through 2007
http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html

3. Link to the click-bias study (that showed that people tend to assume Google knows better what is relevant more
than they themselves can do). This is a good paper to skim to get an idea of what is involved in careful user studies
(contrast their user studies with the  user studies in "0" above--which were much more profitable--as in footnote 1 of that paper).

http://www.cs.cornell.edu/People/tj/publications/joachims_etal_05a.pdf


==========Thinking cap questions=============


0. A comment on "Reduced Shakespere company" is that the more you know the original the funnier it gets. Keeping that in mind, what is a sedate but academic summary of the point of the paper in "0" above?
  (for example, why do the say that instead of idf they found df measure to be more suitable for their purposes)?

1. What is the highest value the normalized euclidean distance measure can take (recall that it is just the distance between two unit vectors in the n-dimensions).


2. What is the effect of alpha/beta/gamma values in the Rocchio relevance feedback method on precision and recall?


3. Does it make sense to do stop-word elimination in the context of vector similarity ranking with tf/idf weights?


=====================




Lecture reflections

In the vector model we discussed, we mentioned the well known Curse of Dimensionality . It is well known in data mining that given a high dimensional space, data tends to accumulate in subspaces, (which is what I clumsily tried to express with "dominant dimensions", since the eigenvector bases are the most important dimensions in an n-dimensional space holding the most amount of information), but given the context of the last lecture, I wonder:




So when you’re doing multidimensional scaling, are you finding the dependencies between words?



Also, with regards to the Rocchio method, I'm not entirely sure I understood it. After the lecture on 9/2, I had the following thought:




It seems to me that the search engine could query the user after the user hits a results page about whether the page was what they were looking for or not, then the machine could take that feedback and apply learning algorithms. It could specifically look at ∆Relevance/∆ Keyword…


I envisioned a system that worked like the old search engine approach of asking for direct relevance feedback, but now it seems that Similar Pages is the de facto automated standard for query adjustment? What other methods are there?

Wednesday, September 3, 2008

*Important* note on readings for the IR lectures

Folks:

 As I mentioned, the primary readings for each topic are the first unindented paper for that topic in the readings list. So for the Traditional Information Retrieval topic we are discussing now, the required reading is  the link with the name

"Text Retrieval (a draft chapter from Wei Meng, SUNY Binhghamton. Used with Dr. Meng's permission)."

I should point out however that this is a sort of the "cliffs notes" version. If you want a more uptodate and comprehensive treatment, then read the relevant chapters from the IR book by Manning et al, for which the link is provided from the readings.  The order of chapters in that book however is different from the order in which we will cover the material. Here is a guide for the next two weeks:

Intro to IR and Boolean retrieval (chapter 1)

Evaluating IR using Precision/recall (chapter 8)

vector space model (chapter 6--6.1 to 6.5)

relevance feedback (chapter 9)

Latent semantic indexing (chapter 18)

regards
rao



Tuesday, September 2, 2008

Class composition after drop/add

Now that the standard add/drop period is over, here is the class composition.
I updated the mailing list to reflect the current enrollment. If you are still getting mail
and shouldn't be let me know (and if you are supposed to get mail and are not, you too should
let me know, but I guess that is harder..)


43 students

7 registered fof 494
  6 senior
  1 exchange

36 registered for 598
  8 phd
  22 MS
  5 MCS
  1 exchange

Number of students "signed" into class blog:
  28

Using my powers of ratiocination, I think we have representatives of the following countries (Let me know if I am missing any):

            USA, India, China, Germany, Turkey,  France, Russia, Ecuador, South Korea

As of now, I can put names to about 14 faces in the class. I would like to make this number higher quickly

added two more questions to homework 1


Thinking Cap Questions

Occasionally, I will send you "thinking cap" questions. These are broad questions (some very simple, some somewhat deep) for which, if you have an answer or opinion, you should post it on the blog (as a comment to the question).

Here go the first batch:


1. I mentioned that any two vectors, not just the orthogonal ones, can be used as coordinate axes to define a space. We know that if we have a vector
P= (4,7), its "coordinate" in Y direction is 7 and X direction is 4.  If our coordinate axes are not X and Y axes but rather two arbitrary vectors u and v (assume u and v are unit vectors), then what are the coordinates of P in this space? What is the restriction on u and v such that  we can always give coordinates to any point in the space this way?

2. How does stemming affect precision and recall?

3. I pointed out two of the problems with the canned relevance judgments used in TREC. That they represent just one user, and they typically represent "binary" judgements. Recalling the form of the R( ) function we dicussed in the class, can you think of other ways these judgements are not really enough to evaluate a ranked list of results?






CSE 494/598 Info Retrieval/Mining/Integration TA Office Hours

For those that did not receive this message through the class blog, here it is again.  I'm planning to hold my office hours on Wednesdays from 3:30pm - 4:30pm in 557AD right outside the CSE main office on the fifth floor.

If you have any questions feel free come see me or send me a mail at garrett.wolf@asu.edu.  If you cannot make it during office hours, let me know and we can schedule a separate time where we can meet.

 

Garrett