Comments on CSE 494/598 Fall 2008 Blog: Blog qn for Homework 4 (You should post your answer as a comment to this thread and also enclose it in your hw 4 submission)

Overall, I appreciate the discussion of social net...

2008-12-09T20:47:00.000-08:00

Overall, I appreciate the discussion of social network most. In KDD'08, social network seems to be the hottest topic, and every one is talking about it. This class covers many interesting topics in social network. Personally, I wish we could discuss more materials on this topic.

Some other thoughts are as follows.

1. Simple approaches are always preferred. Many approaches discussed in the class are quite simple, but they perform well in practice. A typical example is Naïve Bayesian classifier in which we make the independence assumption. Once I talked to a researcher from Siemens Medical, who is also doing research in the field of machine learning and data mining. In his practice, simple supervised learning methods, such as SVM, are always preferred and complex semi-supervised learning methods are not considered. If the label information is not enough, the solution is obtaining the label information first, even at some cost.
2. Dimensionality reduction is very important for large-scale data processing, especially the text document data. Several dimensionality reduction techniques are discussed in class, including PCA and LDA. Prof. Rao explained the idea of PCA very clearly and gave some intuitive ideas behind these models. For PCA, it is reduced to an eigenvalue problem. Generally speaking, many dimensionality reduction techniques can be reduced to eigenvalue problems, such as PCA, CCA (Canonical Correlation Analysis), PLS (Partial Least Squares), even some nonlinear mapping techniques proposed recently (e.g., LPP, i.e., Locality Preserving Projection). As a result, how to solve the large-scale eigenvalue problem is a very important issue in practice. In class we discuss a simple technique, power method. Some complex techniques, such as Lanczos algorithm can be applied. An alternative method is to transform the eigenvalue problem into some simple and equivalent formulation which can be solved efficiently. Currently I am doing research in this field. We have proved that under some very mild conditions, CCA and LDA can be transformed into an equivalent Least Squares problem, which can be solved very efficiently.
3. The discussion of social network is amazing, many interesting phenomena are revealed in class, such as the small world phenomena and scale-free networks. In KDD 2008 I heard a lot about social network, but I could not understand them well. Thanks to Prof. Rao’s discussion, we know some hottest topics.
4. I appreciate the discussion of the motivation for XML. XML bridges the community of database and data mining. Probably I should know more about database and XML in the future.
5. I like the course project. In this project, we are required to implement the ideas and models learned in class. In particular, some basic parts of this project are provided, such as link extraction and index construction by Lucene. Therefore, we can focus on the most essential part, such as vector space model and Pagerank algorithm. I am wondering whether we can do some project on social network.

1.Seeing that ideas like bag of words work reasona...

2008-12-09T17:24:00.000-08:00

1.Seeing that ideas like bag of words work reasonably, I found it interesting that we do not need exact solutions but only reasonable heuristics applied well. The fact that web is not random and humans can be made mediators in challenging tasks can also be leveraged well.

2.The concept of scale free v/s random networks and the realisation(and the flow of topics to) how human social network is modelled by scale-free networks. The DOS attack example and the military attack question (in exam)were very useful in understanding the concept, its importance and potential applications.

3.The idea that logical inference can be done on web entities using RDF schema is a really useful one. If it can be achieved, we would come quite close to realising the superhuman web secretary. However, I doubt people will provide schema for their pages and the alternate techniques seem difficult.

4.The realisation that Google is a lot more than the Pagerank they pretend to be. Smart ideas like map reduce, distributed indexes and other engineering ideas contribute as much, if not more.

5.The topic Social networks was particularly interesting as it can potentially be used to model many problems. The fact that popularity begets popularity and how it changes the graph. The fact that Zipfian curve exists in something like text too, is amazing. It was also interesting to see how businesses could also make money by catering to the long tail instead of/in addition to the short head(the wide shoes eg.)

1.That, Bag of words , in spite of being a “gross ...

2008-12-09T14:20:00.000-08:00

1.That, Bag of words , in spite of being a “gross approximation” works very well. Till we have at least some sort of a working Natural Language Processing Engine, we would be using Bag of words!

2. In a Social Network which is a random network with non-uniform distribution of degrees, one can drastically reduce the connectivity by deliberately taking out a few nodes. This fact is used not only for malicious purposes but also for a good cause such as Disease prevention by quarantining super-spreaders . I learnt a Medicine-related fact in a Computer Science Class!

3.It was good to see a formal explanation for the phenomena of ‘Rich get richer’ in the form of Power Law.

4. I was quite surprised to learn that the most popular word is twice as frequent as the second most popular word! I was always assuming top few words would be equally likely. Also, surprising was the
revelation that Li (1992) showed that just random typing of letters with space will lead to a “language” with Zipfian distribution.

5. Because of this course, I think everyone not just me, but everyone know a lot more about Google. I experimented with Google a lot..and it was great to see in reality whatever I had studied in class. For example, in one of my experiments I juxtaposed a real word with some random letters..

Query: ghgjhpotterfjdfh

And this is the response I got

Did you mean: ghgjh potter fjdfh

However, for the Query
Query: ghgjhpotterfjdfhdjhdsjkdska

The response was
Did you mean: ghgjh potterfjdfhdjhdsjkdska

Thus, we see Google is implementing the concepts of Levenstein distances et al. However, they have a lot of work still to do as can be seen from the second experiment where it cannot detect words if the amount of junk words goes beyond a certain limit.

6. Deep Web: We can find data from the databases of big commercial vendors websites without having direct access to them, by indirect methods. For eg. One can write a Query to find number of books with pages>100 in Amazon.com’s repository by methodically Querying its web-service and counting the results which have > 100 pages.
This can be and I am sure must already be used by companies against their competitors.

7. Google is more than Pagerank. It is about "Engineering"

2008-12-09T14:11:00.000-08:00

This comment has been removed by the author.

I am very interested in Natural Language Processin...

2008-12-09T12:47:00.000-08:00

I am very interested in Natural Language Processing. And hence I found your class on Information Integration to be the most interesting since it came closest to emulating the topic.

Apart from that, topics on social networks and how they have been applied on web were very interesting too. It helped understand the basics behind social networks such as Orkut and facebook.

The clustering techniques and its vast applications have made me certain on it being able to be extended to far more number of fields. (K Means, Buckshot). Hence I think it is a very useful trade I have picked up in class.

Studying the Google search engine structure I learnt that a lot more goes into it than just page ranking. It helped me get an idea of how the early engine used linking, crawling and others together.

The topics on cosine similarity and how it wins over Cosine similarity in certain situations would also prove very useful as its applications is numerous.

The first concept which amazed me was tf-idf weigh...

2008-12-09T12:26:00.000-08:00

The first concept which amazed me was tf-idf weighting. The efficiency of the hueristic is clearly hidden by the simplistic nature of computation. This nature of tf-idf is comparative to KMeans, where also the computation is simple but the clusters achieved are in definite time.

Scale free networks is another area which induced great level of interest in me. Trivial as it may seem, power law is what is present in most real world scenarios..for example the forbes richlist, the top 100 diggs on digg.com etc.

Many concepts we discussed where like.."oh! we just thought X was going to be added to make Y better or more efficient" (however after X was explained..for example..after the discussion on KMeans and HAC..the intuitive effect of combining the two in producing buckshot..is a clear winner.
Authority Hubs computation, I feel is a wonderful method of measuring prestige. One can go on to think of applying Authority hubs computation in the physical world to get info on whos popular where. (not using the webpage of the person i mean
Lets see if my profile increases in rank for what im posting here :D

thank you all for the participation
cheers..

Non trivial ideas i was able to appreciate during ...

2008-12-09T12:16:00.000-08:00

Non trivial ideas i was able to appreciate during the course of the semester.

1. Vector Similarity - the idea of using td-idf to weigh words so that important words no matter how less frequent they occur are given more weightage.

2. Authorities and Hubs - found how the link structure of the web affects the overall results produced. how authority pages and hub pages influence the result of a query.

3. Page Rank - the idea of using the same link structure to rank pages based on their importance. How sink nodes affect the rank of a page and the results. How to make a random surfer stick to a web page and make him follow web pages within a domain.

4. Link Analysis - found out how the whole world is a network of links and how trust is used to detect and eliminate spam.

5. Clustering - the idea of using clustering to improve precision was appreciated. Learnt how clustering can also be directed towards a group of users.

1. Among all of the topics, I like Link Analysis t...

2008-12-09T12:14:00.000-08:00

1. Among all of the topics, I like Link Analysis the most. I knew some of related concepts before, but not in deep. Dr. Rao gave a thorough detailed explanation about the Authority/Hub scores and PageRank algorithms and how they can be effectively computed. One more thing is: the explanation about why eigenvectors can be computed via iterative multiplication is very clear and useful, and why the eigen gap can determine the rate of convergence is very descriptive and interesting.

2. I also like the topic about Semantic Web and XML. Through the class, the first time I know how the web is evolved from plain text to structured data, and from HTML to XML, and how the information is
organized and presented over the internet. The understanding of
Semantic web could be very useful for my future research. One more
suggestion: since I am not very familiar with XML and RDF, I feel a
little bit hard to understand them deeply. Probably, a short tutorial
about some basic concepts (like a summary about those markup languages and schema) could be very useful.

3. Social network is a hot research topic recently. Dr. Rao also provides a thorough introduction about it. I enjoy Dr. Rao's explanation about the comparison about the Scale-free networks and Random networks.
Before the class, I knew the two concepts, but I did nit knew their
difference on the robustness. The introduction about Zipf's law is
interesting and practical. Probably this concept/modeling can be used in my future reseach.

4. The three projects are another good things of the class. By doing the projects, I can systematically connect all of the learned terms/concepts together and understand them deeply. For example, by doing Project A, I can connect TF/IDF with inverted files together and understand deeply why they are designed in such a way. Moreover, the introduction about the effectively retrieval is also interesting and effective.

5. Overall, I like the organization of the class. Dr. Rao is extremely well prepared because I notice he seldom turned around to see the slides on the screen. I also enjoy his way of teaching by explaining the underlying intuition. That can help in understanding much much better. And I feel like most of time, the whole class is very interactive, it can help students to better follow the lectures.

2008-12-09T12:09:00.000-08:00

This comment has been removed by the author.

Selecting relevant features is a more critical pro...

2008-12-09T11:53:00.000-08:00

Selecting relevant features is a more critical problem than actual classification or clustering in high-dimensional data. It would be interesting to see how classification and clustering perform as number of irrelevant features increase or decrease.

In data integration, the need to do horizontal data aggregation (i.e. retrieving relavant attributes from other tables) in the absence of join information poses interesting challenges on how different views can be joined.

Collaborative filtering used in Recommender systems can be used in similar scenarios where the system needs to identify user/users who are trying to intentionally give bad ratings for a particular set of entries. Such users can be blocked form the system.

The way everything is modeled in the current web is based on the fact that “Response Time to the query is very critical”. So all the approaches we discussed mostly rely on offline approaches and they make sense as well. But if we see the same problem in a different setup , where response time is not so critical and the precision and recall have a higher importance, the model needs to be radically changed.

The fact that there are “True dimensions” in data and how LSI finds them and uses them to cluster is data is very interesting. Initally, it is not very intuitive that clustering based on true dimensions would provide clusters which make sense when we see them in the perspective of real dimension. It is also interesting to see how we pick different axis based on the current task at hand. (Finding Clusters or Using it as a distant metric)

1. Eigen values are everywhere – from page rank, a...

2008-12-09T11:51:00.000-08:00

1. Eigen values are everywhere – from page rank, authority scores to LSI! I still am not sure if things were designed so, or if they just turned out to be so. And the power iteration is such a bonus!
2. It’s a small world! Small world phenomenon and the Zipf’s law show that there exist patterns in the real world, in places that are not obvious.
3. We duplicate and store the same data in many forms – inverted index, forward index, lexicon, buckets. I realized, during the project, that this is a necessity while handling large scale corpora.
4. Recommendation systems – I had never imagined that these make such a difference in revenue for the online shopping websites until I actually saw the Netflix’ competition to improve their recommendation system. The problem is definitely more complicated than Naïve Bayes, especially when there are eccentric movies that take extreme values – “Like it” or “Hate it” and is seldom anywhere in between.
5. The google paper gave us a very good implementation level view of the whole search business. It helped to understand what is done during the query time, and what is done offline. It gave us enough background to think about using distributed architecture to improve performance.

2008-12-09T11:50:00.001-08:00

This comment has been removed by the author.

1.The idea to use SVD to do the dimension reductio...

2008-12-09T11:50:00.000-08:00

1.The idea to use SVD to do the dimension reduction. Use mathematical tool to capture the most important features of an object. In this way, the noise can also be effectively eliminated.
2.The Benford’s law can be used to detect the made-up data in research papers. The reason is that all the ten digits are not uniformly distributed for the first digit in a number.
3.PageRank cannot converge when the Transition matrix M is not strongly connected. This can be solved by adding links from sink nodes to all the other pages.
4.The connection between the primal eigenvector and PageRank. This connection is made by the fact that the primal eigenvector of a matrix can be computed by using power iterations algorithm, which is what PageRank algorithm doing.
5.In Buckshot Algorithm, to make the time complexity linear of size n, it only takes √n sample of instance as the input for HAC.

1. With sematic web, a page becomes more than bag-...

2008-12-09T11:49:00.000-08:00

1. With sematic web, a page becomes more than bag-of-words. The latent information in it starts becoming more apparent. In such a scenario, it's interesting to think about more appropriate indexing schemes, since keyword indexing would not capture the latent information. Indexing also needs to become more semantics oriented (not just at the ontology level, but at the content level)

2. Connection of PageRank computation to MDP in terms of prioritized sweeping was interesting. Also, I am curious about the idea of doing PageRank on a smaller representative (sampled) network and using resulting values as seeds to reach faster convergence.

3. Learning from only positive examples (for wrapper induction)at first seemed counter-intuitive, as negative examples narrow down the hypothesis space. Connecting it to grammar acquisition in children was interesting. It seems to support the bias for learning minimalistic models (Occam's razor).

4. Focussed crawling is quite attractive from an engineering point of view. The idea that it can be implemented as A* search, although not particularly deep, is extremely neat.

1. Earlier I had imagined that search engines work...

2008-12-09T11:43:00.000-08:00

1. Earlier I had imagined that search engines worked using 3 things: whether or not the word occurs in the page, page-rank and how close the query terms occurred to each other in a page. Seeing LSI and correlation actually being used to search for documents was interesting to me. Something that I knew in theory before (dimensionality reduction) being actually used in a service was quite exciting.
2. The engineering of a search engine was quite exciting, and that was something we had to do as a part of the project. Trying to optimize the algorithm for speed, as well as keeping a low memory footprint required a lot of effort and optimization. What was weird was that what would ordinarily appear to humans to take a lot of time (page rank – the actual power iteration) seemed to take less time for the computer than something very basic (disk access). Reading the google paper regarding how they solved these problems was enlightening.
3. The fact that trust can be propagated and distrust cannot was another interesting part in this course. I had never thought of how something that is intuitively justified by our common sense could be unintuitive when thought of from the mathematical angle. Without this course, I would have assumed that it was possible to propagate both trust and distrust via the trust propagation algorithm.
4. Semantic web and RDF is very exciting – specifically the thought it might be possible for engines to one day process units of knowledge and come up with logical answers, and that too from the greatest repository of knowledge today – the internet. This is a very exciting field, since this makes me think that we will be able to have a computer that passes the Turing test if he has access to a semantic web.

I was impressed by:1. the fact that abstract analy...

2008-12-09T11:22:00.000-08:00

I was impressed by:
1. the fact that abstract analysis of data (like LSI) can give us good correlations and generalized similarities of texts. Though the extent to which this analysis can be applied is still under the question (for me at least).
2. the ubiquity of scale-free networks and power laws and their connection and also simple models how these networks can be constructed.
3. the fact that Google uses thousands of features for search.

Thanks to Rao’s course. I now know why my vocab is...

2008-12-09T10:57:00.000-08:00

Thanks to Rao’s course. I now know why my vocab is so bad inspite of me trying so hard all these years.. reading more books is not helping me.. my mind is following the “principle of least effort”, and there Zipf’s law to corroborate that.. 

Let me get serious (like all others...)

The concept of inverted index and its application to ranking documents helped me realize how a seemingly complicated problem of search could be simplified by representing data in the right way. The ever increasing web could never be in indexed and searched efficiently unless for the inverted index.

It was interesting to see how LSI/SVD could be used in dimensionality reduction. It got me comparing and contrasting it with DCT and FFT that are used in image processing. The relation between the loss of variance and the Eigen gap also got me pondering. It was pleasing to see some of the applications of the otherwise dry linear algebra.

The explanation as to what happens when a vector is multiplied by matrix with respect to the Eigen vectors and values of the matrix was very interesting. The application of Eigen decomposition in LSI was non-intuitive very handy. The use of power iteration in general and its use in authorities and hub calculation in particular was interesting.

The topics of knowledge representation using RDF and RDF-Schema got me pondering over issues like representing exceptions to rules, knowledge due to cultural and deductions due to cultural differences.

It is very interesting that many intuitive algorit...

2008-12-09T10:51:00.000-08:00

It is very interesting that many intuitive algorithms can be mapped to basic mathematical concepts like linear algebra, set theory, etc. Especially how different formulations of the matrices in eigen analysis lead to different results like authorities/hubs, page rank, LSI, etc.

How the enormity of the web has turned out to be more of an advantage than a problem (collaborative filtering, trust propagation, etc).

How there is always a trade off between different factors and how difficult it is to optimize on all of them. ( robustness to adversarial attacks vs random attacks, flexibility vs amount of upfront work in query processing, etc )

The difficulty of the problem of information extraction and its effect on information integration. How failure to map a single word/set of words can make us loose a whole source of information

The nature of the web which causes the openness of most problems and the complete lack of absolute stable ground truth. And how heuristics lead to good results in most cases.

2008-12-09T10:43:00.000-08:00

This comment has been removed by the author.

At first I didn’t really appreciate or understand ...

2008-12-09T10:34:00.000-08:00

At first I didn’t really appreciate or understand how LSI is useful considering the scaling problems and computation power required to do it. After we learned it in class it was reinforced by several practical ideas, most recently the Netflix article. This really drove the point home and I appreciated the “real life” touch that broke this idea out of being strictly academic for me.

Even though some questioned its usefulness, I really enjoyed the Google paper discussion. This paper was one of the highlights of the semester for me and I was thoroughly engrossed with the discussion that ensued.

One of the highlights lecture-wise, was the social network lectures. The background and attention to detail in this lecture was quite good and made it joy to attend those lectures.

One of my favorite topics was that of the clustering and clustering techniques. These techniques have already proved useful in other topics and discussions outside of class, so this sort of one-two combination punch that went with them really illustrated their usefulness.

The other aspect of the class I enjoyed the most was the project. The lectures were quite fast paced and high level in my opinion, yet we were expected to gather low-level knowledge quickly. The projects helped reinforce this; I learned far more from the projects than from any of the homeworks.

First of all, in IR almost every area we learn can...

2008-12-09T08:42:00.000-08:00

First of all, in IR almost every area we learn can be applied to any other area and each area collaborates together. There are plenty of examples. The idea of classification is used when relevant pages are found. PageRank can be applied as a metric for page trustworthiness to decide importance of sources in information integration, LSI can be extended to LDA in clustering, and in information integration, when coverage for each source is determined, query statistics are grouped using HAC.

In recommendation system, the idea about how people can be influenced by other’s opinion fascinated me. I have an experience to be surprised how similar taste people can have when I see the recommendation on the web site like Amazon. In order to improve this, I think more research regarding psychological analysis is needed. Maybe a lot of variations can be possible based on what features are chosen to decide for classification. When it comes to movie as seen in Netflix example, a lot of factors can be affected to user’s decision, in my case, for example, my favorite directors and actors are the primary factor for choice of movies.

In Google anatomy of how Google works, a lot of things like link structure, html tags, and etc being taken into account, there are more than that, though. There’s no wonder why lay users like me can be really satisfied by the results.

In information extraction and integration, DB can be converted to html for search in deep web. Also html or text can be converted to XML for semantic web or DB table for query processing. Information can be converted to various ways according to semantics of needs or tools available upfront.

Performance can be one of the most important factors to determine algorithm of being useful or not like even though A/H algorithm is working fine, it’s not as popular as PageRank. And clustering algorithm, K-Medoid can help eliminate outliers but it’s not commonly used for performance reason, and etc.

2008-12-09T08:28:00.000-08:00

This comment has been removed by the author.

1. I appreciate most the overall organization of t...

2008-12-09T08:25:00.000-08:00

1. I appreciate most the overall organization of this class where various important topics, ranging from traditional topics such as vector space model to more modern topics such as XML, information integration, are fused into one class. This can provide a big picture for students, and they can easily know what the state-of-the-art is, and what the challenges of IR research are.

2. I also appreciate the way this class was taught where a lot of intuitions have been presented whenever possible. Intuition may be the best way to keep the ideas learned in this class longer in one’s mind. This also helps to understand why such ideas are useful in IR.

3. I also like the class projects which are carefully designed and controlled. Through these projects, the ideas learned in the class are thoroughly understood. Moreover, students get hand-on experiences on IR problems. On the other hand, the tedious processes of crawling and generating inverted index are pre-built, and thus students do not need to spend their time on these. More importantly, the major features of modern search engines are all implemented in these projects.

4. Although I have been involved in research and classes in numerical linear algebra, I found the interpretation of the matrix vector multiplication during the class is very intuitive and illuminating. This explains vividly why the repeated multiplication of a vector with a matrix can result in the principal eigenvector. This intuitive explanation also clearly explains why the eigen-gap determines the convergence rate of the algorithm, and under which condition the algorithm cannot converge to the principal eigenvector.

5. The introduction of the structure into IR as an intermediate level between traditional database and unstructured IR gives very clear understanding why schemes such as XML and XQuery are designed in these ways. It is evident that these schemes are designed to accommodate both fully structured traditional database and the modern IR paradigms.

It is really difficult to pick only five things fr...

2008-12-09T07:40:00.000-08:00

It is really difficult to pick only five things from all those interesting topics, well, I ll give it a shot.

Though TF IDF were nontrivial concepts,I knew them before joining this course. The topic that really caught my attention in this part was Relevance feedback.It is really difficult to get feedbacks from users voluntarily, cause nobody bothers. For example, recently we all have been asked to take Teacher's Evaluations. Though we do it in the end, we keep on procrastinating the task. In this light, What does methods like Rocchio do is amazing.

LSI was a totally new concept to me. Firstly, how important the dimensionality reduction is? Secondly, how it can be achieved with such low loss of information. Its really a pain that computationally it has to be so expensive.

Discussion of social networks brought new dimension to the course(Which was high dimensional already). The concepts like "rare is not so rare" were incredible at first. Scale free networks, their generation and their properties was a good learning experience.

I had only heard about Page Rank before. But, I practically implemented it in this course. I would further want to know what are those other 240 odd metrics that google uses for ranking now a days.

Content-based and collaborative filtering were again totally new concepts to me. After learning how difficult it is to implement them, it is not surprising that perfect recommender systems still dont exist.

I know this is the sixth one, but i have to mention, concepts like semantic web, mediator systems for information integration are really hard to believe.
I really think that topics like Information Extraction and Information Integration should have been discussed in greater details.( I dont blame Dr Rao but such a short duration that we had for this course).

1) Before this course, I had totally taken for gra...

2008-12-09T06:50:00.000-08:00

1) Before this course, I had totally taken for granted the Google Search Engine. Reading the Google paper made me realize that how an idea could be manifested into reality. Combining vector similarity and page rank to give results to queries, sort of seems perfect. But then there are many factors which can help give results faster and better; like query log, query feedback.

2)When there are ambiguous queries like "bush", clustering seems to be an effective technique. The project giving a good insight into the clustering algorithms. KMeans giving faster but less accurate result and Buckshot giving better result as the seed documents taken in the start would be better.

3) The concept of Zipfian Distribution was interesting; especially when the example of how it is used for detecting forged documents. Even though we talk so much about probability; in reality the occurence of no. 1 is much higher than the occurence of no.9. So this aspect should also be kept in mind while designing the search engine.

4) The long tail concept was also quite interesting; for example some sites, networks,etc are specifically created to cater to small percentage of population(niche). Other sites satisfying majority of the population. We could also choose to develop search engines for the majority of the population or a small niche say computer scientists.

5) Content based and collaborative filtering concepts is being widely used by popular sites like amazon, youtube. They can be used for selling online products, suggesting movies and songs. The idea of collaborative filtering, of finding a persons twin(soul mate) and then giving reccomendation based on it was interesting.