Here is the "interactive review" question for homework 4 that I promised.
"List five nontrivial ideas you came to appreciate during the course of this semester"
(You cannot list generic statements like "I thought pagerank was kewl". Here is an example:
The deep connection between feature selection and dimensionality reduction: By
understanding LSI, LDA and MI based feature selection, I was able to see the deep
connections between dimensionality reduction (which works without taking any particular
classification task into account) and feature selection (which typically take the classification task
into account).
)
=============
Subscribe to:
Post Comments (Atom)
40 comments:
I found TF-IDF weighting to be a much more robust way of handling keyword queries than plain text searching, as this system gives priority to certain words over others.
I have always wondered how clustering worked (k-means, buckshot, HAC, etc.) and now that I understand it, I feel like I might be able to apply it to other fields. I just need to make sure I have a way to find distances between items.
Social network topics such as authority/hub values now give a whole new meaning to “the popularity game”. I can now quantify how relatively popular my friends are using simple matrix multiplication.
Stemming and stop-word elimination are useful tools in indexing files. I want to write a search engine, and this is immediately one topic that comes to mind. I want to search for “drive” and be able to locate documents that may contain “drives” or “driving” as well as my original word.
The recommendation systems topic was very also enlightening. Recently, I was playing a board game with a friend of mine who shares a lot of my interests in games. He recommended to me another card game that he enjoyed, and sure enough, I enjoyed it as well.
(sorry if this gets posted twice, I got an error the first time)
1. By understanding the difference between uniform random networks and
scale-free networks, I was able to understand the pro and cons of both,
and how to use their strength and exploit their weaknesses (for instance
a scale-free networks is vulnerable to attack because removing key nodes
will prevent most communications between nodes, but they are also more
robust against any kind of spreading because removing those key nodes
will efficiently stop this spreading)
2. Understanding the utility of Singular Value Decomposition for dimen-
sionality reduction helped me realize that dimensionality of data was a
parameter like any other, which could be processed by algorithms and
not only by humans. Moreover, it was a good illustration of an abstract
mathematical methods in a concrete example.
3. Learning how to compute authority/hub scores and Pagerank helped me
understand how to solve some issue created by the ”adversarial” aspect
of the web by giving a good stability to a ”rank”. It also helped me un-
derstand how works trust rank propagation to counter spam, and allowed
me to imagine many kind of such ”rank propagation”
4. Studying the Google search engine structure made me understand that a
good search engine required a lot more than a good ranking algorithm. It
helped me link together crawler, index and ranking algorithm.
5. Studying k-Gram methods and edit distance such as Levensthein distance
helped me understand that tolerant retrieval could be very simple. Now I
do not think anymore that Google is very skilled in spelling, and I wonder
why systems such as MediaWiki do not use such devices.
One of the most interesting discussions in class was when we dissected the Google Paper. It gave me an insight into what a complete search engine should “supposedly” have, though we all know there’s more to it than what Page and Brin would like us to know!
“Tyranny of Majority” and “Impact of Bridges” in Authority and Hubs caught my attention. It’s intriguing to know that a single page can make such a huge difference; but I still haven’t been able to figure out what the impact would be if the bridge were between node 9 and node 1 instead of node 4 (SLIDE 28, PPT No. 7 Link Analysis to Predict Web-page Importance).
The discussion on Social Networks and the lectures that followed gave me an insight into how important link analysis is in Information Retrieval. I was able to understand that Trust Analysis and propagation of trust through links play a vital role in fighting Spam links during Web page ranking.
I have always wondered about the “Did you mean: ______” feature in Google. I assumed it was magic! With the knowledge I gained in learning about edit distances, spelling correction and query logs I can surely make better assumptions in future!
Lastly, I found the topic on Clustering interesting. I have come to discover that Clustering Algorithms not only have a role in IR, but play an important role in other fields as well (Information Hiding for example).
1. The power iteration in the A/H computation results in Tyranny of Majority, but also shows which community of pages is the largest. Seemed like a useful side-effect.
2. LSI can be used to cluster results by using the df*ff matrix, where each row represents the original doc in the LSI space. This may have been an interesting clustering technique to try in the project.
3. Concatenating a document to itself affects Euclidian distance similarity, but not cosine similarity (since the angles do not change).
4. I thought the insight into how search engines rank by leveraging page structure was enlightening (terms in the title tag, font sizes/bold font, word proximity, anchor tags, etc.). Also, I found it interesting that once people realize the details of the search engine ranking, they exploit it for their own pages (sometimes deviously).
5. Regarding knowledge representation on the web, XML tags and nesting of tags have no meaning (semantics). Using RDF (base facts) and RDF Schema/OWL (background theory) is a way of capturing the logical relationships to form the semantic web. I’ve used XML in the past, and just assumed anyone would know what my tags meant!
1. The way how weights for each term in a document are calculated such that not only TF is used, but also IDF and how similarity and normalization is done on this weight.
2. The mathematical background for LSA. In general: How correlation between words can be found (even if I didn't like to do the SV-decomposition manually :)) and how to choose which dimension can be reduced (by calculation the loss-rate).
3. Link analysis in general. In detail: Tyranny of Majority at Auth/Hub calculation is absolutely not intuitive, but a "real" phenomena. For PageRank: How a random surfer is mathematically simulated and how this model of the random surfer helps to rank pages.
4. Clustering. Although clustering seems to be pretty intuitive, I found especially the implementation of k-means pretty helpful in understanding the concepts and the importance of clustering of search results. I think I can use clustering later again for other projects. It also reminded me on certain fields in machine learning.
1. From the very first lecture, the idea of being unsatisfied with Text retrieval systems and Search engines who merely perform keyword search provided the required mindset to start the course.
2. It was interesting for me to see the applications of Linear Algebra to reduce the dimensions of document vector without much loss of information.
3. The ways in which we can utilize the link structure of webpages is simply amazing. Apart from finding importance(Pagerank, Authorities and Hub), we can find trust. Text around the link can also provide a good description of the page linked and can be more effectively used for describing the document vector.
4. I appreciated the importance of indexing ,hardware(caching) and its right use after reading the google paper and getting to know the amount of hardware utilized by Google as compared to other text retrieval systems.
5. I also appreciated the importance of crawling the deep web. Recently, most of the web pages are dynamically generated. Information required to populate such pages comes from web server's databases. Hence, it is absolutely necessary to crawl the deep web apart from static web pages.
1. The way simple weighting algorithms can produce very reasonable and relevant results when searching. The tf-idf algorithm and cosine similarity are not very complex and run very quickly, but produce quite relevant results.
2. Using LSI and SVD to reduce dimensionality seems at first like it would be a shot in the dark, but it often finds relationships between data that make sense from a real world perspective, despite being completely ignorant to the data itself.
3. Social networking and IR were, in my mind, completely unrelated at the beginning of the semester. It turns out that social networking, trust propagation, and link analysis are very important in IR, and are used extensively to provide better weights for documents when searching. The ability to fight spam by using social networking concepts with relation to the web at large is something I would have never guessed otherwise.
4. It was easy to think that given the simplicity of the k-means algorithm the clusters produced for the corpus in the project would be marginal at best. Instead, the clusters for a given query were usually very well structured and made sense in terms of real world groupings of the returned results. Moreover, a very simple iterative approach to determining the representative terms for a cluster produced viable and useful results.
5. The importance of data extraction and integration was foreign to me, and learning how big a deal it is becoming was quite interesting in light of the current state of the web. Mashups and data aggregation websites are becoming more and more popular, and knowing both how they work and the ways source sites try to stop them was fascinating.
6. The sheer volume of data we dealt with in the project helped show me how crucial small optimizations are in terms of overall speed of an algorithm. Simply abstracting one operation from a loop and doing it afterwards can provide an overall runtime difference of seconds or more. This is the first time I have had algorithms dealing with this magnitude of data, and learning how to optimize and think in terms of very large data sets was very valuable.
In no particular order:
1. Exploitation of link structure: I thought that link analysis algorithms (Pagerank, A/H) were interesting. It paved the way to understanding how to perform IR on a scalable data set while maintaining high quality results. It also was interesting to see how trust propagation could be implemented by simply analyzing the web graph.
2. Inverted Indexes: While I have seen inverted indexes and full-text searching in the database realm, it was nonetheless interesting to see their application in IR. Furthermore, our theoretical discussion on inverted indexes coupled with the implementation details in Google paper lead me to realize how indexes can be optimized.
3. Clustering / KMeans: Using Kmeans made me understand how clustering is done on the web. Also, Buckshot showed how you can get perhaps better cluster results by half-using HAC and getting better initial centroids.
4. LSI / TF-IDF: By using TF-IDF I understood how you can provide better results by focusing only on the most relevant terms and avoiding terms that are overused or exploited (spam, etc.). LSI can be used to reduce the dimensionality of the data.
5. Programming for large data-sets: I do not think that I have every programmed for such a large data set before. I learned several techniques on optimizing performance while working on the projects. For example, using hashes instead of static arrays for sparse vectors, cacheing division results because division is an architecturally expensive operation, etc.
As a last note, I would also like to say that I was quite impressed with the level of mathematics used in the class. Obviously I have taken high level math classes but this class really started to get me thinking mathematically before computationally, which is quite different from many other CS classes. The theory aspect was quite refreshing and much appreciated (even if it was dry at times).
Correlation Analysis and Latent Semantic Indexing:
The ability of scalar clusters to capture transitive correlation seemed very interesting(something which we could not achieve through basic vector similarity model). Even though it seems intuitive that ”database” and ”sql” are related, scalar clusters helped put a quantifiable approach to it. LSI seemed overwhelming at first, however the sequence in which topics were discussed, right from association clusters, scalar clusters and then LSI, allowed better and quicker understanding of LSI. The advantages of LSI in particular the concept of query expansion based on synonymy and correlation
analysis motivated in - depth thinking.
Authority Hub and PageRank analysis
The concept of tyranny of ma jority was something new to me and was not quite obvious intuitively. After understanding the mathematics behind it, things became more clearer. The most interesting aspect of Page Rank analysis, was understanding the issues related to Page Rank computation. Approaches to be taken for handling dangling nodes and cycles (artificial boosting of intra domain links) helped understand how Page Rank can be made more robust.
Recommendation systems:
The most interesting concept in this section was the combination of content based and collaborative filtering. Pseudo-rankings in particular were enlightening and truly demonstrated the concept of ”something is better than nothing”. These techniques helped understand the importance of approximations
based on the data available and ways of coming up with good approximations.
Information Integration:
The surfacing, mediator and warehousing schema helped provide an overall view of things learnt till now related to Web and their connection to traditional databases. Discussions on issues faced during information integration were particularly interesting. Learning source selection statistics and
difficulties faced while dealing with non - collaborative sources motivated thought process.
Social networks:
Comparative analysis of scale free networks and uniform random networks, helped understand the areas in which each could score over the area. It helps understand basics behind social networking services like Orkut and Facebook. It was fascinating to see, based on the examples seen in class, that Power Law actually applies to real world scenario.
Information Retrieval:
The concepts of inverted files and tolerant dictionaries were a surprise to me because of their simple and yet powerful features. The ability provided by the first to search huge amounts of data and to recover relevant results in less than a second is undoubtedly the essence of many of the advances in Information Retrieval. Similarly, the ability of an IR system to identify misspellings in the terms of a query as well as the ability to retrieve relevant documents despite their existence, adds more versatility to the way users interact with it.
LSI:
First, I consider that association clusters are an extremely useful addition to query expansion techniques. In addition, the enormous benefit offered by LSI of finding the most important dimensions within a document space, reducing noise and omitting redundant data, is (totally?) diminished by the extremely costly computations required by it. I consider that this is just unfortunate.
Link analysis:
It was pretty amazing to see the strength of the Authorities/Hubs algorithm to compute the importance of a page in two totally different senses and, even more, its ability to retrieve pages highly correlated to the query but that do not contain any of its terms. Similarly, it was amazing to realize the sensitivity of Pagerank to adversarial forces such as the page farms pointing to each other to increase their importance.
Clustering:
The way the results returned by VSM are organized into their natural categories made me think of clustering as a way to improve relevance feedback by means of helping the user have a better idea of what he/she is looking for.
Information Integration:
It was extremely useful to analyze the technologies behind the integration of heterogeneous data sources. In particular, the orchestration needed to make them all work correctly, as well as the crucial task of the mediator facing problems like the schema mapping and record linkage. Additionally, I found interesting the relationship of these issues with semantic web and machine learning.
Fairly simple methods are surprisingly effective for obtaining useful information (Bag of Words, Inverted Indices, linear systems, Naive Bayes models)
The deeply interconnected structure of the web is both a blessing (PageRank, Collaborative Filtering) and a curse (Tyranny of the Majority, Information Integration).
There are underlying bases and patterns that can be derived and applied to almost any problem, the key is figuring out how to properly conceptualize them (Eigenvectors + LSI, Clustering/Classification)
Scalable, robust solutions on a WWW level require lots of optimization and design work, as opposed to the prevailing attitude that any dunderhead can whip up a good web solution in a few hours/days (the Google Paper, Information Extraction/Integration)
Many "real-world" phenomena can be modeled by social networks ("the six degrees experiment"), and such analysis may give insights into social evolution ("Tyranny of the Majority", "Zipf's Law") and events (the economic downturn modeled as a social network)
IDF was a simple yet effective concept. The fact that we could extend it to get icf to obtain cluster summaries in the last phase of the project showed us new ways in which the original concepts can be extended.
The enire discussion on social networks was both interesting and informative. The fact that small world phenomena is not so unlikely was surprising.
The discussion on Google paper helped realize that there are various issues that need to be tackled during implementation. The design and implementation play a key role in the performance of a Search Engine.
Recommendation systems extended the course beyond text retrival and brought into focus that much more can be achieved from the web. It is fascinating how recommendation systems work.
The fact that the deep web is databases, leaves a lot to be accomplished in terms of text retrieval. There needs to be new ways to index these sources thereby increasing the scope of the search.
1. The idea of vector similarity and the concepts of bag of words/letters/shingles were nice. I understood that, representing a document as a vector in a space with terms as their axis and finding similarity is the easiest way make more money :). Though there are multiple criterias added in a search engine for most of the techniques vector similarity forms the base for improving the performance.
2. Social networks has become predominant these days. Simple techniques like trust propagation can be used in many places apart from the search engines. This has made to think a lot about the security aspect of social networks and trust management.
3. The clustering techniques like kmeans, HAC actually play a vital role in a search engine though it doesn't look obvious to an end user. Now I understood that showing one result from each cluster is a mandatory if you don't know the context in which the user as the query.
4. I understood correlation methods along with the clustering would actually improve the search quality by profiling the user and getting the right documents what he/she is looking for.
5. Though caching and precomputing results are obvious way to improve the performance of any code with the amount of data a search engine process, I felt the google's paper would convince anyone that their architecture is the best way to think about a search engine. I still wonder how a small/start-up search engine company would have such a massive architecture.
This course can precisely be described as a bit of everything. Starting of with Liner algebra (arghhhh ..) from Information Retrieval to Latent Sematic Indexing, a huge digression to social networks (which was a refreshing welcome though) to Text Classification and finally the Information Integration and Information Extraction.
Starting with the bag of words, finding tf-idf (stemming and stop word elimination techniques used in increasing and improving the similarity) and going on to extend it to Rocchio Model and Jaccard Similarity was interesting to learn. Also, extending the concepts of tf-idf to icf in clustering taught us with the possible applications of it.The idea of Forward index and inverted helped a great deal and helped me increase my understanding gradually while implementing every phase of the project.
Social Networks and its relation to IR was something that I was totally unaware of and was able to gather after this course. The idea of small world phenomena and also the understanding of betweeness centrality was some of the striking points that I found to be very useful and was also able to relate them with my other courses.
The whole idea of link structure of webpages and the way we can manage the entire link structure using authority hubs and finding the pagerank of every page, recomputing things whenever a new page gets indexed to link structure were quite a novel concept to learn. It also taught us as to how we can use the vector similarity in authority hub computations and pagerank computations, thereby establishing the interrelatedness. To add to it, the idea of clustering the webpages based on their similarity further extended and helped in exploring my understanding of vector similarity and also I could learn how effective a clusters of data are from an end user point of view. The idea of content based and collaborative filtering actually made me understand as how most of the online shopping websites work.
Finally a note on X-query and Information extraction which helped as learn two completely different concepts. A sneak peak into databases and different ways of querying over the databases with the help of X-query and the idea of information extraction based on mediator schema, LAV, GAV, etc. the whole idea of analyzing the intricacies involved in integrating the huge dataset on web and also handling the problems of schema mapping and linkage of old data.(the example of NASA which was discussed in the class) were extremely thought provoking.
1) Before this course, I had totally taken for granted the Google Search Engine. Reading the Google paper made me realize that how an idea could be manifested into reality. Combining vector similarity and page rank to give results to queries, sort of seems perfect. But then there are many factors which can help give results faster and better; like query log, query feedback.
2)When there are ambiguous queries like "bush", clustering seems to be an effective technique. The project giving a good insight into the clustering algorithms. KMeans giving faster but less accurate result and Buckshot giving better result as the seed documents taken in the start would be better.
3) The concept of Zipfian Distribution was interesting; especially when the example of how it is used for detecting forged documents. Even though we talk so much about probability; in reality the occurence of no. 1 is much higher than the occurence of no.9. So this aspect should also be kept in mind while designing the search engine.
4) The long tail concept was also quite interesting; for example some sites, networks,etc are specifically created to cater to small percentage of population(niche). Other sites satisfying majority of the population. We could also choose to develop search engines for the majority of the population or a small niche say computer scientists.
5) Content based and collaborative filtering concepts is being widely used by popular sites like amazon, youtube. They can be used for selling online products, suggesting movies and songs. The idea of collaborative filtering, of finding a persons twin(soul mate) and then giving reccomendation based on it was interesting.
It is really difficult to pick only five things from all those interesting topics, well, I ll give it a shot.
Though TF IDF were nontrivial concepts,I knew them before joining this course. The topic that really caught my attention in this part was Relevance feedback.It is really difficult to get feedbacks from users voluntarily, cause nobody bothers. For example, recently we all have been asked to take Teacher's Evaluations. Though we do it in the end, we keep on procrastinating the task. In this light, What does methods like Rocchio do is amazing.
LSI was a totally new concept to me. Firstly, how important the dimensionality reduction is? Secondly, how it can be achieved with such low loss of information. Its really a pain that computationally it has to be so expensive.
Discussion of social networks brought new dimension to the course(Which was high dimensional already). The concepts like "rare is not so rare" were incredible at first. Scale free networks, their generation and their properties was a good learning experience.
I had only heard about Page Rank before. But, I practically implemented it in this course. I would further want to know what are those other 240 odd metrics that google uses for ranking now a days.
Content-based and collaborative filtering were again totally new concepts to me. After learning how difficult it is to implement them, it is not surprising that perfect recommender systems still dont exist.
I know this is the sixth one, but i have to mention, concepts like semantic web, mediator systems for information integration are really hard to believe.
I really think that topics like Information Extraction and Information Integration should have been discussed in greater details.( I dont blame Dr Rao but such a short duration that we had for this course).
1. I appreciate most the overall organization of this class where various important topics, ranging from traditional topics such as vector space model to more modern topics such as XML, information integration, are fused into one class. This can provide a big picture for students, and they can easily know what the state-of-the-art is, and what the challenges of IR research are.
2. I also appreciate the way this class was taught where a lot of intuitions have been presented whenever possible. Intuition may be the best way to keep the ideas learned in this class longer in one’s mind. This also helps to understand why such ideas are useful in IR.
3. I also like the class projects which are carefully designed and controlled. Through these projects, the ideas learned in the class are thoroughly understood. Moreover, students get hand-on experiences on IR problems. On the other hand, the tedious processes of crawling and generating inverted index are pre-built, and thus students do not need to spend their time on these. More importantly, the major features of modern search engines are all implemented in these projects.
4. Although I have been involved in research and classes in numerical linear algebra, I found the interpretation of the matrix vector multiplication during the class is very intuitive and illuminating. This explains vividly why the repeated multiplication of a vector with a matrix can result in the principal eigenvector. This intuitive explanation also clearly explains why the eigen-gap determines the convergence rate of the algorithm, and under which condition the algorithm cannot converge to the principal eigenvector.
5. The introduction of the structure into IR as an intermediate level between traditional database and unstructured IR gives very clear understanding why schemes such as XML and XQuery are designed in these ways. It is evident that these schemes are designed to accommodate both fully structured traditional database and the modern IR paradigms.
First of all, in IR almost every area we learn can be applied to any other area and each area collaborates together. There are plenty of examples. The idea of classification is used when relevant pages are found. PageRank can be applied as a metric for page trustworthiness to decide importance of sources in information integration, LSI can be extended to LDA in clustering, and in information integration, when coverage for each source is determined, query statistics are grouped using HAC.
In recommendation system, the idea about how people can be influenced by other’s opinion fascinated me. I have an experience to be surprised how similar taste people can have when I see the recommendation on the web site like Amazon. In order to improve this, I think more research regarding psychological analysis is needed. Maybe a lot of variations can be possible based on what features are chosen to decide for classification. When it comes to movie as seen in Netflix example, a lot of factors can be affected to user’s decision, in my case, for example, my favorite directors and actors are the primary factor for choice of movies.
In Google anatomy of how Google works, a lot of things like link structure, html tags, and etc being taken into account, there are more than that, though. There’s no wonder why lay users like me can be really satisfied by the results.
In information extraction and integration, DB can be converted to html for search in deep web. Also html or text can be converted to XML for semantic web or DB table for query processing. Information can be converted to various ways according to semantics of needs or tools available upfront.
Performance can be one of the most important factors to determine algorithm of being useful or not like even though A/H algorithm is working fine, it’s not as popular as PageRank. And clustering algorithm, K-Medoid can help eliminate outliers but it’s not commonly used for performance reason, and etc.
At first I didn’t really appreciate or understand how LSI is useful considering the scaling problems and computation power required to do it. After we learned it in class it was reinforced by several practical ideas, most recently the Netflix article. This really drove the point home and I appreciated the “real life” touch that broke this idea out of being strictly academic for me.
Even though some questioned its usefulness, I really enjoyed the Google paper discussion. This paper was one of the highlights of the semester for me and I was thoroughly engrossed with the discussion that ensued.
One of the highlights lecture-wise, was the social network lectures. The background and attention to detail in this lecture was quite good and made it joy to attend those lectures.
One of my favorite topics was that of the clustering and clustering techniques. These techniques have already proved useful in other topics and discussions outside of class, so this sort of one-two combination punch that went with them really illustrated their usefulness.
The other aspect of the class I enjoyed the most was the project. The lectures were quite fast paced and high level in my opinion, yet we were expected to gather low-level knowledge quickly. The projects helped reinforce this; I learned far more from the projects than from any of the homeworks.
It is very interesting that many intuitive algorithms can be mapped to basic mathematical concepts like linear algebra, set theory, etc. Especially how different formulations of the matrices in eigen analysis lead to different results like authorities/hubs, page rank, LSI, etc.
How the enormity of the web has turned out to be more of an advantage than a problem (collaborative filtering, trust propagation, etc).
How there is always a trade off between different factors and how difficult it is to optimize on all of them. ( robustness to adversarial attacks vs random attacks, flexibility vs amount of upfront work in query processing, etc )
The difficulty of the problem of information extraction and its effect on information integration. How failure to map a single word/set of words can make us loose a whole source of information
The nature of the web which causes the openness of most problems and the complete lack of absolute stable ground truth. And how heuristics lead to good results in most cases.
Thanks to Rao’s course. I now know why my vocab is so bad inspite of me trying so hard all these years.. reading more books is not helping me.. my mind is following the “principle of least effort”, and there Zipf’s law to corroborate that..
Let me get serious (like all others...)
The concept of inverted index and its application to ranking documents helped me realize how a seemingly complicated problem of search could be simplified by representing data in the right way. The ever increasing web could never be in indexed and searched efficiently unless for the inverted index.
It was interesting to see how LSI/SVD could be used in dimensionality reduction. It got me comparing and contrasting it with DCT and FFT that are used in image processing. The relation between the loss of variance and the Eigen gap also got me pondering. It was pleasing to see some of the applications of the otherwise dry linear algebra.
The explanation as to what happens when a vector is multiplied by matrix with respect to the Eigen vectors and values of the matrix was very interesting. The application of Eigen decomposition in LSI was non-intuitive very handy. The use of power iteration in general and its use in authorities and hub calculation in particular was interesting.
The topics of knowledge representation using RDF and RDF-Schema got me pondering over issues like representing exceptions to rules, knowledge due to cultural and deductions due to cultural differences.
I was impressed by:
1. the fact that abstract analysis of data (like LSI) can give us good correlations and generalized similarities of texts. Though the extent to which this analysis can be applied is still under the question (for me at least).
2. the ubiquity of scale-free networks and power laws and their connection and also simple models how these networks can be constructed.
3. the fact that Google uses thousands of features for search.
1. Earlier I had imagined that search engines worked using 3 things: whether or not the word occurs in the page, page-rank and how close the query terms occurred to each other in a page. Seeing LSI and correlation actually being used to search for documents was interesting to me. Something that I knew in theory before (dimensionality reduction) being actually used in a service was quite exciting.
2. The engineering of a search engine was quite exciting, and that was something we had to do as a part of the project. Trying to optimize the algorithm for speed, as well as keeping a low memory footprint required a lot of effort and optimization. What was weird was that what would ordinarily appear to humans to take a lot of time (page rank – the actual power iteration) seemed to take less time for the computer than something very basic (disk access). Reading the google paper regarding how they solved these problems was enlightening.
3. The fact that trust can be propagated and distrust cannot was another interesting part in this course. I had never thought of how something that is intuitively justified by our common sense could be unintuitive when thought of from the mathematical angle. Without this course, I would have assumed that it was possible to propagate both trust and distrust via the trust propagation algorithm.
4. Semantic web and RDF is very exciting – specifically the thought it might be possible for engines to one day process units of knowledge and come up with logical answers, and that too from the greatest repository of knowledge today – the internet. This is a very exciting field, since this makes me think that we will be able to have a computer that passes the Turing test if he has access to a semantic web.
1. With sematic web, a page becomes more than bag-of-words. The latent information in it starts becoming more apparent. In such a scenario, it's interesting to think about more appropriate indexing schemes, since keyword indexing would not capture the latent information. Indexing also needs to become more semantics oriented (not just at the ontology level, but at the content level)
2. Connection of PageRank computation to MDP in terms of prioritized sweeping was interesting. Also, I am curious about the idea of doing PageRank on a smaller representative (sampled) network and using resulting values as seeds to reach faster convergence.
3. Learning from only positive examples (for wrapper induction)at first seemed counter-intuitive, as negative examples narrow down the hypothesis space. Connecting it to grammar acquisition in children was interesting. It seems to support the bias for learning minimalistic models (Occam's razor).
4. Focussed crawling is quite attractive from an engineering point of view. The idea that it can be implemented as A* search, although not particularly deep, is extremely neat.
1.The idea to use SVD to do the dimension reduction. Use mathematical tool to capture the most important features of an object. In this way, the noise can also be effectively eliminated.
2.The Benford’s law can be used to detect the made-up data in research papers. The reason is that all the ten digits are not uniformly distributed for the first digit in a number.
3.PageRank cannot converge when the Transition matrix M is not strongly connected. This can be solved by adding links from sink nodes to all the other pages.
4.The connection between the primal eigenvector and PageRank. This connection is made by the fact that the primal eigenvector of a matrix can be computed by using power iterations algorithm, which is what PageRank algorithm doing.
5.In Buckshot Algorithm, to make the time complexity linear of size n, it only takes √n sample of instance as the input for HAC.
1. Eigen values are everywhere – from page rank, authority scores to LSI! I still am not sure if things were designed so, or if they just turned out to be so. And the power iteration is such a bonus!
2. It’s a small world! Small world phenomenon and the Zipf’s law show that there exist patterns in the real world, in places that are not obvious.
3. We duplicate and store the same data in many forms – inverted index, forward index, lexicon, buckets. I realized, during the project, that this is a necessity while handling large scale corpora.
4. Recommendation systems – I had never imagined that these make such a difference in revenue for the online shopping websites until I actually saw the Netflix’ competition to improve their recommendation system. The problem is definitely more complicated than Naïve Bayes, especially when there are eccentric movies that take extreme values – “Like it” or “Hate it” and is seldom anywhere in between.
5. The google paper gave us a very good implementation level view of the whole search business. It helped to understand what is done during the query time, and what is done offline. It gave us enough background to think about using distributed architecture to improve performance.
Selecting relevant features is a more critical problem than actual classification or clustering in high-dimensional data. It would be interesting to see how classification and clustering perform as number of irrelevant features increase or decrease.
In data integration, the need to do horizontal data aggregation (i.e. retrieving relavant attributes from other tables) in the absence of join information poses interesting challenges on how different views can be joined.
Collaborative filtering used in Recommender systems can be used in similar scenarios where the system needs to identify user/users who are trying to intentionally give bad ratings for a particular set of entries. Such users can be blocked form the system.
The way everything is modeled in the current web is based on the fact that “Response Time to the query is very critical”. So all the approaches we discussed mostly rely on offline approaches and they make sense as well. But if we see the same problem in a different setup , where response time is not so critical and the precision and recall have a higher importance, the model needs to be radically changed.
The fact that there are “True dimensions” in data and how LSI finds them and uses them to cluster is data is very interesting. Initally, it is not very intuitive that clustering based on true dimensions would provide clusters which make sense when we see them in the perspective of real dimension. It is also interesting to see how we pick different axis based on the current task at hand. (Finding Clusters or Using it as a distant metric)
1. Among all of the topics, I like Link Analysis the most. I knew some of related concepts before, but not in deep. Dr. Rao gave a thorough detailed explanation about the Authority/Hub scores and PageRank algorithms and how they can be effectively computed. One more thing is: the explanation about why eigenvectors can be computed via iterative multiplication is very clear and useful, and why the eigen gap can determine the rate of convergence is very descriptive and interesting.
2. I also like the topic about Semantic Web and XML. Through the class, the first time I know how the web is evolved from plain text to structured data, and from HTML to XML, and how the information is
organized and presented over the internet. The understanding of
Semantic web could be very useful for my future research. One more
suggestion: since I am not very familiar with XML and RDF, I feel a
little bit hard to understand them deeply. Probably, a short tutorial
about some basic concepts (like a summary about those markup languages and schema) could be very useful.
3. Social network is a hot research topic recently. Dr. Rao also provides a thorough introduction about it. I enjoy Dr. Rao's explanation about the comparison about the Scale-free networks and Random networks.
Before the class, I knew the two concepts, but I did nit knew their
difference on the robustness. The introduction about Zipf's law is
interesting and practical. Probably this concept/modeling can be used in my future reseach.
4. The three projects are another good things of the class. By doing the projects, I can systematically connect all of the learned terms/concepts together and understand them deeply. For example, by doing Project A, I can connect TF/IDF with inverted files together and understand deeply why they are designed in such a way. Moreover, the introduction about the effectively retrieval is also interesting and effective.
5. Overall, I like the organization of the class. Dr. Rao is extremely well prepared because I notice he seldom turned around to see the slides on the screen. I also enjoy his way of teaching by explaining the underlying intuition. That can help in understanding much much better. And I feel like most of time, the whole class is very interactive, it can help students to better follow the lectures.
Non trivial ideas i was able to appreciate during the course of the semester.
1. Vector Similarity - the idea of using td-idf to weigh words so that important words no matter how less frequent they occur are given more weightage.
2. Authorities and Hubs - found how the link structure of the web affects the overall results produced. how authority pages and hub pages influence the result of a query.
3. Page Rank - the idea of using the same link structure to rank pages based on their importance. How sink nodes affect the rank of a page and the results. How to make a random surfer stick to a web page and make him follow web pages within a domain.
4. Link Analysis - found out how the whole world is a network of links and how trust is used to detect and eliminate spam.
5. Clustering - the idea of using clustering to improve precision was appreciated. Learnt how clustering can also be directed towards a group of users.
The first concept which amazed me was tf-idf weighting. The efficiency of the hueristic is clearly hidden by the simplistic nature of computation. This nature of tf-idf is comparative to KMeans, where also the computation is simple but the clusters achieved are in definite time.
Scale free networks is another area which induced great level of interest in me. Trivial as it may seem, power law is what is present in most real world scenarios..for example the forbes richlist, the top 100 diggs on digg.com etc.
Many concepts we discussed where like.."oh! we just thought X was going to be added to make Y better or more efficient" (however after X was explained..for example..after the discussion on KMeans and HAC..the intuitive effect of combining the two in producing buckshot..is a clear winner.
Authority Hubs computation, I feel is a wonderful method of measuring prestige. One can go on to think of applying Authority hubs computation in the physical world to get info on whos popular where. (not using the webpage of the person i mean
Lets see if my profile increases in rank for what im posting here :D
thank you all for the participation
cheers..
I am very interested in Natural Language Processing. And hence I found your class on Information Integration to be the most interesting since it came closest to emulating the topic.
Apart from that, topics on social networks and how they have been applied on web were very interesting too. It helped understand the basics behind social networks such as Orkut and facebook.
The clustering techniques and its vast applications have made me certain on it being able to be extended to far more number of fields. (K Means, Buckshot). Hence I think it is a very useful trade I have picked up in class.
Studying the Google search engine structure I learnt that a lot more goes into it than just page ranking. It helped me get an idea of how the early engine used linking, crawling and others together.
The topics on cosine similarity and how it wins over Cosine similarity in certain situations would also prove very useful as its applications is numerous.
1.That, Bag of words , in spite of being a “gross approximation” works very well. Till we have at least some sort of a working Natural Language Processing Engine, we would be using Bag of words!
2. In a Social Network which is a random network with non-uniform distribution of degrees, one can drastically reduce the connectivity by deliberately taking out a few nodes. This fact is used not only for malicious purposes but also for a good cause such as Disease prevention by quarantining super-spreaders . I learnt a Medicine-related fact in a Computer Science Class!
3.It was good to see a formal explanation for the phenomena of ‘Rich get richer’ in the form of Power Law.
4. I was quite surprised to learn that the most popular word is twice as frequent as the second most popular word! I was always assuming top few words would be equally likely. Also, surprising was the
revelation that Li (1992) showed that just random typing of letters with space will lead to a “language” with Zipfian distribution.
5. Because of this course, I think everyone not just me, but everyone know a lot more about Google. I experimented with Google a lot..and it was great to see in reality whatever I had studied in class. For example, in one of my experiments I juxtaposed a real word with some random letters..
Query: ghgjhpotterfjdfh
And this is the response I got
Did you mean: ghgjh potter fjdfh
However, for the Query
Query: ghgjhpotterfjdfhdjhdsjkdska
The response was
Did you mean: ghgjh potterfjdfhdjhdsjkdska
Thus, we see Google is implementing the concepts of Levenstein distances et al. However, they have a lot of work still to do as can be seen from the second experiment where it cannot detect words if the amount of junk words goes beyond a certain limit.
6. Deep Web: We can find data from the databases of big commercial vendors websites without having direct access to them, by indirect methods. For eg. One can write a Query to find number of books with pages>100 in Amazon.com’s repository by methodically Querying its web-service and counting the results which have > 100 pages.
This can be and I am sure must already be used by companies against their competitors.
7. Google is more than Pagerank. It is about "Engineering"
1.Seeing that ideas like bag of words work reasonably, I found it interesting that we do not need exact solutions but only reasonable heuristics applied well. The fact that web is not random and humans can be made mediators in challenging tasks can also be leveraged well.
2.The concept of scale free v/s random networks and the realisation(and the flow of topics to) how human social network is modelled by scale-free networks. The DOS attack example and the military attack question (in exam)were very useful in understanding the concept, its importance and potential applications.
3.The idea that logical inference can be done on web entities using RDF schema is a really useful one. If it can be achieved, we would come quite close to realising the superhuman web secretary. However, I doubt people will provide schema for their pages and the alternate techniques seem difficult.
4.The realisation that Google is a lot more than the Pagerank they pretend to be. Smart ideas like map reduce, distributed indexes and other engineering ideas contribute as much, if not more.
5.The topic Social networks was particularly interesting as it can potentially be used to model many problems. The fact that popularity begets popularity and how it changes the graph. The fact that Zipfian curve exists in something like text too, is amazing. It was also interesting to see how businesses could also make money by catering to the long tail instead of/in addition to the short head(the wide shoes eg.)
Overall, I appreciate the discussion of social network most. In KDD'08, social network seems to be the hottest topic, and every one is talking about it. This class covers many interesting topics in social network. Personally, I wish we could discuss more materials on this topic.
Some other thoughts are as follows.
1. Simple approaches are always preferred. Many approaches discussed in the class are quite simple, but they perform well in practice. A typical example is Naïve Bayesian classifier in which we make the independence assumption. Once I talked to a researcher from Siemens Medical, who is also doing research in the field of machine learning and data mining. In his practice, simple supervised learning methods, such as SVM, are always preferred and complex semi-supervised learning methods are not considered. If the label information is not enough, the solution is obtaining the label information first, even at some cost.
2. Dimensionality reduction is very important for large-scale data processing, especially the text document data. Several dimensionality reduction techniques are discussed in class, including PCA and LDA. Prof. Rao explained the idea of PCA very clearly and gave some intuitive ideas behind these models. For PCA, it is reduced to an eigenvalue problem. Generally speaking, many dimensionality reduction techniques can be reduced to eigenvalue problems, such as PCA, CCA (Canonical Correlation Analysis), PLS (Partial Least Squares), even some nonlinear mapping techniques proposed recently (e.g., LPP, i.e., Locality Preserving Projection). As a result, how to solve the large-scale eigenvalue problem is a very important issue in practice. In class we discuss a simple technique, power method. Some complex techniques, such as Lanczos algorithm can be applied. An alternative method is to transform the eigenvalue problem into some simple and equivalent formulation which can be solved efficiently. Currently I am doing research in this field. We have proved that under some very mild conditions, CCA and LDA can be transformed into an equivalent Least Squares problem, which can be solved very efficiently.
3. The discussion of social network is amazing, many interesting phenomena are revealed in class, such as the small world phenomena and scale-free networks. In KDD 2008 I heard a lot about social network, but I could not understand them well. Thanks to Prof. Rao’s discussion, we know some hottest topics.
4. I appreciate the discussion of the motivation for XML. XML bridges the community of database and data mining. Probably I should know more about database and XML in the future.
5. I like the course project. In this project, we are required to implement the ideas and models learned in class. In particular, some basic parts of this project are provided, such as link extraction and index construction by Lucene. Therefore, we can focus on the most essential part, such as vector space model and Pagerank algorithm. I am wondering whether we can do some project on social network.
Post a Comment