Folks:
A large part of next Tuesday's class will be devoted to a discussion of the paper
http://rakaposhi.eas.asu.edu/cse494/google-paper.pdf
(this is the top one in the "overall search engine" category).
You should all
(1) read it and
(2) bring to the class a an at-most one page review to the class with questions and/or observations about
the paper given all that you know about how search engines could work).
Focus on what are aspects of the architecture that surprised you
and/or different from what we discussed in the class.
(3) Post your comments also onto the blog (as comments on this post)---Preferably before the class.
This will also serve as a good review of a lot of what we have done in the last several weeks.
The class itself will consist of
1. A short first part where I will complete the discussion of Trust Rank and Efficient computation of Pagerank
and
2. Us discussing the questions you bring up.
This is a mandatory class and your participation is mandatory..
cheers
Rao
A large part of next Tuesday's class will be devoted to a discussion of the paper
http://rakaposhi.eas.asu.edu/cse494/google-paper.pdf
(this is the top one in the "overall search engine" category).
You should all
(1) read it and
(2) bring to the class a an at-most one page review to the class with questions and/or observations about
the paper given all that you know about how search engines could work).
Focus on what are aspects of the architecture that surprised you
and/or different from what we discussed in the class.
(3) Post your comments also onto the blog (as comments on this post)---Preferably before the class.
This will also serve as a good review of a lot of what we have done in the last several weeks.
The class itself will consist of
1. A short first part where I will complete the discussion of Trust Rank and Efficient computation of Pagerank
and
2. Us discussing the questions you bring up.
This is a mandatory class and your participation is mandatory..
cheers
Rao
30 comments:
First quick approach:
It is interesting to see that the general formula to compute the PageRank of a page minimizes the impact of the untrustworthy "page farms" trying to increase their link-based importance
There is a lot of effort being put on making the search engine more efficient, with respect to data structures, OS paramaters and memory issues.
One interesting fact, as per my understanding, is that, Google includes idf in term counts because, the counts - weights increase linearly at first, but taper off when the counts become more than a limit. So the very high frequency terms are not included.
The concept of short and long barrel also seems to be a quicker way to answer queries
Although we have discussed briefly about proximity in terms of focused crawling, this paper sort of gives a more detailed view.
I am not quite clear if weights are normalized or not.
I have some simple questions:
- What are the advantages of Big Files?
- How do the Google crawlers avoid revisiting a page? Checksum? Does the URLServer do this in advance?
- What are the criteria to decide what pages are crawled?
- What is the advantage of using Python?
- The use of “barrels” in Google is different from the uses explained in class?
- Has Google improved its mechanism of relevance feedback?
It is interesting to see that Google maintains 2 inverted Barrels. It first searches for the term in the title of the web page or anchor text pointing to the page. If the number of results thus obtained are less, only then is the actual text of the page is taken into consideration.
Also, in the paper it is mentioned that the searcher stops once a certain number of pages (40,000)is retrieved. I wonder if it still has a limit on the no. of pages. Or probably some other mechanism to reduce the no. of pages for which similarity is computed.
Figure 4, 4th step states that doclists are scanned until all search terms are matched in a document.
Nowhere in the algorithm have they stated that documents with partial queries are ranked.
Adding to my previous post, Google even now only searches for documents matching all query terms. This is different from the strategy taught in class.
It is hard to believe that google's repository contains full HTML of every webpage considering the fact that are trillion unique URL's and the web is growing by many billion webpages per day.
Categorizing hits into fancy hits and plain hits is interesting. Does it mean that fancy hits go into one barrel and the plain hits becomes a part of the other large barrel. In plain hits all positions higher than 4095 are labelled as 4096; does this mean text occuring late in the document are treated same?
Does google now use the concept of LSI?
1. Google improves precision by using the "link structure of the web" to evaluate the pageRank of each web page, and "utilizes link to improve search results" - is there a difference between these two features?
2. How does Google take care of synonymy and polysemy?
3. How do they prevent the crawlers from going back and forth between the same links - avoiding infinite looping?
Google has no doubt succeeded in its aim of supporting "novel research activities" online, but Google's goal was also to "push development and understanding into the academic realm" of search engines (page 3). They claimed it to be a "black art" since search engine development took place at companies, with very little publication of the related technical details. How IRONIC is that claim looking at Google now!
This rare paper was published when Google was taking its first steps - I wonder how much of the architecture described still exists in the present day.
Looking at the formula for PageRank given in the Google paper
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
and the bowtie/butterfly structure of the web, it can be seen that pages that are unconnected to the rest of the world have PR=(1-d) because they have no inlinks. Also, pages that lie in the "IN" part (i.e. ones which only point to others but noone points to them) also have PR=(1-d).
1.I wonder whether this is reasonable.
2.Would this amount to somewhat discounting the importance of hubs(the kind like survey papers) in purely PR environment? It could be possible (in the extreme case)that such an "IN" page is a hub that points to many important pages (but importance does not flow backwards). But whatever be the case, such a page is more important than an isolated page.
I notice that they design some special data structures to save space. I am wondering if the space is still a big issue now for Google when they design the system.
While reading the Ranking and Feedback sections, I was wondering how they knew when to stop changing the rank function, or even when they knew it needed to be changed. Is it all based on user feedback, or are there concrete benchmarks?
According to the formula of PageRank, the d damping factor is the probability at each page the "random surfer" will get bored and request another random page. How is the probability calculated? And how is it working to make it nearly impossible to deliberately mislead the system in order to get a higher ranking?
In 2.2 Anchor Text,
...the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens. I wonder how this works.
It is interesting when hit list is calculated, hits occurring in a URL, title, anchor text, or meta tag are fancy hits and everything else is plain hits. Then how about font color?
Although they are mentioned clearly in the paper, I think partial matching is possible by type-prox-weights and LSI can be used when IR score is measured.
@shruthi
Disconnected pages on the web will not receive a Page Rank, because they will not be crawled by the search engine in the first place. And I think it is very reasonable to have 1-d page rank for pages that do not have any citations. This is the lowest page rank that any page in the connected web can get, and we know that Page Rank is biased towards authorities.
1. "PageRank corresponds to the principal eigen vector of the normalized link matrix of the web." How are these related at all?
It is very interesting how they have handled so many issues in web search with what little they had during their time.
Whats interesting to note is
1. Their primitive form of indexing. They had four major data structures to implement this: Big files, repository, document index, lexicon.
This I feel, calls for a lot of overhead and was not the most efficient way of indexing. But considering the time, they have come up with something truly significant paving ways to further improvements.
2. The page rank computation considers a 'd' factor which is very important to reduce false importance to pages.
3. How TF-IDF concept today was implemented as the count-weights concept then.
4. Even correlation was handled in terms of hits occuring close together.(Type-proc weight)
5. While the whole concept here is to scan till a match for all the terms is available in any document, this concept has evolved to finding the similarity with the documents and then ranking them by page ranking and clustering.
@Mithila
2. Polysemy of the query word can be highlighted by clustering the search results, according to their semantics. Google does not do this. It clusters search results only based on the server-name, to make sifting through results easier.
Top 10 query results for "cricket" shows 9 results for the sport cricket, and 1 result for the insect cricket. This shows that it concentrated on the most popular meaning of the query word. However, a more qualified query "cricket insect" shows only the documents that talk about the insect - cricket.
Coming to the questions,
1. I couldn clearly understand how the fancy hits and plain hits are handled. How they are used once the query is obtained. He says they use 2 inverted barrels. So obviously, the fancy hits are given more importance than the plain hits. Is this the right way to approach???
2. And what does he mean by making all occurences after 4095 4096??? Is this a way he has approached TF-IDF in his implementation. He has done nothing but taper occurences beyond a certain point. Is this an efficient way of doing it???
@Vidya
Crawlers will crawl pages depending on the seed URLs. So, it is possible that they can crawl this isolated component. There are other techniques too, like web harvesting.
The bias of pagerank for deeming authorities more important than hubs is what I was also tryign to point out.
Also, the fact that isolated pages are not the same in terms of importance as hubs.
Some observations:
1. The paper is really engineering, not scientific. I mean that the main value was not in ideas express there but in the google system which they created. Nowadays it must be still unclear if such papers should be published before some practical evaluation.
2. From the very beginning google have been winning not by its pagerank only but by the talent of its engineers. Guys could create complex, reliable, distributed, scalable system which outperformed big commercial competitors of the day.
3. Paper reminded good news for those who wants to compete ms,google, etc. -- moore's law outperforms growth of web data. So cheap hardware will be always available for those who has great ideas.
@Mithila and Vidya
I think Mithila is right in saying that the version of Google in the paper does not care about synonymy and polysemy-"In order to rank a document with a single word, Google looks at the document's hitlist for THAT word".
However this can be incorporated by rewriting the query as multiple queries of synonyms looked up from a thesaurus. Wordnet is commonly used.
@Yang
And I also wonder why they still have to use compression unless they use it for faster pattern(term) search. Compression and de-compression also take time.
It seems that the system uses compression/decompression only during index creation, which is an offline process. queries would an answered from the index itself, which has enough information contained in the hit lists
1. Hidden Web: One important aspect this paper missed out is about the hidden web and web services. Most of the useful and important data these days is in the kind of web services or dynamic webpages which cannot be indexed. I couldnt find a mention about this in the paper.
Pages like JSP, ASP with form based interfaces where you only can query the services are important part of the web, Infact I would say they some of the most authorative and reliable sources of information.
2. Structured Information on web: Apart from pure HTML, there is a lot of structured information(XML and relational databases) on the web inform of tables which can indeed give precise answers to some of the queries. Hybrid approaches based on the format of underlying data, rather than "one size fits all" kind of approach is the order of the day.
3 Node Failures: Given the huge distributed setup they have, how they are handling the server failures and bandwidth congestion on networks is not mentioned at all.
@Anupam
I was talking about the runtime of the crawler+indexer. Even though the process is offline, the compression and de-compression would use up this time. I found it unnecessary to do it and was wondering whether its only use was so that the indexer could pull out terms faster from the compressed page than the uncompressed one.(in terms of linear search/pattern matching)
I was particularly interested in their solutions/hacks for employing parallelism for different processes
1. Distributed indexing of documents into barrels: (what about data being skewed?)
The idea of base lexicon (with 14Million words) seems reasonable, but there a minor point. I suppose that the wordID ranges for the barrels would not be constant, but based on distribution of words, in order to keep all the barrels more or less equal in size. That would've helped in the sorting phase. But what if the word distribution of that actual crawled data disagreed to a large degree from the expected distribution?
2. Distributed crawling
The benefit of distributing the crawling process depends on how little overlap is among the crawlers, and it is an interesting problem to consider.
Performing duplicate detection is not a practical solution, even for an offline process. The architecture diagram seems to suggest that the entities being called crawlers are merely fetchers (in the sense that they just fetch pages based on url from the url server). Given the list of URL to be crawled (or rather fetched), the question still remains as to where did the URLServer get those URLs in the first place. Else, if the URLs were being crawled as the pages were being fetched, how did they ensure that different crawlers don't run into each other's region?
3. Proximity search
The system performs proximity search by scanning multiple hit lists in parallel. That possibly explains the high (1-10 seconds) response time, since they did not perform phrase-detection up-front.
One of my questions is: How to combine different rankings together? In this paper, different ranking systems are considered, including PageRank, the weighting for different type of data. In class we discuss a simple scheme to combine them, and how to determine the weight for different ranking systems? How to determine the weight for different types of data? The weights have a significant influence on the performance of search engine. Thus, the weights should be chosen very carefully.
Another interesting topic is the implementation of the parser. I tried to analyze the html pages in the past, and failed due to so many different errors of html pages. I used htmlParser, a tool to analyze html pages. In fact, it performed very poorly in practice because a great variety of errors that makes my analyzer get stuck often.
Does Google put its duplicated data center (as described in Figure 1) in many many locations such that
we can obtain search results in very short time wherever we access internet?
In the paper, they have mentioned one of the failings of Vector Space Model, which probably we did not discuss in class. They mention that Vector Space Model favors shorter documents over longer ones even when both the documents contain the same terms from the Query and with the same frequency. I figured it does so because Similarity is basically the Dot-Product between the Query Vector and Document Vector divided by the magnitudes of the Query and Document Vectors respectively. Hence, even when the Dot-Products of the shorter and the longer documents with the Query are equal ,still the longer document will have lower similarity because its magnitude is high...
One thing that I did not fully understand is how Google knows which sites (or perhaps URLs) should be crawled more frequently. Is there any human intervention? I would imagine that this is something that is handled by the URLServer. I think it makes sense that if a page needs to be crawled more frequently, the URLServer will give the url to a crawler more frequently. But how does it make the decisions?
In the google query evaluation process, it mentions about searching, where after certain no of matching of documents are found, the searcher begins to sort the documents. But this might lead to sub-optimal results as mentioned. But the paper doesn’t speak about any clear solution to the problem
Post a Comment