
Googleを支える技術 ?巨大システムの内側の世界 (WEB+DB PRESSプラスシリーズ)


同じWeb Search Engine論文の4.3「Crawling the Web」には、ほかにも面白いエピソードが紹介されています。

The Anatomy of a Large-Scale Hypertextual Web Search Engine



せっかくなので、「The Anatomy of a Large-Scale Hypertextual Web Search Engine」の他の部分も読んでみた。

3.2 Differences Between the Web and Well Controlled Collections

The web is a vast collection of completely uncontrolled heterogeneous documents. Documents on the web have extreme variation internal to the documents, and also in the external meta information that might be available. For example, documents differ internally in their language (both human and programming), vocabulary (email addresses, links, zip codes, phone numbers, product numbers), type or format (text, HTML, PDF, images, sounds), and may even be machine generated (log files or output from a database). On the other hand, we define external meta information as information that can be inferred about a document, but is not contained within it. Examples of external meta information include things like reputation of the source, update frequency, quality, popularity or usage, and citations. Not only are the possible sources of external meta information varied, but the things that are being measured vary many orders of magnitude as well. For example, compare the usage information from a major homepage, like Yahoo's which currently receives millions of page views every day with an obscure historical article which might receive one view every ten years. Clearly, these two items must be treated very differently by a search engine.

Another big difference between the web and traditional well controlled collections is that there is virtually no control over what people can put on the web. Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulating search engines for profit become a serious problem. This problem that has not been addressed in traditional closed information retrieval systems. Also, it is interesting to note that metadata efforts have largely failed with web search engines, because any text on the page which is not directly represented to the user is abused to manipulate search engines. There are even numerous companies which specialize in manipulating search engines for profit.

3.2 Webとコントロールされたコンテンツ群との違い



Webの検索と、企業内のドキュメント検索では考慮すべきことが違うということは、去年のMicrosoftのTech Edのセッションでも述べられていた。

  • そもそもリンクが少ないのでページランクのような仕組みが使えない。
  • アクセス権の制御
  • Webとは異なったメタデータ(作成者、更新者、Officeドキュメントのプロパティ)の利用

