Google検索ランキングのテクノロジ - アラフォーPMm-tanakaの日記

Google Official Blogより
http://googleblog.blogspot.com/2008/07/technologies-behind-google-ranking.html



Technologies behind Google ranking

 

7/16/2008 10:53:00 AM 

In my previous post, I introduced the philosophies behind Google ranking. As part of our effort to discuss search quality, I want to tell you more about the technologies behind our ranking. The core technology in our ranking system comes from the academic field of Information Retrieval (IR). The IR community has studied search for almost 50 years. It uses statistical signals of word salience, like word frequency, to rank pages. (See "Modern Information Retrieval: A Brief Overview" for a quick overview of IR technology.) IR gave us a solid foundation, and we have built a tremendous system on top using links, page structure, and many other such innovations.

Google検索ランキングのテクノロジ
7/16/2008 10:53:00 AM
Googleのランキングに関する哲学については、前回の投稿で紹介しました。検索品質に関する努力の一環として、われわれのランキングのテクノロジーに関してもう少し詳しくお話したいと思います。われわれのランキングシステムの主要テクノロジは情報検索という学問分野に由来しています。情報検索学コミュニティはすでに50年近くにわたって研究を続けています。ページのランクづけには、単語の登場頻度のような単語の統計的特徴が利用されています。（「現代の情報検索：概要」を参照http://singhal.info/ieee2001.pdf）われわれは情報検索学のしっかりとした基盤の上に、リンク、ページ構造、その他の革新的技術を利用したたくさんのシステムを構築しています。



Search in the last decade has moved from give me what I said to give me what I want. User expectations from search have rightly increased. We work hard to fulfill the expectations of each and every user, and to do that we need to better understand the pages, the queries, and our users. Over the last decade we have pushed the technologies for understanding these three components (of the search process) to completely new dimensions.

この10年間で検索は、「私のいったこと」を提供するレベルから「私のほしいもの」を提供するレベルに移行しています。ユーザの検索に対する期待値はますますおおきくなっています。すべての、そして個々のユーザの期待を満たすためには、ページ、クエリ、ユーザについてより深く理解するために努力する必要があります。この10年間でわれわれはこれら3つのコンポーネント（の検索プロセス）を解釈するためのテクノロジを全く新しい次元に押し上げることに成功しました。



When we talk about queries at Google, we use square brackets [ ] to mark the beginning and end of queries (see "How to write queries" by Matt Cutts), a notation I will use throughout this post. (Pages and search results change frequently, so in time, some examples used here may not behave as explained.)

Googleで検索語について書くときは、検索語の開始と終了の間を、角括弧[]で囲むようにしています。（"検索語の書き方" Matt Cuttsを参照）この投稿でもこの記述方法を使うことにします。(ページや検索結果は頻繁に変更されるので、この投稿で説明したサンプルのとおりにはならない場合もあります。)



Understanding pages: Over years we have invested heavily in our crawl and indexing system. As a result we have a very large and very fresh index. In addition to size and freshness, we have improved our index in other ways. One of the key technologies we have developed to understand pages is associating important concepts to a page even when they are not obvious on the page. We find the official homepage for Sprovieri Gallery in London for the Italian query [galleria sprovieri londra], even though the official page does not have either London or Londra on it. In the U.S., a user searching for [cool tech pc vancouver, wa] finds the homepage www.cooltechpc.com even though the page does not mention anywhere that they are in Vancouver, WA. Other technologies we have developed include distinctions between important and less important words in the page and the freshness of the information on the page.

ページの解釈：われわれは何年にもわたって、クロールとインデクシングのシステムに大きな投資を続けてきました。その結果、広大で新鮮なインデックスを構築することができました。サイズと新鮮さに加えて、われわれはインデックスをさまざまな方法で向上させてきました。ページを解釈するためにわれわれが開発したキーテクノロジの一つは、ページそのものには登場しない重要な概念をページと結びつけるという方法です。イタリア語の検索語[galleria sprovieri londra]に対して、公式ホームページにはLondonもlondraも記述されていないにもかかわらず、ロンドンのSprovieri Galleryの公式ホームページを見つけることができます。アメリカでは、[cool tech pc vancouver, wa]という検索語にたいして、ページ上のどこにもその店がバンクーバーにあることは記述されていないにもかかわらず、www.cooltechpc.comを見つけることができます。そのほかにも、ページ中の重要な単語とそうでない単語を見分けるテクノロジや、情報の新鮮さを見分けるテクノロジなどがあります。



Understanding queries: It is critical that we understand what our users are looking for (beyond just the few words in their query). We have made several notable advances in this area including a best-in-class spelling suggestion system, an advanced synonyms system, and a very strong concept analysis system. 

Most users have used our spelling suggestion system at one time or another. It knows that someone searching for [kofee annan] is really searching for Mr. Kofi Annan, and is prompted: Did you mean: kofi annan; whereas someone searching for [kofee beans] is actually looking for coffee beans. Doing this internationally with very high accuracy is hard, and we do it well.

検索語の解釈：ユーザが何を探しているのかを理解することは、われわれにとって非常に重要です。(クエリに含まれる単語だけではなく)われわれはこの分野で突出した先進性もっています。なかでも特筆すべきなのは、綴り文字のサジェスチョンシステム、先進的な類義語システム、協力な概念分析システムです。
ほとんどのユーザは綴りのサジェスチョンシステムを気づくことなく利用しています。このシステムは[kofee annan]という検索語が、本当はKofi Annan氏について検索したいのだということを認識し、「もしかして：kofi annan」と表示します。対して[kofee beans]が本当はコーヒー豆(Coffe beans)を検索したいのだということも認識します。これらの処理を国際的にかつ高い精度でおこなうことはとても難しいことですが、われわれはこれを上手くおこなっています。



Synonyms are the foundation of our query understanding work. This is one of the hardest problems we are solving at Google. Though sometimes obvious to humans, it is an unsolved problem in automatic language processing. As a user, I don't want to think too much about what words I should use in my queries. Often I don't even know what the right words are. This is where our synonyms system comes into action. Our synonyms system can do sophisticated query modifications, e.g., it knows that the word 'Dr' in the query [Dr Zhivago] stands for Doctor whereas in [Rodeo Dr] it means Drive. A user looking for [back bumper repair] gets results about rear bumper repair. For [Ramstein ab], we automatically look for Ramstein Air Base; for the query query [b&b ab] we search for Bed and Breakfasts in Alberta, Canada. We have developed this level of query understanding for almost one hundred different languages, which is what I am truly proud of.

類義語は、検索語分析の基盤となっています。これはGoogleのなかでも最も難しい問題のひとつです。自動的な言語処理は人間にとってさえも解決不能な問題であることがあります。ユーザはどんな単語を検索語とするかについて深く考えなければならないことを嫌います。時には、どんな単語が正しいのかさえ分からないこともあります。このようなときに、われわれのシステムが役に立ちます。類義語システムは洗練された検索語の調整をおこないます。たとえば、[Dr Zhivago]という検索語のDrはDoctorのを意味し、[Rodeo Dr]のDrはDriveを意味していること認識できます。[back bumper repair]を検索したユーザは、rear bumper repairについての検索結果も得ることができます。[Ramstein ab]に対しては、Ramstein空軍基地、[b&b ab]にたいしてはカナダのアルバータ州のBed and Breakfasts（朝食付の宿）の検索結果を返します。われわれはこうしたレベルの検索語の分析を100前後の言語に対しておこなっており、これは私がとても誇りに思っていることでもあります。



Another technology we use in our ranking system is concept identification. Identifying critical concepts in the query allows us to return much more relevant results. For example, our algorithms understand that in the query [new york times square church] the user is looking for the well-known church in Times Square and not for articles from the New York Times. We don't just stop at identifying concepts; we further enhance the query with the right concepts when, for instance, someone looking for [PC and its impact on people] is in fact looking for impact of computers on society, or someone who searches for [rainforest instructional activities for vocabulary] is really looking for rain forest lesson plans. Our query analysis algorithms have many such state-of-the-art techniques built into them, and once again, we do this internationally in almost every language we serve.

ランキングシステムに使われているもうひとつのテクノロジは、概念の識別です。検索語に含まれる概念を識別することで、より関連性の高い結果を返すことが可能になります。例えば、われわれのアルゴリズムは、[new york times square church]という検索語でユーザが探しているのは、タイムズスクウェア(Times Square)にある有名な教会を探しているのであって、ニューヨークタイムス（New York Times）の記事を探しているのではないこと認識することができます。われわれは概念を識別するにとどまらず、検索語を概念にそって強化するようにしています。たとえば、[PC and its impact on people](PCとその人々への影響)という検索が、実はimpact of computers on society（社会にたいするコンピュータの影響）を探しているということ。[rainforest instructional activities for vocabulary]という検索語が、実はrain forest lesson plansを探しているということなど。われわれの検索語分析システムには、こうした最先端の技術がたくさん組み込まれています。そして、繰り返しになりますが、これらは私たちが提供している言語のほとんどすべてに組み込まれています。



Understanding users: Our work on interpreting user intent is aimed at returning results people really want, not just what they said in their query. This work starts with a world class localization system, and adds to it our advanced personalization technology, and several other great strides we have made in interpreting user intent, e.g. Universal Search.

ユーザの解釈：ユーザの解釈の目的は、検索語で指定されたことだけではなく、ユーザが本当に必要としている結果を返すことです。このおこないは、世界レベルのローカライゼーションシステムからはじまり、先進的なパーソナライゼーション技術、Universal Searchなどのユーザ解釈にかんするおおきな動きを加えて現在にいたっています。



Our clear focus on "best locally relevant results served globally" is reflected in our work on localization. The same query typed in multiple countries may deserve completely different results. A user looking for [bank] in the US should get American banks, whereas a user in the UK is either looking for the Bank Fashion line or for British financial institutions. The results for this query should return local financial institutions in other English speaking countries like Australia, Canada, New Zealand, South Africa. The fun really starts when this query is typed in non-English-speaking countries like Egypt, Israel, Japan, Russia, Saudi Arabia, Switzerland. Likewise the query [football] refers to entirely different sports in Australia, the UK, and the US. These examples mostly show how we get the localized version of the same concept correctly (financial institution, sport, etc.). However, the same query can mean entirely different things in different countries. For example, [Cote d'Or] is a geographic region in France - but it is a large chocolate manufacturer in neighboring French-speaking Belgium; and yes, we get that right too :-).

「地域的にもっとも適した結果を、世界的に提供する」というわれわれの明確な目的は、ローカライゼーションにも生かされています。同じ検索を異なった国で実施すると、全く異なった結果が返ってきます。アメリカでの[bank]という検索は、アメリカの銀行についての結果を返します。対して、イギリスでの検索は、Bank Fashionまたは、イギリスの財務関連の機関についての結果が返ってきます。このクエリの結果は、他の英語圏の国々(オーストラリア、カナダ、ニュージーランド、南アフリカなど)でもその国の財務関連の機関についての結果を返します。さらに同じ検索を非英語圏の国々（エジプト、イスラエル、日本、ロシア、サウジアラビア、スイスなど）で実施すると面白いことがおこるのがわかります。同じように、[football]という検索語は、オーストラリア、イギリス、アメリカでは全く異なったスポーツを意味しています。これらの例は、どのようにしてローカライズドされた概念（財務期機関、スポーツなど）を正しく抽出しているかを表しています。同じ検索語が、国によって全く異なった事柄を表わしていることもあります。たとえば [Cote d'Or] はフランスの1地域ですが、ベルギーの大きなチョコレート製造会社でもあります。もちろん、これも正しく扱うことができます。



Personalization is another strong feature in our search system which tailors search results to individual users. Users who are logged-in while searching and have signed up for Web History get results that are more relevant for them than the general Google results. For example, someone who does a lot football-related searches might get more football related results for [giants], while other users might get results related to the baseball team. Similarly, if you tend to prefer results from a particular shopping site, you will be more likely to get results from that site when you search for products. Our evaluation shows that users who get personalized results find them to be more relevant than non-personalized results.

パーソナライゼーションは、個々のユーザに対して最適化された検索結果を返すためのもう一つの重要な機能です。ログイン済みでWeb履歴の登録をしているユーザは、通常のGoogle検索結果よりももっと適切な結果を得ることができます。たとえば、フットボール関連の検索をたくさんおこなっているユーザは、[giants]という検索語に対してフットボールに関連した結果を得ることができますが、他のユーザは、野球に関連した結果を得ることになります。同様に、特定のショッピングサイトの検索結果を利用する傾向がある場合、製品に関する検索結果もそのサイトのものが多くなります。われわれの評価結果では、パーソナライズドされた検索結果を受け取るユーザは、そうでないユーザに比べて、より有用な検索結果を得ていると感じているとのことです。



Another case of user intent can be observed for the query [chevrolet magnum]. Magnum is actually made by Dodge and not Chevrolet. So we present the results for Dodge Magnum with the prompt See results for: dodge magnum in our result set.

ユーザの意図は、[chevrolet magnum]という検索語にも見ることができます。Magnumはシボレーではなくダッジによって作られてます。そこで、われわれはDodge Magnumについての検索結果を、「dodge magnumについての結果をみる」という案内とともに表示します



Our work on Universal Search is another example of how we interpret user intent to give them what they (sometimes) really want. Someone searching for [bangalore] not only gets the important web pages, they also get a map, a video showing street life, traffic, etc. in Bangalore -- watching this video I almost feel I am there :-) -- and at the time of writing there is relevant news and relevant blogs about Bangalore.

Finally let me briefly mention the latest advance we have made in search: Cross Language Information Retrieval (CLIR). CLIR allows users to first discover information that is not in their language, and then using Google's translation technology, we make this information accessible. I call this advance: give me what I want in any language. A user looking for Tony Blair's biography in Russia who types the query in Russian [Тони Блэр биография] is prompted at the bottom of our results to search the English web with:

Universal Serachへのとりくみは、ユーザが本当にほしいものを理解するもう一つの例です。たとえば[bangalore]について検索したとき、重要なWebページだけではなく、地図や、町の様子のビデオ、交通情報なども得ることができます。バンガロール(bangalore)の例の場合、ビデオを見ることでその場にいるような感覚になります。
最後に、検索分野での最新の成果について述べさせてください。言語間情報検索(CLIR)です。CLIRは自身の言語ではない情報を検索可能とします、そして、Googleの翻訳技術によってこれらの情報へのアクセスを可能とします。この「私のほしいものをどんな言語でも」という点をわたしは「最新の」と言っています。ロシアでトニーブレア氏の経歴を知りたい人は、 [Тони Блэр биография]という検索語で検索します、すると、検索結果の下部には「英語サイトを検索する」というバナーが表示されます。



Similarly a user searching for Disney movie songs in Egypt with the query [#:'FJ #AD'E /J2FJ] is prompted to search the English web. We are very excited about CLIR as it truly brings us closer to our mission to organize the world's information and make it universally accessible and useful.

同様に、ディズニー映画の主題歌をエジプト語で検索した場合も、「英語サイトを検索する」のバナーが表示されます。われわれはCLIRがわれわれの「世界中の情報をどこからでも便利にアクセス可能とする」という使命を達成することに貢献していることにとても興奮しています。



I could go on and on showing examples of state-of-the-art technology that we have developed to make our ranking system as good as it is, but the fact is that search is nowhere close to being a solved problem. Many queries still don't get satisfactory results from Google, and each such query is an opportunity to improve our ranking system. I am confident that with numerous techniques under development in our group, we will make large improvements to our ranking algorithms in the near future.

I hope my two posts about Google ranking have made it clear that we live and breathe search, and we are more passionate than ever about it. Our fervor for serving all our users worldwide is unprecedented. We pride ourselves in running a very good ranking system, and are working incredibly hard every day to make it even better.

私は今後も、ランキングシステムをよりよいものにするために我々が開発してきた最先端のテクノロジを紹介するつもりです。しかし実際には検索は、まだその問題を解決するには至っていません。多くのクエリはに対してGoogleはいまだ満足な結果を返すことができていません。これらのクエリはランキングシステムを改良するためのよい機会となっています。わたしは、われわれのチームで開発中のたくさんの技術が近い将来、ランキングシステムの大幅な向上をもたらすと信じています。
私のGoogleランキングシステムに関する2つの投稿が、検索とその背後にあるものを明らかにすることに役立っていれば幸いです。われわれはこの取組にもっと前向きになるつもりです。世界中のユーザに対するわれわれの情熱は、前例のないくらい大きくなっています。われわれは、とても良いランキングシステムを持っていると自負していますが、これをさらに向上すべく毎日懸命に努力を続けています。

こちらも参照
「Googleランキングの背後にある技術」--グーグルがブログで説明
http://japan.cnet.com/marketing/story/0,3800080523,20377405,00.htm