Very interesting article in today’s NYT Business section (Annoyingly, WordPress.com doesn’t let me put a COinS in my blog post! Argh! Sorry. June 3, 2007. New York Times. “Google Keeps Tweaking Its Search Engine” by Saul Hansell) about Google’s relevancy ranking algorithms.
This article has a sub-text (well, not too sub) about how insanely awesome Google is, how much further ahead than anyone else they are. No doubt getting press like that is part of the reason Google gave the reporter access to this department which is usually instead cloaked in trade-secrecy.
Still, that’s definitley part of the story. It’s important to remember/realize taht Google’s relevancy ranking algorithms are very sophisticated and complex, and getting constantly more so, in order to give us the simplicity of the good results we see. Our simplistic conception of ‘page rank’ is just one increasingly small part of the whole set of algorithms. So, no, we can’t “just copy what Google does” (not least, but not only, because we are dealing with a different data domain than Google).
The solution to what we need isn’t just waiting out there in the open for us to copy. The solution(s) are waiting for us to discover and invent. On the other hand, of course we want to pay attention to what we can learn from Google and what Google does (in broad principles and–where we can figure them out–specific details) in figuring it out.
Some choice quotes:
Google’s approach to search reflects its unconventional management practices. It has hundreds of engineers, including leading experts in search lured from academia, loosely organized and working on projects that interest them. But when it comes to the search engine — which has many thousands of interlocking equations — it has to double-check the engineers’ independent work with objective, quantitative rigor to ensure that new formulas don’t do more harm than good.
As always, tweaking and quality control involve a balancing act. “You make a change, and it affects some queries positively and others negatively,” Mr. Manber says. “You can’t only launch things that are 100 percent positive.”
Google lured Mr. Manber from Amazon last year. When he arrived and began to look inside the company’s black boxes, he says, he was surprised that Google’s methods were so far ahead of those of academic researchers and corporate rivals.
“I spent the first three months saying, ‘I have an idea,’ ” he recalls. “And they’d say, ‘We’ve thought of that and it’s already in there,’ or ‘It doesn’t work.’
Mr. Singhal has developed a far more elaborate system for ranking pages, which involves more than 200 types of information, or what Google calls “signals.” PageRank is but one signal. Some signals are on Web pages — like words, links, images and so on. Some are drawn from the history of how pages have changed over time. Some signals are data patterns uncovered in the trillions of searches that Google has handled over the years.
Once Google corrals its myriad signals, it feeds them into formulas it calls classifiers that try to infer useful information about the type of search, in order to send the user to the most helpful pages. Classifiers can tell, for example, whether someone is searching for a product to buy, or for information about a place, a company or a person. Google recently developed a new classifier to identify names of people who aren’t famous. Another identifies brand names.
These signals and classifiers calculate several key measures of a page’s relevance, including one it calls “topicality” — a measure of how the topic of a page relates to the broad category of the user’s query. A page about President Bush’s speech about Darfur last week at the White House, for example, would rank high in topicality for “Darfur,” less so for “George Bush” and even less for “White House.” Google combines all these measures into a final relevancy score.
In the end, it’s hard to gauge exactly how advanced Google’s techniques are, because so much of what it and its search rivals do is veiled in secrecy. In a look at the results, the differences between the leading search engines are subtle, although Danny Sullivan, a veteran search specialist and blogger who runs Searchengineland.com, says Google continues to outpace its competitors.