presenting the structured data found on the web to its users. By
"structured data," Halevy was referring to the databases of the "deep
web" - those internet resources that sit behind forms and site-specific
search boxes, unable to be indexed through passive means.
Google's Deep Web Search
Halevy, who heads the "Deep Web" search initiative at Google,
described the "Shallow Web" as containing about 5 million web pages
while the "Deep Web" is estimated to be 500 times the size. This hidden
web is currently being indexed in part by Google's automated systems
that submit queries to various databases, retrieving the content found
for indexing. In addition to that aspect of the Deep Web - dubbed "vertical searching" - Halevy also referenced two other types of Deep Web Search: semantic search and product search.
Google wants to also be able to retrieve the data found in
structured tables on the web, said Halevy, citing a table on a page
listing the U.S. presidents as an example. There are 14 billion such
tables on the web, and, after filtering, about 154 million of them are
interesting enough to be worth indexing.
Can Google Dig into the Deep Web?
question that remains is whether or not Google's current search engine
technology is going to be adept at doing all the different types of
Deep Web indexing or if they will need to come up with something new.
As of now, Google uses the Big Table database and MapReduce framework for everything search related, notes Alex Esterkin, Chief Architect at Infobright, Inc.,
a company delivering open source data warehousing solutions. During the
talk, Halevy listed a number of analytical database application
challenges that Google is currently dealing with: schema auto-complete,
synonym discovery, creating entity lists, association between instances
and aspects, and data level synonyms discovery. These challenges are
addressed by Infobright's technology, said Esterkin, but "Google will have to solve these problems the hard way."
Also mentioned during the speech was how Google plans to organize
"aspects" of search queries. The company wants to be able to separate
exploratory queries (e.g., "Vietnam travel") from ones where a user is
in search of a particular fact ("Vietnam population"). The former query
should deliver information about visa requirements, weather and tour
packages, etc. In a way, this is like what the search service offered
by Kosmix is doing. But Google
wants to go further, said Halevy. "Kosmix will give you an 'aspect,'
but it's attached to an information source. In our case, all the
aspects might be just Web search results, but we'd organize them
Yahoo Working on Similar Structured Data Retrieval
The challenges facing Google today are also being addressed by their nearest competitor in search, Yahoo. In December, Yahoo announced that they were taking their SearchMonkey technology in-house
to automate the extraction of structured information from large classes
of web sites. The results of that in-house extraction technique will
allow Yahoo to augment their Yahoo Search results with key information
returned alongside the URLs.
In this aspect of web search, it's clear that no single company has
yet to dominate. However, even if a non-Google company surges ahead, it
may not be enough to get people to switch engines. Today, "Google" has
become synonymous with web search, just like "Kleenex" is a tissue,
"Band-Aid" is an adhesive bandage, and "Xerox" is a way to make
photocopies. Once that psychological mark has been made into our
collective psyches and the habit formed, people tend to stick with what
they know, regardless of who does it better. That's something that's a
bit troublesome - if better search technology for indexing the Deep Web
comes into existence outside of Google, the world may not end up using
it until such point Google either duplicates or acquires the invention.
Still, it's far too soon to write Google off yet. They clearly have
a lead when it comes to search and that came from hard work, incredibly
smart people, and innovative technical achievements. No doubt they can
figure out this Deep Web thing, too. (We hope).