There was a short article drawing attention to the problem of searching the Deep Web in the Sunday NYT titled Exploring a ‘Deep Web’ That Google Can’t Grasp by Alex Wright that is excerpted here:
“…To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases…
That approach may sound straightforward in theory, but in practice the vast variety of database structures and possible search terms poses a thorny computational challenge.
‘This is the most interesting data integration problem imaginable,’ says Alon Halevy, a former computer science professor at the University of Washington who is now leading a team at Google that is trying to solve the Deep Web conundrum.
Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters…
In a similar vein, Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game.
‘The naïve way would be to query all the words in the dictionary,’ Ms. Freire said. Instead, DeepPeep starts by posing a small number of sample queries, ‘so we can then use that to build up our understanding of the databases and choose which words to search.’
Based on that analysis, the program then fires off automated search terms in an effort to dislodge as much data as possible. Ms. Freire claims that her approach retrieves better than 90 percent of the content stored in any given database. Ms. Freire’s work has recently attracted overtures from one of the major search engine companies…”
See how the DeepPeep search interface looks like now here: