Drowning in the Deep Web

February 25, 2009 · 2 Comments

The concept of the semantic web is not terribly new. As humans we provide direction to the search engine to retrieve what we’re seeking. The semantic web would go beyond the direction we give it to form its own conclusions and deduce what we want. According to Wikipedia:

semantic web is a vision of information that is understandable by computers, so that they can perform more of the tedious work involved in finding, sharing, and combining information on the web.

drowning_small1

It’s difficult for search engines to process data cognitively – they’re only machines after all. The deductive reasoning research inherent in the semantic web are a long way out. Or so I thought until I read this article.

The challenges that the major search engines face in penetrating this so-called Deep Web go a long way toward explaining why they still can’t provide satisfying answers to questions like “What’s the best fare from New York to London next Thursday?” The answers are readily available — if only the search engines knew how to find them.

Now a new breed of technologies is taking shape that will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results — it may ultimately reshape the way many companies do business online.

Search engines rely on programs known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together. While that approach works well for the pages that make up the surface Web, these programs have a harder time penetrating databases that are set up to respond to typed queries.

This type of Deep Web searching is not only good for consumers wanting to book a flight, but it would allow businesses to cross-reference their data with research and news to come back with results that reflect political and social landscapes.

Who is pushing the boundary on this? I bet you can guess.

Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. For example, if the search engine finds a page with a form related to fine art, it starts guessing likely search terms — “Rembrandt,” “Picasso,” “Vermeer” and so on — until one of those terms returns a match. The search engine then analyzes the results and develops a predictive model of what the database contains.

In a similar vein, Prof. Juliana Freire at the University of Utah is working on an ambitious project called DeepPeep (www.deeppeep.org) that eventually aims to crawl and index every database on the public Web. Extracting the contents of so many far-flung data sets requires a sophisticated kind of computational guessing game.

I’m excited to start booking my travel.

Categories: Social Web

2 responses so far ↓

  • Matthew Theobald // March 16, 2009 at 1:09 am | Reply

    You may find ISEN a “vague” but interesting approach to the deep web issue. 12 years in the works, it probably won’t be ready for public consumption until 2010. Until then you can watch a few younger projects attempt it. But ISEN holds a patent pending that if awarded goes back to 2004. Check out the information on isen.org and blog.isen.org for more info on this project.

    All the Best,

    Matt

  • Chuck Rockey // June 11, 2009 at 10:56 pm | Reply

    Glad this is the year you getting on board Torn!!

    If you want to dive a little deeper into the subject, this is a long, but easy to follow video from this year’s TED. There are pithier ones out there but sometimes it good to get it from the horse’s mouth.

    http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html

    xoxo,

    Chuck

Leave a Comment