Google Index Reaches 1 Trillion Unique URL Pages…07.25.08

25 07 2008

Although it reminds me of the old McDonalds evolving signs with the number of hamburgers “served,” Google announced today []:

We’ve known it for a long time: the web is big. The first Google index in 1998 already had 26 million pages, and by 2000 the Google index reached the one billion mark. Over the last eight years, we’ve seen a lot of big numbers about how much content is really out there. Recently, even our search engineers stopped in awe about just how big the web is these days — when our systems that process links on the web to find new content hit a milestone: 1 trillion (as in 1,000,000,000,000) unique URLs on the web at once!

How do we find all those pages? We start at a set of well-connected initial pages and follow each of their links to new pages. Then we follow the links on those new pages to even more pages and so on, until we have a huge list of links. In fact, we found even more than 1 trillion individual links, but not all of them lead to unique web pages. Many pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other. Even after removing those exact duplicates, we saw a trillion unique URLs, and the number of individual web pages out there is growing by several billion pages per day.

So how many unique pages does the web really contain? We don’t know; we don’t have time to look at them all! 🙂 Strictly speaking, the number of pages out there is infinite — for example, web calendars may have a “next day” link, and we could follow that link forever, each time finding a “new” page. We’re not doing that, obviously, since there would be little benefit to you. But this example shows that the size of the web really depends on your definition of what’s a useful page, and there is no exact answer.

We don’t index every one of those trillion pages — many of them are similar to each other, or represent auto-generated content similar to the calendar example that isn’t very useful to searchers. But we’re proud to have the most comprehensive index of any search engine, and our goal always has been to index all the world’s data.

To keep up with this volume of information, our systems have come a long way since the first set of web data Google processed to answer queries. Back then, we did everything in batches: one workstation could compute the PageRank graph on 26 million pages in a couple of hours, and that set of pages would be used as Google’s index for a fixed period of time. Today, Google downloads the web continuously, collecting updated page information and re-processing the entire web-link graph several times per day. This graph of one trillion URLs is similar to a map made up of one trillion intersections. So multiple times every day, we do the computational equivalent of fully exploring every intersection of every road in the United States. Except it’d be a map about 50,000 times as big as the U.S., with 50,000 times as many roads and intersections.

As you can see, our distributed infrastructure allows applications to efficiently traverse a link graph with many trillions of connections, or quickly sort petabytes of data, just to prepare to answer the most important question: your next Google search.”

I would like to add an excerpt from a blog post from the Australian blog “Libraries Interact” entitled “Size of the Internet”:

“Asking how big the internet is, is a bit like asking how long is a piece of string. The answer is we really don’t know because it is unorganised, uncatalogued and continues to grow at a phenomenal rate.

However, two recent sources are having a guestimate on where the internet is in terms of a global resource.

The first came from Internet World Stats which is “an International website featuring up to date world Internet Usage, Population Statistics and Internet Market Research Data, for over 233 individual countries and world regions.”

Their statistics put the number of worldwide internet users at 1.407 billion, up from 16 million in 1995.  How things have changed.  This is now 21.1% or more than 1 in every 5 people in the world who use the internet. They have an interesting table and graph, showing the growth over the last 13 years, which are well worth checking out.

These statistitcs are of course skewed by western nations’ use.  Australia/Oceania, for example has only 0.5% of the world’s population, but 1.4% of the world’s internet users.  Of the total population in this corner of the world, 57% are internet users.  If we broke that further down to Australia alone, that would be higher.

The other stat come from the Google blog (thanks to Phil Bradley for the link).  According to the Google post – We knew the web was big…., their “systems that process new links on the web to find new content hit a milestone: 1 trillion .. unique URLs on the web at once!”  Wow, that’s 1,000,000,000,000 URLs.   This does not include duplicated cotent or auto-generated copies, so its not as inflated as it may seem.

Very interesting too is that the first Google index in 1998 (yes, they are 10 years old this year), only had 26 million unique URLS. The post won’t guess at how many unique pages are on the web, although they suggest it could be infinite.

Big numbers, big things happening, just all the more reason for libraries to be there too.”




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: