Hurricane Sandy crashed Google’s party forcing it to shut down its big Android event but it looks like Sandy has caused more damage. It looks like parts of Google’s indexer mechanism has been affected by Sandy, this is an assumption but I do have some theoretical evidence to make this claim.
Basically, in the last 24 hours I have noticed that new pages are being crawled by Google but they are no where to be seen in the index as quickly as they should. So I assume that while the crawlers are doing their job the index refresh is affected and the only thing that comes to my mind is Sandy. Bear in mind Google Search is an immensely huge distributed system and their infrastructure spans across all continents and they have a team of of people called Site Reliability Engineers or SREs whose main focus is to keep Google search and other services running – more on this here. However, with all the planning and technology that drives data centers they are still vulnerable to natural disasters to some degree.
Let’s take a quick look at a typical informational retrieval system architecture. As you can see below, you have a crawler, a central repo, an indexer and then a ranking mechanism. With distributed systems any of these could be affected and because they are separate entities to some extent, they can have outages without affecting the overall system. So in this case I think the index is probably affected.
Now to show some “evidence”. My colleague Dom Calisto posted a blog titled “13 SEO nightmares that will keep you up at night!” on Oct 29, 2012 @ 11:16, see screenshot below:
Now a while ago I created SEO Crawlytics which is a WordPress plugin that tracks robot visits and does so very accurately. Using the plugin we can see the exact time stamp robot visits. As you can see below, Dom’s post was crawled within 15 minutes.
Historically, new pages on our site get crawled within 15 minutes (at the max) and then usually indexed within 40 minutes. This is something that we constantly monitor so we know the benchmarks very accurately. But this time around, it took Google around 48 hours to push the page into its index.
I understand this is not empirical data but I have noticed with a few of our sites and the only thing that I can think of is Sandy. So is Google’s infrastructure affected by it?