Peeking Into Google – Same Approach to Use in Monte Carlo Simulations??
The InternetNews.com website has an excellent story written by Susan Kuchinskas entitled Peeking into Google that details that the key to the speed and reliability of Google search is cutting up the data into chucks. Urs Hoelzle, Google vice president of operations and vice president of engineering, offered a rare behind-the-scenes tour of Google's architecture at EclipseCon 2005, a conference on the open source, extensible platform for software tools. The article continues:
To deal with the more than 10 billion Web pages and tens of terabytes of information on Google's servers, the company combines cheap machines with plenty of redundancy, Hoelzle said. Its commodity servers cost around $1,000 apiece, and Google's architecture places them into interconnected nodes.
All machines run on a stripped-down Linux kernel. The distribution is Red Hat (Quote, Chart), but Hoelzle said Google doesn't use much of the distro. Moreover, Google has created its own patches for things that haven't been fixed in the original kernel.
"The downside to cheap machines is, you have to make them work together reliably," Hoelzle said. "These things are cheap and easy to put together. The problem is, these things break."
In fact, at Google, many will fail every day. So, Google has automated methods of dealing with machine failures, allowing it to build a fast, highly reliable service with cheap hardware.
Google replicates the Web pages it caches by splitting them up into pieces it calls "shards." The shards are small enough that several can fit on one machine. And they're replicated on several machines, so that if one breaks, another can serve up the information. The master index is also split up among several servers, and that set also is replicated several times. The engineers call these "chunk servers."
As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box.
In parallel, clusters of document servers contain copies of Web pages that Google has cached. Hoelzle said that the refresh rate is from one to seven days, with an average of two days. That's mostly dependent on the needs of the Web publishers.
The whole article is worth reading, especially for Enterprise Risk Management business people and technologists dealing with issues around Monte Carlo simulations in the credit, mortgage-prepayment and other optional cash flow payment derivatives. Google makes a point of utilizing many cheap servers with lots of redundancy spread out over many data centers. Toomre Capital Markets LLC would suggest that the same approach can be used in creating a grid network of relatively cheap servers to calculate the millions of Monte Carlo simulations that are inherent in a real-time Enterprise Risk Management view of the multitude of risks medium and large financial institutions bear each day. Please contact TCM if you would like to explore this concept further.