From Wired.com, by
ABOUT A DECADE ago, a handful of Google’s most talented engineers started building a system that seems to defy logic.
Called Spanner, it was the first global database, a way of storing information across millions of machines in dozens of data centers spanning multiple continents, and it now underpins everything from Gmail to AdWords, the company’s primary moneymaker. But it’s not just the size of this creation that boggles the mind. The real trick is that, even though Spanner stretches across the globe, it behaves as if it’s in one place.
Google can change company data in one part of this database—running an ad, say, or debiting an advertiser’s account—without contradicting changes made on the other side of the planet. What’s more, it can readily and reliably replicate data across multiple data centers in multiple parts of the world—and seamlessly retrieve these copies if any one data center goes down. For a truly global business like Google, such transcontinental consistency is enormously powerful.
Before Spanner, this didn’t seem possible. Machines couldn’t keep databases consistent without constant and heavy communication, and communication across the globe took much too long. You know, the speed of light and all that. Google’s engineers needed something like the the ansible, a fictional device that first appeared in Ursula Le Guin’s 1966 novel Rocannon’s Worldand became a sci-fi trope. The ansible can instantly send information across any distance, defying both time and space. Spanner isn’t the ansible. It can’t shrink space. But it works because those engineers found a way to harness time.
No one else has ever built a system like this. No one else has taken hold of time in the same way. And now Google is offering this technology to the rest of the world as a cloud computing service.
Google believes this can provide some added leverage in its battle with Microsoft and Amazon for supremacy in the increasingly important cloud computing market, just because Spanner is unique. And some agree. “If they offer it, people will want it, and people will use it,” says Peter Bailis, an assistant professor of computer science at Stanford University who specializes in massively distributed software systems. But as others point out: Few businesses have the same needs as Google.
In the past, if you built a system that spanned hundreds of machines and multiple data centers, you followed an important rule: Don’t trust time. If a system involved communication between many machines in many different places, time would vary from machine to machine, just because time—precise time—is a hard thing to keep. Services like the Network Time Protocol aimed to provide machines with a common reference point. But this worked only so well, mainly because networks are slow. It takes time to send the time.
For Google, this was a problem. If a database spanned multiple regions, it couldn’t ensure that transactions in one part of the world lined up with transactions in another. It couldn’t get a truly global picture of its operations. It couldn’t seamlessly replicate data cross regions or quickly retrieve replicated data when it was needed. So Google’s top engineers found a way to trust time.
A margin of error still exists, but thanks to so many readings, the masters can bootstrap a far more reliable timekeeping service. “This gives you faster-than-light coordination between two places,” says Peter Mattis, a former Google engineer who founded CockroachDB, a startup working to build an open source version of Spanner.
Google calls this timekeeping technology TrueTime, and only Google has it. Drawing on a celebrated research paperGoogle released in 2012, Mattis and CockroachDB have duplicated many other parts of Spanner—but not TrueTime. Google can pull this off only because of its massive global infrastructure.
A Changing World
To be sure, a few others could build a similar service, namely Amazon and Microsoft. But they haven’t yet. With help from TrueTime, Spanner has provided Google with a competitive advantage in so many different markets. It underpins not only AdWords and Gmail but more than 2,000 other Google services, including Google Photos and the Google Play store. Google gained the ability to juggle online transactions at an unprecedented scale, and thanks to Spanner’s extreme form of data replication, it was able to keep its services up and running with unprecedented consistency.
Now Google wants a different kind of competitive advantage in the cloud computing market. It hopes to convince customers that Spanner provides an easier way of running a global business, a easier way of replicating their data across multiple regions and, thus, guard against outages. The rub is that few businesses are truly global. But Google is betting its new service will give customers the freedom to expand as time goes on. Among them is JDA, a company that helps businesses oversee their supply chains, which is now testing Spanner. “The volume of data—and velocity with which that data is coming at us—is amplifying significantly,” says JDA group vice president John Sarvari.
Spanner could also be useful in the financial markets, allowing big banks to more efficiently track and synchronize trades happening across the planet. And Google says it’s already in talks with large financial institutions about this kind of thing. Traditionally, many banks were wary of handling trades in the cloud for reasons of security and privacy. But those attitudes are softening. A few years ago, Spanner was something only Google needed. Now, Google is banking on change.