I wrote a quick overview of how I approach the scaling-up of FieldMotion, over at the FieldMotion blog.
Basically, we started off with a huge monolithic block of code and data, then looked carefully at it to see how we could tear it apart into logical chunks.
The simplest to start with was data, so we separated out the database into a MySQL master/slave cluster, and a MongoDB replicated shard cluster.
Afterwards, we looked at the actual services that were performed by the server and started separating those out so they were completely independent from the main block.
At the moment, FieldMotion runs across more than 60 servers, and can tolerate sudden catastrophes on pretty much any part of it. We actually had an incident recently where an entire datacentre went offline for about 8 hours, but no-one noticed because we make sure each of our replicas is in a different datacentre.