Similar to most web applications that don't have millions of users, our simple server architecture sufficed for the first couple years of our company. Two front-end servers (EC2 instances), one load balancer, and one database server (RDS instance). Apart from the occasional press hit where we had to fire up five or six additional front-end servers, this setup served us well. Up until recently.
Having to re-scale your architecture is a good problem to have. It's a problem you only encounter when your website has to accommodate more visitors. At first, we optimized our inefficient and slow queries, but that can only do so much. Next, we addressed the additional traffic by upgrading our servers to have more resources, but eventually we were on Amazon's biggest DB instance and during high traffic loads, the database server CPU load would go North of 90%. We knew we needed multiple databases to scale.
I suppose we could have entertained the idea of sharding, but we already had a read replica up and running for our business intelligence tool so the master/slave replication approach was the natural fit.
Replication on AWS RDS
Since we're using MySQL, we had to build for asynchronous replication. The way asynchronous replication works in a master/slave configuration is that a DB write is made to the master database and then that write is propagated out to the remote storage device, which in our configuration is the replica (slave) database. According to Amazon, the optimal replication configuration is for the replica instances to have equal or greater computing power than the master instance. By optimal, I mean the setup that creates the least amount of lag (amount of time the replica database is behind the master database). So for us, we're using 1 master and 2 replicas that are all the same RDS DB instance size (m2.2xlarge).
Replica lag is the enemy. With lag, it's possible that you'll write to the database (writes always goes to master) and when you fetch that data it might not be there yet as reads go to the replicas, which, in the event of lag, might not have been updated yet at the time of your fetch. For example, your checkout form could create an order on orders#create and then on orders#show it doesn't exist yet according to your replica. It would be an awful experience for the customer. I'll touch more on how to handle this later.
Rails Gems for Replication
Understanding your architecture is essential for finding the right gem for your Rails app, which is why I wanted to give a brief background of the architecture at hand before I jumped into the web application layer.
I did a decent amount of research before deciding on what gem I should use to help us with managing our master/slave configuration. I checked out Masochism, DbCharmer, multi_db, Data fabric, and Octopus. My CTO taught me a great lesson years ago about gem selection criteria - look at the gem's recent commit activity. If a gem hasn't been committed to in months or years, then it's probably not being maintained and will most likely not work for your Rails version (we run 3.2.11) or it will probably not be updated for a future Rails version. Octopus and DbCharmer were the only two repositories that had recent activity.
At thredUP, we have multiple Rails applications. One of these applications, our 'Ops' Rails app, has to communicate to two databases - its own DB and the web app DB. To do this, it uses DbCharmer to switch DB connections for a couple of its models. Because of this, I was inclined to give DbCharmer a chance first as it would be nice for both of our Rails apps to use the same gem for handling DB connections.
After installing the gem, I went through our Ops app to see how they were using DbCharmer and then I read through DbCharmer's documentation. DbCharmer has plenty of documentation, but their section on "Using Models in Master-Slave Environments" was surprisingly ambiguous. No master/slave database.yml example in the docs and both of the repository's test projects were about sharding and custom slave reads, but I was looking for a standard master/slave replication example.
I decided to plow ahead anyways so I updated my database.yml to including a nested set of database configurations (just like their example for sharding) under one of our staging server environments. As soon as I tried to initialize that environment, I received an error about my database adapter settings and after Googling for 10 minutes about that error and finding nothing, I decided it was time to give Octopus a try.
Setting up Octopus was incredibly easy. First, add the gem to your Gemfile:
Surprisingly there is only one step left after this to get Octopus up and running. Create a config/shards.yml to let Octopus know which Rails environments to setup master/slave replication for. Here is an example of ours:
In this example, we have master/slave replication configured for our staging and production environment. The way Octopus determines what to use for your master database is by using the database settings are already defined in your config/database.yml file. Once you deploy, all writes will automatically go to your master and all reads to your replica listed in shards.yml. It's almost scary how easy it is to set this up.
Dealing with Replica Lag
Octopus gives you two ways to deal with replica lag in the event the data you need to fetch has to be 100% up-to-date. The first is something to use on your ActiveRecord model queries.
The above .using method allows you to choose where the query should be made. In the example above, I'm specifying 'master' as the database I want to use to make this query,however, you can specify any database name such as 'slave1'. This syntax cannot be used on active record associations unfortunately (something I wish the Octopus team would add support of). Instead, you have to do something like this:
If you have a big app, it's not going to be easy to know where you'll need these .using(:master) blocks, but if you see an 'ActiveRecord Not Found' exception raised, that's a good starting point.
When I was working on implementing Octopus, my CTO brought up a really good point: in the ideal coding world, you shouldn't need to read directly from master.
Calling .using(:master) is typically done if you recently wrote to the DB and want to make sure you have the most up-to-date data. However, every time you write to the DB (either through updating or inserting), you have access to the most up-to-date data at that moment since inserts return the results and updates are already in memory. So in the ideal world, you would cache that object right there and avoid the unnecessary master read later.
A good example would be a user creating a new address for their account. When a user goes to create their shipping address (on addresses/new), they are re-directed back to addresses/:id/show after a sucessful address creation. Adresses#show will try to find that new address by id (ex: Address.find(params[:id])) and if the replicate databases are lagging by a significant amount, addresses#show will error. The obvious fix is to do Address.using(:master).find(params[:id]), but what if you cached the address in addresses#create (since ActiveRecord creates return the full object) and then search the cache in addresses#show. Doing this would avoid a read to the master (and the replica) and you wouldn't have to worry about the replica lag at all.
If you want to do master/slave replication in Rails, Octopus is your buddy, but make sure you understand replica lag and the potential ramifications it will have with your app before you make the plunge into master/slave replication.