A Year with MongoDB

This week marks the one year anniversary of Kiip running MongoDB in production. As of this week, we’ve also moved over 95% of our data off of MongoDB onto systems such as Riak and PostgreSQL, depending which solution made sense for the way we use our data. This post highlights our experience with MongoDB over the past year. A future post will elaborate on the migration process: how we evaluated the proper solutions to migrate to and how we migrated the data from MongoDB.

First, some numbers about our data to give context to the scale being discussed. The figures below represent the peak usage when we were completely on MongoDB — the numbers are actually much higher now but are spread across different data stores.

  • Data size: 240 GB
  • Total documents: 85,000,000
  • Operations per second: 520 (Create, reads, updates, etc.)

The Good

We were initially attracted to MongoDB due to the features highlighted on the website as well as word of mouth from those who had used it successfully. MongoDB delivered on some of its promises, and our early experiences were positive.

  • Schemaless - Being a document data store, the schemaless-nature of MongoDB helps a lot. It is easy to add new fields, and even completely change the structure of a model. We changed the structure of our heaviest used models a couple times in the past year, and instead of going back and updating millions of old documents, we simply added a “version” field to the document and the application handled the logic of reading both the old and new version. This flexibility was useful for both application developers and operations engineers.

  • Simple replication - Replica Sets are easy to setup and work well enough. There are some issues that I’ll talk about later, but for the most part as an early stage startup, this feature was easy to incorporate and appeared to work as advertised.

  • Query Language - Querying into documents and being able to perform atomic operations on your data is pretty cool. Both of these features were used heavily. Unfortunately, these queries didn’t scale due to underlying architectural problems. Early on we were able to use advanced queries to build features quickly into our application.

  • Full-featured Drivers for Many Languages - 10gen curates official MongoDB drivers for many languages, and in our experience the driver for each language we’ve tried has been top-notch. Drivers were never an issue when working with MongoDB.

The Bad

Although MongoDB has a lot of nice features on the surface, most of them are marred by underlying architectural issues. These issues are certainly fixable, but currently limit the practical usage we were able to achieve with MongoDB. This list highlights some of the major issues we ran into.

  • Non-counting B-Trees - MongoDB uses non-counting B-trees as the underlying data structure to index data. This impacts a lot of what you’re able to do with MongoDB. It means that a simple count of a collection on an indexed field requires Mongo to traverse the entire matching subset of the B-tree. To support limit/offset queries, MongoDB needs to traverse the leaves of the B-tree to that point. This unnecessary traversal causes data you don’t need to be faulted into memory, potentially purging out warm or hot data, hurting your overall throughput. There has been an open ticket for this issue since September, 2010.

  • Poor Memory Management - MongoDB manages memory by memory mapping your entire data set, leaving page cache management and faulting up to the kernel. A more intelligent scheme would be able to do things like fault in your indexes before use as well as handle faulting in of cold/hot data more effectively. The result is that memory usage can’t be effectively reasoned about, and performance is non-optimal.

  • Uncompressed field names - If you store 1,000 documents with the key “foo”, then “foo” is stored 1,000 times in your data set. Although MongoDB supports any arbitrary document, in practice most of your field names are similar. It is considered good practice to shorten field names for space optimization. A ticket for this issue has been open since April 2010, yet this problem still exists today. At Kiip, we built field aliasing into our model layer, so a field with name “username” may actually map to “u” in the database. The database should handle this transparently by keeping a logical mapping between field names and a compressed form, instead of requiring clients to handle it explicitly.

  • Global write lock - MongoDB (as of the current version at the time of writing: 2.0), has a process-wide write lock. Conceptually this makes no sense. A write on collection X blocks a write on collection Y, despite MongoDB having no concept of transactions or join semantics. We reached practical limitations of MongoDB when pushing a mere 200 updates per second to a single server. At this point, all other operations including reads are blocked because of the write lock. When reaching out to 10gen for assistance, they recommended we look into sharding, since that is their general scaling solution. With other RDBMS solutions, we would at least be able to continue vertically scaling for some time before investigating sharding as a solution.

  • Safe off by default - This is a crazy default, although useful for benchmarks. As a general analogy: it’s like a car manufacturer shipping a car with air bags off, then shrugging and saying “you could’ve turned it on” when something goes wrong. We lost a sizable amount of data at Kiip for some time before realizing what was happening and using safe saves where they made sense (user accounts, billing, etc.).

  • Offline table compaction - The on-disk data size with MongoDB grows unbounded until you compact the database. Compaction is extremely time consuming and blocks all other DB operations, so it must be done offline or on a secondary/slave server. Traditional RDBMS systems such as PostgreSQL have handled this with auto-vacuums that clean up the database over time.

  • Secondaries do not keep hot data in RAM - The primary doesn’t relay queries to secondary servers, preventing secondaries from maintaining hot data in memory. This severely hinders the “hot-standby” feature of replica sets, since the moment the primary fails and switches to a secondary, all the hot data must be once again faulted into memory. Faulting in gigabytes of data can be painfully slow, especially when your data is backed by something like EBS. Distributing reads to secondaries helps with this, but if you’re only using secondaries as a means of backup or failover, the effect on throughput when a primary switch happens can be crippling until your hot data is faulted in.

What We’re Doing Now

Initially, we felt MongoDB gave us the flexibility and power we needed in a database. Unfortunately, underlying architectural issues forced us to investigate other solutions rather quickly. We never attempted to horizontally scale MongoDB since our confidence in the product was hurt by the time that was offered as a solution, and because we believe horizontally scaling shouldn’t be necessary for the relatively small amount of ops per second we were sending to MongoDB.

Over the past 6 months, we’ve “scaled” MongoDB by moving data off of it. This process is an entire blog post itself, but the gist of the matter is that we looked at our data access patterns and chose the right tool for the job. For key-value data, we switched to Riak, which provides predictable read/write latencies and is completely horizontally scalable. For smaller sets of relational data where we wanted a rich query layer, we moved to PostgreSQL. A small fraction of our data has been moved to non-durable purely in-memory solutions if it wasn’t important for us to persist or be able to query later.

In retrospect, MongoDB was not the right solution for Kiip. Although it may be a bit more upfront effort, we recommend using PostgreSQL (or some traditional RDBMS) first, then investigating other solutions if and when you find them necessary. In future blog posts, we’ll talk about how we chose our data stores and the steps we took to migrate data while minimizing downtime.

EC2 to VPC: Executing a Zero-Downtime Migration

A couple weeks ago, Kiip completed a transition from EC2 to VPC. In a previous post, I talked about the benefits and basic terminology of VPC. In this post, I’ll cover the planning that went into it, some software we’re using with our VPC, and executing our planned migration with zero downtime.

Transitioning from EC2 to VPC is not like transitioning from one set of EC2 machines to another. It’s more like transitioning from one hosting provider to a completely new hosting provider. The main reason for this is because nodes outside the VPC cannot talk to nodes inside the VPC (with exceptions, of course). Therefore, you need to almost bring up an entirely new cluster alongside your existing cluster, and make a big switch to the new machines while making sure everything continues working smoothly. All the while, you need to make sure you have no data loss during this process.

Determine your VPC Layout

Before moving, it is important to decide on the layout of your private network. Amazon does a good job documenting standard VPC scenarios. At Kiip, we decided to go with a standard layout: VPC with a single public and a single private subnet. This gives us around 250 IPs per subnet (Amazon reserves a few for themselves), far more than enough for our infrastructure.

At this point, I recommend spending a day or two creating VPCs and building some scripts around launching nodes into a VPC. This will get you comfortable with the new layout as well as allows you to pre-plan all of your network ACLs, security groups, routing tables, etc. It is important to be very comfortable with VPC when you’re reading to execute a move.

Our exact setup is the following, using example CIDR blocks:

  • VPC: 10.101.0.0/16
  • Two subnets: Public (10.101.0.0/24), Private (10.101.1.0/24)
  • Two routing tables: One that is able to route through an internet gateway, and one that only talks to the internet through a NAT device for the private subnet.
  • Allow-all network ACLs and security groups to begin. We use iptables on our external nodes to whitelist traffic and we decided this was enough to begin.

Create Migration Plans for Nodes

At this point we went through each of our nodes and separated the group of “stateful” nodes with “non-stateful” nodes. Nodes without state include app servers and load balancers. Non-stateful nodes require no real migration planning since they can be created in parallel without fear of data loss.

Left with only stateful nodes, the real planning begins. The following is the list of some of our stateful nodes and how we decided to handle the migration for each. Note I won’t go over each node, since some are very similar to others. Instead, I’ll just highlight a few nodes to cover different cases.

  • Memcached - We went over the cost/benefit and decided that degraded response times while our caches rewarmed was a decent tradeoff. Our caches are able to warm back to 90% of their hit rate within 5 minutes. For those 5 minutes, our response times would go from around 20ms up to 200ms, an order of magnitude. But in terms of migration ease and time, we decided this would be tolerable.
  • Graphite - For Graphite, we would restore from a recent EBS snapshot in the VPC. Bringing up a new node with an EBS snapshot attached takes around 5 minutes. We would lose 5 minutes of server statistical data, but this wouldn’t cause any downtime externally. Tolerable.
  • RabbitMQ - We decided to bring up a brand new MQ, point the new servers at this MQ to fill up a queue of jobs while the old MQ drained, and only when the old MQ drained we would take it down and start draining the new MQ. This would cause some level of backlog in our MQ, but again would cause no external slowdown or downtime except some analytical data for our advertiser dashboard would be delayed by a few minutes. This is perfectly okay since we make no hard real time guarantees about our analytics.

MongoDB was the hardest migration to plan. Our instance constantly sees around 2000 updates per second, and any downtime would cause externally visible errors and inconsistencies that are unacceptable. The migration had to be instant. Therefore our plan was to bring up a new DB instance in the VPC, set it as a slave to the primary outside of the VPC, bring it up to date, then promote the slave in the VPC to primary. On a high level, this plan works. The devil in this case is in the details:

  • All nodes in a MongoDB replica set need to be able to communicate with each other. Unfortunately, nodes outside a VPC cannot easily talk to nodes within a VPC (since they can’t even be addressed!).
  • If we don’t have at least 3 nodes in our replica set, then we’re not safe against node failures. In the midst of a complicated migration, we wanted to minimize the amount that could go wrong as much as possible, so even having a few minutes where we were vulnerable to node failure and data loss was unacceptable.
  • In order to switch the primary node over to a VPC node, all the nodes that communicate with MongoDB must be in the VPC!

Based on these requirements, the MongoDB migration became tricky, but doable:

  1. Prerequisite: All nodes that must communicate with the DB must already be in the VPC.
  2. Bring up 3 nodes in the public subnet of the VPC, make them slaves, bring them up to date.
  3. Promote a public VPC node to primary and take down the EC2 nodes.
  4. Bring up 3 nodes in the private subnet of the VPC, make them slaves, bring them up to date.
  5. Promote a private VPC node to primary and take down the public VPC nodes.

Ouch! The room for error in the above was scary, but we wrote down any failure cases we could think of and how we would handle it. Our best defense here would be our ability to identify and react to any problems quickly and predictably.

Practice and Execute

Once the plan was in place, I spent a day practicing the transition for a staging environment. This was mainly to identify any major issues with my transition checklists but also to bolster more confidence in the major migration. If you can afford the time, I highly recommend it, since this time is surely cheaper than having extended downtime and potential data loss if your migration goes poorly.

Finally, the migration was executed. The exact order of our migration was the following:

  1. Non-stateful nodes and stateful nodes that didn’t require data migration came up first. Load balancers, app servers, memcached, RabbitMQ, etc.
  2. Load balancer IP was switched over to the VPC load balancer since VPC nodes can talk to external nodes through a NAT, so the VPC nodes in our case simply talked to the EC2 stateful nodes.
  3. Stateful nodes transitioned one at a time, according to their plans, with the database last.

Tips and Tricks

Now, some of our useful tips and tricks for migrating. Note that we use Chef, so some of these tips are around that.

  • Have the VPC nodes in a separate Chef environment, and use attributes for each chef role to slowly point nodes into the new VPC environment. For example: while the app servers were transitioned to VPC right away, they still pointed at the non-VPC stateful nodes for some time. To make this easy to switch, we had Chef attributes such as node[:app][:memcached_environment] = "production" which we could then switch to production-vpc the moment we wanted to make the switch.
  • Prior to transitioning, make sure all your configuration management scripts use the internal IP for nodes, since VPC nodes do not have a public IP or FQDN initially.
  • Use network ACLs and routing tables to make sure separate environments in the VPC (staging, QA, etc.) cannot talk to each other. Since you can guarantee the subnets of each environment, its easy to blacklist entire CIDR blocks.

EC2 to VPC: A transition worth doing.

Kiip recently completed a migration from EC2 to VPC. VPC exited beta and became generally available in all regions in August, and allows you to provision compute nodes within a virtual network in AWS. For anything more than simple websites, we believe migrating to VPC is something worth doing, or at the very least worth investigating.

Because Amazon VPC is a new service and requires a substantial amount of domain knowledge, this article will first cover a quick intro to the benefits and parts of building a VPC. Specific details about our VPC architecture here at Kiip, tooling we’ve built around it, and our migration process will be covered in future posts.

Read More