One of the benefits of MongoDB is its ability to shard your data set, opening the door to horizontal scaling.  There has been quite a bit written about how sharding works in general.

When setting up sharding for MongoDB, one of the first questions you must answer is “What should I use as my shard key for this collection?”  This is a very important question, and how you choose your shard key determines performance under various conditions.  For example, some shard keys are good for read access, and others are great for heavy writes.

What is a shard key?

A shard key is used to determine which chunk a document will be written to.  MongoDB does a great job at spreading chunks around all your shard servers, and it automatically balances them too.  So if one chunk gets too large, it will split.  If one shard server has too many chunks, some will be migrated off to another server.  It’s like magic.

Things to consider

  • Randomness.  A completely random shard key (i.e. an md5sum) will be excellent for spreading reads and writes across all your shards.  Assuming you can re-generate your shard key when you need to query, then this can be an excellent way to balance your load.

    However, keep in mind that in order to benefit from sharding when doing reads, you need to use your shard key in your query.  Otherwise Mongo will need to query all shards.  When you use the shard key in your query, it’s able to jump directly to the correct chunk on the correct shard.  This may be hard to do with a random shard key.

  • Selectivity.  You want to be sure there is enough variation in your shard key values to be useful.  For example, if you’re sharding on log_type which has values of “error”, “warning”, and “notice”, then there will at most be 3 chunks, and they could get very large.  This is not optimal. To improve selectivity, you could use a compound key.

    There really isn’t any problem with having too many values for a shard key.  If you’re sharding on last_name, and you have 20,000 last names, that’s fine.  Mongo can use ranges.  So A-M might be in chunk1, and N-Z might be in chunk2 when you first start.  Eventually, A-Da may be in chunk1, Db-M in chunk2, and N-Z in chunk3.

  • Compound keys.  You can always have more than one key in your sharding key.  Using the example above, if you wanted log_type in your shard key, you could also add the created_on timestamp.  Now, errors created today may be on one chunk, errors created yesterday in another chunk, and all your notices in a third chunk.

  • Timestamps.  At this point, you may think that a timestamp is the perfect shard key.  There are plenty of values (so it’s very selective), yet it’s not random so it can be easy to use in your query, assuming you know the timestamp at query-time.  A good example would be log data.  You could shard on timestamps if you want to query all log data for yesterday.  Mongo would instantly know which chunks to pull the logs from.  Sweet!

    There is a pitfall to consider though.  Since there is no randomness, and since timestamps are always increasing in value, then you will always be writing to one chunk.  The chunk you’re writing to may change, but all your writes will always be directed to the same place.  This could be very bad if you have heavy writes.

Shard key examples

Lets go through an example.  Lets say your collection consists of User documents.  This is a very simple, and very common use case.  A document might look like:

{
    _id: <integer>, // ex: 1831
    username: <string>, // ex: lehresman
    password: <string>, // sha1 hash
    first_name: <string>, // ex: 'Luke'
    last_name: <string>, // ex: 'Ehresman'
    created_on: <integer>, // timestamp when created
    modified_on: <integer> // timestamp when modified
}

We have many possibilities for shard keys.

  • {password} - This is nice because it’s very random, since it’s a sha1 hash.  However, this is not optimal.  What happens if they change their password?  This is a question I haven’t been able to answer yet.  I suspect that the document is migrated to a new chunk.  In which case, you’re introducing some overhead any time the user changes their password.
  • {created_on} - This is bad for reads, because we don’t know when the user was created when the user is signing in to the site.  Thus, we’ll have to do a general query across all shards since we won’t be able to use the shard key.
  • {username} - This is a very good choice for this collection.  It’s unique, so it’s very selective.  Plus, it’s very likely to be known at query-time.  This would be my choice in this situation.

Lets look at another example that is very different.  Log data has very different access patterns from user data.  Log data is typically written “now” and read over a range of time (depending on your application).  The fact that data with a “now” timestamp is usually all that’s written, makes this an interesting use case.  Lets say a document looks like this:

{
    log_type: <string>, // one of "warn", "notice", "error"
    application: <string>,
    message: <string>,
    created_on: <integer> // timestamp when created
}

Some possible shard keys are:

  • {log_type} - This suffers from not being selective enough (as described above in the “Selective” section).
  • {application} - It’s very likely that this isn’t selective enough either.  It’s dependent on how many applications are logging data, but chances are it’s just a handful.
  • {created_on} - If this were used, all writes would go to one collection.  This may be fine, it depends on how intense your writes are.  We would likely know this when we’re querying, since usually we’ll be asking for log data within a range of time, so from the read side, this would be great.
  • {log_type, application, created_on} - If it were up to me, I would probably choose a compound shard key here.  The combination of log_type and application would spread the writes across my various shards, and I would likely know these when querying

Conclusion

In conclusion, picking a shard key is not easy.  You really need to spend some time considering how data will be read from and written to the collection.  A well chosen shard key is worth its weight in gold, so it’s worth spending time thinking about it ahead of time.