Sunday, March 13, 2011

One way MongoDB will get you in trouble if you're not careful.

One way MongoDB will get you in trouble if you're not careful:

Storing everything in a single document.

An example I see a lot of is storing blog posts and the associated comments inside of a single MongoDB document. For instance, this presentation about scaling with MongoDB uses the following example schema:

_id: ObjectID('Incredibly Long, unfriendly GUID'),
author: 'roger',
date: 'Some random Date String',
text: 'This is the main body of the blog post.',
comments_count: 1,
comments: [
{ author: 'Gretchen',
date: 'Yesterday',
text: 'This is a spam comment; there are no real comments on the Internet.',

What's wrong with this schema? Like, 80 million things.

What happens when you want to collect every blog post author and show it somewhere on the blog, let's say as an "authors" sidebar?

When you want to show the site's most recent comments, by date?

Show a billboard of your blog's most prolific commenters and their highest rated comments?

You see where I'm going with this. Theoretically, you could write a map reduce function for all of the above queries -- theoretically. Looking at it from the perspective of a pragmatic programmer, though, do you really want to dip your feet into what are essentially SQL's Stored Procedures?

Make no mistake, the syntax is different, but once you start writing a map reduce function for any of the above queries, you have to maintain it -- whenever the schema changes, you have to update it, whenever the data moves, you have to update it. Poor schema modeling will turn the beauties of an efficient map reduce into a maintenance nightmare.

There's a right time to jam everything into one document, and there's a wrong time.

The wrong time: any valuable data you are going to want to isolate and iterate through, without having to load a possibly enormous parent document into memory first. These are the basics in relational models: the user model, the comment model, the blog post model. See above for the laundry list of reasons why.

The right time: miscellaneous data. Metadata. Any kind of data that won't fit cleanly into a relational model, or any kind of data that will be awkward to work with when its been detached from a document.

Too vague?

Let's say you want users of your site to be able to hide certain posts (blog posts OR comments) by certain authors ("it offends my eyes!"). You add an attribute to the User document, "hidden_authors," which is an array of ObjectIDs.


1) The attribute has no meaning outside of the User document.

2) In general if you need to access a User document's hidden_authors attribute you will have already loaded that document into memory. If you haven't, thanks to MongoDB's excellent querying you can query inside of the array, which covers most cases ("Is this author blocked by this user?") quite nicely.

3) It literally makes no sense as a separate document. As a separate document, it would end up being more unwieldy: not only would you need the hidden_authors attribute on this second document, but you'd also need an attribute to point to the related User document. What have you gained? Nothing. You've actually lost some flexibility, since now you're dealing with two documents. In a relational database like MySQL or Postgres, you've got no choice, but in MongoDB the easier way is also the better way in this case.

I know the whole "blog post + comments = single document" is the canonical example for everyone who discusses MongoDB, but I really wish people would just let it die and be replaced by something that makes more sense. If you've had your head deep enough in databases, when you see that example you see something very "brittle" and something that is going to give you some headaches in the near future.

1 comment:

Edemilson Lima said...

Also, a single document is limited to 16 MB of data, so if anything embed cannot grow forever.

The only problem in normalizing with MongoDB is that we don't have JOINS, so the data must be collected and joined at the application.