Monday, September 22, 2014

What is the best granularity of CouchDB documents?

As many of us, I have a background in relational databases, but here is the time to understand new use cases and new technologies provided by NoSQL databases. There are many flavors of NoSQL databases and I am going to start a series of posts about CouchDB to share my experience on several key subjects. CouchDB is a JSON document database and probably one of the first questions anybody will ask is about the granularity of documents. Is it better to define coarse grain documents with many attributes and sub-objects or to define smaller ones ? In order to answer this question in the context of CouchDB, I will first define the three design forces that must be balanced: unity, size and concurrency.

The typical design process of a relational database is to start with a well defined entity relation model, then derive the normalized table representation and later on denormalize on a case by case basis for performance reasons. Using the relational model, you end up creating tables with a very fine grained representation and you do not distinguish between tables that would typically be accessed together and the ones that are more distant semantically. The application will have to join data from the tables to get data in a more meaningful way. This is where I see the main advantage of document database because you can keep semantically coupled data together in a single document. This is what I call unity. For example, an order having several line items could be stored like this:

   { _id : "order1",
     type : "order",
     customer : "c1",
     items: [
       {product : "p1", quantity : 1},
       {product : "p2", quantity : 5},
       {product : "p3", quantity : 2}
     ]
   }

The second force that you need to balance is the size of documents. Indeed, each time a document is changed, the whole document must be exchanged between the server and the client. There is no partial update as we can do with MongoDB. It is difficult to define a size limit, but if you have a lot of text for example, then decomposing into smaller document will be better.

Finally, you need to think about concurrency. CouchDB concurrency control is based on document revision, and there is no transaction. Each document has a revision and updating a document is just about creating a new document identified by the same id, but having a new revision. When the application needs to update a document, it has to provide the revision it wants to update. If the document has already been changed by the another part of the application, the revision has already been changed and the new update will be rejected with a conflict error. In this case, you may want to retry your updates after having potentially merged with the already updated document. However, having a highly concurrent application updating the same documents is quickly a big problem and will not work. You then have two major options: either you design you application so that you create new document at each update, or you decompose your document based on access patterns. In the latter option, you need to realize that concurrent updates may not always be about the same part of the document, and so decomposing the document into several small documents reduces the conflicts. In the example mentioned above, if line items are frequently changed by different parts of the application concurrently, creating a document for each line item may be necessary. Accessing line items of an order will then require a view that will fetch the items.

   { _id : "order1",
     type : "order",
     customer : "c1"
   }
   { _id : "order1.item1",
     type : "item",
     order: "order1",
     product : "p1", 
     quantity : 1
   }
   { _id : "order1.item2",
     type : "item",
     order: "order1",
     product : "p2", 
     quantity : 5
   }
   { _id : "order1.item3",
     type : "item",
     order: "order1",
     product : "p3", 
     quantity : 2
   }

As you can see, unity and concurrency are conflicting forces in the case of CouchDB. It is not necessarily the case with other document databases. MongoDB has the atomic operation of find-and-modify which helps a lot because you do not need to decompose the updates of a document in two steps: getting the revision, and sending the update. A question is why CouchDB does not provide such an atomic operation?

With this in mind, if the typical size of the data is reasonable, I would recommend to start with coarse grain documents, and decompose on a case by case basis based on concurrency needs. Otherwise, think about smaller documents right away.

No comments:

Post a Comment