Tuesday, September 30, 2014

How to generate CouchDB document ID?

A CouchDB database is just a bag of documents and you need to make sure the document IDs are unique. But this is not the only requirement, the ID should be quick to generate, efficiently used by CouchDB, as short as possible, and provide useful information in logs or monitoring tools.

The first idea is typically to use UUID. You can generate UUIDs using your host programming language or ask CouchDB to generate one or more UUID for you using the _uuids resource:

GET https://xxx.cloudant.com/_uuids?count=5
{"uuids":["648961210dab8fdffac52cc2f28e143e",
          "648961210dab8fdffac52cc2f28e200f",
          "648961210dab8fdffac52cc2f28e2d2e",
          "648961210dab8fdffac52cc2f28e3263",
          "648961210dab8fdffac52cc2f28e3997"]}
Then, you can create the document using a PUT request with the UUID specified in the URI:
Request:
PUT https://xxx.cloudant.com/blogdb/648961210dab8fdffac52cc2f28e143e
{ "customer" : "c1" ...}

Response:
{"ok":true,"id":"648961210dab8fdffac52cc2f28e143e","rev":"1-9f1fc712b431b44ec6cf09369183a96b"}
Note than if the ID already exists, a conflict is returned:
{"error":"conflict","reason":"Document update conflict."}
Alternatively, you can let CouchDB generate the UUID for you by using a POST request, and the id will be returned:
Request:
POST https://xxx.cloudant.com/blogdb/
{ "customer" : "c1" ...}

Response:
{"ok":true,"id":"f32fee7ca6ce5a755900525f6c87f346","rev":"1-acea01cbb45b0d08d9f534f9651ef7b1"}
UUID are very opaque and this is good in some cases. However, it does not help when you look at logs or lists of objects to know what object you are referencing, especially when your database has different types of documents, they will be all mixed up. Also, UUID may not be the best choice depending on the algorithm used to let CouchDB update the B-Tree indexes (see comments and some tests).

Finally, generating sequence numbers such as 1,2,3... is a generally difficult in a distributed environment as this requires some synchronization. This is usually nice for end users but not a good practice for scalable implementation.

With this in mind, I recommend to create document id as follows but this depends on your application and performance needs:

  • Include the document type
  • Include related identifier such as user id or other document id
  • Include a timestamp such as the number of milliseconds
For example, the following id meets my requirements so far: order.1SXDGF.1412020886716. I should add some performance benchmarks later.

Wednesday, September 24, 2014

What is the best granularity of CouchDB databases?

After the granularity of documents that I covered in my previous post, the typical next question is about organizing the various document types. Is it better to store documents in different databases or group the documents in a single one?

CouchDB does not have a built-in notion of document type. As a comparison, you can organize documents by collections in MongoDB and you have basic functions to manage your collections (insert, remove, find, drop etc). There is nothing like it with CouchDB. A CouchDB database is able to contain any document of any shape or form. You then need to use views to access a subset of documents, and you can use any criteria. However, a commonly used pattern is to make sure all the documents have a "type" attribute. You can then use the type to create views to return a subset of the documents. For example, if you have documents for orders and line items, you can access them by type using this design document:

   {
     "language": "javascript",
      "views":
      {
        "orders": {
          "map": "function(doc) { if (doc.type == 'order')  emit(null, doc._rev) }"
        },
       "items": {
         "map": "function(doc) { if (doc.type == 'item')  emit(null, doc._rev) }"
       }
   }

When storing different document types in a single database, you also need to make sure that document ids cannot be in conflict. One approach that I recommend is to use the document type as a prefix of the document id. Deleting all or a subset of the documents of a given type can be achieved using the bulk API by sending the document ids, their revisions and set the _deleted attribute to true. However, it requires to get the the complete list of document ids and revisions before deleting, and the bulk update can fail due to conflicts.

On the other hand, you can create new databases and store different document types in each. The major drawback of this approach is that you cannot create views across databases. For example, if you want to list all orders and items of a given customer, using different databases for orders and items will be a problem. You may also want to setup replication later on, and you will need to create different replications and monitor more things. But if the data is really not related, using different databases is a nice option. Based on other posts CouchDB should handle a large number of databases even if some configuration may be necessary (see this post).

The question of database granularity can also come with the multi-tenancy requirement. Is it better to create a single database shared by tenants or to create a database per tenant. I will distinguish between two multi-tenancy use cases:

  • (a) lot of the infrastructure is shared between tenants and you need to monitor the activity of tenants globally. In this case, using a single database is better, and you need to make sure that all documents have a tenant attribute.
  • (b) each tenant needs to store data of its own and you do not need to aggregate data from different tenants in reports. In this case, using a database per tenant is better.

In conclusion, I recommend to use a single database to store various document types so that you can create views to manage your data with more flexibility. If your concern is about multi-tenancy, use a single database or a database per tenant depending on the need to aggregate data from different tenants.

Monday, September 22, 2014

What is the best granularity of CouchDB documents?

As many of us, I have a background in relational databases, but here is the time to understand new use cases and new technologies provided by NoSQL databases. There are many flavors of NoSQL databases and I am going to start a series of posts about CouchDB to share my experience on several key subjects. CouchDB is a JSON document database and probably one of the first questions anybody will ask is about the granularity of documents. Is it better to define coarse grain documents with many attributes and sub-objects or to define smaller ones ? In order to answer this question in the context of CouchDB, I will first define the three design forces that must be balanced: unity, size and concurrency.

The typical design process of a relational database is to start with a well defined entity relation model, then derive the normalized table representation and later on denormalize on a case by case basis for performance reasons. Using the relational model, you end up creating tables with a very fine grained representation and you do not distinguish between tables that would typically be accessed together and the ones that are more distant semantically. The application will have to join data from the tables to get data in a more meaningful way. This is where I see the main advantage of document database because you can keep semantically coupled data together in a single document. This is what I call unity. For example, an order having several line items could be stored like this:

   { _id : "order1",
     type : "order",
     customer : "c1",
     items: [
       {product : "p1", quantity : 1},
       {product : "p2", quantity : 5},
       {product : "p3", quantity : 2}
     ]
   }

The second force that you need to balance is the size of documents. Indeed, each time a document is changed, the whole document must be exchanged between the server and the client. There is no partial update as we can do with MongoDB. It is difficult to define a size limit, but if you have a lot of text for example, then decomposing into smaller document will be better.

Finally, you need to think about concurrency. CouchDB concurrency control is based on document revision, and there is no transaction. Each document has a revision and updating a document is just about creating a new document identified by the same id, but having a new revision. When the application needs to update a document, it has to provide the revision it wants to update. If the document has already been changed by the another part of the application, the revision has already been changed and the new update will be rejected with a conflict error. In this case, you may want to retry your updates after having potentially merged with the already updated document. However, having a highly concurrent application updating the same documents is quickly a big problem and will not work. You then have two major options: either you design you application so that you create new document at each update, or you decompose your document based on access patterns. In the latter option, you need to realize that concurrent updates may not always be about the same part of the document, and so decomposing the document into several small documents reduces the conflicts. In the example mentioned above, if line items are frequently changed by different parts of the application concurrently, creating a document for each line item may be necessary. Accessing line items of an order will then require a view that will fetch the items.

   { _id : "order1",
     type : "order",
     customer : "c1"
   }
   { _id : "order1.item1",
     type : "item",
     order: "order1",
     product : "p1", 
     quantity : 1
   }
   { _id : "order1.item2",
     type : "item",
     order: "order1",
     product : "p2", 
     quantity : 5
   }
   { _id : "order1.item3",
     type : "item",
     order: "order1",
     product : "p3", 
     quantity : 2
   }

As you can see, unity and concurrency are conflicting forces in the case of CouchDB. It is not necessarily the case with other document databases. MongoDB has the atomic operation of find-and-modify which helps a lot because you do not need to decompose the updates of a document in two steps: getting the revision, and sending the update. A question is why CouchDB does not provide such an atomic operation?

With this in mind, if the typical size of the data is reasonable, I would recommend to start with coarse grain documents, and decompose on a case by case basis based on concurrency needs. Otherwise, think about smaller documents right away.

Thursday, September 18, 2014

How to enforce secured connections with IBM Bluemix?

IBM Bluemix has a DataPower appliance in front of all deployed application (see Bluemix security). In particular, the DataPower terminates secured connections so that secured connections are forwarded to the application using non-secured connections. For example all the HTTPS traffic is forwarded as HTTP traffic to your applications.

This is very nice because you have nothing to configure in your app to accept secured connections. In addition, all the compute power needed to decrypt messages is on the DataPower, and not your application server instance.

However, it has one very important drawback. It does not enforce the use of secured connections. For example, a client application could use HTTP where HTTPS should have been used. This would be a major issue when using Basic Authentication, or anytime an Authorization header or access token is used.

Fortunately the DataPower sets some interesting headers to indicate several attributes of the original connection. One of them is the header $WSIS that indicates if the original connection was secured or not. With this in mind, we can easily write a servlet filter like this:


  @Override
  public void doFilter(ServletRequest request, ServletResponse response,
   FilterChain chain) throws IOException, ServletException {
    if (request instanceof HttpServletRequest && response instanceof HttpServletResponse) {
      HttpServletRequest req = (HttpServletRequest)request;
      HttpServletResponse res = (HttpServletResponse)response;
      String wsis = req.getHeader("$wsis");
      if (wsis!=null && !wsis.equalsIgnoreCase("true")){
        res.setStatus(403);
        return;
      }
    }
    chain.doFilter(request, response); 
  }
With this filter, 403 (FORBIDDEN) is returned if non-secured connections were used, so that secured connections are enforced when deployed to Bluemix. In addition, when you test your application locally you can still use the non-secured connections because the header $WSIS will not be present.

To conclude, I recommend to use this simple filter.