Sunday, February 8, 2015

How to retrieve incrementally a time series using a REST API?

In this post, I would like to propose a REST API pattern to retrieve incrementally a time series.

Let's take the example of a system generating a time series such as price change, log events, machine status change etc. This system is able to send a sequence of data points using a messaging protocol, but as with many messaging protocols, the delivery and the order of delivery are not guaranteed. The goal is to propose a service that will record these messages and provide a REST API to incrementally access existing data points or new data points when they are available and in order.

First of all, we can easily record these messages in a data store, let's say using JSON format and in a MongoDB collection or in a CouchDB database. We can also notice that messages should be ordered and assigned a sequence number at the source. This way, it is possible to tell if 2 messages are contiguous or if there are one or more messages not yet received in between. So the first part of the service is about subscribing to the data stream, and storing the messages as they come.

Then, the second part of the service is to provide a REST API to access the time series (no Websockets or long polling for now). Well, the REST API can be very simple as this one:

where TS is the name of the time series and X the sequence index at which we need to start returning the data points. The returned data could be of this form with X=10:
  { "index" : 10,
    "date" : 1423259728987,
    "value" : 10},
  { "index" : 11,
    "date" : 1423259730094,
    "value" : 42},

However, there are a couple of things to consider:

  • The list of data points can be very large, and could cause out-of-memory crashes on the client or on the server depending on the implementation.
  • Data points are received asynchronously, and there is no guarantee that the returned list will contain a continuous sequence.
  • Some data point may be lost and will not be delivered, and again there is no guarantee that the returned list will contain a continuous sequence.
  • Finally, the data stream may have ended, so that no more data points will be available and we need to indicate this.

With this in mind, I would like to propose the following design:

  • There must be a server side limit on the number of data points returned at a time. This limit can be a default and potentially a lower limit could be specified by the client, but the important thing is to define a reasonable maximum limit to enforce. With this approach, the client can iterate over the data points, and specify the next start index as the last index + 1
  • As some data points may be received out of order, gaps in the sequence can happen. In this case, the returned list should stop at the first gap detected. So if we recorded data points X, X+1, X+2, X+4, the call would return only X, X+1 and X+2.
  • But some data point may be lost, and we need to set a maximum delay when we will consider that the data point will not be returned. If we continue the previous example, the next call will be with X=X+3. If the elapsed time between X+4 and the current time is over a maximum delay, we should assume X+3 lost, and we can return a fake data point with an attribute missing set to true, along with the rest of the sequence.
  • Finally, if we know there is no more data points because the stream ended, we can indicate this by returning a fake data item with an attribute stop set to true,

In conclusion, the client can poll the server until it receives the stop flag. At each call it will receive no more than the maximum block size defined. It can specify the next call by adding 1 to the last index received. The API also guarantees that the data items are returned in order, and that if a data item is not available after some maximum delay, it will be returned and flagged as missing along with the rest of the data points. I believe this approach can be of a general interest.