Tuesday, May 26, 2015

Why is Cloud Foundry a revolution?

I have recently attended the Cloud Foundry Summit 2015 in Santa Clara, and I wanted to share the excitement around this growing technology. First, last year there were something like 1000 attendees in San Francisco, and this year it was about 1500. This quickly shows the growing interest. And most importantly, you can really see that Cloud Foundry (CF) is being adopted by many public PaaS providers (Pivotal, IBM, HP, SAP, EMC, Huawei) and also by companies setting up internal or industry specific PaaS (GE, Comcast, AllState, Verizon, Lookeed Martin). IBM mentioned that Bluemix, their PaaS based on Cloud Foundry, was getting 8000 new users per week. So what is going on and why is this wave coming? Here are several reasons showing the benefits of Cloud Foundry:

CF empowers developers

There is no question that CF empowers developers by letting them develop and deploy their ideas very quickly. No need to request machines, setup network and access rights etc. IBM bluemix is an example of that and once you've got your account set up you can deploy in minutes. Also, because it is so easy to deploy, it is a revolution for prototyping and testing. Instead of deploying to a shared test system, a developer can deploy all the components into his/her own space. To some extent, there is no more the need to mock dependent services, and you can test with the real components.

CF reduces barriers between Dev and Ops towards the DevOps model

In the end, the goal is to develop and deploy applications that are useful, easy to use and robust. The separation between Dev and Ops does not work anymore. We need speed. We need to close the loop between users, Ops and Dev. With CF, Dev and Ops can talk about the same things, the apps. Dev can be involved in the way their apps are performing, and can react quickly with new releases. Ops can better understand the underlying architecture, spot bottlenecks and talk to Dev in a more proactive way. Actually, companies could leverage CF to support the Amazon model of "you build it, you run it".

CF gives a framework to operate the latest best practices

A lot of things have been said about the benefits of 12 factor app and microservices. CF actually gives the framework where these principles can be applied in a repeatable way. It is pretty impressive to see that developers using CF are applying these principles sometimes not even knowing about them, just because CF is guiding them.

CF enables companies to catch up

And because the best practices are at the core of the development and deployment model, companies can deploy CF in their own data centers or through a PaaS provider to quickly catch up. Deploying at scale, in a reliable and repeatable way, is no longer reserved to large Web companies.

CF enables companies to simplify processes and reduce time-to-market

Several companies at the summit mentioned that they were able to drastically reduce the processes and time-to-market from months to days (AllSate, Humana). This is really a revolution in these companies where the process could have taken up to 100 days just to get the environment setup. Once the upper management agreed to the new process, results can be delivered very fast. It also pushes the ball to the app designers, marketing and architects to come up with ideas for their business, knowing that developing and deploying is no longer a barrier.

CF helps avoiding PaaS vendor lock-in

CF provides a framework that can be deployed directly in your own data centers or an IaaS such as AWS, or accessed through PaaS providers (IBM, GE, EMC, SAP, Huawei...). So on paper it could avoid vendor lock-in. However, these environments will still offer different services on top of CF, and it is unclear how you could migrate from one provider to the other. However, some of the concepts and tools will be the same so that an active community can be leveraged.

CF enables continuous innovation and continuous disruption...

When you put all these benefits together, you can see that CF enables continuous innovation by liberating people and removing lots of the existing barriers. If companies can take advantage of these new tools and processes, this could mean continuous disruption.

However, there are some challenges going ahead. I believe it will only work for companies willing to change their processes and for developers capable of leaving their comfort zone, learning new tools and languages, and following an ecosystem that is moving very fast. So for companies and developers alike, it is time to seize the opportunity.

Friday, April 17, 2015

What is DOcloud?

DOCloud is the short name of the IBM Decision Optimization on Cloud service. This service lets you solve CPLEX and OPL problems on the Cloud. You can access the interactive service called DropSolve or you can use use the API to integrate the service into your application.

CPLEX is a mathematical programming solver for linear programming, mixed integer programming, and quadratic programming. CPLEX is known as the leader and most efficient solver of its kind. It is a prescriptive analytics tool that helps you take better decision about your business. OPL is a modelling language that facilitates the use of CPLEX. These tools used to be available on-premise only, with a nice framework to build application called Decision Optimization Center.

The good news is that you can now access these tools on-line!

DropSolve is a simple web interface where you can literally drag and drop your model and data files, and they will be executed on IBM servers. If you want to build an application, you can use the API. The API is pretty simple: you create a job with the model and data, then you submit and monitor it, and finally you get the results once it is completed. There is a developer community where you can access to the API documentation, the examples and a forum.

To help you get started, I have created a YouTube playlist where you will find tutorials. The blog "IT Best Kept Secret Is Optimization" is also a must-read about optimization and DOcloud.

Sunday, February 8, 2015

How to retrieve incrementally a time series using a REST API?

In this post, I would like to propose a REST API pattern to retrieve incrementally a time series.

Let's take the example of a system generating a time series such as price change, log events, machine status change etc. This system is able to send a sequence of data points using a messaging protocol, but as with many messaging protocols, the delivery and the order of delivery are not guaranteed. The goal is to propose a service that will record these messages and provide a REST API to incrementally access existing data points or new data points when they are available and in order.

First of all, we can easily record these messages in a data store, let's say using JSON format and in a MongoDB collection or in a CouchDB database. We can also notice that messages should be ordered and assigned a sequence number at the source. This way, it is possible to tell if 2 messages are contiguous or if there are one or more messages not yet received in between. So the first part of the service is about subscribing to the data stream, and storing the messages as they come.

Then, the second part of the service is to provide a REST API to access the time series (no Websockets or long polling for now). Well, the REST API can be very simple as this one:

   GET https://www.myservice.com/timeseries/TS?start=X
where TS is the name of the time series and X the sequence index at which we need to start returning the data points. The returned data could be of this form with X=10:
[ 
  { "index" : 10,
    "date" : 1423259728987,
    "value" : 10},
  { "index" : 11,
    "date" : 1423259730094,
    "value" : 42},
  ...
]

However, there are a couple of things to consider:

  • The list of data points can be very large, and could cause out-of-memory crashes on the client or on the server depending on the implementation.
  • Data points are received asynchronously, and there is no guarantee that the returned list will contain a continuous sequence.
  • Some data point may be lost and will not be delivered, and again there is no guarantee that the returned list will contain a continuous sequence.
  • Finally, the data stream may have ended, so that no more data points will be available and we need to indicate this.

With this in mind, I would like to propose the following design:

  • There must be a server side limit on the number of data points returned at a time. This limit can be a default and potentially a lower limit could be specified by the client, but the important thing is to define a reasonable maximum limit to enforce. With this approach, the client can iterate over the data points, and specify the next start index as the last index + 1
  • As some data points may be received out of order, gaps in the sequence can happen. In this case, the returned list should stop at the first gap detected. So if we recorded data points X, X+1, X+2, X+4, the call would return only X, X+1 and X+2.
  • But some data point may be lost, and we need to set a maximum delay when we will consider that the data point will not be returned. If we continue the previous example, the next call will be with X=X+3. If the elapsed time between X+4 and the current time is over a maximum delay, we should assume X+3 lost, and we can return a fake data point with an attribute missing set to true, along with the rest of the sequence.
  • Finally, if we know there is no more data points because the stream ended, we can indicate this by returning a fake data item with an attribute stop set to true,

In conclusion, the client can poll the server until it receives the stop flag. At each call it will receive no more than the maximum block size defined. It can specify the next call by adding 1 to the last index received. The API also guarantees that the data items are returned in order, and that if a data item is not available after some maximum delay, it will be returned and flagged as missing along with the rest of the data points. I believe this approach can be of a general interest.

Tuesday, December 16, 2014

How to deal with replication conflicts in CouchDB?

In a previous post I introduced the different types of conflicts in CouchDB: creation conflicts, update conflicts, replication conflicts and deletion replication conflicts. This time, I share more details about the replication conflicts and how I recommend to resolve them.

A replication conflict occurs when the same document is updated in nodes A and B at the same time while the replication between A and B has not been fully processed. Then, when the replication actually occurs, the nodes will have 2 revisions of the same document. CouchDB will pick one of the revisions as the winner, and will store the other one in the _conflicts attribute. CouchDB has an algorithm ensuring that the same revision will be picked by all the nodes. This situation can occur in a cluster when the rate of modifications of the document is higher than the throughput of the replication.

The application has no control over the winning revision or how merging could be attempted at the time of the replication. Resolving the conflict means to get the document, get all the conflicting revisions, merge the updates, save back the document and delete the discarded revisions. At least this is what is explained in wiki and in the Cloudant documentation. You can do this either on the fly when accessing documents or using a background process or both.

However, there is a problem. When deleting the discarded revisions, these revisions are then added to the _deleted_conflicts array. And, as I explained before, this field is also very useful to implement a simple management of delete conflicts so that delete always wins. But if you do so, the merged documents should be considered deleted and that's not what you want... Actually, there is no way to distinguish between a real deletion and a deletion due to a conflict resolution. Very few people mentioned this issue (ref1, ref2).

One solution that was proposed is to add a new attribute to flag document deleted by the application in addition to _deleted. This way, you can perform the merge as explain above. However, you need to update a lot of logic to handle this new flag. When you read a document that has a _deleted_conflicts you need to get all the revisions to know if one of them was a real delete where the flag was set. If it was set, then you should delete the document otherwise you need to continue, the problem is that you need to check this all the time for documents that were just merged.

I would like to propose another solution by adding a new collection of merged revisions to the document. With this approach you keep track of revisions you have merged, and you do not delete them. Each time you merge a revision, just add it to the merged ones. With this, you can keep the simple deletion process to delete any document that has a _deleted_conflicts. You also don't need to lose time fetching previous revisions all the time, because if the the revisions from _conflicts are already in the merged list, then there is nothing to do. The only drawback is probably that revisions in conflicts cannot be purged by the compaction, but if conflicts are rare and if you eventually delete your documents in the application, then that's not a big problem.

So the recipe is the following:

  • Implement a function to merge a list of documents of the same type.
  • When fetching a document, always set the query parameter conflicts=true, and if the returned _conflicts contains revisions that have not been merged yet, merge them, add them to the merged list and save the document before returning it.
  • When accessing list of documents, always set the query parameter conflicts=true, and merge the documents as explained above if necessary.
  • In the background, implement a process that will identify documents to merge, and merge them as explained above. You need to do that because some documents may not be accessed, and will not be merged on the fly, but still used in views or other aggregations. To identify the list of documents to merge, just create a view that emits the document only if there is at least one revision listed in _conflicts and not in the merged list.

Saturday, October 25, 2014

How to configure WAS Liberty to use Apache Wink and Jackson?

As we have seen in the previous post, IBM WebSphere Liberty comes with JAX-RS and JSON support. In this post, I will show you how to explicitly use Apache Wink for the JAX-RS runtime and use Jackson as the JSON provider instead of the default providers. The updated code can be found on my GitHub repository.

The first step is to update the maven dependencies to add Wink and Jackson like this:

<dependency>
  <groupId>org.apache.wink</groupId>
  <artifactId>wink-server</artifactId>
  <version>1.4</version>
</dependency>

<dependency>
  <groupId>com.fasterxml.jackson.jaxrs</groupId>
  <artifactId>jackson-jaxrs-json-provider</artifactId>
  <version>2.4.3</version>
</dependency>

Then you need to declare the servlet in your web.xml file. Instead of using the the WAS Liberty JAX-RS servlet, you just need to indicate the class name of the Apache Wink servlet.

<servlet>
  <description>JAX-RS Tools Generated - Do not modify</description>
  <servlet-name>JAX-RS Servlet</servlet-name>
  <servlet-class>org.apache.wink.server.internal.servlet.RestServlet</servlet-class>
  <init-param>
    <param-name>javax.ws.rs.Application</param-name>
    <param-value>com.mycloudtips.swagger.MctApplication</param-value>
  </init-param>
  <load-on-startup>1</load-on-startup>
  <enabled>true</enabled>
  <async-supported>false</async-supported>
</servlet>
<servlet-mapping>
  <servlet-name>JAX-RS Servlet</servlet-name>
  <url-pattern>/jaxrs/*</url-pattern>
</servlet-mapping>

And in the application class (MctApplication class) you need to add the Jackson provider.

    @Override
    public Set<Class<?>> getClasses() {
 Set<Class<?>> classes = new HashSet<Class<?>>();

 classes.add(ApiDeclarationProvider.class);
 classes.add(ResourceListingProvider.class);
 classes.add(ApiListingResourceJSON.class);

 classes.add(JacksonJsonProvider.class);
  
 return classes;
    }

Finally, make sure you remove the feature jaxrs-1.1 from your server.xml and replace it by a simple servlet-3.0. That's it, esay peasy.

Monday, October 20, 2014

How to document your JAX-RS API using Swagger, WAS Liberty Profile and Bluemix?

Swagger has become the de facto standard for REST API documentation. It is also a pretty generic framework and developers need to know how to configure their specific environment. In this post, I will review the steps required to document a JAX-RS API developed with IBM WebSphere Application Server Liberty Profile. The complete example is available on my GitHub repository.

I will assume that you have created a Maven Dynamic Web project in Eclipse (project name and web context root are set to 'swagger-liberty'), and that you have defined a WAS Liberty server environment. Setting up your environment is outside the scope of this post, but you can find more information here.

In order to develop and document your JAX-RS API, you will need to follow these steps:

  • Declare the required maven dependencies.
  • Declare the JAX-RS and Swagger servlets.
  • Declare the Swagger JAX-RS providers and your JAX-RS resources.
  • Implement and document your APIs using Java annotations.
  • Copy the Swagger UI web resource files.
  • Activate the JAX-RS feature of Liberty.
  • Test your server locally.

The first step is to add the maven dependencies to your maven project. You need to add the Swagger JAX-RS bridge, logging bridge and JEE 6 apis:

<dependency>
  <groupId>com.wordnik</groupId>
  <artifactId>swagger-jaxrs_2.10</artifactId>
  <version>1.3.10</version>
</dependency>
<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-jdk14</artifactId>
  <version>1.7.7</version>
</dependency>
<dependency>
  <groupId>javax</groupId>
  <artifactId>javaee-web-api</artifactId>
  <version>6.0</version>
  <scope>provided</scope>
</dependency>

Then you need to declare the servlets in your web.xml file. The first servlet is used to indicate the JAX-RS runtime where to find your JAX-RS application.

<servlet>
  <description>JAX-RS Tools Generated - Do not modify</description>
  <servlet-name>JAX-RS Servlet</servlet-name>
  <servlet-class>com.ibm.websphere.jaxrs.server.IBMRestServlet</servlet-class>
  <init-param>
    <param-name>javax.ws.rs.Application</param-name>
    <param-value>com.mycloudtips.swagger.MctApplication</param-value>
  </init-param>
  <load-on-startup>1</load-on-startup>
  <enabled>true</enabled>
  <async-supported>false</async-supported>
</servlet>
<servlet-mapping>
  <servlet-name>JAX-RS Servlet</servlet-name>
  <url-pattern>/jaxrs/*</url-pattern>
</servlet-mapping>
The second servlet is used to configure the Swagger runtime and indicate where to find the API meta-data (the base path, which is made of the web root context and the JAX-RS servlet mapping).
<servlet>
  <servlet-name>DefaultJaxrsConfig</servlet-name>
  <servlet-class>com.wordnik.swagger.jaxrs.config.DefaultJaxrsConfig</servlet-class>
  <init-param>
	<param-name>api.version</param-name>
	<param-value>1.0.0</param-value>
  </init-param>
  <init-param>
     <param-name>swagger.api.basepath</param-name>
     <param-value>/swagger-liberty/jaxrs</param-value>
   </init-param>
   <load-on-startup>2</load-on-startup>
</servlet>

The application class (MctApplication class) is the place where you need to declare the JAX-RS Swagger providers and your JAX-RS resource (MctResource class). Note that I usually declare the resources as singletons so that they are not created at each request.

    @Override
    public Set<Class<?>> getClasses() {
	Set<Class<?>> classes = new HashSet<Class<?>>();

	classes.add(ApiDeclarationProvider.class);
	classes.add(ResourceListingProvider.class);
	classes.add(ApiListingResourceJSON.class);
	return classes;
    }
    @Override
    public Set<Object> getSingletons() {
	Set<Object> singletons = new HashSet<Object>();
	singletons.add(new MctResource());
	return singletons;
    }

The resource class is the place where you can develop and document your APIs. You need to use the JAX-RS and Swagger annotations. Here is an example to declare a method returning a list of books:

@GET
@ApiOperation(value = "Returns the list of books from the library.", 
              response = MctBook.class, responseContainer = "List")
@ApiResponses(value = { @ApiResponse(code = 200, message = "OK"),
	@ApiResponse(code = 500, message = "Internal error") })
public Collection<MctBook> getBooks() {
  return library.values();
}

The server will include the Swagger UI and you need to copy the web resources (index.html, o2c.html, sagger-ui.js, sawgger-ui.min.js, lib, images and css files and directories). You can find these files in the Swagger UI JAX-RS sample or in my GitHub repository. You also need to adjust a path in the index.html file to point to your API:

   $(function () {
      window.swaggerUi = new SwaggerUi({
      url: "/swagger-liberty/jaxrs/api-docs",
      ...
    });

At this point, your project should compile fine and you should get ready to test. Before doing so, you need to activate the JAX-RS support in Liberty. Remember that Liberty is very flexible and let you decide what features will be loaded. To do so, you need to add the jaxrs-1.1 feature in the server.xml file.

   <featureManager>
    	<feature>jaxrs-1.1</feature>
        <feature>localConnector-1.0</feature>
    </featureManager>

Finally, you can add your application to your server runtime and start it. You should then be able to access the Swagger UI:

http://localhost:9080/swagger-liberty/
As an optional step, not covered in this post, you can easily deploy this server to IBM Bluemix.

Friday, October 10, 2014

How to build a document archive with CouchDB?

Let's imagine you have a database where documents are created and deleted. But you need to keep a record of the documents that are deleted in an archive. How to setup such an archive with CouchDB? Well, by using the features of CouchDB, and more specifically the replication, it is actually pretty simple.

A CouchDB replication copies the new revisions of the documents from a source database to a target one. It also deletes the documents as they are deleted in the source. So if you do nothing, the target database will not be an archive but just a copy of the source database. The trick is to define a filter that will not propagate the deletion. Here is such as simple filter:

"filters": {
      "archiveFilter": "function(doc) {return !doc._deleted }"
     },
Note that you can customize this filter so that you archive only specific types of documents.

You can run the replication on demand or continuously. It will be generally better in this case to setup a continuous replication so that you keep your archive up-to-date automatically. The replication document will then look like the following:

{
  "_id": "myarchive",
  "source": {
    "url": "...source URL...",
    "headers": {
      "Authorization": "..token..."
    }
  },
  "target": {
    "url": "...target URL...",
    "headers": {
      "Authorization": "...token..."
    }
  },
  "continuous": true,
  "filter": "archive/archiveFilter"
}