Some thoughts about async, when and how you should it

In the first 2 month of working as FTE at Factual, my jobs mainly are rebuilding a out of maintenance, but mission critical internal proxy server. The complexity to this project is low, they asked me to make this project work with their new data platform and improve it's performance and stability if applicable. After reviewed the original code base, I found some major flaws that can impact performance and prevent it from working for a reasonable long time without any human interactions. Because of fixing such flaws and make improvements may take too much time, I decided to take this chance to rebuild it from scratch (kind of). The detail of functionality and how I simplifying it's architecture to improve stability is uninteresting, but I would like to introduce how did I accelerate it's performance up to 10x, by one commonly known but widely misunderstood programming pattern.

The proxy server I mentioned before, don't have any computational heavy tasks, just lookup a table, monitor and load resource in managed server if need and pass-through client requests to managed servers. When performance problem occurs, the server have very low CPU utilization but all of the requests regardless it's actual run-time for the underlying function, all delayed for a long time to response.

In my discovery, the original server did used async HTTP server library Jetty, but it just stopped there. Jetty is a async HTTP server, that means it use only a few threads to achieve high performance. But if the underlying application code does not utilize this trait, it don't work well because tasks occupying the working threads for too much time and blocked other tasks to execute on them just for waiting for managed server to response. Some requests to the managed server do take a long time to response. Apparently, async won't works if you don't use it well, no matter how powerful your server is used.

So, what is the right way to employ async programming and make it serve us well in our projects?

The most popular feature for almost all of the high performance server software is async, abbreviation of asynchronous, it have other name like event-driven, non-blocking I/O etc.  Although it's fame, it seems like hardly every user know how it works and what is the use case for it. It is misused term and technology in many places. So many developers considered introducing async can make performance improvements for sure.

Well, one things I learned form my research projects is that there is no silver bullet in software engineering. Every technology have it's trade-offs, you can't take benefits for granted from any solutions to one problem. For example, there are so many classes in standard collection packages that provides the same interface, functions and constructors. As a junior programmer without proper training on data structures, they can hardly tell differences from ArrayList and LinkedList, HashSet and BTreeSet. Because literally they can interchange with each other without triggering any compiler error and warning, you can feel no difference even in run-time. But those data structures are designed for different purposes. Misuse like taking nth element from a linked list too often will harm performance, but array list won't.

Asynchronous programming, in my opinion, is something like that. Many programming language like C# provided async primitives inside the language itself makes implementing async programming as easy as possible. Some language even forced you to make everything asynchronous, like Node.js But should you use it on every time you want your tasks to be finished in shortest time?

Let's understand what is async and how it can make performance better (in some cases) first, and go back to what does not fit in this model.

Asynchronous, in it's literal term, is to not to wait for every process finish their jobs one by one, but do keep busy on provide services to jobs. Like a KFC restaurant, you have a line of customers waiting to make orders. The counterpart of async, which is synchronous, will first try to ask customers what is their like, wait for the order, input the order for checkout, ask for cashes, wait for customer to get their wallet out and hand over cashes (or open the payment app on their iPhone), operate on the machine to send order to the kitchen, wait for the ticked to be printed out and finally hand over the tick to customer and ask them to wait for their order. You can see there are so many waiting in the process. An experienced async worker may ask the next customer for orders when the first customer is trying to find his wallet from a bag.

From the example above, we can see that the most suitable use case for async programming is that you are wait for something. In software engineering, it can be a database query, disk I/O, a task that running in another thread pool, or even waiting for users to make their decision. Async programming allows you to make the working thread to keep busy on accepting events and do it's jobs, not waiting for events to happen and then take action. The other use of async is for better model to multi-thread programming, but we are talking about I/O use case here.

73ad0a05gy1fgcoxo3qvdj20dw0903zb

This feature can make performance gain because you can put one thread to do multiply jobs at a time and threads are expensive. You cannot simply create unlimited threads for each job to achieve concurrency in real world. Too much threads will leads to frequent CPU context switching, which is slower than memory I/O. Before Nginx server showed up, the most popular is Apache Web Server. Apache is a traditional web server that maintains one thread for each requests. It works well in the beginning, but hit the worker limitation easily under heavy load even it don't have much CPU load. You can try to raise the limitation but it is not a long run solution. Nginx in the other hand, is a event-driven web server, it can take almost limitless (actually it depends on your operating system) client connections, and have much better performance on serving static files and transparent proxy.

These benefits does not come for free. Async programming building blocks like NIO from Java platform take years to grow and employed novel design like channel and selector to achieve this model. In compare to regular sync model there is some differences need to keep in mind.

  1. It comes with additional run-time overhead. Although you may don't really need to open a executor to run all of your tasks, async functions require some coordination to make it work. You can shard data and parallel your function, but too much fine grained tasks won't be worthwhile. This rule may vary from your use cases, just don't overuse it.
  2. Once you employed async programming, you need to keep it in that way. Async functions don't return the actual value to it's result, but a state machine or we can call it a pipeline. For dynamic typed language like Clojure, you have to ensure all of the data pipeline, from data sources, intermediate processors and spout are all working in async fashion, or it won't work well.
  3. Users need to consider the data flows in a pipeline, anything outside of can hardly affect on it, the only way to do this correct is to alter the pipeline.

To explain those concepts, let me first introduce the library I choose for clean and concise demonstration, manifold built by Zach Tellman. You don't need to read it's document for this article, I will explain to you.

For the first rule, it should be easy to understand that if coordination takes more time than the actual computation, async will be futile. This rule is easily to be ignored and also require delicate planning to find the balance point.

For the second one, if you are running a web server and want it to keep it async (for example, some event-driven library like Jetty and aleph) to achieve theoretically unlimited concurrency (that won't happened in real world because CPU kicks in when I/O wait is lifted), you have to make the whole process running in async way. If you are just running a parallel job, it is required to wait for all async jobs to finish. For example, aleph accepts both a regular value or a deferred, basic unit for the pipeline. If we want to employ the event-driven part from aleph, we have to deliver deferred instead.

The third rule means you need to use alternative way to handle exception or do profiling for those tasks. Because the outer instructions can only obtain the async pipeline, it actually don't have any view inside it.

There is still a lot to introduce for async programming pettern. Employ async in applications efficiently is not as simple as we thought it should be, but always worth the efforts if did it right.