Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login
Node.js: Cluster vs. Async (synsem.com) similar stories update story
98.0 points by skazka16 | karma 1993 | avg karma 12.23 2015-05-22 01:48:36+00:00 | hide | past | favorite | 58 comments



view as:

It makes sense to have both event-based and non-event-based options for server side javascript development.

OS-level multitasking won't be able to achieve the same level of concurrency, but the simplicity and maintainability of the application will go up. The right choice depends on the needs of the application, of course.

Both evented and non-evented approaches have their place, and most server-side languages allow development with either approach: Ruby, Python, C, Java all have solid options for evented and non-evented solutions.


> OS-level multitasking won't be able to achieve the same level of concurrency

Do you have a source for this claim? I've seen it repeated many times, especially in the node.js community but I've yet to see any evidence to back it up. From what I've read, a synchronous threaded model can be just as fast as an event-based system [1].

[1] http://www.mailinator.com/tymaPaulMultithreaded.pdf


It's the design that has allowed tools like nginx and HA Proxy to scale so well. There's a lot of good material here:

http://www.kegel.com/c10k.html


To be fair, that link is almost 15 years old. Back then we had 32-bit address spaces, and that was the main limiting factor for threads (because you'd often allocate 2MB of address space for each stack). And we didn't have multi-core processors.

These days you could actually reasonably have 10k threads. In theory switching between threads shouldn't be much different performance-wise than switching between callbacks in an event loop (either way you take some cache misses), and the thread stack is probably more cache-friendly than scattering objects all over the heap (and certainly easier to use).

But now you have the problem that synchronization between threads (whether mutex locking or by lock-free algorithms) is complicated and surprisingly slow, specifically because you have to worry about all the ways simultaneous memory access might confuse the CPU or its caches. Whereas with single-threaded async each callback is effectively a transaction, without requiring any slow synchronization.

Of course if you're doing single-threaded async then you probably aren't fully utilizing even one core. You see, even if you think you are doing everything in a non-blocking way, that's not really the case all the way down the stack. If you try to access memory that is paged out, guess what? You are now blocked on disk I/O. And because you aren't using threads, the OS can't schedule any other work while you wait. And even if you're pretty sure you never touch memory that is paged out, you surely do sometimes touch memory that is not in the CPU cache, which also takes a while. If your CPU supports hyper-threading, it could be executing another thread in the meantime... but you don't have any other threads.

And then multicore. The previous paragraph was a lot more interesting before multicore, but now it's just obvious that you can't utilize your CPU with a single thread.

The heavy-duty high-scalability servers out there (like nginx and I'd guess HA Proxy) actually use both threads and async, but while this gets the best of both words, it also gets the worst: complicated synchronization and callback hell.

Basically, all concurrency models suck.

https://plus.google.com/+KentonVarda/posts/D95XKtB5DhK


A big problem with one-thread-per-connection is that you open yourself to slowloris-type DoS attacks.[1] Normal load (and even extreme load) is fine, but a few malicious clients can use up all of your threads and take down your server.

This is touched upon in the slides you linked to. On slide 62 (SMTP server) a point says, "Server spends a lot of time waiting for the next command (like many milliseconds)." A malicious client could send bytes very slowly, using up a thread for a much longer period of time. If the client has an async architecture, it can open multiple slow connections with little overhead. The asymmetry in resource usage can be quite staggering.

1. http://en.wikipedia.org/wiki/Slowloris_(software)


You seem to be imagining a case where you only allocate a small fixed thread-pool and when it runs out you just stop and wait. I think the slide deck is advocating that you just keep allocating more threads.

I'm talking about hitting OS or resource limits. Let's say a server is configured to time-out requests after 2 minutes. A malicious client could do something like...

Every second:

1. Open 40 connections to the server.

2. For all open connections, send one byte.

Repeat indefinitely.

Steady state would be reached at 4,800 open connections. At 1 byte of actual data per second per connection, data plus TCP overhead would use around 200KB/s of bandwidth. The server would have to run 4,800 threads to handle this load. Depending on memory usage per thread, this could exhaust the server's RAM.

There are ways to mitigate this simple example attack, but the only way to defend against more sophisticated variants is to break the one-thread-per-connection relationship.


What i am truely missing is a good benchmark and comparisons between async vs sync. It seems true that everybody says that async is best but i don't see much evidence. For example, how should 4800 threads exhaust the servers RAM when the thread stack size can be as small as 48kB. That's a round 200MB of memory.

I'm not saying that the threaded approach is better, but that almost everyone comes around with some theoretical statement but nobody seems to care to find hard evidence.


You are right to distrust these claims. The reality is that threads can be significantly faster than async -- async code has to do a lot of bookkeeping and that bookkeeping has overhead. OTOH, threads have their own kind of overhead that can also be bad.

The slide deck that bysin linked above is pretty good:

http://www.mailinator.com/tymaPaulMultithreaded.pdf

This is by Paul Tyma, who at the time worked on Google's Java infrastructure team with Josh Bloch and other people who know what they're doing. Apparently he found threads to be faster in a number of benchmarks.

Ultimately which is actually faster will always depend on your use case. Unfortunately this means that general benchmarks aren't all that useful; you need to benchmark your system. And you aren't going to write your whole system both ways in order to find out which is faster. So probably you should just choose the style you're more comfortable with.

Async is kind of like libertarianism: It works pretty well in some cases, pretty poorly in others, but it has a contingent of fans who think they've discovered some magic solution to all problems and if you disagree then you must just not understand and you need to be educated.

(Note: The code I've been writing lately is heavily async, FWIW.)


Why is 4800 threads a problem, and 4800 heap-allocated callbacks not a problem? Are you assuming a thread consumes significantly more memory than the state you'd need to allocate in the async case? This isn't necessarily true.

This is great. I wonder how long til the node community "discovers" that using a dedicated httpd and communicating over a standardized middleware (fcgi, wsgi, rack, etc) is also a superior approach instead of handling http directly.

FastCGI, really?

And Node has the equivalent of WSGI and Rack: Connect/Express middleware.


for such a smug comment you don't seem to know much about node.

I recently started programming more using promises and I can say that I am very satisfied with the way it does away with the callback hell problem.

Instead of taking a callback, a function returns a promise, which can be daisy-chained to do work.

Ex:

file.read().then(console.log);

Or using the example in the article:

var a = fileA.read();

var b = fileB.read();

promise.all([a,b]).then((data) => console.log(data[0], data[1]));


just wait till you try generators + async/await, it can be a lovely experience. No callbacks, and working try/catch.

+1 for generators. As a former PHP developer, I jumped into using generators that return promises in io.js (as well as es6 classes), and the code reads very much like PHP code, except all of the i/o is now asynchronous.

There is no benefit of io being async in itself until you have many users. The immediate and more accessible benefit is speeding up individual requests due to the ease at which you can perfom io in parallel. But if you just sprinkle await/yield everywhere (which everyone unfortunately does), you don't even get this benefit.

Sure you can.

  let [a, b] = await Promise.all([asyncA(), asyncB()])

I am referring to usage of generators without promises (or rather code that uses generators in a way that it wouldnt matter if promises or thunks were used). And even then I didn't say that you couldn't, even when using promises and generators together most people make their code unneceasarily sequential.

There was a post on HN recently on correctly using promises. A must read for anyone getting started with promises: http://pouchdb.com/2015/05/18/we-have-a-problem-with-promise...

OpenResty does something similar. Code can be written synchronously, but all the network io for example happens in a non-blocking manner. The code still looks like synchronous, though - without call back hell. This doesn't come without issues as you need to change libraries to use OpenResty (Nginx) network primitives. Overall it is one of the nicest platforms I have worked with. A great webserver (Nginx) that can be programmed with a great language (Lua + LuaJIT).

At Nginx conf the Nginx developers where showing interest to bring Javascript to the platform. They said that they will take similar approach that OpenResty uses (aka no callback hell).


OpenResty is an evented model too. Coroutines are used to avoid using callbacks.

I don't understand these strawmans. The virtue of the asynchronous programming model is low-memory overhead as compared to threads AND low latency for IO-bound tasks in highly concurrent scenarios.

Request per-process/thread has all the same memory overhead implication it has always had. It's almost like the author is ignorant of the reason for node.js' success or why it was built in the first place.

Also, "callback hell" is just FUD. Nobody who does this for a living and knows what they're doing really has an issue with this. Promises solve the unreliability issues, and async/await solves the syntax complexity issues.

I'd like to see this same analysis for 1000 req concurrency measuring memory overhead and using async/await for code comparison. Cooperative multitasking will always be capable of lower latency when you know what you're doing, and async programming is lightyears simpler than multi-threaded programming.


I'd say at least 10K simultaneous requests on a single instance with ease, let alone several. Just a simple echo web-server... http://localhost/foo => "hello foo" ...

Launching that many threads will quickly hit bottlenecks on most systems... I've seen this happen in a poorly written simulation server (each actor had it's own work thread)... the server would freeze up randomly with only a handful of connections in a test scenario... changing to an event-loop using an async thread-pool resolved most of these issues (this was before node).


>Cons: Larger memory footprint

The more annoying con is lack of shared memory. A single process can be much less complex when it doesn't have to worry about messaging systems and off process caching.


"Asynchronous event-driven programming is at the heart of Node.js, however it is also the root cause of Callback Hell."

I'd argue the root cause is... callbacks.

Asynchronous programming can be done elegantly, in a synchronous style using "async/await", originally (?) in C# [1], likely to be added to the next version of JavaScript [2], also in Dart [2], Hack [4], and Python 3.5 [5]. It can also be emulated in languages with coroutines/generators [6][7][8] (which in turn can be implemented by a fairly simple transpiler [9][10])

This:

    function foo(a, callback) {
      bar(a, function(err, b) {
        if (err) {
          callback(err)
        } else {
           baz(b, function(err, c) {
             if (err) {
               callback(err)
             } else {
               // some more stuff
               callback(null, d)
             }
           })
        }
      })
    }
Becomes this:

    async function foo() {
      var a = await bar(a)
      var c = await baz(b)
      // some more stuff
      return d;
    }
And you'll see even greater improvements when using other constructs like try/catch, conditionals, loops, etc.

[1] https://msdn.microsoft.com/en-us/library/hh191443.aspx

[2] http://jakearchibald.com/2014/es7-async-functions/

[3] https://www.dartlang.org/articles/await-async/

[4] http://docs.hhvm.com/manual/en/hack.async.asyncawait.php

[5] https://lwn.net/Articles/643786/

[6] https://github.com/petkaantonov/bluebird/blob/master/API.md#...

[7] https://github.com/kriskowal/q/tree/v1/examples/async-genera...

[8] http://taskjs.org/

[9] https://babeljs.io/docs/learn-es6/#generators

[10] https://facebook.github.io/regenerator/


Not only the simplified code you write with async/await (using babeljs for this today) ... There are a lot of complex processes that are easier to reason against in JS over low level C.

Not to mention, that the article's example is only comparing a small piece of work... A single node instance can easily manage 10K simultaneous requests... launching a thread per request, and you're going to hit resource bottlenecks at the CPU pretty quickly, compared to several node instances approaching a million simultaneous connections on a single server. Node uses not only an even loop, but a shared thread pool for isolation of work without blowing out resource contention.

I've seen issues with even a few thousand threads in poorly written simulation servers... going to an event-loop against a threadpool always worked out better under true load.


Chapter and verse of this marvelous tale? [Github link] to something that I can pull, throw at it 5k RPS, make it proxy a result from Google?

Callbacks suck, no doubt, and the sooner we have an real alternative (sorry Promises, not you) for them in JS, the better.

In the meantime, for those of us stuck with contemporary JS, there are mitigating code styles, in particular "early return":

    function foo(a, callback) {
      bar(a, function(err, b) {
        if (err) return callback(err)

        baz(b, function(err, c) {
          if (err) return callback(err)

          // some more stuff
          callback(null, d)
        })
      })
    }
Just to be clear, this is still inferior to a linear style.

It is however much clearer than the "traditional" style.


OK, I'll shoot: Why not Promises?

They don't handle streaming data, but they're not supposed to; they're supposed to model the traditional JS function call model (single value return or single value exception) which they do well. For streaming data use some other paradigm (perhaps Observables from https://github.com/jhusain/asyncgenerator), but that's orthogonal to Promises, not a replacement for them.

The only other reasonable complaint I've heard is that they aren't cancellable, which is legit. There are proposals for that, though (https://github.com/petkaantonov/bluebird/blob/master/API.md#...).

At least with Promises functions actually return, as opposed to the continuation-passing-style of callbacks. (CPS is a fine intermediate representation for compilers, but it really shouldn't be written by hand; that's why we have compilers!)


Not the op, but I can give you one important reason for me why Promises aren't as great when compared to callbacks (specifically wrt the ES7 async/await syntax): with callbacks, I have greater flexibility in choosing the best approach for running asynchronous code in parallel.

For example, let's take the wonderful async.js library. There are a variety of different tools at my disposal on picking how to run tasks in parallel. My favorite one is async.auto. With this method, I can define the asynchronous tasks and their dependencies, then async.auto finds the best way to run the tasks in parallel while still ensuring the tasks end up having their dependencies met. This is more in tune to declarative programming's ideal of defining what results you want, rather than specifying how to accomplish getting those results.

Now with ES7's async/await, everything is still sequential. Better than blocking synchronous programming, but it could be better.


Promises are just as easy to parallelize as callbacks, though. Easier, IMO, since you can pull their values more than once.

Here's an example of some code I had to write to query a somewhat nasty Web API:

  async function apiCall(method, path, qs, body) { ... }

  async function getPaginated(path, qs, extractionFn) {
      let data = [];
      let pagePromises = [
          apiCall('GET', path, Object.assign({ }, qs, { page: 1 }))
      ];

      let firstPage = await pagePromises[0];
      for (let page = 2; page <= firstPage.pagination.page_count; ++page) {
          pagePromises.push(
              apiCall('GET', path, Object.assign({ }, qs, { page }))
          );
      }

      let pages = await Promise.all(pagePromises);
      for (let page of pages) {
          let extracted = await extractionFn(page, qs);
          data = data.concat(extracted);
      }

      return data;
  }
Note that because of how promises work I didn't have to special case the storage of the first page (which needs to be fetched first to get the number of pages); I can just wait on it again and get the promise-cached answer. I don't know what async.js code to do that would look like but I doubt it'd be as succinct.

async functions magically return promises. You don't have to just await them, you can choose to do whatever you want with the resulting promises. And promises, being actual values and not just functions called at the right time, are much easier to compose.

Anyway, if you like the async.auto functionality, it wouldn't be very hard to write a Promise.auto that would let you do (compare to the async.auto example at https://github.com/caolan/async#auto):

  let results = await Promise.auto({
      get_data: async function(){
          console.log('in get_data');
          // async code to get some data
          return [ 'data', 'converted to array' ];
      },
      make_folder: async function(){
          console.log('in make_folder');
          // async code to create a directory to store a file in
          // this is run at the same time as getting the data
          return 'folder';
      },
      write_file: ['get_data', 'make_folder', async function(results){
          console.log('in write_file', JSON.stringify(results));
          // once there is some data and the directory exists,
          // write the data to a file in the directory
          return 'filename';
      }],
      email_link: ['write_file', async function({write_file}){
          console.log('in email_link', JSON.stringify(results));
          // once the file is written let's email a link to it...
          // results.write_file contains the filename returned by write_file.
          return {'file':write_file, 'email':'user@example.com'};
      }]
  });
  console.log('results = ', results);
...which is the same functionality except you don't have explicit callbacks running around everywhere. For that example, though, I'd rather just write a:

  let [ data, folder ] = await Promise.all([get_data, make_folder]);
and do the rest as simple "sequential-ish" code.

I've written code with async.js, and it's nice. I vastly prefer Promises. The only thing I still use from async.js is the queueing stuff, and I wrap that up with Promises.


As an example of non-trivial parallelism with Promises and async/await, here's a (slightly naive) implementation of Promise.auto I just whipped up.

  async function Promise_auto(clauses) {
      let keys = Object.keys(clauses);
      let results = { };
      let clausePromises = { };
      function runClause(key) {
          if (!clausePromises[key]) {
              clausePromises[key] = (async () => {
                  let fn = clauses[key];
                  if (fn instanceof Array) {
                      let deps = fn.slice(0, -1);
                      fn = fn.slice(-1)[0];
                      await Promise.all(deps.map(runClause));
                  }
                  console.log('start', key);
                  results[key] = await fn(results);
                  console.log('end', key);
              })();
          }
          return clausePromises[key];
      }
      await Promise.all(keys.map(runClause));
      return results;
  }
With the above (note clauses are intentionally out-of-order):

  (async () => {
      let results = await Promise_auto({
          write_file: ['get_data', 'make_folder', async function(results){
              console.log('in write_file', results);
              return 'filename';
          }],
          email_link: ['write_file', async function(results){
              console.log('in email_link', results);
              return {'file':results.write_file, 'email':'user@example.com'};
          }],
          get_data: async function(){
              console.log('in get_data');
              return [ 'data', 'converted to array' ];
          },
          make_folder: async function(){
              console.log('in make_folder');
              return 'folder';
          },
      });
      console.log('results = ', results);
  })();

  =>

  start get_data
  in get_data
  start make_folder
  in make_folder
  end get_data
  end make_folder
  start write_file
  in write_file { get_data: [ 'data', 'converted to array' ],
    make_folder: 'folder' }
  end write_file
  start email_link
  in email_link { get_data: [ 'data', 'converted to array' ],
    make_folder: 'folder',
    write_file: 'filename' }
  end email_link
  results =  { get_data: [ 'data', 'converted to array' ],
    make_folder: 'folder',
    write_file: 'filename',
    email_link: { file: 'filename', email: 'user@example.com' } }
Note that there was no topo-sorting or any other stuff to explicitly figure out an execution order, it just takes advantage of the fact that Promises can have their values pulled out multiple times to sequence things properly.

We (http://Clara.io) run multiple NodeJS instances per machine and our code base is Async. I believe this gives us the best of both worlds.

Also Sync versions of calls in NodeJS are likely going to be deprecated, thus this won't even be possible in NodeJS going forward.


Sync calls wouldn't be deprecated IMHO as not everything utilizing Node.js is an HTTP server (think of build scripts, background jobs, etc.).

I often use it for ad hoc scripts, similar to the way many people use python

Sometimes. Measure, don't assume. For example, we ran across a bug recently with docker and AUFS which caused multiple instances of Dockerized node to all wait on a single mutex, resulting in no improvement in speed over a single instance.

Just add a pickle Promises

This is unsuitable for real-world applications, where you will, inevitably, need at least a little mutable shared state. Async handles this reasonably well at a decent performance cost; shared-memory threads (near-certain catastrophic failure) and database-only state (awful performance) do not.

The only real competition is transactional memory but it hasn't become mainstream yet.


I'm not sure what you're saying here exactly. What is unsuitable for real-world applications?

I will say this, though, regarding the need for shared, mutable state: You can communicate by sharing memory, but you can also share memory by communicating. This is I guess what you allude to with database-only state, but it doesn't need to be a database. It could just as easily be a memcached server or some other fase key-value store.


Now forgive my naivety, but isn't this just threads?

I understand that because if the funny scoping rules it means that threading is actually surprisingly hard? surely you'd want more control over your threaded event loops?


Do your apps run single-threaded? Why wouldn't you cluster? In-memory state is easily avoidable. Using in-memory sessions even prints a warning in express by default.

This article is wrong on so many levels...

- It implies that async execution is equivalent to callback hell. In reality there are excellent ways to have async code which looks just like sync (generators, async/await).

- It benchmarks multi-core (sync) vs single-core (async) and makes claims based on the results.

- It presents async execution as an antipode of clustering. In reality it's a best practice to make use of both.

...and everything that follows is just irrelevant.


In addition... "PUT commands write to a file, GET commands read the file back, sort and reduce the data, then return the file"

I guess the author had to add the sort/reduce to "prove" his point...


There are so many statements in the article that makes me laugh

> async cons: Adds latency with parallel overhead

What parallel? This should be cons for multi threaded running on single core cpu

> async pros: Callbacks help enforce error-prone synchronization

How does callbacks help in error prone synchronization? It is not callback, it is single threaded model that does.


Edit: The whole idea behind clustering is to run an application instance per thread/core for better performance and load balance requests between the application instances. This article seems absurd in its intention to force you to choose between multi-threaded synchronous application instances or a single application instance using callbacks.

We've been running a Koa.js API server using Cluster in production for over a year now with no hiccups (on a Windows machine).

I've been thinking about making the switch to iisnode, as it handles clustering, graceful shutdown and zero-downtime from within IIS (and does a couple of other things). It uses named pipes to proxy connections and also supports web sockets among other things.

With the nodeProcessCommandLine configuration setting, you can pass parameters to node (e.g. --harmony), use babel-node or io.js.

See: http://www.hanselman.com/blog/InstallingAndRunningNodejsAppl...

A blog post I wrote a while ago: https://shelakel.co.za/hosting-almost-any-application-on-iis...


I've adapted the node-fibers library to write synchronous style code. It works really well for my needs, but I do understand that my approach does litter the function prototype which is not ideal.

Code looks like this:

  var sync = require('./sync');
  
  sync(function () {
    try {
      var result = someAsyncFunc1.callSync(this, 'param1', param2');
      if (result.something === true) {
        var result2 = someAsyncFunc2.callSync(this, 'param1');
      } else {
        var result2 = someAsyncFunc3.callSync(this, 'param1');
      }
      
      console.log(result2.message);      
    } catch (ex) {
      // One of them returned an err param in their callback
    }
  });
I haven't tested the performance, so no idea if it's a running like a dog.

Meteor does something similar with Fibers, but you wrap each function ahead of time, so you don't need to worry about littering Function.prototype.

https://www.discovermeteor.com/blog/wrapping-npm-packages/

It feels a bit like "get that sync code off my Javascript lawn!" but once you get used to it, it's pretty great in practice, and good for introducing newcomers.


If you look more closely at the actual code, this exercise compares the performance of readFileSync (an operation that deliberately blocks) on 1 core vs 2 cores.

Why is this even being upvoted? it confuses important concepts and what isn't wrong is irrelevant.

Hogwash. It seems like this person doesn't understand that node is an event based, asynchronous language, and that's one of its big advantages over other languages that force threads, or don't generally offer parallel execution.

If this had compared Node.js with clustering and async vs a synchronous language like Ruby, it might have been interesting. Maybe. But non-asynchronous operations in Node are an antipattern that core node contribs are trying to remove (the -Sync in node stdlib are trying to be removed).

Good coding conventions, promises, generators, and async/await are your friends for making callback hell go away.


Indeed, I've noticed that quite a few -Sync methods aren't documented anymore. My hope is that node will have a major release which removes all -Sync methods and replaces callbacks with Promises. Then we'll all be happy and use cluster when we can.

No please! More sync function please!

Yes, if I'm writing a web server I want everything to be async. But, just as I want to share JS on the server and the client I also want to use JS as my build language. At least for me, I find it much faster to build using a sync style.

Maybe I just haven't learned the async way but for example I tried to make a build system using node. I needed to spawn out to a builder to build some stuff, copy files, spawn git in various ways to see if repos are dirty or clean, git add, git commit, and a few other things. I found it a massive nightmare and after 2 days I switched to synchronous python for my building. Was done in 2 hours.

If there's some articles or tips that will make me as comfortable at async for building as I'm as sync in python then please point me at them. But, for whatever reason, I'm struggling with async for really complex tasks. (and yes, I'm using promises)


If you truly need to do sync programming for whatever reason, you probably aren't using the right language. It's not just web programming that benefits from asynchrony, it's pretty much any non-trivial algorithm. Sure, it can require a bit more code to do right, but you don't want to block all your other background events from firing just because you have to write a bit more code.

If you really prefer synchronous style, and don't mind the drawbacks, there are plenty of languages out there that'll work better for you. PHP, Ruby, and Python readily come to mind.


Yes, but a big draw of using node at all is that it's JavaScript. I write some function "templateThisString" and I can use it on both the server and the client. Now I also want to use it in my build process. If I switch to another language I have double the work.

I actually have a tip for async programming on small projects. Consider storing as much of your state as possible in a central location if it's feasible for your application. As much as possible, avoid relying on closures except for data hiding. This makes it easier to think about what is going on in your application even if the code flow is not exactly linear.

For example use a global object called State that contains information about the list of build tasks to be run and the state of each of the stages for the tasks.


Well, okay, quick and dirty scripts where you are expecting (or want) things to be blocking are probably the one place where you need things to be sync.

I suppose though that you could always just call process.exit() when a callback was called or a promise resolved. Personally I'd love a file-level await statement.


Legal | privacy