Tuesday, February 3, 2015

Maxing out at 50 concurrent connections with Play or Netty on OS X Heres a fix

I recently ran across a strange problem with the Play Framework and Netty: on Linux, my Play app could easily handle thousands of concurrent connections; on OS X, the same app maxed out at around 50 concurrent connections. It took a while to figure out the problem, so in this post, Im documenting the solution in case other folks run into the same issue in the future. Note: if youre using raw Netty, the fix is very straightforward; if youre using Play, its much trickier, and I hope it will be fixed in Play itself in the near future.

tldr: set the backlog option on Nettys Bootstrap object to a higher value (example).

Symptoms

On OS X, your Play or Netty app cannot handle more than ~50 concurrent connections. A simple way to test this is to use Apache Bench:


As soon as you set the concurrency level (the -c parameter) above 50, youll get the error "apr_socket_recv: Connection reset by peer (54)". Moreover, your Play or Netty app will never actually see the request and nothing will show up in the logs.

However, if you run the same experiment against the same app running on Linux, even with the concurrency level set to several hundred or several thousand, all requests will complete successfully, without errors. Therefore, there must be something OS specific causing this problem.

To be fair, its rare to use OS X in any production or high traffic capacity - for example, at LinkedIn, we use OS X in dev, but Linux in prod - so the concurrency limitation is rarely a problem. However, we had a few use cases where, even in dev mode, we had to make many concurrent calls to the same app, so we had to find a solution.

The Cause

It turns out that "50" is the default size for the "backlog" parameter in Javas ServerSocket. It is even explained in the JavaDoc:
The maximum queue length for incoming connection indications (a request to connect) is set to 50. If a connection indication arrives when the queue is full, the connection is refused.
Therefore, whatever code manages sockets in Netty must use different configurations for the backlog parameter on Linux and OS X. This code is likely tied to the selector implementation for the OS: Im guessing the Linux version uses epoll, while OS X uses kqueue. The former probably sets backlog to some reasonable value (perhaps from OS settings) while the latter just uses the default (which is 50).

The Solution (pure Netty)

After some more digging, this StackOverflow thread reveals that the Netty ServerBootstrap class lets you set an option to override the backlog:

If youre using pure netty, just use the code above and the 50 concurrent connections limit will vanish immediately!

Also worth noting: this issue exists in Netty 3.x, but apparently Netty 4.x sets a better default than 50 on all OSs, so upgrading Netty versions may be another solution.

The Solution (Play)

Play instantiates the ServerBootstrap class inside of NettyServer.scala. Unfortunately, neither the class nor the boostrap instance inside of it are accessible to app code. This should be easy to fix via a pull request, but until that happens, and until a new version is available, here is a two part workaround to get moving.

Note: this is an ugly hack with lots of copy/paste from the original Play source code and is only meant as a temporary workaround. It has been tested with Play 2.2.1; figure out which version of Play youre on and be sure to use code from that release!

Step 1: make a local copy of NettyServer.scala called TempNettyServer.scala

Youll want to put TempNettyServer.scala in a different SBT project than your normal app code - that is, dont just put it in the app folder.  See SBT Multi-Project Builds for more info.

The folder structure looks something like this: my-app is my original Play app and monkey-patch is a new SBT project for TempNettyServer.scala:

Copy the contents of the original NettyServer.scala into TempNettyServer.scala, with two changes:
  1. Replace all NettyServer references to TempNettyServer
  2. In the newBootstrap method, make the change below to allow configuring the backlog option

Now, configure this new SBT project in project/Build.scala:


Step 2: override the run and start commands to use TempNettyServer

Ready for more copy/paste?

Grab PlayRun.scala and copy it into the project folder under some other name, such as TempPlayRun.scala and make two changes:
  1. Replace all PlayRun references with TempPlayRun: there should only be one, which is the class name.
  2. Replace all NettyServer references with TempNettyServer: there should two, both in String literals, used in the "run" and "start" commands to fire up the app.
Now, update the settings in project/Build.scala to use your versions of the "run" and "start" commands:


A note on OS limits

After making the changes above, you should be able to handle more than 50 concurrent connections. However, depending on how your OS is configured, you might still hit a limit at 128 or so. This is probably due to the kernel config kern.ipc.somaxconn, which controls "the size of the listen queue for accepting new TCP connections" and has a default of 128.

To tweak this limit, you can run the following command:

Your Netty or Play app should now be able to handle over 1000 concurrent connections (or more, depending on what limits you set above).

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.