Benchmarking the benchmarks (part 3)

This is a continuation of the previous post

Optimizing the operating system

First a few words about optimization in general.

There is a lot to be said for default configurations. They tend to be very well tested and (hopefully) they are balanced for good all-round performance. Do underestimate the value of this. Introducing an optimization may look great in your synthetic test case, but could very well lower the performance in an actual real life scenario. Or it could improve performance for 95% of your users but leave the remaining 5% complaining about service failures. None of the 95% of the users are going to give you a gold medal for saving some servers, but you can be sure that users with problems will do the opposite.

If you are running a large scale operation, change management becomes a huge challenge. You have to maintain the system but also have to make sure all the changes are tested. The problem is that the tests will always be approximations, the real test is when you roll something into production. Being close to a default configuration means you are likely to have operating system and other components already tested in the same context which is worth a lot, and sometimes updating components mean optimizations stop working or suddenly create problems that couldn’t be predicted.

Saying this, in some situations some optimizations are vital and will seriously cripple you if not applied. For a production environment in general I would suggest that you only apply an optimization if you are sure the gain in performance is sufficient, that you don’t apply it blindly and instead make sure you actually understand how and why it works, and that you make sure to actually test it and include it in regression tests.


After some feedback from the Nginx team I removed the “worker_cpu_affinity” option, and added “accept_mutex off;” in the events section. I decided to keep GWAN in the race despite the fact that it caches the 1kB file making the comparison somewhat unfair and added “-b” to GWAN to use DEFER_ACCEPT which added some performance. I’m also adding Ulib to the servers being benchmarked.

I’m going to use pounce alone for a while since it takes a lot of time not doing all the benchmarks twice.


I download the latest source for Ulib, configured and build it with default options, and then ran userver_tcp. I will probably need some feedback from the author regarding recommended configuration.


The below applies to the hardware specified earlier, Arch Linux running Linux 3.0.6, and the local network as of 11/10 2011.

Some options that are not recommended

  • net.ipv4.tcp_syncookies – Defaults to 1 in Arch but doesn’t make any noticeable difference
  • fs.file-max – Default to 808886, and we’re not running out of file descriptors here at least, but you do need to increase the shell limit with “ulimit -n 100000” or similar if running high concurrency tests
  • net.core.[rw]mem_max, net.core.[rw]mem_default – Commonly recommended to optimize but in these tests it will actually have a very noticable direct effect to the opposite
  • net.ipv4.tcp_mem – Does not help us
  • net.ipv4.tcp_sack, net.ipv4.tcp_timestamps – Defaults to 1 and are there for a reason, does nothing significant to this test
  • net.ipv4.tcp_window_scaling – Disabling this will lower you performance with large files and gives you nothing with small files
  • net.ipv4.tcp_congestion_control – Defaults to “cubic” and could be relevant to look at in another setup
  • net.ipv4.tcp_ecn – No difference

Relevant options

  • ulimit -n 100000
  • net.ipv4.tcp_max_syn_backlog = 262144 – See below
  • net.core.netdev_max_backlog = 262144 – See below
  • net.core.somaxconn = 262144 – important if you are testing high concurrency, without them you will suffer 3 sec penalties on connection attempts that are lost due to a full backlog
  • net.ipv4.tcp_wmem=”4096 87380 16777216″,  net.ipv4.tcp_rmem=”4096 87380 16777216″ – Possibly a small increase with large files
  • net.ipv4.ip_local_port_range=”1024 65000″ – You might need many ports

Not very exciting, however this is a special case since we are using the local network which as we’ve said before is an extremely simplified environment.

Benchmark 4

Results (pounce, average of 8 runs with optimal number of tool cores noted)

before-after/size=1kB/n=10000/c=100/no keep-alive

before-after/size=1MB/n=10000/c=100/no keep-alive

before-after/size=50MB/n=100/c=100/no keep-alive


  • The increase in Nginx and G-WAN is because of the configuration changes to the applications
  • The only real relevant change here was “net.ipv4.tcp_[rw]mem” so there is not much to see
  • I probably need some feedback from the Ulib author about configuration

Benchmark 5

There is really nothing of interest to see. Let’s try the suggestion from the G-WAN author.

fs.file-max = 5000000
net.core.netdev_max_backlog = 400000
net.core.optmem_max = 10000000
net.core.rmem_default = 10000000
net.core.rmem_max = 10000000
net.core.somaxconn = 100000
net.core.wmem_default = 10000000
net.core.wmem_max = 10000000
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.tcp_congestion_control = bic
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_max syn backlog = 12000
net.ipv4.tcp_max tw buckets = 2000000
net.ipv4.tcp_mem = 30000000 30000000 30000000
net.ipv4.tcp_rmem = 30000000 30000000 30000000
net.ipv4.tcp_sack = 1
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_wmem = 30000000 30000000 30000000

Results (pounce, average of 8 runs with optimal number of tool cores noted)

before-after/size=1kB/n=10000/c=100/no keep-alive

before-after/size=1MB/n=10000/c=100/no keep-alive

before-after/size=50MB/n=100/c=100/no keep-alive


  • There is a tiny increase in performance in the 1kB test
  • There is a 10-20% decrease in performance in the 1MB test
  • There is a 20-30% decrease in performance in the 50MB test
  • Ulib is the only one that actually gained any benefits from the change

Final conclusion

In this test, with all the approximations noted, optimizing the operating system and network does not help us. With higher concurrency somaxconn etc. will be relevant but we are still running will a small number (100) of concurrent clients.

Are you running 10GbE networks, having problems with congestion, packet loss, etc, the situation is different and you might benefit from looking over the optimizations above.


6 thoughts on “Benchmarking the benchmarks (part 3)”

    1. These are not single client thread tests.

      Only the first benchmark was done with a single thread to include comparisons with the Apache “ab” benchmark. Starting with part 2, “Scaling the benchmark tool”, “ab” is abandoned, “weighttp” is run with “-t4”, and “pounce” with “-d4” which works similar to the “weighttp” option.

      This is stated in the beginning of part 2.

  1. Hi Fredrik,

    I am glad that you have include Ulib in your analisys. Thank you very much.

    I am surprise for the performance. So far I have used ab for my personal tests.

    Can you try one time with ab to confirm the comparison ?

    Is possible to have your client “pounce” to investigate in my familiar context ?

    however I start now to try with“weighttp” and I see there are some strange behaviour…

    Thanks in advance

    1. Hi Stefano,

      With Ulib and ab/weight/pounce and 1kB files I get this;

      > ab -n100000 -c100
      Requests per second: 13918.91 [#/sec] (mean)
      > weighttp -n100000 -c100 -t4
      finished in 4 sec, 540 millisec and 513 microsec, 22023 req/s, 5075 kbyte/s
      requests: 100000 total, 100000 started, 100000 done, 0 succeeded, 100000 failed, 0 errored
      > pounce -n100000 -c100 -d2
      4070197 us, 24568 rps, 19900 kB, 4889 kB/s

      Seems like weighttp fails all requests for some reason. ab/pounce thinks all requests are valid.

      I’m planning to release pounce asap, but as usual the last 20 takes 80, and some other things got in the way…


  2. Hi Fredrik,

    I make some change, now weighttp don’t complain about the response.
    (It need one, and only one, space in ‘Content-Length:’ header response…)

    I added an option in configuration (SENDFILE_THRESHOLD_NONBLOCK, default 1M) to regulate sendfile() blocking management,
    maybe can improve the throughout for big file



  3. When are you releasing pounce, your benchmarking tool? I am curious how it compares to ab and httperf.
    Can you give me a download link or something?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s