Benchmarking the benchmarks (part 2)

This is a continuation of the previous post

Scaling the benchmark tool

We were able to improve on the original “ab” benchmark quite  a bit especially for large files but as the authors of both “weighttp” and “G-WAN” points out the benchmark tool is only running on a single core. Here we leave “ab” behind since it has no such capabilities.

“weighttp” is built to be able to scale over a number of cores using the “-t” option specifying the number of threads to run.

“pounce” will (now) by default spawn a process per core but this can be specified with a “-d” option similar to the one above.


If we are going to try to push the daemons we likely need to optimize the configuration. I’m going to use a few common recommendations but not dive too deeply into this. Please note that the configurations are based on the fact the we have 4 cores on this system.


I’m going to leave this as it is for now. If this is upsetting I’ll come back to this later.


worker_processes  4;
worker_cpu_affinity 0001 0010 0100 1000;
events {
    worker_connections  10240;
http {
    include       mime.types;
    default_type  application/octet-stream;
    sendfile        on;
    tcp_nopush      on;
    tcp_nodelay     on;
    open_file_cache max=1000 inactive=20s;
    open_file_cache_valid    30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors   on;
    keepalive_timeout  65;
    server {
        listen       80 backlog=1024;
        server_name  localhost;
        access_log off;
        location / {
            root   html;
            index  index.html index.htm;
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;


Seems to be quite optimized already.


server.port                  = 80
server.username              = "http"
server.groupname             = "http"
server.document-root         = "/srv/http"
server.errorlog              = "/var/log/lighttpd/error.log"
dir-listing.activate         = "enable"
index-file.names             = ( "index.html" )
mimetype.assign              = ( ".html" => "text/html", ".txt" => "text/plain", ".jpg" => "image/jpeg", ".png" => "image/pn
g" )
server.max-fds               = 10000
server.event-handler         = "linux-sysepoll"       = "linux-sendfile"
server.use-noatime           = "enable"
server.max-worker            = 4


Is optimized out of the box.

Benchmark 3

Results (weighttp/pounce, average of 8 runs with optimal number of tool cores noted)

weighttp/pounce/size=1kB/n=10000/c=100/no keep-alive

weighttp/pounce/size=1MB/n=10000/c=100/no keep-alive

weighttp/pounce/size=50MB/n=100/c=100/no keep-alive


Ok, now it seems we’re finally getting somewhere. There are still a lot of assumptions and simplifications of course but we’ve come some way since the intial “ab” benchmarks.

  • We’ve probably been unfair towards Apache by not optimizing the configuration
  • With very small files it seems that sharing the resources between the benchmark tool and the daemon is more difficult though it also seems to depend on the daemon implementation
  • G-WAN has a slight advantage with very small files
  • Cherokee, Lighttpd and G-WAN give similar results with larger files
  • weighttp performs better with multithreading but falls somewhat behind pounce especially with larger files

I’ll be back with some keep-alive and concurrency tests later hopefully.

… continued here


12 thoughts on “Benchmarking the benchmarks (part 2)”

  1. Hello Fredrik,

    A couple of remarks that could help to explain the poor top performances of your test:


    Hope it helps.


    1. Pierre,

      You are missing the point here, please read the posts before you trash them.

      Some pointers
      – I am testing how using different benchmarking tools affects the result
      – The posts so far have been for non keep-alive as is very clearly stated, and I also clearly state that keep-alive tests are coming up
      – Concurrency is 100 in these tests and as I state concurrency tests are coming up
      – Arch Linux is 64-bit, and only exists as a 64-bit operating system
      – SND buffer is a socket option, not a TCP/IP option, but if I can configure G-WAN to be more optimized with regards to large files please enlighten me


  2. Fredrik,

    > “SND buffer is a socket option, not a TCP/IP option, but if I can configure G-WAN to be more optimized with regards to large files please enlighten me”

    I was just trying to explain that if Cherokee is faster for large files and slower for small files then this is NOT because Cherokee is faster.

    This is because Cherokee is using large SEND buffers for sockets (a TCP/IP option).

    Letting readers understand the results of your tests has value in my eyes – hence my attempt to explain.


      1. Hi Pierre,

        Network and operating system optimizations is definitely a relevant parameter. What I’m trying to do here is starting with a default base, using “ab”, and adding to it gradually to see in what way the introduced variables influence the result, or doesn’t.

        So, let’s add optimizations and look at the result. Of course the benchmark is already very artificial since the “network” we are using lacks all the problems the network stack is trying to solve, no packet loss, no packets out of order, zero latency, bandwidth limited to bus/CPU/memory etc. Whatever changes we do could very well improve the performance in our already close to irrelevant setup, but actually instead worsen performance in an actual real life production environment. But it’s interesting to see how it affects the client and server behaviour.

        I’d be happy to let you, and others here, suggest a set of optimizations. The kernel is a 3.0.6 version so it is very recent.


      2. Fredrik,

        Here is the best-working kernel setup that I have found:

        “Performance Scalability of a Multi-Core Web Server”, Nov 2007
        (Bryan Veal and Annie Foong, Intel Corporation, Page 4/10)

        Intel researchers describe these options as:

        “Table 1 shows modifications made to the Linux kernel default sysctl settings. We increased operating system limits where necessary to scale. We also set TCP options to values we believe are similar to those used on high-performance commercial web servers.”

        sudo gedit /etc/sysctl.conf
        sudo sysctl -p /etc/sysctl.conf

        fs.file-max = 5000000
        net.core.netdev_max_backlog = 400000
        net.core.optmem_max = 10000000
        net.core.rmem_default = 10000000
        net.core.rmem_max = 10000000
        net.core.somaxconn = 100000
        net.core.wmem_default = 10000000
        net.core.wmem_max = 10000000
        net.ipv4.conf.all.rp_filter = 1
        net.ipv4.conf.default.rp_filter = 1
        net.ipv4.tcp_congestion_control = bic
        net.ipv4.tcp_ecn = 0
        net.ipv4.tcp_max syn backlog = 12000
        net.ipv4.tcp_max tw buckets = 2000000
        net.ipv4.tcp_mem = 30000000 30000000 30000000
        net.ipv4.tcp_rmem = 30000000 30000000 30000000
        net.ipv4.tcp_sack = 1
        net.ipv4.tcp_syncookies = 0
        net.ipv4.tcp_timestamps = 1
        net.ipv4.tcp_wmem = 30000000 30000000 30000000

        This Intel paper also explains that to saturate the Apache server they have used 8 NICs tied to 8 CPU Cores.

        If you don’t have such a monster machine, use localhost to remove the network from the equation.

        With these options on localhost, G-WAN reaches 749,574 HTTP requests per second (Nginx 207,558 and Lighty 215,614).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s