Benchmarking the benchmarks (part 1)

Why another benchmark

I’ve spent a lot of time benchmarking over the last years while building distributed systems and working with cost/performance optimization. Everything from infrastructure, hardware, storage, database and application solutions.

An understanding that gradually comes to you along the way is that benchmarking actually is very difficult. That is to say, benchmarking in itself is easy, but producing valuable data is not. The result seems to vary from “does at least give an indication” at best to “totally useless in real life scenarios”, with the latter typically occurring more often.

The reason is of course that there just are too many variables and unknowns, and to add to the difficulty some of them are quite complicated to simulate realistically. To be able to produce any data at all we make a lot of assumptions and simplify as much as possible, and keep maybe one or two variables, hoping that it will at least to a degree reflect on what we want to see.

Keeping this in mind it is of course obvious that you can’t put a lot of trust in a single benchmark, and even more obvious that you likely can’t trust someone who benchmarks with agenda at all. Lying with benchmarks is as easy as lying with statistics, you just pick the set of assumptions and fix variables where you perform at your best and your opposition at their worst. Knowing who has an agenda can be difficult, but someone who is benchmarking their own product, well, maybe has one…

This being said, I spend some time looking at the web-proxy Varnish at this summer and since I was curious of potential performance gain I did some benchmarks and decided to share them. I will actually redo them to make them a bit more up-to-date and I will probably skip Varnish itself since it actually is a somewhat different solution than a pure web server.

So this will be just another benchmark of web distribution of static content. If nothing else it will be an additional, for a brief period the most recent, indication of the performance of web server daemons running on Linux. Hopefully there will be a few valuable thoughts along the way.

Software tested

I will benchmark

  • Apache v2.2.21 – The old work horse
  • Nginx v1.1.5 – Probably the most common Linux alternative to Apache
  • Cherokee v1.2.99 – “The fastest free Web Server out there”
  • Lighttpd v1.4.29 – That will “scale several times better with the same hardware than with alternative web-servers. “
  • G-WAN v2.10.6 – According to the vendor the silver bullet that makes all other software regardless of purpose obsolete (and will cure disease and solve the worlds conflicts along the way)

Of these G-WAN is the one that sticks out by being not open-source. This obviously has implications that will discourage some but that subject is out of scope of this benchmark. (Note: The vendor is commercial and sell services like support around the product, but the software in itself is free of cost)

As to having an agenda, I’m writing this a private individual without any commercial interests. I’ve obviously used Apache, who hasn’t, but have over the last years worked much more with Nginx. I’ve spend some time looking at Lighttpd, and much less at G-WAN and Lighttpd. I will try to be as unbiased as I can.

Benchmarking environment

So, a few assumptions and simplifications

  • I will use a single DELL PowerEdge 1950 with a dual Xeon(R) CPU 5130 @ 2.00GHz (dual cores)with 16GB of RAM, running an updated Arch Linux (that finally convinced me to leave the OpenBSD world) with a 3.0.6 Linux kernel
  • I will benchmark concurrent downloads of a single file with random content of the sizes of 1kB, 1MB and 50MB
  • I will, at least initially, look at non keep-alive requests
  • I will, at least initially, simulate load using 100 concurrent clients
  • I will, at least intially, use default configurations for operating system and applications
  • I will measure pure throughput in requests-per-second or Mbps

As you can see this will not be a full feathered final benchmark as I am more interested in looking at the process of benchmarking than in finally resolving the issue of which web software that is “the best and finest”. I had to lower the max file size to 50MB since I’m using XFS on all partitions except for /boot which is small, and G-WAN promptly refuses to run on XFS…

What can we with a critical eye say of these assumptions, well, probably a lot…

  • The hardware is quite old with only 4 cores in total and different software will scale differently depending on number of cores
  • The benchmarking tool will run in a single process without scaling over the available cores to compare with “ab”
  • We will run clients and servers on the same hardware over localnet sharing resources
  • All clients will use the exact same implementation, will run with virtually zero latency, without packet loss, congestion, or any other network issues
  • We only look at static content, and all clients will download the same meaningless single file
  • Performance or stability change over time will not be visible
  • How the different software uses resources such as CPU, memory, etc. to produce the result will not be presented
  • We could go on and on…

To summarize it’s hard to see how the scenario could even come close to being called relevant. In any real life production environment we would probably have a very large number of different files and the bottlenecks could very likely be something completely different such as disk or network I/O, and parameters such as compliance, reliability, security and stability over time would be more or less decisive. But let’s move on anyway.

Ok, maybe too much information… Time for some pictures to liven this up.

Benchmark 1

Results (using ab, average of 8 runs)

ab/size=1kB/n=10000/c=100/no keep-alive

ab/size=1MB/n=10000/c=100/no keep-alive

ab/size=50MB/n=100/c=100/no keep-alive

Conclusions

So how boring was this on a graph, there is almost nothing to tell from the results. G-WAN is somewhat faster with small files and Cherokee, slightly surprising me, wins the race for 1MB files.

One thing that comes to mind is the possibility that we’re not actually measuring the web daemon at all, but maybe instead the operating system or the benchmark tool itself.

When I did the benchmarks this summer it seemed to me that everyone was using the “ab” benchmark tool, and I failed to find an alternative. So I wrote my own, called “pounce”. I will release this right after this benchmark somewhere. I recently learned what Lighttpd has it’s own tool, called “weighttp” so lets try these and see if they make any difference.

Benchmark 2

Results (using ab/weighttp/pounce, average of 8 runs)
(ab is the first set of benchmarks, weighttp the second, pounce the third)

ab/weighttp/pounce/size=1kB/n=10000/c=100/no keep-alive

ab/weighttp/pounce/size=1MB/n=10000/c=100/no keep-alive

ab/weighttp/pounce/size=50MB/n=100/c=100/no keep-alive

Conclusions

  • After talking to the weighttp author it becomes clear that the intention with the program is to run it with the multithreading option enabled so the result is not representative and should be taken with a grain of salt
  • The benchmark tool implementation clearly matters here with “pounce” performing clearly better on small files and much better on large
  • The result seems to depend on the combination of tool and daemon with Apache for example coming out on top for 50MB files using “weighttpd” indicating that the fact that we are indeed sharing resources between benchmark tool and web daemon is an issue
  • Relying on “pounce” G-WAN wins the race here by a small margin

So are we done here? Probably not but this post has become long enough. There are other things to consider but I’ll write another post later on about this.

… continued here

Advertisements

25 thoughts on “Benchmarking the benchmarks (part 1)”

  1. Hi,

    I’m the author of weighty (weighttp). I’m curious why pound should be so much faster than weighty. Did you use the multiple threads feature?
    Without looking into it further, I can imagine pound using multiple threads while you didn’t with weighty. “ab” of course doesn’t support that.

    Which brings me to my main point: please publish all configs and parameters you used.

    Thanks

    1. Hi,

      Thanks for weighttp, it’s somewhat strange that only ab has existed for some time. Pounce is completely event driven and does not use threads. There are actually still improvements to be made but I’ll put that in another post. I’ll release the code shortly as well.

      I’m using 100% “default” (i.e. default for Arch Linux pacman packages) configurations, except for Nginx where the latest pacman package is not up to date. For Nginx I used the plain default build configuration.

      I might post an optimized benchmark later that would reflect the potential better of the different daemons.

      Thanks

      1. Please use the -t parameter with values 2 or 4. It will make better use of the hardware. I’m really interested why pounce would be so much faster. Weighty should be quite efficient and there isn’t much that it actually does ๐Ÿ™‚
        On the other hand I’m sure it can be tuned quite a bit since I didn’t tune it at all, I just wanted to be slightly faster than “ab” and be able to use multiple cores.

        I’d be happy if I could take a look at the source.

        Regarding Lighty: I checked the default config that Arch provides.

        Setting server.max-keep-alive-requests to something like 1000 should help if you used keep alive (which you didn’t, so doesn’t apply).
        I’d guess using server.network-backend = “linux-sendfile” could improve stuff quite a bit by avoiding unnessary copying of data.
        Also server.event-handler = “linux-sysepoll” might be worth a try.
        Plus in some cases (like this), server.max-worker = 4 could be used without fearing its drawbacks.

      2. Using multiple cores was/is going to be the next post, but you are right, the results are much more even using multithreading in weighttp. Pounce is indeed also a very simple and small program. Meanwhile I added a note saying the weighttp result in not really fair without a multithreading option.

        I’ll try to make some optimizations the next round as well!

        Thanks

  2. Dear Fredrik,

    I am G-WAN’s author. Thanks for posting your test on our forum.

    May I ask why you found relevant to test SMP servers with a single client thread?

    I mean, it’s not like if you did not read the prose recently posted on our forum since you are using G-WAN v2.10.6 and only ironising about it:

    [G-WAN] “the silver bullet that makes all other software regardless of purpose obsolete (and will cure disease and solve the worlds conflicts along the way)”.

    May I also ask you to correct this erroneous declaration:

    “Of these G-WAN is the one that sticks out by being commercial and not open-source.”

    A freeware without any string attached is even less “commercial” than “open-source” products that require licenses for anything else than non-profit.

    Nobody is under any obligation to pay for using G-WAN – whatever the use.
    Not all open-source servers can claim being so “free”.

    Pierre.

  3. Hi Pierre,

    Thank you for a great software!

    The point of the post is to look at how different benchmarking tools can produce very different results under the same circumstances, most specifically looking at Apache ab which has been used in almost all web benchmarks I’ve seen until now. The idea is to in further posts look at how the result changes when using several cores, optimized configurations, keep-alive, etc, depending on how much time I end up having.

    The version 2.10.6 I believe was the latest version yesterday. I must admit I’m being a bit ironic about claims of the different daemons, and since G-WAN (to me) seems to be the most aggressive in this regard, I’m afraid you got a large portion of the irony.

    To me is seems that G-WAN is a commercial product since the “Buy Now” menu is amongst the first thing you see when entering your site, and since the source code isn’t available it becomes different than the other daemons benchmarked. This is not the scope of the post, but I do of course support the idea that creative people should be rewarded for their time and effort! Personally though I have trust issues with source I can’t get access to.

    Thanks,
    Fredrik

    1. >To me is seems that G-WAN is a commercial product since the โ€œBuy Nowโ€ menu is amongst the first thing you see when entering your site

      If you lick the โ€œBuy Nowโ€ the first things you see are “Why Buy Free Software?” and “What you buy below is technical support…” – as many open source projects do, he is selling support. The software itself is freeware.

      You can find the reasons stopping him to open the source in G-WAN’s official forum if you’re curious enough.

      1. I will write something to clarify this. The distinction what makes a product “commercial” or not can of course be discussed. Selling support for 150000 USD/year indicates a strong commercial interest but I suppose it is a question of semantics.

      2. The only thing you need to do to clarify it’s just substitute the word commercial with freeware. It’s not semantics – just plain definitions: you don’t have to pay anything to legaly use it in any way you want it. Open-source projects don’t stop being so if they have paid support – same applies to freeware.

      3. “Freeware” is also a loosely defined term, but as you say I will clarify that the vendor is commercial and sell services around it, but that the software in itself is free of cost.

      4. Getting a little offtopic but – should ubuntu users be warned that Canonical have “strong commercial interest” in it? ๐Ÿ™‚

      5. Haha, no I missed that too. Indeed, who doesn’t have a commercial interest, after all there’s no such thing as a free lunch, and so forth. Let’s skip the whole commercial aspect. ๐Ÿ™‚

      6. Sorry – could you remove this and my last comment above. Fairness acheved – no need to point fingers ๐Ÿ™‚

  4. I am sorry, but I highly disagree with your statement regarding freeware. You’re lying to yourself here.
    The reason I say this is because you’re drinking Coca-Cola and eating stuff from the Supermarket, you buy hardware and life in a closed source world.
    To tell that you don’t trust freeware and drinking coke at the same time is irony. You’ll never know their business-secrets and they will never tell you how for a good reason obviously and those business laws are enforced a lot harder than other laws. Crying out loud for OpenSource won’t make people open their sources, the OpenSource “industry” is an illusion and it’s hard to open the eyes to realize that it’s not more than crowd-sourcing. I don’t believe in security by OpenSource, I don’t believe in Freedom by opensource, but I believe in flexibility by OpenSource.

    In example, msn,yahoo,skype,icq are all closed source and people reverse engineered their protocols to offer opensource clients that of course must be maintained because of protocol changes. Opensourcing something doesn’t make it better, but testing something out very hard, like in your benchmarks in example is helping to become better for each player in this field.

    1. You’re reading a lot into a few words here. This is not the subject of the post, but I’m not saying I don’t trust freeware, or that I expect software to be free of cost. There is a lot to be said about this and the dynamics involved of course, too much to write here.

      What I was saying is that I have trust issues with source code I or other independent agents can’t review. I’m saying this having worked as a penetration tester for many years and in the security field in general for most of my grown life, but this is just my own “paranoia” and my own trust issues.

      Of course there are plenty of excellent products that aren’t open source, and not seldom they are more mature, stable and well written than the open source alternatives, but when they are on par I personally prefer to build on open source when possible since it gives me more control. But this is just my own preference, I’m not saying it applies to anybody else.

      1. Hi Lonewolfer, thanks a lot for giving me an answer and staying cool, really appreciate it ๐Ÿ™‚
        You’re right on that point of course I was just trying to interpret more detail into it, that I thought was missing.

        I’m doing pen-testing myself and operate on a custom pen-testing Linux. I share the same good paranoia as you do, that’s why I was stating that food, hardware and other harmless things are back-doored too even though it’s freely accessible. Freeware != Freeware

        Could you please open-source your (Kernel/OS) settings so that the benchmark becomes reproducible?

        I’ve also trust issues with software but that applies to all software and not only closed source software. But you’re right trying to hide something makes someone more suspicious of course. Currently I think it’s a good idea for Pierre to hide his code until he has some market, otherwise I think he would loose his advantage and I’m sure Microsoft, or other companies would copy the code, regardless of the license. I’m sure they have done that on many parts of the Linux Kernel and lots of other software too. Everything Microsoft made so far started with a RIP or RIP-off.

        Regarding your last paragraph I think opening the source is a strategic decision and should be done in the right time and place only.
        If you have a business idea, you shouldn’t go open-source with your beta software or prototype, even going open-source with a final product might be wrong strategically. I’ve seen cases where this doesn’t apply, but that always depends, because it’s a decision relative to the person.
        My personal experience is that I’ve seen people living the elbow-culture, means that they don’t share their knowledge to capitalize for the own benefit as much as possible. I hate such persons, but sharing information with anyone, especially these persons is stupid in my opinion unless they don’t get a course in sanity.

        That’s why I’ve no personal interest in seeing pounce’s code if it could harm you in any way, but I am curious to learn how it works and would like to have the binary to hands.

        Cheers
        X4

      2. Hi X4,

        I’ll answer more in depth later on, now I have to go get some groceries, but regarding operating system settings I’m still running on a default updated Arch Linux system running a 3.0.6 kernel. I just answered Pierre in another post and suggest that he and others suggest operating system optimizations so we see how they affect the data.

        Fredrik

      3. I actually ended up being side-tracked and doing a black box test of G-WAN. A post about this is coming up.

        “pounce” contains absolutely no magic, it’s just a very simple fast multiprocess event-driven routine. You will get the source soon, but probably in a very early version so I can’t guarantee it will compile on all Linux dialects without some minor changes.

        The strategies around how to make successful software of course varies. I won’t presume to have better answers than anyone else, and people do indeed have different ideas on how to go about this. On point worth a lot is of course though the value of collaboration both in functionality and content if you are looking for maximizing potential. Of course the whole Internet is the result of collaboration, Wikipedia as well, as well as countless other phenomena. If there is a possibility to tap in to this there is enormous potential to be won.

        As for success stories, I’ve built systems on GlusterFS for a number of years. Excellent open source project with a clean design and a very dedicated team. They went “commercial” a while back selling support for the product which makes a lot of sense, and just a few weeks back they sold to RedHat for 136M USD. It’s a great product and the deserve the success. Personally I’d rather see it didn’t go into RedHat but that’s beside the point.

  5. Interesting write up. Being a developer, my interest in benchmarking mainly comes from a programming view point. As well as all those times when I get lumped in with the IT crowd and put in charge of the servers. My first ‘measurement of success’ was to make a site ‘Digg proof’. This lead me down the path of benchmarking, and mainly towards ab. Often those that helped me would point out some of the limitations. So to have someone with the time and experience to look into this, is extremely useful for folks like me.

    I’m also a little surprised at the strength of response by some of the software authors! Perhaps they are so acutely aware of the common limitations and mistakes made by amateur benchmarkers (like myself) that they wished they had started this very post themselves ๐Ÿ™‚

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s