October 12, 2016

Performance at its (CPU-) Core

Who should read this?

This article will be interesting for you if you:
    • Are trying to decide which hardware to buy;
    • Are designing a service that is supposed to run on a specific hardware specification;
    • Are an operator like me; or,
    • Generally like to fiddle around with hardware and tune system performance.


    Lots of today's server infrastructures are multi-socket systems, sometimes with hyperthreading if you buy Intel CPUs. Lately, we have had a decision to make that seemed trivial at first: Which CPUs should we get for our server replacement, 2x8 Cores@3.2 Ghz or 2x6 Cores@3.4 Ghz? Both CPUs are Intel Xeon brand and therefore support hyperthreading. For further consideration, we will call them X1 and X2, respectively.

    Since hyperthreading cores reflect as a logical extra core at the system level, we essentially get 32 logical cores on X1 and 24 logical cores on X2.

    Our target application only uses eight cores simultaneously, so we better go for the faster single-core solution (X2), right?

    Spoiler alert: Wrong. Before I explain why, I need to explain two technologies first:


    Originally introduced by Intel, hyperthreading is an approach to better utilize CPU pipelines by providing an additional set of CPU registers. As a result, a single pipeline can simultaneously process, say, a floating point operation and a logical shift (done by the ALU). In the execution step, both instructions use different resources (FPU and ALU). The same is true for memory read/write instructions and arithmetical operations, for instance. It’s great technology, but there is a "but."

    As soon as the same instruction is executed in succession, the performance gain is lost. Each operation has to wait for the previous to finish in order to free the needed resource, just as it would in a single-threaded CPU. An example would be a set of ten successive memory reads, for instance. As a result, the actual performance gain with hyperthreading is not 200% as compared to a single core, but more in the range of 130%, depending on the executed instructions.

    Caching on multi-core systems

    A cache is, simply put, a fast buffer for a slower but larger memory. Its sole purpose is to speed up accesses to the main memory. Usually a processor die incorporates three levels of caches:

    - L1I: A small, very fast cache for instructions

    - L1D: A small, very fast cache for data

    - L2: A larger cache for both data and instructions. This cache is sometimes shared among cores, but usually is not.

    - L3: A large cache shared among all cores. It can easily account for 1/3 of the whole processor die, to provide the best hit rate (see picture).

    But what does this mean in practical terms?

    If you switch the executed process, it is called a “context switch.” If a previously disrupted process is executed again, it is best to do that on the same core as before. This way, chances are high that the L1 and L2 caches still hold relevant data. A worse case scenario is when the process switches cores, because it invalidates the L1 and L2 caches. Consequently, the worst case is when a process switches to a completely different CPU socket, because it also invalidates the L3 cache.

    When a process constantly switches cores and invalidates caches, it is called cache thrashing. That is something that must be avoided.

    Multi socket, Multi Core and Hyperthreading on Linux

    If you buy a system like X1 and execute htop on your Linux machine, you will see something like this:

    Thirty-two cores, but you now know that half of them are just hyperthreading cores. Furthermore, 16 of these are located on a different socket than the other 16. So, where should we put our 8-core application?

    The first hint is given by the following command (uninteresting output is cut):

    [root@X1 christian]# cat /proc/cpuinfo
    processor       : 0
    model name      : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
    physical id     : 0
    siblings        : 16
    core id         : 0
    cpu cores       : 8

    Processor is an ID to identify each logical core;

    CPU cores tells us that each socket has eight cores;

    Siblings tells us that hyperthreading is enabled. Hence, there are 16 siblings for eight cores;

    Physical ID is the socket number (0 and 1); and, finally,

    Core ID defines the core. Hyperthreading cores share the same core ID.

    An even better overview is given by lscpu:

    [root@X1 christian]# lscpu --extended
    0   0    0      0    0:0:0:0       yes
    1   0    0      1    1:1:1:0       yes
    2   0    0      2    2:2:2:0       yes
    3   0    0      3    3:3:3:0       yes
    4   0    0      4    4:4:4:0       yes
    5   0    0      5    5:5:5:0       yes
    6   0    0      6    6:6:6:0       yes
    7   0    0      7    7:7:7:0       yes
    8   1    1      8    8:8:8:1       yes
    9   1    1      9    9:9:9:1       yes
    10  1    1      10   10:10:10:1    yes
    11  1    1      11   11:11:11:1    yes
    12  1    1      12   12:12:12:1    yes
    13  1    1      13   13:13:13:1    yes
    14  1    1      14   14:14:14:1    yes
    15  1    1      15   15:15:15:1    yes
    16  0    0      0    0:0:0:0       yes
    17  0    0      1    1:1:1:0       yes
    18  0    0      2    2:2:2:0       yes
    19  0    0      3    3:3:3:0       yes
    20  0    0      4    4:4:4:0       yes
    21  0    0      5    5:5:5:0       yes
    22  0    0      6    6:6:6:0       yes
    23  0    0      7    7:7:7:0       yes
    24  1    1      8    8:8:8:1       yes
    25  1    1      9    9:9:9:1       yes
    26  1    1      10   10:10:10:1    yes
    27  1    1      11   11:11:11:1    yes
    28  1    1      12   12:12:12:1    yes
    29  1    1      13   13:13:13:1    yes
    30  1    1      14   14:14:14:1    yes
    31  1    1      15   15:15:15:1    yes

    This shows that the first eight cores are from the first socket, while the second eight cores are from the second socket. Then again, eight cores are from the first socket, but with the same core ID. Therefore, CPUs 0 and 16 "share" a physical core with hyperthreading. The same is true for CPUs 8 and 24, for example. You can even see that each core has a dedicated Level 2 cache, while the L3 cache is shared among cores but differs from socket to socket.


    To test the real impact with our target application, we ran it in different constellations.
    An important side note: Our application only supports successive CPU affinity by assigning a "start" CPU. As a result, it can be assigned to CPUs 1–8 but not 1–4 and 9–12.

    Furthermore, our application spawns an "engine" for each CPU it utilizes. By default, there is no CPU affinity, which means the operating system assigns processes as it sees fit. Internally, tasks are assigned to each engine based on its workload.

    Engine Load Start CPU Execution Time [s]
    1 (12,5%) - (unassigned) 21
    8 (100%) - (unassigned) 38-50
    1 0 (1 Socket) 21
    8 0 (1 Socket) 21
    8 4 (2 Sockets) 38-42
    8 22 (2 Sockets) 38-42

    Unfortunately, we were not able to assign cores 0–3 and 16–19 (2 Sockets + HT), due to the aforementioned restriction of the application. Since these cores are essentially the same cores with their hyperthreading-counterparts, we expect them to perform even worse than the version spread over both sockets.

    This clearly shows that a lightly loaded system performs practically identically, independent of the core affinity. The reason is simple: Each core on its own always has the same performance, and obviously the application is also smart enough to execute a single task on the same engine until it is finished.

    Under a full load, however, tasks are switched depending on the availability of engine resources. As a result, they are switched to another core, and possibly to another socket. Without assigning cores, the switch may even hit the same core on hyperthreading.


    These results consequently lead us to the decision to go with an 8-core solution instead of the faster 6-core. As a reader, you might think that it is always detrimental to go with a multi-socket CPU or hyperthreading, but that is not the case. Here are some considerations that may help to make such a decision:
    • Single-threaded processes run equally fast on each CPU. There is no need to worry about sockets or hyperthreading.
    • Multi-process applications benefit from manual core assignment if tasks are shared between cores.
    • Multi-process applications benefit from manual core assignment if the CPU supports hyperthreading and less than half of the logical cores are used. 
    • If you cannot set your application's core affinity, and it uses half your available cores or less, it might even be a good idea to turn off hyperthreading via the BIOS.
    • Hyperthreading cores are, by no means, inferior to "normal" cores. In our example, Cores 0 and 16 will deliver exactly the same performance, so long as they are not both loaded completely.
    • If you can fit your application to a single socket, it is a good idea to assign it to the last available cores. This way, the first socket is free to process OS tasks like scheduling, interrupt processing, and everything else an operating system does in its free time.

    A good example is haproxy, which I explained in an earlier post. It utilizes two 12-core CPUs with hyperthreading, resulting in 48 logical cores. The application uses ten processes that we all fit to cores 37–46. This guarantees the best performance and leaves room for the operating system to handle interrupts efficiently.
    In general, it is a good idea to think about core affinity for CPU-intensive applications like databases, web servers, application servers, and the like. In most cases, it is quite easy to try new settings and perhaps to squeeze some extra performance from your CPUs.

    No comments:

    Post a Comment