RTLinuxPro CPU reservation technology

May 7, 2004 — by LinuxDevices Staff — from the LinuxDevices Archive — 19 views

Foreword — This whitepaper from FSMLabs engineer Matt Sherer describes the “CPU reservation” capabilities of RTLinuxPro, a dual-kernel real-time Linux operating system from FSMLabs. RTLinuxPro comprises a tiny real-time kernel running a general purpose operating system — either Linux or BSD — as its idle task. Enjoy!

RTLinuxPro CPU Reservation Technology
Some real-time OS vendors in the Linux arena concentrate on interrupt response time and processor reservation as an approach to real-time Linux. The result is a minimally functional environment that provides few services and still no hard guarantees. This article shows how easy RTLinuxPro makes processor reservation and interrupt control for the user, while providing a full POSIX RTOS environment to back it up.

1. Some background

Hard real-time systems used to be single purpose, non-integratable, 'blinking box in the corner' types of applications. Linux systems, on the other hand, represent some of the best integration possibilities currently available in computing. When these two meet, what you've got is a fully capable hard real-time operating system that integrates securely with the rest of the network as quickly and easily as a user with a web connection.

RTCore, the OS that provides the hard real-time environment for RTLinuxPro and RTCoreBSD, is a perfect example of this. In addition to the broad range of Linux applications, RTCore provides hard real-time interrupt handlers, threads, signals, mutexes, semaphores, interprocess communication, etc., all in a POSIX[2] environment. Essentially, it provides what you would expect from an RTOS.

Other approaches to achieving this goal have met with very limited results, to the point of scaling focus back to just improving interrupt response. While this is admirable, the result is marginal – an improved interrupt response, under a specific Linux version on a specific architecture, still with no hard guarantees. The OS services available to programmers using this method are very limited. For the most basic of applications, this may be functional, such as one whose entirety consists of high frequency sampling and an immediate, simple response based on that input. FSMLabs' experience with hard real-time customers, on the other hand, have shown real-world application domains to be anything but simple[3].

In this article, I'll focus on the processor reservation features of RTCore. To even the field, let's ignore for now that RTCore does provide all of the common OS capabilities users expect, and focus entirely on its processor reservation capabilities. Let's see how well RTCore handles this option.

2. Directing threads at a specfic CPU

Rather than dragging on the discussion, let's look at an example of a real-time thread that specifically places itself on a given CPU.


           #include stdio.h>
           #include pthread.h>
           #include  unistd.h>
           #include time.h>
           
           static pthread_t thread;
           void *thread_code(void *t) {
          	    struct timespec next;
                    clock_gettime(CLOCK_REALTIME, &next);
                    while (1) {                                        
                              timespec_add_ns(&next, 1000*1000);
                              clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME,
                                          &next, NULL);
                   }
                   return NULL;
           }
           int main(void) {
                   pthread_attr_t attr;
                   pthread_attr_init(&attr);   
                   pthread_attr_setcpu_np(&attr, 0);
                   pthread_create(&thread, &attr, thread_code, 0);
  
                   rtl_main_wait();
                   
                   pthread_cancel(thread);
                   pthread_join(thread, NULL);
                   return 0;
           }

Nothing really new here – a real-time thread is spawned, just like any other POSIX application. The difference is the single line pthread_attr_setcpu_np(&attr, 0); This tells RTCore that when the thread is created, it should be placed on CPU 0. It will never move from this processor. The reason for this is that when a thread/task is allowed to migrate from one processor to another, cache effects can impact real-time performance. By knowing what threads are on what CPU, transient effects are minimized, performance is much higher, and the RTOS's scheduling overhead is minimized.

Something else that may be new is timespec_add_ns(&next,1000*1000);. This adds one millisecond to the timespec structure, and is just a convenience function. (POSIX does not define this, so users usually have to normalize values by hand.) Users familiar with RTLinux should not be surprised by this – it's been there for years. Users of RTLinuxPro should recognize rtl_main_wait() – this is an event loop handler that allows the program to suspend until it receives an exit signal from the user or system. (Similar to an event loop entry for GUI applications.)

Now that we have a thread on a specific CPU, we can reserve that processor away from the General Purpose OS (GPOS). All tests in this article were done with Linux as the GPOS.

3. Reserving the processor

This may seem anticlimactic – the title of the article implies that the entire focus is on processor reservation, implying that it's detailed, involved, and difficult. Actually, it only takes one line of code – just the following modification to the above example:



...
pthread_attr_setcpu_np(&attr, 0);
pthread_attr_setreserve_np(&attr, 1);
pthread_create(&thread, NULL, thread_code, 0);
...

The pthread_attr_setreserve_np(&attr, 1); sets a boolean attribute for the thread. When it is spawned on the intended processor, the GPOS is prevented from running there again until that thread exits.

That's all there is to it. Once this code executes, the GPOS can't run on that processor; it is specifically reserved for that thread (and any other real-time threads placed on that CPU). In some environments, this allows entire real-time applications to live directly in the processor cache. Since the processor doesn't have to go out to RAM for code, performance is right up against its limit. Because the GPOS never executes there, which could push the real-time code out of the cache, the cache remains filled with real-time code. (Linux is fairly heavy, and can have a significant cache footprint when it gets time to run.)

4. Interrupt control

In the above code, the act of reserving a CPU for real-time code has the side effect of refocusing GPOS interrupts being used on that CPU to another processor. Because Linux isn't going to run code there again, it can't even receive interrupts there, so hardware such as Ethernet devices need to get signals to Linux via another CPU. This leaves the reserved processor relieved of any interrupts other than what is needed to direct the real-time threads.

Now that the processor is entirely under real-time control, we can focus any real-time interrupts we're interested in back to that processor. As with the threads, the result is better performance due to controlled cache usage, and a minimally active interrupt controller that is only concerned with a small set of interrupts.

Let's look at how to deal with this refocusing – since POSIX doesn't deal with interrupt control at all, there are specific RTCore functions for it. First, we add a 'counting' interrupt handler at the top of the file:



...
static int interrupt_count = 0;
unsigned int interrupt_handler(unsigned int irq,
        struct rtl_frame *regs) {
        interrupt_count++;
        rtl_global_pend_irq(irq);
        return 0;
}
...

And in the thread code, we focus the interrupt back to CPU 0:



...
static unsigned long affinity = 1;
static unsigned long old_affinity;
void *thread_code(void *t) {
        struct timespec next;
        rtl_request_irq(4, interrupt_handler);
        rtl_irq_set_affinity(4, &affinity, &old_affinity);
...

Here we have a very basic interrupt handler that just increments the count of received interrupts on interrupt line 4. (It also pends this interrupt on for further processing if the GPOS is interested in it, although we're not interested in that here.) This handler is attached to the line with the rtl request irq() function. On cleanup, the handler should be released, and previous affinity masks restored:



...
    pthread_join(thread, NULL);
    rtl_irq_set_affinity(4, &old_affinity, NULL);
    rtl_free_irq(4);
...

In the thread's startup, the first affinity call makes sure that interrupt 4 is focused on processor 0 (the affinity parameter is a CPU bitmask, so a 1 in bit 0 implies that the interrupt should be directed at CPU 0). The previous affinity mask is stored in the old affinity parameter, so that it can be restored when the application unloads.

5. Real results

Now for some results. For all of the tests done here, we only report worst cases. Average cases are usually much lower, but they have no meaning in hard real-time applications. If you see vendors quoting average cases or showing 'improved' event distribution, either they're trying to make their numbers look good, or it's a soft real-time system. FSMLabs believes that if code needs to be absolutely deterministic, it lives in the hard real-time system, where the worst case is the boundary. If not, it gets pushed out to the GPOS, which is usually non real-time or soft real-time at best.

First off, we'll modify our test application to track the worst case execution delay between the time it is supposed to run and the time it did run. This involves adding the following piece of code to the thread, along with a new timespec structure declaration to track the worst case:


...
    clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME,
            &next, NULL);
    clock_gettime(CLOCK_REALTIME, &sample);
    timespec_sub(&sample, &next);
    if (timespec_gt(&sample,&worst)) {
            worst.tv_sec = sample.tv_sec;
            worst.tv_nsec = sample.tv_nsec;
            printf("Worst case - %lds, %ldns ",
                    worst.tv_sec,worst.tv_nsec);
    }
...

Each test will track this value and print new worst cases as they appear. Since the interest here is in processor reservation and interrupt control, tests are done under RTCore only, as stock Linux tests would be unfair. Linux does not pretend to provide hard real-time guarantees, although it can make some good rough soft real-time best efforts. Neither does it provide user interfaces for interrupt control, real processor affinity, or strict reservation.

The results provided here are intended to demonstrate the simplicity and performance that RTCore provides out of the box. FSMLabs welcomes you to compare these results against any other RTOS vendor's solution, Linux or otherwise.

5.1 Real-time code without processor reservation

First, let's look at the test we've shown above without processor reservation. This involves a real-time thread on a millisecond period, tracking its worst case deviation. (Test hardware is an aging 1.2GHz dual Athlon.)

The timer interrupt will indirectly be used to measure interrupt latency. As the interrupt latency on the timer affects the overall latency in getting into the scheduler and then into thread code, the worst case latency measured in the test is the sum total of all three factors. So, the worst case interrupt latency in this test is guaranteed to be lower than the final result.

After running the test for several days under extreme disk load, VM pressure, network load, and heavy developer use, the worst case value is 16.299us. By comparing this number to our next result, it's easy to see that cache thrashing with Linux affects the system, forcing it to reload from RAM in many cases. In fact, measurements have shown that using NetBSD as the GPOS results in lower worst case numbers, as it is in general lighter, and less likely to walk through as much of the cache as Linux when it is allowed to run.

5.2 Real-time code with processor reservation

The same test was rerun under the same set of conditions and for the same duration with processor reservation enabled. With a worst case of 2.079us, the worst case is much improved. Heavy IDE usage, Ethernet work, and other interrupt loading have no effect on the system, as they have been focused away. In general, the worst case value is found immediately as the cache effects are worked out during the initial load.

That's all there is to the test – no strange setups or considerations to keep in mind, no statistics, rules, or anything else – just a worst case number. Place a periodic real-time thread on a reserved processor, measure how much its scheduling deviates, and print the worst case value. RTCore provides 2.079us as a worst case here, and less on newer hardware, while still providing all of the normal services required of a Linux system.

5.3 Reducing it further

In some applications, even 2us is too much of a variation in scheduling control. For these situations, RTCore provides an advance timer feature, which informs the scheduler of how much inherent hardware jitter to account for. In this case, we know that there can be as much as 2.079us that we can minimize. So, we make the following modification:



...
struct timespec worst = { 0, 0 };
struct timespec advance = { 0, 2500 };
void *thread_code(void *t) {
...

This new advance structure notes that the worst case jitter we are accounting for on this hardware is 2.5us. (We could place this right at the 2079 value, if needed.) Next the code tells the scheduler to account for this when scheduling the task:



...
timespec_add_ns(&next, 1000*1000);
clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME|TIMER_ADVANCE,
     &next, &advance);
clock_gettime(CLOCK_REALTIME, &sample);
...

Running the same test under the same load as before, the worst case value we now see is 143 nanoseconds, about a tenth of millionth of a second.

The worst case started at 16us, then dropped to about 2us with processor reservation, then fell down to about a tenth of a microsecond with 3 lines of code dedicated to advance scheduling. The downside of this advance method is that the scheduler is active during a portion of that advance time, which takes a few extra cycles, but the benefit is that the requested thread activity can occur within a few nanoseconds of the intended timeslice.

6. Hyperthreading

Intel's Hyper-Threading (HT)[1] presents some issues for real-time processor reservation. HT is essentially multiple processors within the same core, so to a GPOS like Linux, there are two processors, when in reality there is only one.

Logically, this makes sense – the CPU hides the complexity. However, the processor core is still sharing some internal resources, rather than duplicating everything. For Linux, this doesn't matter too much – for some applications there is a net gain, for others a loss. From a real-time perspective, a thread running on logical processor 0 (physical processor 0) may end up stalling slightly while waiting for the chip to handle work for Linux on logical processor 1 (still physical processor 0).

The response is still deterministic, but the jitter is higher – rather than a worst case of 15us on a board, HT may drive it up to 35us. Processor reservation can bring that number back under control, though. With 2 HT processors, (effectively 4 CPUs) it's a matter of spawning a reserved real-time thread on logical processor 0, and another on logical processor 1 that blocks on a semaphore or does other real-time work out of phase with the other processor. This still leaves 2 logical processors for Linux (1 full physical), so you still get SMP Linux and good response on the other HT processor.

7. In the field

Talking about getting better performance is useless if it doesn't provide a real benefit in the field. FSMLabs' customers have proven that there is a real need and use for this technology. Our customers are involved in large scale engine test stands, astronomy control systems, manufacturing inspection systems, and other performance-critical markets.

When a hard real-time application is running close to the limits of the hardware, the act of allowing Linux to run as an idle thread is enough to corrupt the cache and run right past the worst case response time in executing real-time code. By reserving the processor away from Linux, interrupt response is tighter, cache hits are increased, and code performance is right up at the hardware's limits. By scheduling in advance, inherent hardware delays are minimized, allowing us in this case to hit our deadline within just a few nanoseconds.

It's important to note, though, that RTCore backs all of this up with a fully POSIX RTOS. Other solutions on the market can increase interrupt response to some degree, but do not provide hard guarantees or many services beyond that interrupt handler. Even the basic example we have above with a single real-time thread is more than an interrupt-only approach can provide. When you're writing a real-time application, which would you rather have – a simple interrupt hook, or a real RTOS backing you up?

References

Hyper-threading technology

The Single UNIX Specification, Version 3

Real-time Linux puts jet engine to the test

This article was originally published on LinuxDevices.com and has been donated to the open source community by QuinStreet Inc. Please visit LinuxToday.com for up-to-date news and articles about Linux and open source.

Comments are closed.

Pages

Archives

Categories

RTLinuxPro CPU reservation technology

Related Posts: