News Archive (1999-2012) | 2013-current at LinuxGizmos | Current Tech News Portal |    About   

Desirable properties of real-time OSes (Part B)

Jan 2, 1997 — by LinuxDevices Staff — from the LinuxDevices Archive — 12 views

B. DESIRABLE PROPERTIES

As usual, there are conflicting desires, at least they conflict given the current state of the art. These desires fall into the following categories:

  1. Quality of service
  2. Amount of code that must be inspected to assure quality of service
  3. API provided
  4. Relative complexity of OS and applications
  5. Fault isolation: what non-RT failures endanger RT code?
  6. What hardware and software configurations are supported?

Each of these categories is expanded upon below, and later used to compare a number of proposed realtime approaches for Linux. The discussion does go for some time, which is not surprising given that it is summarizing many hundreds of email messages. ;-)

  1. Quality of Service

    The traditional view is that the entire operating system is either hard realtime, soft realtime, or non-realtime, but this viewpoint is too coarse grained. Different workloads have different needs, and there is disagreement over the exact definitions of these three categories of realtime. For example, (at least) the following two definitions of “hard realtime” are in use:

    1. In absence of hardware failures, software provably meets the specified deadlines. This is fine and good, but many applications simply do not need this “diamond hard” realtime.
    2. Failure to meet the specified deadline results in application failure. This is OK, but -only- if there is a corresponding required probability of success. Otherwise, one could claim “hard realtime” by simply failing the application every time it tries to do anything, which is clearly not useful.

    A better approach is to simply specified the required probability of meeting the specified deadline in absence of hardware failure. A probability of 1.0 is consistent with definition (a). Other applications will be satisfied with a probability such as 0.999999, which might be sufficiently high that the probability of software scheduling failure is “in the noise” compared with the probability of hardware failure. A recent LKML thread called this “metal hard” realtime. Or was it “ruby hard”? ;-) Of course, one can increase the reliability of hardware through redundancy, but no hardware configuration provides perfect reliability. For example, clusters can increase reliability, so that the probability of failure of the cluster is p^n, where “p” is the probability of a single node failing and “n” is the number of nodes. Note that this expression never reaches a probability of 1, no matter how large “n” is. In addition, this mathematical expression assumes that the failover software is perfectly reliable and perfectly configured. This assumption conflicts sharply with my own experience, in which there has always been a point beyond which adding nodes -decreased- cluster reliability.

    The timeframe is also critically important. Any system can provide hard realtime guarantees if the deadline is an infinite amount of time in the future. No computer system that I am aware of at this writing is capable of meeting a 1-picosecond scheduling deadline for any task of non-zero duration, but then neither can dedicated digital hardware. Some applications have definite response-time goals, for example, industrial process-control applications tend to have response-time goals ranging from 100s of microseconds to small numbers of seconds. Other applications can benefit from any improvement in response-time goals — faster is better, think in terms of Doom players — but even in these cases there is normally a point of diminishing returns.

    The services used by the realtime application also figure in. Given current disk technology, it is not possible to meet a 100-microsecond deadline for a 1MB synchronous write to disk. Not even if you cheat and supply the disk with a battery-backed-up DRAM. However, many realtime applications need only a few of the services that an operating system might provide. This list might include interrupt handling, process scheduling, disk I/O, network I/O, process creation/destruction, VM operations, and so on. Keep in mind that many popular RTOSes provide very little in the way of services! They frequently leave the complex stuff (e.g., web serving) to general-purpose operating systems.

    Note that each service can have an associated deadline that it can meet. The interrupt system might be able to meet a 1-microsecond deadline, the real-time process scheduler a 10-microsecond deadline, the disk I/O system a 10-millisecond deadline for moderate-sized I/Os, and so on. The deadline that a service can meet might also depend on the parameters, so that the disk-I/O system would be expected to take longer for larger I/Os.

    Furthermore, the probability might vary from service to service or with the parameters to that service. For example, the probability of network I/O completing successfully in minimal time might well be a function of the number of packets transmitted (to account for the probability of packet loss) as well as of packet size (to account for bit-error rate). To make things even more complicated, the probability of meeting the deadline will vary depending on the length of time allowed. Considering the networking example, a very short deadline might not allow the data transmission to complete, even if it proceeds at wire speed. A longer deadline might allow transmission to complete, but only if there are no transmission errors. An even longer deadline might allow time for a limited number of retransmissions, in order to recover from packet loss due to transmission errors. Of course, a deadline infinitely far into the future would allow guaranteed completion, but I for one am not that patient.

    Finally, the performance and scalability of both realtime and non-realtime applications running on the system can be important. Given the current state of the art, one must pay a performance penalty for realtime support, but the smaller the penalty, the better.

    So, to sum up, here are the components of a quality-of-service metric for realtime OSes:

    1. List of services for which realtime response is supported.
    2. For each service:
      1. Probability of missing a deadline due to software, ranging from 0 to 1, with the value of 1 corresponding to the hardest possible hard realtime.
      2. Allowable deadline, measured from the time that the request is initiated to the time by which the response must be received.

    3. Performance and scalability provided to both realtime and non-realtime applications.

  2. Amount of Code Inspection Required

    So you add a new feature to a realtime operating system. How much of the rest of the system must you inspect and understand in order to be able to guarantee that your new feature provides the required level of realtime response? The smaller this amount of code, the easier it is to add new features and fix bugs, and the greater the number of people who will be able to contribute to the project. In addition, the smaller the amount of such code, the smaller the probability that some well-intentioned bug fix will break realtime response.

    Each of the following categories of code might need to be inspected:

    1. The low-level interrupt-handing code.
    2. The realtime process scheduler.
    3. Any code that disables interrupts.
    4. Any code that disables preemption.
    5. Any code that holds a lock, mutex, semaphore, or other resource that is needed by the code implementing your new feature.

    Of course, use of automated tools could make such inspection much more reliable and less onerous, but such tools would need to deal with the very large number of CPU architectures and configuration options that Linux supports. The smaller the amount of code that must be inspected, the less chance there is that such a tool will fall victim to configuration-architecture combinatorial explosion.

    Each of Linux realtime approaches uses a different strategy to minimize the amount of code in these categories. These differences are surprisingly important, and will be discussed in more detail when going over the various approaches to Linux realtime.

  3. API Provided

    I never have learned to -really- like the POSIX API, with the gets() primitive being a particular cause of heartburn, but given the huge amount of software out there that relies on it and the equally huge number of developers who are familiar with it, one should certainly strive to provide it, or at least a sizeable subset of it.

    Other popular APIs include the various Java runtime environments, and of course the feared and loathed, but quite ubiquitous, Windows API.

    There are a lot of developers and a lot of software out there. The more of these existing developers and software your API supports, the more successful your realtime facility is likely to be.

  4. Relative Complexity

    How much realtime capability should be added to the operating system? How much of this burden should the applications take on? Is it better to push some of the complexity into a nanokernel, hypervisor, or other software or firmware layer? Let's first look at the tradeoff between OS and application.

    For example, although it is certainly possible to program for separate realtime and non-realtime operating-system instances, doing so adds complexity to the application. Complexity is particularly deadly in the hard realtime arena, and can be literally so if human lives are at risk.

    Balancing this consideration is the need for simplicity in the operating-system kernel. This balancing act must be carefully considered, taking both the relative complexities and the number of uses into account. Some would argue that it is worthwhile adding 1,000 lines to the OS if that saves 100 lines in each of 1,000 applications. Others would disagree, perhaps citing the greater fault isolation that might be provided by the separation.

    But this balance clearly must be struck somewhere between writing the application to bare metal on the one hand (but achieving a perfectly simple zero-size operating system) and bloating the operating system beyond the limits of maintainability on the other hand.

    Similar arguments can be made for moving some functionality into a hypervisor or nanokernel layer, though fault isolation also comes into play here.

    Many of the most vociferous arguments seem to revolve around this complexity issue.

  5. Fault Isolation

    Can a programming error in a non-realtime application or in a non-realtime portion of the OS harm a realtime application?

    Some applications do not care: in these cases, a failure anywhere causes a user-visible failure, so it is not important to isolate faults. Of course, even in these cases, it may be valuable to isolate faults in order to aid debugging, but, other than that, the fault isolation does not help overall application reliability.

    In other cases, the realtime portion of the application is protecting someone's life and limb, but the non-realtime portion is only compiling statistics and reports. In this case, fault isolation can be of the utmost importance.

    What sorts of faults need isolating?

    • Excessive disabling of interrupts
    • Excessive disabling of preemption
    • Holding a lock, mutex, or semaphore for too long, when that resource must be acquired by realtime code
    • Memory corruption, either via wild pointers or via wild DMA

    These faults might occur in the main kernel, in a loadable module, or in some debugging tool, such as a kprobe procedure or a kernel-debugger breakpoint script. Though in the latter case, perhaps realtime deadlines should not be guaranteed when actively debugging. After all, straightforward debugging techniques, such as use of kprint(), can cause response-time problems even in non-realtime environments.

  6. Hardware and Software Configurations

    Is SMP required? If so, how many CPUs? How many tasks? How many disks? How many HBAs?

    If all the code in the kernel were O(1), it might not matter, but the Linux kernel has not yet reached this goal. Therefore, some applications may choose to restrict the software or the hardware configuration of the platform in order to meet the realtime deadlines. This approach is consistent with traditional RTOS methodology — RTOS vendors have been known to restrict the configurations in which they will support hard realtime guarantees.

Continue HERE . . .


Story Navigation


  1. INTRODUCTION
  2. DESIRABLE PROPERTIES
  3. LINUX REALTIME APPROACHES
  4. SUMMARY

 
This article was originally published on LinuxDevices.com and has been donated to the open source community by QuinStreet Inc. Please visit LinuxToday.com for up-to-date news and articles about Linux and open source.



Comments are closed.