Why is Malloc Different Under uClinux?

May 30, 2002 — by LinuxDevices Staff — from the LinuxDevices Archive — 208 views

Firstly, under uClinux we don't have Virtual Memory (VM). This means you cannot arbitrarily add memory to an already running process. As VM is usually implemented using a processing unit called a MMU (memory management unit) you will often hear the term NOMMU when traveling in uClinux circles. Let's discuss at a very high level how not having an MMU or VM affects malloc.

With VM, all processes run at the same address (albeit a virtual one) and the VM system takes care of what physical memory is actually mapped to these locations. So even though the virtual memory that the process sees is contiguous, the physical memory it occupies can be scattered about. Some of it may even be on a hard disk in swap!

Without VM, each process must be located at a place in memory where it can be run. In the simplest case, this area of memory must be contiguous, and generally cannot be expanded as there may be other processes above and below it.

With respect to the original topic, malloc is normally implemented at the low level using the sbrk/brk system calls which increase/change the size of a process's address space. The malloc library then manages the extra memory obtained by calling sbrk() on behalf of the application. When it runs out of memory it can get more by calling sbrk() again, it can also decrease memory using brk(), although very few malloc implementations do this. sbrk() works by adding more memory to the end of a process (increasing its size). brk() can arbitrarily set the end of the process to be closer to the start of the process (reduce the process size) or further away (increase the process size). The standard operation of sbrk()/brk() is to increase/decrease a processes size.

Since sbrk/brk cannot increase a processes size under uClinux, malloc needs to undergo some changes at least at the low level in order to work.

There are many possibilities here. Some obvious ones are:

Do not allow processes to allocate dynamic memory.
Allocate a heap per process that sbrk/brk can operate on. The heap size cannot be changed by the process and is allocated at process startup.
Let processes allocate memory from the global pool (system wide) of free memory.

Simply put, (1) is just not useful. Too much application software assumes the ability to allocate memory and preventing it just increases the effort required to utilize the vast amount of Open Source software (OSS).

Option (2) has its merits. It limits the amount of memory a process can use (which can be considered a good thing). But it also means that the memory for the heap is always allocated, even if it is only used temporarily. The heap space has to be allocated at process start time so that the semantics of sbrk/brk still hold and the normal malloc library can be used.

Finally we come to (3). There are pitfalls here. A run away process can use all of the available memory for one. Allocating from the system pool is not compatible with sbrk/brk as they require memory to be added to the end of a processes address space. Thus a normal malloc implementation is no good and a new implementation is needed. The advantage of this method is that only the amount of memory actually required is used. Memory can be returned to the global pool as soon as it's finished with and the implementation can take advantage of the existing kernel allocator for managing this memory.

Currently method 3 is used by uClinux. The simplest malloc implementation calls mmap to obtain memory from the kernel's free memory pool and munmap to return memory to the kernel's free memory pool. This gives a very small malloc implementation (1 system call).

Problems encountered with such a simple malloc implementation include:

The overheads of using mmap under uClinux are about 56 bytes per allocation. This turns out to be extremely bad for applications that do a large number of small allocations (like the Zebra routing daemon). Under uClinux, mmaps from user programs are simply allocated from the pool of free memory using kmalloc, the kernel allocator, and then chained into a linked list attached to the process. The 56 bytes above comes from the kmalloc overhead plus the linked list overheads. Not only is the 56 bytes significant, but a large number of small allocations will result in quite a long list leading to slower deallocations and reallocations (new allocations are put directly at the head of the list).
The standard kernel allocator allocates only power-of-two sized quantities up to 1MB, which is inefficient and limiting. To understand the ramifications of this, especially for large allocations, consider that a 33K allocation is rounded to the next power of two, 64K!

Several small malloc implementations have been created to alleviate point 1 by reducing the effect of the 56 byte overhead by allocating larger blocks and then managing those blocks internally for better results.

The kernel allocator has been modified to increase the maximum allocation size substantially. In some cases this is done by a kernel config option. This allows for larger allocations and thus larger programs.

An alternative kernel allocator has been added to uClinux that no longer enforces power of two allocations, thus eliminating the wastage in allocations quite substantially. This allocator is commonly known as Kmalloc2 and it can substantially reduce the overheads for memory allocation in a NOMMU environment and increase the amount of free memory available for other tasks.

The standard kernel allocator only allocates memory based on a power of two. For example, if you need 12000 bytes, it will allocate 16KB, and you cannot make use of the extra 4KB that was allocated. This can be very expensive when you start an application. For example an application of 130KB in size will actually need 256KB just to run.

Kmalloc2 addresses this by using a power-of-two allocator for allocations up to 1 page in size (a page is 4096 bytes, or 4KB). It then allocates memory rounded up to the nearest page. So for the previous example, an application of 130KB will actually have 132KB allocated to it. A saving of over 124KB on the standard kernel allocator.

Kmalloc2 also takes steps to avoid fragmenting memory. It allocates all amounts of 2 pages (8KB) or less from the end of memory down, and all larger amounts from the start of free memory up. This stops transient allocations for network buffers and so on fragmenting memory and preventing large apps from running.

Kmalloc2 is not perfect, it is quite susceptible to fragmentation of memory, although it could be argued that the standard allocator is more susceptible. Kmalloc2 works well in practice as the embedded environments that run uClinux tend to have a relatively static group of long lived applications.

Now we have discussed the kernel level options for memory allocation a little, lets look at the options available to user programs. Due to the issues mentioned thus far, there are quite a few solutions available for use with user programs. Each has its place.

For a start there are a choice of “libc”s — a topic for another discussion. The choice of “malloc”s usually depends on the libc you are using: uC-libc or uClibc. Both offer a simple allocator, malloc-simple.

malloc-simple uses mmap and munmap to let the kernel actually handle the requests for memory. The implementation is trivial and the code size is negligible, so the cost of including it in an application is very small.

The problem with malloc-simple is, as mentioned earlier, that the kernel overhead on an mmap based allocation is about 56 bytes. So if you have an application that does a lot of small allocations, its memory usage will be quite high. For example, say your program allocates 1000 10 byte amounts, that's a total of 10000 bytes of memory that is required. Because of the 56 byte overhead, it will actually allocate 66000 bytes, a 560% increase over what was actually required. Zebra, a routing daemon, is an example of an application that suffers quite badly from the small allocation problem as it allocates command data structures as it starts (it has a large number of commands/keywords!).

uC-libc used to offer a libsmalloc, a version of malloc that specializes in low overhead small allocations specifically for applications like Zebra. This version of malloc has since been merged with malloc-simple as the extra code required is no longer a significant overhead when using shared libraries for uClinux.

uClibc offers a few choices as well:

malloc — A hunk based allocator that appears to have NOMMU support, although it uses mmap functionality not available on NOMMU systems. At this point it appears that this allocator will not run on NOMMU systems. It may be relatively simple to fix this, and if so, it should have some potential.
malloc-930716 This allocator will only work on MMU systems. It relies on the brk/sbrk calls. Although these will return a small amount of memory on uClinux systems, it is useless as the sole source of memory for an allocator.
malloc-simple The most simple allocator, works on both MMU and NOMMU systems.

Generally the more complex mallocs deliver faster allocations and are more efficient for small allocations. They also add considerable code size to small applications such as one would find in a uClinux environment.

There are a few choices for malloc. Which one should you use? malloc-simple is generally a good default malloc to use. Then you can choose a better malloc for the individual applications that need it. As you can see above, your choices are limited if you want something that works out of the box.

Inevitably, despite your choice of kernel/user allocations, you will run out of memory. One of the common problems new users encounter is the “missing memory” problem. The system is showing a large amount of free memory but my application cannot allocate a buffer of size X. The problem here is memory fragmentation, and all of the solutions available at this time suffer from it. Because of the lack of VM in the uClinux environment it is near impossible to fully utilize memory due to fragmentation. This is best explained by example. Suppose your system has 500KB of memory free and you want to allocate 100KB. It is easy to think that this would be possible. However, it is important to remember that you must have a contiguous 100KB of memory in order to satisfy the allocation. Suppose the memory map looks like this. Each character represents approximately 20KB. X marks areas that are allocated or in use by other programs or the kernel:

 
 0    100   200   300   400   500   600   700   800   900  1000 
-+—–+—–+—–+—–+—–+—–+—–+—–+—–+—–+-
 |XXXXX|XXXXX|—XX|–X–|-X—|XX—|-X—|-XX–|-X—|XXXXX|

As you can see, there is 500KB free, but the largest contiguous block is only 80KB. There are many ways to arrive at such a situation. A program that allocates some memory, then frees most of it leaving a small allocation in the middle of a larger free block is often the cause. Transient programs under uClinux can also affect where and how memory is allocated.

The question is often asked, why can't this memory be defragmented. The problem is that we don't have VM and we cannot move memory that is being used by programs. Programs usually have references to addresses within the allocated memory regions and without VM to make the memory always appear to be at the correct address, the program will crash if we move its memory. There is no solution to this problem under uClinux. The developer needs to be aware of the problem and, where possible, try to utilize smaller blocks of memory.

Memory allocation under uClinux, as we have seen, is quite similar to memory allocation under normal Linux but has its own set of quirks and shortfalls. Further progress will no doubt be made on the memory allocation front now that uClinux is enjoying its first shared library implementations. This reduces the requirement to keep the malloc implementation as small as possible as it can live in a shared library. This will lead to larger, but more efficient user level malloc implementations that help to reduce the problems of fragmentation and malloc overheads.

References

Here are references from the uClinux-distribution sources available for download at www.uclinux.org . . .

Kernel allocators and mmap implementations:

linux-2.4.x/mmnommu/slab.c
linux-2.4.x/mmnommu/page_alloc.c
linux-2.4.x/mmnommu/page_alloc2.c
linux-2.4.x/mmnommu/mmap.c
linux-2.0.x/mmnommu/kmalloc.c
linux-2.0.x/mmnommu/page_alloc.c
linux-2.0.x/mmnommu/kmalloc2.c
linux-2.0.x/mmnommu/page_alloc2.c
linux-2.0.x/mmnommu/mmap.c

uC-libc malloc implementations:

lib/libc/malloc/alloc.c
lib/libc/malloc-simple/alloc.c

uClibc malloc implementations:

uClibc/libc/stdlib/malloc
uClibc/libc/stdlib/malloc-930716
uClibc/libc/stdlib/malloc-simple

Further information on uClinux

About the author: David McCullough is kernel engineer at SnapGear who is currently the maintainer of uClinux's ColdFire distributions, and who has contributed patches to upgrade uClinux to the Linux 2.4 kernel.

This article was originally published on LinuxDevices.com and has been donated to the open source community by QuinStreet Inc. Please visit LinuxToday.com for up-to-date news and articles about Linux and open source.

Comments are closed.

Pages

Archives

Categories

Why is Malloc Different Under uClinux?

Related Posts: