© 2000 Andrea Arcangeli, andrea@suse.de, SuSE
v1.0 3 April 2000, Imola (Italy)
This document is the reference for my talk about the 2.3.x developement that will happen at the Free Software Days in Strasbourg, France on 12 and 13 May 2000. The object of this document is to explore a few of the new exciting features that everybody will enjoy in the new upcoming 2.4.x Linux kernels. This document can be redistributed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
The elevator is a queue where the I/O requests gets ordered in function of their sector on disk. The object of this ordering is to optimize away disk-seeks when possible in order to improve I/O performance. Almost all normal harddisk gets improved by avoiding seeks (I would guess the elevator also improve the lifetime of the harddisk .
Suppose there are three processes A, B and C. Assume all the three processes does this I/O in this order:
time |process I/O sector -----|------- ---------- 0 |A 10 1 |B 14 2 |C 12 3 |A 11 4 |B 15 5 |C 13
Without the elevator the disk would move this way:
sector |10 | 11 | 12 | 13 | 14 | 15 |------------------------------------------ time
Thanks to elevator inside the linux operative system the disk instead moves this way:
sector |10 | 11 | 12 | 13 | 14 | 15 |------------------------------------------ time
and no one seeks happens (the harddisk doesn't move back and forth) and the I/O completes much faster.
Elevator starvation means that the elevator due its reordering could delay one of the I/O requests for a too long indefinite time. This is what always happened in linux before the upcoming 2.4.x.
Assume there are two processes A and B. Process A writes to disk continously 2 Terabyte of data starting from sector . This is how the linux hard disk queue gets filled by the elevator under the load generated by process A. (assume the queue contains at most 10 I/O requests for semplicity, the left of the queue is where the I/O controller eats packet from)
sector |0 1 2 3 4 5 6 7 8 9 process |A A A A A A A A A A
After some time the queue will looks like this:
sector |100 101 102 103 104 105 106 107 108 109 process |A A A A A A A A A A
Now the I/O controller finish processing the request on sector 100 and the queue get shifted left of one request and so there's now space for a new request in the queue:
sector |101 102 103 104 105 106 107 108 109 process |A A A A A A A A A
But at this point it's process B that tries to read one single sector (mere 512 bytes) at a 3 Terabyte offset.
sector |101 102 103 104 105 106 107 108 109 6442450944 process |A A A A A A A A A B
Later the I/O controller completes also request 101:
sector |102 103 104 105 106 107 108 109 6442450944 process |A A A A A A A A B
At this point process A is able to queue a new request in the queue. Precisely the next request that process A wants to queue corresponds to sector 110. The old elevator correctly notices 109 < 110 < 6442450944 and so it inserts the latest request in between:
sector |102 103 104 105 106 107 108 109 110 6442450944 process |A A A A A A A A A B
Then the request at address 102 gets completed too and request 111 gets queued again in order:
sector |103 104 105 106 107 108 109 110 111 6442450944 process |A A A A A A A A A B
As you can see above with the old elevator in linux 2.2.x the request queued from process B was delayed indefinitely and process B was stalled completly waiting the request to be completed. Basically process B had to wait process A to complete writing 2 Terabyte of data before being able to read 512 bytes from disk. Writing 2 Terabyte of data takes 29 hours with a throughput of 20Mbyte/sec (fast harddisk). That's why Linux was never been able to remain responsive during heavy I/O.
To overcome the longstanding starvation problem the I/O elevator is been rewrote making it a real I/O scheduler that enforces a certain per-request latency.
In 2.4.x each I/O queue has a sequence number (q→elevator.sequence) that gets increased each time a new request is queued. Each request in the queue has a sequence number too (req→elevator_sequence). The per-request sequence number gets initialized to the queue sequence number at inserction time, plus a number that indicates the max latency of the request. Max latency of the request means how many other request are allowed to pass the request. Now none request can pass a request that has a sequence number lower than the queue sequence number. A request with a timed out sequence number acts as a barrier and the elevator doesn't delay it anymore. This fixes the starvation completly.
This is the implementation of the new elevator IO scheduler (took from 2.3.99-pre3drivers/block/elevator.c/):
/* * linux/drivers/block/elevator.c * * Block device elevator/IO-scheduler. * * Copyright (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE */ #include <linux/fs.h> #include <linux/blkdev.h> #include <linux/elevator.h> #include <linux/blk.h> #include <asm/uaccess.h> static void elevator_default(struct request * req, elevator_t * elevator, struct list_head * real_head, struct list_head * head, int orig_latency) { struct list_head * entry = real_head, * point = NULL; struct request * tmp; int sequence = elevator->sequence; int latency = orig_latency -= elevator->nr_segments, pass = 0; int point_latency = 0xbeefbeef; while ((entry = entry->prev) != head) { if (!point && latency >= 0) { point = entry; point_latency = latency; } tmp = blkdev_entry_to_request(entry); if (elevator_sequence_before(tmp->elevator_sequence, sequence) || !tmp->q) break; if (latency >= 0) { if (IN_ORDER(tmp, req) || (pass && !IN_ORDER(tmp, blkdev_next_request(tmp)))) goto link; } latency += tmp->nr_segments; pass = 1; } if (point) { entry = point; latency = point_latency; } link: list_add(&req->queue, entry); req->elevator_sequence = elevator_sequence(elevator, latency); } [..]
There seems to be a widesperad idea that a two way elevator suffers about latency problems compared to a simple one way simple CSCAN (circular scan) elevator (like the old one in the linux kernel). That's not true, a one way elevator is just not usable in real world and it needs proper per request latency calculations to work correctly without starvation. A two way elevator would need proper I/O scheduling functions too, and thus it wouldn't be worse than a one way elevator with the I/O scheduling part functional. It's the I/O scheduler part of the elevator that controls the starvation behaviour, the elevator ordering algorithm per se doesn't matter at all with regard to starvation. The ordering algorithm only deals with performance.
It's also possible to tune the I/O scheduler part of the new elevator at runtime (while writing to disk! using the elvtune utility that you can get from the latest ftp://ftp.win.tue.nl/pub/linux/utils/util-linux/.
Each disk queue is tunable. Here the elvtune manpage:
ELVTUNE(8) ELVTUNE(8) NAME elvtune - I/O elevator tuner SYNOPSIS elvtune [ -r r_lat ] [ -w w_lat ] [ -b b_max ] /dev/blkdev1 [ /dev/blkdev2 ... ] elvtune -h elvtune -v DESCRIPTION elvtune allows to tune the I/O elevator per blockdevice queue basis. The tuning can be safely done at runtime. Tuning the elevator means being able to change disk per formance and interactiveness. In the output of elvtune the address of the queue tuned will be shown and it can be considered as a queue ID. For example multiple partitions in the same harddisk will share the same queue and so tun ing one partition will be like tuning the whole HD. OPTIONS -r r_lat set the max latency that the I/O scheduler will provide on each read. -w w_lat set the max latency that the I/O scheduler will provide on each write. -b b_max max coalescing factor allowed on writes when there are reads pending in the queue. -h help. -v version. NOTE Actually the only fields tunable are those relative to the IO scheduler. It's not possible to select a one-way or two-way elevator yet. For logical blockdevices like LVM the tuning has to be done on the physical devices. Tuning the queue of the LVM logical device is useless. RETURN VALUE 0 on success and 1 on failure. AUTHORS Andrea Arcangeli <andrea@suse.de> SuSE
With elvtune it's possible to change completly the behaviour of the I/O scheduler making the I/O incredibily responsive (paying with more seeks) or making the disk I/O very smooth but with worse latency depending on the apps. The default value happens to be a good compromise between the two worlds since it gives good repsonsiveness and good throughtput, thus the administrator is not supposed to need to tune the elevator, but anyway elvtune offers a backdoor into the kernel that the administrator can use for patological cases.
A few days ago Lawrence Manning sent me some number he produced with his machine (on IDE disk) by running ftp://samba.org/pub/tridge/dbench/ using with different elvtune values using the latest 2.3.99-pre kernels with the latest (at the time of this writing) elevator I/O scheduler included. The first column (X axis) is the latency enforced on READ requests. The second column (Y axis) is the same but for WRITE requests. He left the writebomb coalescing factor to the default value of 4.
READ WRITE throughput MB/sec 1 1 6.50377 1 2 6.5755 1 4 6.62596 1 8 6.9788 1 16 6.20607 1 32 7.0891 1 64 7.2836 1 128 7.24925 1 256 6.17217 1 512 6.80693 1 1024 7.38977 1 2048 9.59344 1 4096 13.9536 1 8192 14.9688 2 1 6.29503 2 2 6.71732 2 4 7.42069 2 8 6.69864 2 16 6.58865 2 32 6.65129 2 64 6.75208 2 128 6.95417 2 256 5.66051 2 512 6.29516 2 1024 7.94299 2 2048 9.68956 2 4096 14.0267 2 8192 17.3723 4 1 6.07824 4 2 6.2368 4 4 6.04078 4 8 6.88112 4 16 6.64808 4 32 6.44013 4 64 6.67114 4 128 6.04835 4 256 5.3561 4 512 5.84396 4 1024 8.38425 4 2048 10.2627 4 4096 14.1305 4 8192 16.1292 8 1 6.68223 8 2 5.9122 8 4 6.9095 8 8 6.93605 8 16 6.1103 8 32 6.91022 8 64 6.36142 8 128 7.33921 8 256 5.71747 8 512 6.0396 8 1024 7.54242 8 2048 13.0412 8 4096 12.5244 8 8192 17.1636 16 1 4.82748 16 2 4.6775 16 4 4.73035 16 8 6.20665 16 16 5.91671 16 32 7.32775 16 64 6.10932 16 128 6.36999 16 256 4.84733 16 512 6.34704 16 1024 7.93269 16 2048 10.2227 16 4096 12.2227 16 8192 16.4733 32 1 6.33722 32 2 5.88329 32 4 6.6827 32 8 6.70212 32 16 6.50164 32 32 6.06152 32 64 6.62159 32 128 6.78119 32 256 4.33677 32 512 6.6392 32 1024 7.60042 32 2048 10.0367 32 4096 12.7925 32 8192 16.1764 64 1 4.9119 64 2 5.21311 64 4 6.6686 64 8 5.76938 64 16 6.6466 64 32 5.73703 64 64 6.08781 64 128 6.11287 64 256 5.74427 64 512 6.43403 64 1024 7.95141 64 2048 10.1734 64 4096 13.3854 64 8192 16.8578 128 1 6.24238 128 2 6.55747 128 4 6.92611 128 8 6.98796 128 16 6.26899 128 32 6.30927 128 64 6.1957 128 128 6.10965 128 256 5.4383 128 512 6.27289 128 1024 9.63812 128 2048 11.5621 128 4096 13.718 128 8192 15.9914 256 1 8.29861 256 2 8.87839 256 4 8.3596 256 8 8.64884 256 16 8.01177 256 32 9.30556 256 64 8.25547 256 128 8.36583 256 256 8.38941 256 512 9.98628 256 1024 10.3707 256 2048 11.5963 256 4096 14.1799 256 8192 14.3723 512 1 8.10531 512 2 6.92571 512 4 5.75492 512 8 8.06484 512 16 8.22953 512 32 8.49832 512 64 8.35781 512 128 8.9446 512 256 12.3957 512 512 11.7484 512 1024 12.2108 512 2048 13.7929 512 4096 13.9785 512 8192 13.0536 1024 1 8.62234 1024 2 9.09723 1024 4 9.28719 1024 8 8.97176 1024 16 8.83318 1024 32 9.4878 1024 64 9.2156 1024 128 9.54619 1024 256 11.5551 1024 512 11.9493 1024 1024 12.8216 1024 2048 11.8851 1024 4096 13.6494 1024 8192 12.519 2048 1 9.91729 2048 2 9.45817 2048 4 9.86336 2048 8 9.56891 2048 16 9.81485 2048 32 9.86392 2048 64 9.77939 2048 128 10.0317 2048 256 11.3267 2048 512 12.4794 2048 1024 12.9573 2048 2048 12.3045 2048 4096 13.3217 2048 8192 13.576 4096 1 9.83617 4096 2 8.55634 4096 4 9.19586 4096 8 10.1315 4096 16 8.35717 4096 32 8.23017 4096 64 8.35075 4096 128 9.76017 4096 256 11.3581 4096 512 11.7681 4096 1024 12.9542 4096 2048 12.1081 4096 4096 13.0081 4096 8192 13.6161 8192 1 9.15503 8192 2 7.51432 8192 4 7.96411 8192 8 8.14143 8192 16 8.5751 8192 32 8.69921 8192 64 7.7448 8192 128 7.6479 8192 256 11.6386 8192 512 13.3031 8192 1024 11.9246 8192 2048 12.2524 8192 4096 13.8584 8192 8192 15.3906
Using such input I drawed a 3d graph (note: X and Y axis are in logaritmic scale).
That shows that allowing the elevator to aggresively reorder writes and processing the read fast provides good thrughtput. It also shows the default value (-r 128 -w 8192) is very good for dbench.
There's some non obvious real life subtle issue to consider in order to fully understand the reasons for which resonsiveness on writes isn't interesting while read latency is critical. The reasons is the nature of read and of writes. When you read it means you don't have an information and so you are going to need it synchronously before you are allowed to do the next step. That happens each time a program is started for example. Writes have a completly different nature. Writes are almost always executed asynchronously instead. With writes we don't care if the write hits the disk after lots of time or immediatly because we don't have to wait synchronously for I/O completation. This mean we can almost always aggressively reorder writes without risking to hurt responsiveness.
The starvation generated by the CSCAN elevator isn't only a problem for repsoniveness but it's also an issue for robusteness of the filesystem. If the starved request it's a read the effect will be bad responsiveness, but if the starved requests are a cupole of writes then if a crash happens during the starvation, then the filesystem could happen to be very corrupted and to not have very old changes stored on disk. Avoiding starvation in the elevator is not only a performance issue.
A stable fix for the elevator starvation for 2.2.x can be found ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.15pre14/elevator-starvation-8.gz. It's inferior than the stock 2.3.x version because 2.2.x doesn't export the request queue structure to the high level (ll_rw_block) layer.
HIGHMEM is a kernel configuration option that allows Linux to handle optionally 4giga of RAM or 64giga of RAM (64giga is possible using PAE physical address extension three level pagetables) on IA32. HIGHMEM is going to be used also from the alpha to handle more than 2giga of RAM without using the PCI MMU (not all drivers that does any kind of PCI DMA are been converted to use the PCI MMU yet). We can for seplicity analyze only the 4Giga support. The 64Giga support works in the same way with the difference that three level pagetables that allow accessing more physical ram are used instead of the normal two level pagetables.
If you hear BIGMEM don't get confused, BIGMEM and HIGHMEM were the same patch. Originally HIGHMEM was called BIGMEM and BIGMEM seen the light in kernel 2.3.16. BIGMEM is been gratuitous renamed to HIGHMEM during the 2.3.x developement cycle. Actually the 2.2.x version of the patch is still called ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/bigmem-2.2.14-9.gz (and it provides support for 4giga of RAM on IA32 and up to Terabytes of RAM on alpha).
On IA32 (without PAE mode) we have 4giga of physical RAM and 4giga of virtual ram:
//picture1// ------------------------------------------------------------ | userspace virtual 0-3g | kernelvirtual 3-4g | ------------------------------------------------------------
The kernel runs in kernel virtual space and in order to access any kind of physical memory it always needs to first setup a proper virtual to physical mapping.
The kernel virtual area maps directly the 3-4g range to the 0-1 giga phys range as below shown:
//picture2// ------------------------------------------------------------ | kernel-phys 0-1g | unused-phys-memory 1-4giga | ------------------------------------------------------------
and the userspace virtual addresses (0-3giga area in highmem-picture1) maps with 4kbyte (PAGE_SIZE) granularity all over the phys memory that the kernel can address all in function of the user pagetables setting.
So basically the kernel was used to do a:
*(virtual_address) = 0;
in order to write to the physical memory at address “virtual_address - 3giga”.
//picture3// ------------------------------------------------------------ | USERSPACE VIRTUAL SPACE 0-3g | kernel virt 3-4g | ------------------------------------------------------------ | virtual write here ----------------------------------------- | \|/ physical write happens here ------------------------------------------------------------ | kernel phys | PHYSICAL SPACE 1-4g | ------------------------------------------------------------
So where's the problem?
Basically the problem was that the kernel wasn't able to access the physical memory between 1giga and 4giga (as seen in the below part of highmem-picture3).
As you can see from the highmem-picture3 the last address that the kernel could access writing to the ident mapping happens to be a 4giga-1 and it points to the physical address 1giga-1. Then the virtual addresses wraps around.
NOTE: We can't use the userspace virtual space to write after the 1giga phys. This because the userspace pagetables belongs to the process memory layout and we can't change them without clobbering the userspace mappings and so without flushing away user tlb and destroying the user mm context.
The whole point of HIGHMEM is to overcome the old pre-highmem-limitation. HIGHMEM allows the kernel to access the physical memory between 1giga and 4giga (physical).
The design of BIGMEM/HIGHMEM consists in reserving a pool of virtual page at the end of the virtual memory (at around address 4giga) and to put pagetables on top of it, in order to map such virtual addresses to the physical space after the first giga of RAM. By changing the mapping of such virtual addresses we are now able to address all the physical RAM in the system.
This is a picture of how the HIGMEM design works.
//picture4// ----------------------------------------------------------------- | userspace 0-3g VIRTUAL SPACE | kernel virt 3-4g |pool| ----------------------------------------------------------------- | |||| ----------------------------------------- |||| | -------------------||| | ----------------------|-------------------|| | | -----------|--------------------| | | | | ------------ \|/ \|/ \|/ \|/ \|/ ----------------------------------------------------------------- | kern phys direct | NEW PHYSICAL SPACE WE ARE NOW ABLE TO USE | -----------------------------------------------------------------
As you can see with the pool of virtual pages placed at around virtual address 4giga, we can now address all the physical space available from the kernel. Actually in the above picture the physical space is large as the virtual space (4g). With the new PAE support the code works exactly in the same way, with the difference that with the 3-level-pagetables of the PAE mode the physical space is larger than 4giga and it's instead large 64giga. So the above picture should be changed moving the end of the physical space to address 64giga and not 4giga, that's conceptually the only difference between PAE mode and non-PAE mode (in practice there is some more detail to consider of course).
The userspace in HIGHMEM continues to map the pages in the same way as before. The only thing which change is that the kernel can allow userspace to map all over the physical memory because the kernel now can access all the physical memory in first place.
Of course there's a performance penality in accessing the memory after 1giga, this because we have to setup a virt-to-phys mapping and flush away old tlb entries before we can acces it. It's very rasonable to have such performance penality to handle much more memory. We measured the performance hit in the worst case (with maximal page-faults rate) to be 2%. In real life usage it's not measurable.
This is the pseudocode of the testcase that is been used to generate the patological case.
#define SIZE (700*1024*1024) main() { int i; char * buf = malloc(SIZE); if (!buf) ... i = 0; get_start_time(); for (; i < SIZE; i += 4096, buf += 4096) *(int *) buf = 0; get_stop_time(); printf("time %d\n", get_diff_time()); }
Subtle: one could think that allocating only 700mbyte is not enough to hit the high memory (since the high memory happens to be over one giga). That's not the case because the allocator knows when we can handle highmem, and if we can handle highmem it prefers to always return highmem. Anonymous memory can be allocated with highmem and so if there are at least 700mbyte of highmemory free, we'll certainly allocate the anonymous memory from there.
To automatically map/unmap (with pagetables) the reserved pool of virtual pages to access all the physical memory we created two helper functions called kmap() and kunmap() that deals automagically with setting and clearing virt-to-phys mapping.
The code in 2.3.x that uses the reserved pool to access the memory over 1giga looks like this:
struct page * page = alloc_page(GFP_HIGHUSER); unsigned long vaddr = kmap(page); [..] -- here feel free to use vaddr virtual address to access the physical page `page' [..] kunmap(page); -- at this point vaddr is not valid anymore
All the code that have to deal with highmem have to issue a kmap before dereferencing the pointer to the memory. If the memory is not in the high zone, then the ident mapping (direct mapping) is used. If the memory is in the high zone the highmem pool is mapped to the physical page.
The page LRU is a list (ordered in function of the last recently used entries) where all the freeable cache entities gets queued. Each time we allocate a cache entity (for example while reading from a file), we link the cache-memory into the page-LRU. Later, when we gets low on memory, we'll browse the page-LRU and we'll free and unlink pages to allow other allocations to succeed.
Before 2.3.x when the system was low on memory the kernel had to browse all the physical pages in the system in order to find out a freeable page. The browse was done in round robin using a clock algorithm. Clock algorithm means that each attempt of recycling the cache was starting browsing at the point where the last try stopped.
The old clock algorithm had big complexity problem. Suppose there's an application that allocates almost all the phisical memory into anonymous user memory (using malloc() for example).
physical memory in the machine ----------------------------------------------------------------- | memory allocated in userspace | cache | -----------------------------------------------------------------
The above shows the worse secenario in which the old clock algorithm would kill performance of the machine.
Suppose there are 10giga of physical RAM, that the memory allocated in userspace is 9.5giga and that there's half giga allocated in cache. Assume a page size of 8kbytes (like on the alpha). With such a scenario the old algorithm would have wasted time in a not schedule friendly kernel loop trying to free 1245184 pages that are not part of cache at each loop. Only the last 65536 physical pages were part of cache instead.
The above scenario was also going to fool the VM balance algorithm causing random swapouts as soon as the clock algorithm would enter in the non freeable userspace part on the left (the old VM balance algorithm could work well only if the cache was fragmented on the physical space and that wasn't going to be the case most of the time.
With the page-LRU in the same scenario we would work only on the half giga of cache, and we would never waste time trying to free anonymous memory. The page-LRU allows us to reach cache pages in O(1). This also mean we don't rely anymore on the cache to be fragmented all over the physical memory to have a regular swap behaviour. With the page-LRU the random swapouts due too long physically contigous non-cache areas, can't happen anymore.
The new page-LRU design is very SMP friendly. The only lock necessary to read or modify the page-LRU is the pagemap_lru_lock global spinlock. This means that the cache freeing completly scales in SMP and by using some trick (moving the non freeable pages in private lru local lists temporarly) we can allow the page stealer to run on multiple CPUs at the same time without risking to duplicate the SMP work but really scaling the page freeing work in SMP.
Once we find an interesting page we unlink it and we drop the pagemap_lru_lock and the real freeing job is done without the global pagemap_lru_lock held, but it's done only with the very finegrined per-page SMP lock (we atomically downgrade the locking from global to per-page to scale and we later upgrade it once the freeing work is finished). The pagemap_lru_lock is acquired only while unlinking the page from the list and later to reinsert the page in one of the local dispose lists if the freeing didn't succeeded.
Due the lock ordering we can't spin on the per-page lock if we take the pagemap_lru_lock held. So to avoid deadlocks we only try to acquire the per-page lock while we still held the pagemap_lru_lock (we have a fail path and we don't spin). That's what I meant with page-LRU-downgrade-SMP-lock. If we succeed in acquiring the per-page lock we release the pagemap_lru_lock and we slowly try to free the page. Then once the page-freeing is finished before releasing the per-page lock we re-acquire the pagemap_lru_lock with a spin_lock() operation, we can do that since we are in the right locking order at that point.
This is the page stealer code, the function is called shrink_mmap().
NOTE: the name of the function is a bit misleading since only non-mapped page cache and unused buffers can be freed from shrink_mmap(), shrink_cache() would been a more appropriate name, but I think the name shrink_mmap() holds an heavy historical background .
int shrink_mmap(int priority, int gfp_mask, zone_t *zone) { int ret = 0, count; LIST_HEAD(young); LIST_HEAD(old); LIST_HEAD(forget); struct list_head * page_lru, * dispose; struct page * page; if (!zone) BUG(); count = nr_lru_pages / (priority+1); spin_lock(&pagemap_lru_lock); while (count > 0 && (page_lru = zone->lru_cache.prev) != &zone->lru_cache) { page = list_entry(page_lru, struct page, lru); list_del(page_lru); dispose = &zone->lru_cache; if (test_and_clear_bit(PG_referenced, &page->flags)) /* Roll the page at the top of the lru list, * we could also be more aggressive putting * the page in the young-dispose-list, so * avoiding to free young pages in each pass. */ goto dispose_continue; dispose = &old; /* don't account passes over not DMA pages */ if (zone && (!memclass(page->zone, zone))) goto dispose_continue; count--; dispose = &young; /* avoid unscalable SMP locking */ if (!page->buffers && page_count(page) > 1) goto dispose_continue; if (TryLockPage(page)) goto dispose_continue; /* Release the pagemap_lru lock even if the page is not yet queued in any lru queue since we have just locked down the page so nobody else may SMP race with us running a lru_cache_del() (lru_cache_del() always run with the page locked down ;). */ spin_unlock(&pagemap_lru_lock); /* avoid freeing the page while it's locked */ get_page(page); /* Is it a buffer page? */ if (page->buffers) { if (!try_to_free_buffers(page)) goto unlock_continue; /* page was locked, inode can't go away under us */ if (!page->mapping) { atomic_dec(&buffermem_pages); goto made_buffer_progress; } } /* Take the pagecache_lock spinlock held to avoid other tasks to notice the page while we are looking at its page count. If it's a pagecache-page we'll free it in one atomic transaction after checking its page count. */ spin_lock(&pagecache_lock); /* * We can't free pages unless there's just one user * (count == 2 because we added one ourselves above). */ if (page_count(page) != 2) goto cache_unlock_continue; /* * Is it a page swap page? If so, we want to * drop it if it is no longer used, even if it * were to be marked referenced.. */ if (PageSwapCache(page)) { spin_unlock(&pagecache_lock); __delete_from_swap_cache(page); goto made_inode_progress; } /* is it a page-cache page? */ if (page->mapping) { if (!Page_Dirty(page) && !pgcache_under_min()) { remove_page_from_inode_queue(page); remove_page_from_hash_queue(page); page->mapping = NULL; spin_unlock(&pagecache_lock); goto made_inode_progress; } goto cache_unlock_continue; } dispose = &forget; printk(KERN_ERR "shrink_mmap: unknown LRU page!\n"); cache_unlock_continue: spin_unlock(&pagecache_lock); unlock_continue: spin_lock(&pagemap_lru_lock); UnlockPage(page); put_page(page); list_add(page_lru, dispose); continue; dispose_continue: list_add(page_lru, dispose); } goto out; made_inode_progress: page_cache_release(page); made_buffer_progress: UnlockPage(page); put_page(page); ret = 1; spin_lock(&pagemap_lru_lock); /* nr_lru_pages needs the spinlock */ nr_lru_pages--; out: list_splice(&young, &zone->lru_cache); list_splice(&old, zone->lru_cache.prev); spin_unlock(&pagemap_lru_lock); return ret; }
So far we examined only the cache-stealer that runs as soon as the system goes low on memory. That's the most interesting part of the page-LRU. Of course there are also a few entry/exit points for the page-LRU. They happens to be when we allocate cache and when we destroy cache (for example while destroying an inode we drop the page cache related to such inode and so we also have to unlink such cache from the page-LRU). That entry/exit points are accomplished by the lru_cache_add() and lru_cache_del() macros.
extern spinlock_t pagemap_lru_lock; /* * Helper macros for lru_pages handling. */ #define lru_cache_add(page) \ do { \ spin_lock(&pagemap_lru_lock); \ list_add(&(page)->lru, &page->zone->lru_cache); \ nr_lru_pages++; \ spin_unlock(&pagemap_lru_lock); \ } while (0) #define lru_cache_del(page) \ do { \ if (!PageLocked(page)) \ BUG(); \ spin_lock(&pagemap_lru_lock); \ list_del(&(page)->lru); \ nr_lru_pages--; \ spin_unlock(&pagemap_lru_lock); \ } while (0)
The above macros are basically the only additional fixed cost that the new design requires.
A normal lru list implicitly assings an age to the entries only in function on their position in the list. That's not enterely true in the case of the page-LRU.
We keep using a referenced bit that is the aging information about each LRU entry. We do that because we want to avoid having to roll (modify) the LRU list at each cache hit (each time we do a search on the cache and we find out the interesting information in the cache).
In the cache hit case we want to be fast and we only want to set the reference bit in order to signal shrink_mmap that the page is young and that shrink_mmap shouldn't try to free it before the reference bit gets cleared.
Later, when shrink_mmap will find the page at the bottom of the LRU, it will then clear the referenced bit and it will roll the referenced page at the head of the lru list at very lower cost).
This way definitely improves performance and it provides a rasonable precise aging anyway.
Certainly there are many more, these are the first important ones that cames to mind .