===== Exploring some Linux **2.3.x** new feature =====
© 2000 Andrea Arcangeli, <andrea@suse.de>, [[http://www.suse.com/|SuSE]] \\
v1.0 3 April 2000, Imola (Italy)

----

//This document is the reference for my talk about the **2.3.x** developement that will happen at the **Free Software Days in Strasbourg, France** on **12 and 13 May 2000**. The object of this document is to explore a few of the new exciting features that everybody will enjoy in the new upcoming 2.4.x Linux kernels. This document can be redistributed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.//

----

===== 1. Elevator I/O scheduler =====

==== 1.1 What is the elevator? ====

The elevator is a queue where the I/O requests gets ordered in function of their sector on disk. The object of this ordering is to optimize away disk-seeks when possible in order to improve I/O performance. Almost all normal harddisk gets improved by avoiding seeks (I would guess the elevator also improve the lifetime of the harddisk :-).

Suppose there are three processes **A**, **B** and **C**. Assume all the three processes does this I/O in this order:

<code>
time |process         I/O sector
-----|-------         ----------
0    |A               10
1    |B               14
2    |C               12
3    |A               11
4    |B               15
5    |C               13
</code>

Without the elevator the disk would move this way:

<code>
sector |10
       |                        11
       |                12
       |                                        13
       |        14
       |                                15
       |------------------------------------------
       time
</code>

Thanks to elevator inside the linux operative system the disk instead moves this way:

<code>
sector |10
       |        11
       |                12
       |                        13
       |                                14
       |                                        15
       |------------------------------------------
       time
</code>

and no one seeks happens (the harddisk doesn't move back and forth) and the I/O completes much faster.

==== 1.2 Elevator starvation ====

Elevator starvation means that the elevator due its reordering could delay one of the I/O requests for a too long //indefinite// time. This is what always happened in linux before the upcoming **2.4.x**.

Assume there are two processes **A** and **B**. Process **A** writes to disk continously 2 Terabyte of data starting from sector ****. This is how the linux hard disk queue gets filled by the elevator under the load generated by process **A**. (assume the queue contains at most 10 I/O requests for semplicity, the left of the queue is where the I/O controller eats packet from)

<code>
sector  |0 1 2 3 4 5 6 7 8 9
process |A A A A A A A A A A
</code>

After some time the queue will looks like this:

 
<code>
sector  |100 101 102 103 104 105 106 107 108 109
process |A   A   A   A   A   A   A   A   A   A
</code>

Now the I/O controller finish processing the request on sector **100** and the queue get shifted left of one request and so there's now space for a new request in the queue:

 
<code>
sector  |101 102 103 104 105 106 107 108 109
process |A   A   A   A   A   A   A   A   A
</code>

But at this point it's process **B** that tries to read **one** single sector (mere 512 bytes) at a 3 Terabyte offset.

 
<code>
sector  |101 102 103 104 105 106 107 108 109 6442450944
process |A   A   A   A   A   A   A   A   A   B
</code>

Later the I/O controller completes also request **101**:

 
<code>
sector  |102 103 104 105 106 107 108 109 6442450944
process |A   A   A   A   A   A   A   A   B
</code>

At this point process **A** is able to queue a new request in the queue. Precisely the next request that process **A** wants to queue corresponds to sector **110**. The old elevator correctly notices **109** < **110** < **6442450944** and so it inserts the latest request in between:

 
<code>
sector  |102 103 104 105 106 107 108 109 110 6442450944
process |A   A   A   A   A   A   A   A   A   B
</code>

Then the request at address **102** gets completed too and request **111** gets queued again in order:

 
<code>
sector  |103 104 105 106 107 108 109 110 111 6442450944
process |A   A   A   A   A   A   A   A   A   B
</code>

As you can see above with the old elevator in linux **2.2.x** the request queued from process **B** was delayed indefinitely and process **B** was stalled completly waiting the request to be completed. Basically process **B** had to wait process **A** to complete writing 2 Terabyte of data before being able to read 512 bytes from disk. Writing 2 Terabyte of data takes **29 hours** with a throughput of 20Mbyte/sec (fast harddisk). That's why Linux was never been able to remain responsive during heavy I/O.

==== 1.3 I/O scheduler ====

To overcome the longstanding starvation problem the I/O elevator is been rewrote making it a real I/O scheduler that enforces a certain per-request latency.

In **2.4.x** each I/O queue has a sequence number (//q->elevator.sequence//) that gets increased each time a new request is queued. Each request in the queue has a sequence number too (//req->elevator_sequence//). The per-request sequence number gets initialized to the queue sequence number at inserction time, plus a number that indicates the max latency of the request. //Max latency of the request// means how many other request are allowed to pass the request. Now none request can pass a request that has a sequence number lower than the queue sequence number. A request with a timed out sequence number acts as a barrier and the elevator doesn't delay it anymore. This fixes the starvation completly.

This is the implementation of the new elevator IO scheduler (took from 2.3.99-pre3drivers/block/elevator.c/):

<code c>

/*
*  linux/drivers/block/elevator.c
*
*  Block device elevator/IO-scheduler.
*
*  Copyright (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE
*/
#include <linux/fs.h>
#include <linux/blkdev.h>
#include <linux/elevator.h>
#include <linux/blk.h>
#include <asm/uaccess.h>
static void elevator_default(struct request * req, elevator_t * elevator,
struct list_head * real_head,
struct list_head * head, int orig_latency)
{
struct list_head * entry = real_head, * point = NULL;
struct request * tmp;
int sequence = elevator->sequence;
int latency = orig_latency -= elevator->nr_segments, pass = 0;
int point_latency = 0xbeefbeef;
while ((entry = entry->prev) != head) {
if (!point && latency >= 0) {
point = entry;
point_latency = latency;
}
tmp = blkdev_entry_to_request(entry);
if (elevator_sequence_before(tmp->elevator_sequence, sequence) || !tmp->q)
break;
if (latency >= 0) {
if (IN_ORDER(tmp, req) ||
(pass && !IN_ORDER(tmp, blkdev_next_request(tmp))))
goto link;
}
latency += tmp->nr_segments;
pass = 1;
}
if (point) {
entry = point;
latency = point_latency;
}
link:
list_add(&req->queue, entry);
req->elevator_sequence = elevator_sequence(elevator, latency);
}
[..]

</code>

==== 1.4 One way elevator against two way elevator ====

There seems to be a widesperad idea that a two way elevator suffers about latency problems compared to a simple one way simple CSCAN (circular scan) elevator (like the old one in the linux kernel). That's not true, a one way elevator is just not usable in real world and it needs proper per request latency calculations to work correctly without starvation. A two way elevator would need proper I/O scheduling functions too, and thus it wouldn't be worse than a one way elevator with the I/O scheduling part functional. It's the I/O scheduler part of the elevator that controls the starvation behaviour, the elevator ordering algorithm per se doesn't matter at all with regard to starvation. The ordering algorithm only deals with performance.

==== 1.5 Elevator tuning with `elvtune' ====

It's also possible to tune the I/O scheduler part of the new elevator at runtime (while writing to disk! :-) using the **elvtune** utility that you can get from the latest ftp://ftp.win.tue.nl/pub/linux/utils/util-linux/.

Each disk queue is tunable. Here the **elvtune** manpage:

<code> 
ELVTUNE(8)                                             ELVTUNE(8)

NAME
elvtune - I/O elevator tuner

SYNOPSIS
elvtune  [  -r  r_lat  ]  [  -w  w_lat  ]  [  -b  b_max  ]
/dev/blkdev1 [ /dev/blkdev2 ...  ]
elvtune -h
elvtune -v

DESCRIPTION
elvtune allows to tune the I/O  elevator  per  blockdevice
queue  basis.  The  tuning  can be safely done at runtime.
Tuning the elevator means being able to change  disk  per­
formance and interactiveness. In the output of elvtune the
address of the queue tuned will be shown  and  it  can  be
considered as a queue ID.  For example multiple partitions
in the same harddisk will share the same queue and so tun­
ing one partition will be like tuning the whole HD.

OPTIONS
-r r_lat
set  the  max  latency  that the I/O scheduler will
provide on each read.
-w w_lat
set the max latency that  the  I/O  scheduler  will
provide on each write.
-b b_max
max  coalescing factor allowed on writes when there
are reads pending in the queue.
-h     help.
-v     version.

NOTE
Actually the only fields tunable are those relative to the
IO  scheduler.  It's  not  possible to select a one-way or
two-way elevator yet.
For logical blockdevices like LVM the  tuning  has  to  be
done  on the physical devices. Tuning the queue of the LVM
logical device is useless.

RETURN VALUE
0 on success and 1 on failure.

AUTHORS
Andrea Arcangeli <andrea@suse.de> SuSE
</code>

With elvtune it's possible to change completly the behaviour of the I/O scheduler making the I/O incredibily responsive (paying with more seeks) or making the disk I/O very smooth but with worse latency depending on the apps. The default value happens to be a good compromise between the two worlds since it gives good repsonsiveness and good throughtput, thus the administrator is not supposed to need to tune the elevator, but anyway **elvtune** offers a backdoor into the kernel that the administrator can use for patological cases.

==== 1.6 `elvtune' numbers ====

A few days ago [[mailto:lawrence@aslak.demon.co.uk|Lawrence Manning]] sent me some number he produced with his machine (on IDE disk) by running ftp://samba.org/pub/tridge/dbench/ using with different elvtune values using the latest **2.3.99-pre** kernels with the latest (at the time of this writing) elevator I/O scheduler included. The first column (X axis) is the latency enforced on **READ** requests. The second column (Y axis) is the same but for **WRITE** requests. He left the writebomb coalescing factor to the default value of **4**.

 
<code>
READ    WRITE   throughput MB/sec
1       1       6.50377
1       2       6.5755
1       4       6.62596
1       8       6.9788
1       16      6.20607
1       32      7.0891
1       64      7.2836
1       128     7.24925
1       256     6.17217
1       512     6.80693
1       1024    7.38977
1       2048    9.59344
1       4096    13.9536
1       8192    14.9688
2       1       6.29503
2       2       6.71732
2       4       7.42069
2       8       6.69864
2       16      6.58865
2       32      6.65129
2       64      6.75208
2       128     6.95417
2       256     5.66051
2       512     6.29516
2       1024    7.94299
2       2048    9.68956
2       4096    14.0267
2       8192    17.3723
4       1       6.07824
4       2       6.2368
4       4       6.04078
4       8       6.88112
4       16      6.64808
4       32      6.44013
4       64      6.67114
4       128     6.04835
4       256     5.3561
4       512     5.84396
4       1024    8.38425
4       2048    10.2627
4       4096    14.1305
4       8192    16.1292
8       1       6.68223
8       2       5.9122
8       4       6.9095
8       8       6.93605
8       16      6.1103
8       32      6.91022
8       64      6.36142
8       128     7.33921
8       256     5.71747
8       512     6.0396
8       1024    7.54242
8       2048    13.0412
8       4096    12.5244
8       8192    17.1636
16      1       4.82748
16      2       4.6775
16      4       4.73035
16      8       6.20665
16      16      5.91671
16      32      7.32775
16      64      6.10932
16      128     6.36999
16      256     4.84733
16      512     6.34704
16      1024    7.93269
16      2048    10.2227
16      4096    12.2227
16      8192    16.4733
32      1       6.33722
32      2       5.88329
32      4       6.6827
32      8       6.70212
32      16      6.50164
32      32      6.06152
32      64      6.62159
32      128     6.78119
32      256     4.33677
32      512     6.6392
32      1024    7.60042
32      2048    10.0367
32      4096    12.7925
32      8192    16.1764
64      1       4.9119
64      2       5.21311
64      4       6.6686
64      8       5.76938
64      16      6.6466
64      32      5.73703
64      64      6.08781
64      128     6.11287
64      256     5.74427
64      512     6.43403
64      1024    7.95141
64      2048    10.1734
64      4096    13.3854
64      8192    16.8578
128     1       6.24238
128     2       6.55747
128     4       6.92611
128     8       6.98796
128     16      6.26899
128     32      6.30927
128     64      6.1957
128     128     6.10965
128     256     5.4383
128     512     6.27289
128     1024    9.63812
128     2048    11.5621
128     4096    13.718
128     8192    15.9914
256     1       8.29861
256     2       8.87839
256     4       8.3596
256     8       8.64884
256     16      8.01177
256     32      9.30556
256     64      8.25547
256     128     8.36583
256     256     8.38941
256     512     9.98628
256     1024    10.3707
256     2048    11.5963
256     4096    14.1799
256     8192    14.3723
512     1       8.10531
512     2       6.92571
512     4       5.75492
512     8       8.06484
512     16      8.22953
512     32      8.49832
512     64      8.35781
512     128     8.9446
512     256     12.3957
512     512     11.7484
512     1024    12.2108
512     2048    13.7929
512     4096    13.9785
512     8192    13.0536
1024    1       8.62234
1024    2       9.09723
1024    4       9.28719
1024    8       8.97176
1024    16      8.83318
1024    32      9.4878
1024    64      9.2156
1024    128     9.54619
1024    256     11.5551
1024    512     11.9493
1024    1024    12.8216
1024    2048    11.8851
1024    4096    13.6494
1024    8192    12.519
2048    1       9.91729
2048    2       9.45817
2048    4       9.86336
2048    8       9.56891
2048    16      9.81485
2048    32      9.86392
2048    64      9.77939
2048    128     10.0317
2048    256     11.3267
2048    512     12.4794
2048    1024    12.9573
2048    2048    12.3045
2048    4096    13.3217
2048    8192    13.576
4096    1       9.83617
4096    2       8.55634
4096    4       9.19586
4096    8       10.1315
4096    16      8.35717
4096    32      8.23017
4096    64      8.35075
4096    128     9.76017
4096    256     11.3581
4096    512     11.7681
4096    1024    12.9542
4096    2048    12.1081
4096    4096    13.0081
4096    8192    13.6161
8192    1       9.15503
8192    2       7.51432
8192    4       7.96411
8192    8       8.14143
8192    16      8.5751
8192    32      8.69921
8192    64      7.7448
8192    128     7.6479
8192    256     11.6386
8192    512     13.3031
8192    1024    11.9246
8192    2048    12.2524
8192    4096    13.8584
8192    8192    15.3906
</code>

==== 1.7 `elvtune' graph ====

Using such input I drawed a 3d graph (note: X and Y axis are in logaritmic scale).

{{dbench-elvtune.gif}}

That shows that allowing the elevator to aggresively reorder writes and processing the read fast provides good thrughtput. It also shows the default value (-r 128 -w 8192) is very good for dbench.

==== 1.8 Real world thoguths ====

There's some non obvious real life subtle issue to consider in order to fully understand the reasons for which resonsiveness on writes isn't interesting while read latency is critical. The reasons is the nature of read and of writes. When you read it means you don't have an information and so you are going to need it synchronously before you are allowed to do the next step. That happens each time a program is started for example. Writes have a completly different nature. Writes are almost always executed asynchronously instead. With writes we don't care if the write hits the disk after lots of time or immediatly because we don't have to wait synchronously for I/O completation. This mean we can almost always aggressively reorder writes without risking to hurt responsiveness.

==== 1.9 Robusteness of filesystem with starvation in the I/O layer ====

The starvation generated by the **CSCAN** elevator isn't only a problem for repsoniveness but it's also an issue for robusteness of the filesystem. If the starved request it's a **read** the effect will be bad responsiveness, but if the starved requests are a cupole of **writes** then if a crash happens during the starvation, then the filesystem could happen to be very corrupted and to not have very old changes stored on disk. Avoiding starvation in the elevator is not only a performance issue.

==== 1.10 Elevator replacement for **2.2.x** ====

A stable fix for the elevator starvation for **2.2.x** can be found ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.15pre14/elevator-starvation-8.gz. It's inferior than the stock **2.3.x** version because **2.2.x** doesn't export the request queue structure to the high level (ll_rw_block) layer.

----


===== 2. HIGHMEM =====

==== 2.1 What's HIGHMEM? ====

**HIGHMEM** is a kernel configuration option that allows Linux to handle optionally **4giga** of RAM or **64giga** of RAM (64giga is possible using PAE //physical address extension// three level pagetables) on IA32. **HIGHMEM** is going to be used also from the alpha to handle more than 2giga of RAM without using the PCI MMU (not all drivers that does any kind of PCI DMA are been converted to use the PCI MMU yet). We can for seplicity analyze only the 4Giga support. The 64Giga support works in the same way with the difference that three level pagetables that allow accessing more physical ram are used instead of the normal two level pagetables.

==== 2.2 Little bit of history (BIGMEM) ====

If you hear **BIGMEM** don't get confused, **BIGMEM** and **HIGHMEM** were the **same** patch. Originally **HIGHMEM** was called **BIGMEM** and **BIGMEM** seen the light in kernel **2.3.16**. **BIGMEM** is been gratuitous renamed to **HIGHMEM** during the **2.3.x** developement cycle. Actually the **2.2.x** version of the patch is still called ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.2.14/bigmem-2.2.14-9.gz (and it provides support for 4giga of RAM on IA32 and up to Terabytes of RAM on alpha).

==== 2.3 Previous limitation ====

On IA32 (without PAE mode) we have 4giga of physical RAM and 4giga of virtual ram:

<code>
  //picture1// 
------------------------------------------------------------
|  userspace virtual 0-3g             | kernelvirtual 3-4g |
------------------------------------------------------------
</code>

The kernel runs in kernel virtual space and in order to access any kind of physical memory it always needs to first setup a proper virtual to physical mapping.

The kernel virtual area maps directly the 3-4g range to the 0-1 giga phys range as below shown:

<code>
 //picture2// 
------------------------------------------------------------
| kernel-phys 0-1g | unused-phys-memory 1-4giga            |
------------------------------------------------------------
</code>

and the userspace **virtual** addresses (0-3giga area in highmem-picture1) maps with 4kbyte (PAGE_SIZE) granularity all over **the phys memory that the kernel can address** all in function of the user pagetables setting.

So basically the kernel was used to do a:

<code>
*(virtual_address) = 0;
</code>

in order to write to the physical memory at address "virtual_address - 3giga".

<code>
  //picture3// 
------------------------------------------------------------
|       USERSPACE  VIRTUAL SPACE 0-3g   | kernel virt 3-4g |
------------------------------------------------------------
| virtual write here
-----------------------------------------
|
\|/ physical write happens here
------------------------------------------------------------
|   kernel phys  |   PHYSICAL SPACE 1-4g                   |
------------------------------------------------------------
</code>

So where's the problem?

Basically the  problem was that the kernel wasn't able to access the physical memory between 1giga and 4giga (as seen in the below part of highmem-picture3).

As you can see from the highmem-picture3 the last address that the kernel could access writing to the ident mapping happens to be a 4giga-1 and it points to the physical address 1giga-1. Then the virtual addresses wraps around.

NOTE: We can't use the userspace virtual space to write after the 1giga phys. This because the userspace pagetables belongs to the process memory layout and we can't change them without clobbering the userspace mappings and so without flushing away user tlb and destroying the user mm context. 

==== 2.4 Object ====

The whole point of **HIGHMEM** is to overcome the old pre-highmem-limitation. **HIGHMEM** allows the kernel to access the physical memory between 1giga and 4giga (physical).

==== 2.5 Design ====

The design of **BIGMEM/HIGHMEM** consists in reserving a pool of virtual page at the end of the virtual memory (at around address 4giga) and to put pagetables on top of it, in order to map such virtual addresses to the physical space after the first giga of RAM. By changing the mapping of such virtual addresses we are now able to address all the physical RAM in the system.

==== 2.6 Result ====

This is a picture of how the **HIGMEM** design works.

<code>
 //picture4// 
-----------------------------------------------------------------
|   userspace 0-3g VIRTUAL SPACE        | kernel virt 3-4g |pool|
-----------------------------------------------------------------
|           ||||
-----------------------------------------           ||||
|                                 -------------------|||
|           ----------------------|-------------------||
|           |          -----------|--------------------|
|           |          |          |         ------------
\|/         \|/        \|/        \|/       \|/ 
-----------------------------------------------------------------
| kern phys direct | NEW PHYSICAL SPACE WE ARE NOW ABLE TO USE  |
-----------------------------------------------------------------
</code>

As you can see with the pool of virtual pages placed at around virtual address 4giga, we can now address all the physical space available from the kernel. Actually in the above picture the physical space is large as the virtual space (4g). With the new PAE support the code works exactly in the same way, with the difference that with the 3-level-pagetables of the PAE mode the physical space is larger than 4giga and it's instead large 64giga. So the above picture should be changed moving the end of the physical space to address 64giga and not 4giga, that's conceptually the only difference between PAE mode and non-PAE mode (in practice there is some more //detail// to consider of course).

The userspace in **HIGHMEM** continues to map the pages in the same way as before. The only thing which change is that the kernel can allow userspace to map all over the physical memory because the kernel now can access all the physical memory in first place.

==== 2.7 Performance ====

Of course there's a performance penality in accessing the memory after 1giga, this because we have to setup a virt-to-phys mapping and flush away old tlb entries before we can acces it. It's very rasonable to have such performance penality to handle much more memory. We measured the performance hit in the **worst** case (with maximal page-faults rate) to be 2%. In real life usage it's not measurable.

This is the pseudocode of the testcase that is been used to generate the patological case.

<code c>
#define SIZE (700*1024*1024)
main()
{
int i;
char * buf = malloc(SIZE);
if (!buf)
...
i = 0;
get_start_time();
for (; i < SIZE; i += 4096, buf += 4096)
*(int *) buf = 0;
get_stop_time();
printf("time %d\n", get_diff_time());
}
</code>

Subtle: one could think that allocating only 700mbyte is not enough to hit the high memory (since the high memory happens to be over one giga). That's not the case because the allocator knows when we can handle highmem, and if we can handle highmem it prefers to always return highmem. Anonymous memory can be allocated with highmem and so if there are at least 700mbyte of highmemory free, we'll certainly allocate the anonymous memory from there. 

==== 2.8 **HIGHMEM** API ====

To automatically map/unmap (with pagetables) the reserved pool of virtual pages to access all the physical memory we created two helper functions called kmap() and kunmap() that deals automagically with setting and clearing virt-to-phys mapping.

The code in 2.3.x that uses the reserved pool to access the memory over 1giga looks like this:

<code c>
struct page * page = alloc_page(GFP_HIGHUSER);
unsigned long vaddr = kmap(page);
[..]
-- here feel free to use vaddr virtual address to access the physical page `page'
[..]
kunmap(page);
-- at this point vaddr is not valid anymore
</code>

All the code that have to deal with highmem have to issue a kmap before dereferencing the pointer to the memory. If the memory is not in the high zone, then the ident mapping (direct mapping) is used. If the memory is in the high zone the highmem pool is mapped to the physical page.

----

===== 3. page-LRU =====

==== 3.1 What is the page-LRU? ====

The page LRU is a list (ordered in function of the //last recently used// entries) where all the freeable cache entities gets queued. Each time we allocate a cache entity (for example while reading from a file), we link the cache-memory into the page-LRU. Later, when we gets low on memory, we'll browse the page-LRU and we'll free and unlink pages to allow other allocations to succeed.

==== 3.2 How the cache recycling was working without page-LRU? ====

Before **2.3.x** when the system was low on memory the kernel had to browse **all** the physical pages in the system in order to find out a freeable page. The browse was done in round robin using a clock algorithm. Clock algorithm means that each attempt of recycling the cache was starting browsing at the point where the last try stopped.

==== 3.3 Problems of the old algorithm ====

The old clock algorithm had big complexity problem. Suppose there's an application that allocates almost all the phisical memory into anonymous user memory (using malloc() for example).

 
<code>
physical memory in the machine
-----------------------------------------------------------------
|           memory allocated in userspace               | cache |
-----------------------------------------------------------------
</code>

The above shows the worse secenario in which the old clock algorithm would kill performance of the machine.

Suppose there are **10giga** of physical RAM, that the memory allocated in userspace is **9.5giga** and that there's **half giga** allocated in cache. Assume a page size of **8kbytes** (like on the alpha). With such a scenario the old algorithm would have wasted time in a not schedule friendly kernel loop trying to free **1245184** pages that are not part of cache at each loop. Only the last **65536** physical pages were part of cache instead.

The above scenario was also going to fool the VM balance algorithm causing random swapouts as soon as the clock algorithm would enter in the non freeable userspace part on the left (the old VM balance algorithm could work well only if the cache was fragmented on the physical space and that wasn't going to be the case most of the time.

==== 3.4 Behaviour of the page-LRU ====

With the page-LRU in the same scenario we would work only on the **half giga** of cache, and we would never waste time trying to free anonymous memory. The page-LRU allows us to reach cache pages in **O(1)**. This also mean we don't rely anymore on the cache to be fragmented all over the physical memory to have a regular swap behaviour. With the page-LRU the random swapouts due too long physically contigous non-cache areas, can't happen anymore.

==== 3.5 The page-LRU scales completly in SMP ====

The new page-LRU design is very SMP friendly. The only lock necessary to read or modify the page-LRU is the //pagemap_lru_lock// global spinlock. This means that the cache freeing completly scales in SMP and by using some trick (moving the non freeable pages in private lru local lists temporarly) we can allow the page stealer to run on multiple CPUs at the same time without risking to duplicate the SMP work but really scaling the page freeing work in SMP.

Once we find an interesting page we unlink it and we drop the //pagemap_lru_lock// and the real freeing job is done **without** the **global** //pagemap_lru_lock// held, but it's done only with the very finegrined **per-page** SMP lock (we  atomically downgrade the locking from global to per-page to scale and we later upgrade it once the freeing work is finished). The //pagemap_lru_lock// is acquired only while unlinking the page from the list and later to reinsert the page in one of the local dispose lists if the freeing didn't succeeded.

Due the lock ordering we can't spin on the per-page lock if we take the //pagemap_lru_lock// held. So to avoid deadlocks we only **try** to acquire the per-page lock while we still held the //pagemap_lru_lock// (we have a fail path and we don't spin). That's what I meant with page-LRU-downgrade-SMP-lock. If we succeed in acquiring the per-page lock we release the //pagemap_lru_lock// and we slowly try to free the page. Then once the page-freeing is finished **before** releasing the per-page lock we re-acquire the //pagemap_lru_lock// with a spin_lock() operation, we can do that since we are in the right locking order at that point.

==== 3.6 New shrink_mmap() page-LRU based ====

This is the page stealer code, the function is called //shrink_mmap()//.

NOTE: the name of the function is a bit misleading since only non-mapped page cache and unused buffers can be freed from //shrink_mmap()//, //shrink_cache()// would been a more appropriate name, but I think the name //shrink_mmap()// holds an heavy historical background ;-). 

<code c>
int shrink_mmap(int priority, int gfp_mask, zone_t *zone)
{
int ret = 0, count;
LIST_HEAD(young);
LIST_HEAD(old);
LIST_HEAD(forget);
struct list_head * page_lru, * dispose;
struct page * page;
if (!zone)
BUG();
count = nr_lru_pages / (priority+1);
spin_lock(&pagemap_lru_lock);
while (count > 0 && (page_lru = zone->lru_cache.prev) != &zone->lru_cache) {
page = list_entry(page_lru, struct page, lru);
list_del(page_lru);
dispose = &zone->lru_cache;
if (test_and_clear_bit(PG_referenced, &page->flags))
/* Roll the page at the top of the lru list,
* we could also be more aggressive putting
* the page in the young-dispose-list, so
* avoiding to free young pages in each pass.
*/
goto dispose_continue;
dispose = &old;
/* don't account passes over not DMA pages */
if (zone && (!memclass(page->zone, zone)))
goto dispose_continue;
count--;
dispose = &young;
/* avoid unscalable SMP locking */
if (!page->buffers && page_count(page) > 1)
goto dispose_continue;
if (TryLockPage(page))
goto dispose_continue;
/* Release the pagemap_lru lock even if the page is not yet
queued in any lru queue since we have just locked down
the page so nobody else may SMP race with us running
a lru_cache_del() (lru_cache_del() always run with the
page locked down ;). */
spin_unlock(&pagemap_lru_lock);
/* avoid freeing the page while it's locked */
get_page(page);
/* Is it a buffer page? */
if (page->buffers) {
if (!try_to_free_buffers(page))
goto unlock_continue;
/* page was locked, inode can't go away under us */
if (!page->mapping) {
atomic_dec(&buffermem_pages);
goto made_buffer_progress;
}
}
/* Take the pagecache_lock spinlock held to avoid
other tasks to notice the page while we are looking at its
page count. If it's a pagecache-page we'll free it
in one atomic transaction after checking its page count. */
spin_lock(&pagecache_lock);
/*
* We can't free pages unless there's just one user
* (count == 2 because we added one ourselves above).
*/
if (page_count(page) != 2)
goto cache_unlock_continue;
/*
* Is it a page swap page? If so, we want to
* drop it if it is no longer used, even if it
* were to be marked referenced..
*/
if (PageSwapCache(page)) {
spin_unlock(&pagecache_lock);
__delete_from_swap_cache(page);
goto made_inode_progress;
}       
/* is it a page-cache page? */
if (page->mapping) {
if (!Page_Dirty(page) && !pgcache_under_min()) {
remove_page_from_inode_queue(page);
remove_page_from_hash_queue(page);
page->mapping = NULL;
spin_unlock(&pagecache_lock);
goto made_inode_progress;
}
goto cache_unlock_continue;
}
dispose = &forget;
printk(KERN_ERR "shrink_mmap: unknown LRU page!\n");
cache_unlock_continue:
spin_unlock(&pagecache_lock);
unlock_continue:
spin_lock(&pagemap_lru_lock);
UnlockPage(page);
put_page(page);
list_add(page_lru, dispose);
continue;
dispose_continue:
list_add(page_lru, dispose);
}
goto out;
made_inode_progress:
page_cache_release(page);
made_buffer_progress:
UnlockPage(page);
put_page(page);
ret = 1;
spin_lock(&pagemap_lru_lock);
/* nr_lru_pages needs the spinlock */
nr_lru_pages--;
out:
list_splice(&young, &zone->lru_cache);
list_splice(&old, zone->lru_cache.prev);
spin_unlock(&pagemap_lru_lock);
return ret;
}
</code>

==== 3.7 LRU entry/exit points ====

So far we examined only the cache-stealer that runs as soon as the system goes low on memory. That's the most interesting part of the page-LRU. Of course there are also a few entry/exit points for the page-LRU. They happens to be when we allocate cache and when we destroy cache (for example while destroying an inode we drop the page cache related to such inode and so we also have to unlink such cache from the page-LRU). That entry/exit points are accomplished by the //lru_cache_add()// and //lru_cache_del()// macros.

<code>
extern spinlock_t pagemap_lru_lock;
/*
* Helper macros for lru_pages handling.
*/
#define lru_cache_add(page)                     \
do {                                            \
spin_lock(&pagemap_lru_lock);               \
list_add(&(page)->lru, &page->zone->lru_cache); \
nr_lru_pages++;                         \
spin_unlock(&pagemap_lru_lock);             \
} while (0)
#define lru_cache_del(page)                     \
do {                                            \
if (!PageLocked(page))                  \
BUG();                          \
spin_lock(&pagemap_lru_lock);               \
list_del(&(page)->lru);                     \
nr_lru_pages--;                         \
spin_unlock(&pagemap_lru_lock);             \
} while (0)
</code>

The above macros are basically the only additional fixed cost that the new design requires.

==== 3.8 LRU aging ====

A normal lru list implicitly assings an age to the entries only in function on their position in the list. That's not enterely true in the case of the page-LRU.

We keep using a referenced bit that is the aging information about each LRU entry. We do that because we want to avoid having to roll (modify) the LRU list at each cache hit (each time we do a search on the cache and we find out the interesting information in the cache).

In the cache hit case we want to be fast and we only want to set the reference bit in order to signal shrink_mmap that the page is young and that shrink_mmap shouldn't try to free it before the reference bit gets cleared.

Later, when shrink_mmap will find the page at the bottom of the LRU, it will then clear the referenced bit and it will roll the referenced page at the head of the lru list at very lower cost).

This way definitely improves performance and it provides a rasonable precise aging anyway.

----

===== 4. List of some other 2.3.x feature not examined in this document =====

Certainly there are many more, these are the first important ones that cames to mind :-).

  * finegrined SMP locking in page cache, VM and networking (major feature)
  * better SMP scheduler
  * SMP irq affinity
  * alpha irq rewrite for SMP irq scaling
  * LVM (with snapshotting)
  * dynamic inode allocation
  * better dirty cache flushing algorithm
  * several new IDE drivers for new IDE chipsets
  * fully featured USB
  * RAWIO (for allowing userspace to handle distributed caches)
  * IA64 architecture support
  * wake one wakeups in semaphores, accept(2) (even if for accept(2) it should be definitely a LIFO wakeup while it's FIFO wakeup!!!) and I/O get_request
  * LFS (>2giga filesize on 32bit archs)
  * PCI MMU support (PCI DMA over 4giga of physical ram)
  * tasklets/softnet (equivalent of SMP scalable bottom half handlers)
  * filesystem writes goes through page cache
  * more powerful and cleaner PCI/IO-resources code
  * infrastructure for hotplug PCI
  * zone allocator re-implemented object oriented
  * NUMA support (avoids wasting memory in the mem_map_t with memory holes)
  * better IA32 memory detection
  * netfilter
  * ATM support
  * Cardbus and PCMCIA support in kernel
  * shmfs namespace
  * althlon 3dnow support
  * several new drivers

----