C++博客-

JonsenElizee

-随笔分类-Linux.Basic

High Memory In The Linux Kernel

JonsenElizee — Sat, 22 Jan 2011 02:44:00 GMT

Feature: High Memory In The Linux Kernel

on February 21, 2004 - 6:02am

As RAM increasingly becomes a commodity, the prices drop and computer users are able to buy more. 32-bit archictectures face certain limitations in regards to accessing these growing amounts of RAM. To better understand the problem and the various solutions, we begin with an overview of Linux memory management. Understanding how basic memory management works, we are better able to define the problem, and finally to review the various solutions.

This article was written by examining the Linux 2.6 kernel source code for the x86 architecture types.

Overview of Linux memory management

32-bit architectures can reference 4 GB of physical memory (2^32). Processors that have an MMU (Memory Management Unit) support the concept of virtual memory: page tables are set up by the kernel which map "virtual addresses" to "physical addresses"; this basically means that each process can access 4 GB of memory, thinking it's the only process running on the machine (much like multi-tasking, in which each process is made to think that it's the only process executing on a CPU).

The virtual address to physical address mappings are done by the kernel. When a new process is "fork()"ed, the kernel creates a new set of page tables for the process. The addresses referenced within a process in user-space are virtual addresses. They do not necessarily map directly to the same physical address. The virtual address is passed to the MMU (Memory Management Unit of the processor) which converts it to the proper physical address based on the tables set up by the kernel. Hence, two processes can refer to memory address 0x08329, but they would refer to two different locations in memory.

The Linux kernel splits the 4 GB virtual address space of a process in two parts: 3 GB and 1 GB. The lower 3 GB of the process virtual address space is accessible as the user-space virtual addresses and the upper 1 GB space is reserved for the kernel virtual addresses. This is true for all processes.

							      
      +----------+ 4 GB				      
      |          |     				      
      |	         |     				      
      |	         |     				      
      | Kernel   |     				      
      |          |     			       +----------+ 
      |	Virtual  |     			       |          |
      |          |     			       |          |
      | Space    |     			       | High     |
      |          |     			       |          |
      | (1 GB)   |     			       | Memory   |
      |          |     			       |          |
      |	         |     			       | (unused) |
      +----------+ 3 GB	   		       +----------+ 1 GB
      |          |     	   		       |          |
      |          |     	   		       |          |
      |	         |     	   		       |          |
      |          |     			       | Kernel   |
      |          |     			       |          |
      |          |     			       | Physical |
      |          |     			       |          |
      |User-space|     			       | Space    |
      |	       	 |     			       |      	  |
      | Virtual  |     			       |          |
      |          |     			       |          |
      | Space    |     			       |          |
      |          |     			       |          |    	
      | (3 GB)   |     			       +----------+ 0 GB
      |          |     				      	 
      |          |     			       	 Physical 
      |          |     			          Memory 
      |          |     				      	 
      |          |     				      	 
      |          |     				      	 
      |          |     				      	 
      |          |     				      	 
      +----------+ 0 GB					 
			 					 
       	Virtual	 					 
       	Memory

The kernel virtual area (3 - 4 GB address space) maps to the first 1 GB of physical RAM. The 3 GB addressable RAM available to each process is mapped to the available physical RAM.

The Problem

So, the basic problem here is, the kernel can just address 1 GB of virtual addresses, which can translate to a maximum of 1 GB of physical memory. This is because the kernel directly maps all available kernel virtual space addresses to the available physical memory.

Solutions

There are some solutions which address this problem:

2G / 2G, 1G / 3G split
HIGHMEM solution for using up to 4 GB of memory
HIGHMEM solution for using up to 64 GB of memory

1. 2G / 2G, 1G / 3G split

Instead of splitting the virtual address space the traditional way of 3G / 1G (3 GB for user-space, 1 GB for kernel space), third-party patches exist to split the virtual address space 2G / 2G or 1G / 3G. The 1G / 3G split is a bit extreme in that you can map up to 3 GB of physical memory, but user-space applications cannot grow beyond 1 GB. It could work for simple applications; but if one has more than 3 GB of physical RAM, he / she won't run simple applications on it, right?

The 2G / 2G split seems to be a balanced approach to using RAM more than 1 GB without using the HIGHMEM patches. However, server applications like databases always want as much virtual addressing space as possible; so this approach may not work in those scenarios.

There's a patch for 2.4.23 that includes a config-time option of selecting the user / kernel split values by Andrea Arcangeli. It is available at his kernel page. It's a simple patch and making it work on 2.6 should not be too difficult.

Before looking at solutions 2 & 3, let's take a look at some more Linux Memory Management issues.

Zones

In Linux, the memory available from all banks is classified into "nodes". These nodes indicate how much memory each bank has. This classification is mainly useful for NUMA architectures, but it's also used for UMA architectures, where the number of nodes is just 1.

Memory in each node is divided into "zones". The zones currently defined are ZONE_DMA, ZONE_NORMAL and ZONE_HIGHMEM.

ZONE_DMA is used by some devices for data transfer and is mapped in the lower physical memory range (up to 16 MB).

Memory in the ZONE_NORMAL region is mapped by the kernel in the upper region of the linear address space. Most operations can only take place in ZONE_NORMAL; so this is the most performance critical zone. ZONE_NORMAL goes from 16 MB to 896 MB.

To address memory from 1 GB onwards, the kernel has to map pages from high memory into ZONE_NORMAL.

Some area of memory is reserved for storing several kernel data structures that store information about the memory map and page tables. This on x86 is 128 MB. Hence, of the 1 GB physical memory the kernel can access, 128MB is reserved. This means that the kernel virtual address in this 128 MB is not mapped to physical memory. This leaves a maximum of 896 MB for ZONE_NORMAL. So, even if one has 1 GB of physical RAM, just 896 MB will be actually available.

Back to the solutions:

2. HIGHMEM solution for using up to 4 GB of memory

Since Linux can't access memory which hasn't been directly mapped into its address space, to use memory > 1 GB, the physical pages have to be mapped in the kernel virtual address space first. This means that the pages in ZONE_HIGHMEM have to be mapped in ZONE_NORMAL before they can be accessed.

The reserved space which we talked about earlier (in case of x86, 128 MB) has an area in which pages from high memory are mapped into the kernel address space.

To create a permanent mapping, the "kmap" function is used. Since this function may sleep, it may not be used in interrupt context. Since the number of permanent mappings is limited (if not, we could've directly mapped all the high memory in the address space), pages mapped this way should be "kunmap"ped when no longer needed.

Temporary mappings can be created via "kmap_atomic". This function doesn't block, so it can be used in interrupt context. "kunmap_atomic" un-maps the mapped high memory page. A temporary mapping is only available as long as the next temporary mapping. However, since the mapping and un-mapping functions also disable / enable preemption, it's a bug to not kunmap_atomic a page mapped via kmap_atomic.

3. HIGHMEM solution for using 64 GB of memory

This is enabled via the PAE (Physical Address Extension) extension of the PentiumPro processors. PAE addresses the 4 GB physical memory limitation and is seen as Intel's answer to AMD 64-bit and AMD x86-64. PAE allows processors to access physical memory up to 64 GB (36 bits of address bus). However, since the virtual address space is just 32 bits wide, each process can't grow beyond 4 GB. The mechanism used to access memory from 4 GB to 64 GB is essentially the same as that of accessing the 1 GB - 4 GB RAM via the HIGHMEM solution discussed above.

Should I enable CONFIG_HIGHMEM for my 1 GB RAM system?

It is advised to not enable CONFIG_HIGHMEM in the kernel to utilize the extra 128 MB you get for your 1 GB RAM system. I/O Devices cannot directly address high memory from PCI space, so bounce buffers have to be used. Plus the virtual memory management and paging costs come with extra mappings. For details on bounce buffers, refer to Mel Gorman's documentation (link below).

For more information, see

Andrea Arcangeli's article on the original HIGHMEM patches in Linux 2.3
Mel Gorman's VM Documentation for the 2.4 Linux Memory Management subsystem
Linux kernel sources

JonsenElizee 2011-01-22 10:44 发表评论

Linux lib

JonsenElizee — Thu, 30 Sep 2010 18:10:00 GMT

How does linux lib work?

用 Linux 进行动态加载

Linux 并不会自动为给定程序加载和链接库，而是与应用程序本身共享该控制权。这个过程就称为动态加载。使用动态加载，应用程序能够先指定要加载的库，然后将该库作为一个可执行文件来使用（即调用其中的函数）。但是正如您在前面所了解到的，用于动态加载的共享库与标准共享库（ELF 共享对象）无异。事实上，ld-linux 动态链接器作为 ELF 加载器和解释器，仍然会参与到这个过程中。

动态加载（Dynamic Loading，DL）API 就是为了动态加载而存在的，它允许共享库对用户空间程序可用。尽管非常小，但是这个 API 提供了所有需要的东西，而且很多困难的工作是在后台完成的。表 1 展示了这个完整的 API。

表 1. Dl API

函数	描述
dlopen	使对象文件可被程序访问
dlsym	获取执行了 `dlopen` 函数的对象文件中的符号的地址
dlerror	返回上一次出现错误的字符串错误
dlclose	关闭目标文件

该过程首先是调用 dlopen，提供要访问的文件对象和模式。调用 dlopen 的结果是稍候要使用的对象的句柄。mode 参数通知动态链接器何时执行再定位。有两个可能的值。第一个是 RTLD_NOW，它表明动态链接器将会在调用 dlopen 时完成所有必要的再定位。第二个可选的模式是 RTLD_LAZY，它只在需要时执行再定位。这是通过在内部使用动态链接器重定向所有尚未再定位的请求来完成的。这样，动态链接器就能够在请求时知晓何时发生了新的引用，而且再定位可以正常进行。后面的调用无需重复再定位过程。

还可以选择另外两种模式，它们可以按位 OR 到 mode 参数中。RTLD_LOCAL 表明其他任何对象都无法使加载的共享对象的符号用于再定位过程。如果这正是您想要的的话（例如，为了让共享的对象能够调用原始进程映像中的符号），那就使用 RTLD_GLOBAL 吧。

dlopen 函数还会自动解析共享库中的依赖项。这样，如果您打开了一个依赖于其他共享库的对象，它就会自动加载它们。函数返回一个句柄，该句柄用于后续的 API 调用。dlopen 的原型为：

#include 

void *dlopen( const char *file, int mode );

有了 ELF 对象的句柄，就可以通过调用 dlsym 来识别这个对象内的符号的地址了。该函数采用一个符号名称，如对象内的一个函数的名称。返回值为对象符号的解析地址：

void *dlsym( void *restrict handle, const char *restrict name );

如果调用该 API 时发生了错误，可以使用 dlerror 函数返回一个表示此错误的人类可读的字符串。该函数没有参数，它会在发生前面的错误时返回一个字符串，在没有错误发生时返回 NULL：

char *dlerror();

最后，如果无需再调用共享对象的话，应用程序可以调用 dlclose 来通知操作系统不再需要句柄和对象引用了。它完全是按引用来计数的，所以同一个共享对象的多个用户相互间不会发生冲突（只要还有一个用户在使用它，它就会待在内存中）。任何通过已关闭的对象的 dlsym 解析的符号都将不再可用。

char *dlclose( void *handle );

回页首

动态加载示例

了解了 API 之后，下面让我们来看一看 DL API 的例子。在这个应用程序中，您主要实现了一个 shell，它允许操作员来指定库、函数和参数。换句话说，也就是用户能够指定一个库并调用该库（先前未链接于该应用程序的）内的任意一个函数。首先使用 DL API 来解析该库中的函数，然后使用用户定义的参数（用来发送结果）来调用它。清单 2 展示了完整的应用程序。

清单 2. 使用 DL API 的 Shell

	
#include 
#include 
#include 

#define MAX_STRING      80


void invoke_method( char *lib, char *method, float argument )
{
  void *dl_handle;
  float (*func)(float);
  char *error;

  /* Open the shared object */
  dl_handle = dlopen( lib, RTLD_LAZY );
  if (!dl_handle) {
    printf( "!!! %s\n", dlerror() );
    return;
  }

  /* Resolve the symbol (method) from the object */
  func = dlsym( dl_handle, method );
  error = dlerror();
  if (error != NULL) {
    printf( "!!! %s\n", error );
    return;
  }

  /* Call the resolved method and print the result */
  printf("  %f\n", (*func)(argument) );

  /* Close the object */
  dlclose( dl_handle );

  return;
}


int main( int argc, char *argv[] )
{
  char line[MAX_STRING+1];
  char lib[MAX_STRING+1];
  char method[MAX_STRING+1];
  float argument;

  while (1) {

    printf("> ");

    line[0]=0;
    fgets( line, MAX_STRING, stdin);

    if (!strncmp(line, "bye", 3)) break;

    sscanf( line, "%s %s %f", lib, method, &argument);

    invoke_method( lib, method, argument );

  }

}

要构建这个应用程序，需要通过 GNU Compiler Collection（GCC）使用如下的编译行。选项 -rdynamic 用来通知链接器将所有符号添加到动态符号表中（目的是能够通过使用 dlopen 来实现向后跟踪）。-ldl 表明一定要将 dllib 链接于该程序。

gcc -rdynamic -o dl dl.c -ldl

再回到清单 2，main 函数仅充当解释器，解析来自输入行的三个参数（库名、函数名和浮点参数）。如果出现 bye 的话，应用程序就会退出。否则的话，这三个参数就会传递给使用 DL API 的 invoke_method 函数。

首先调用 dlopen 来访问目标文件。如果返回 NULL 句柄，表示无法找到对象，过程结束。否则的话，将会得到对象的一个句柄，可以进一步询问对象。然后使用 dlsym API 函数，尝试解析新打开的对象文件中的符号。您将会得到一个有效的指向该符号的指针，或者是得到一个 NULL 并返回一个错误。

在 ELF 对象中解析了符号后，下一步就只需要调用函数。要注意一下这个代码和前面讨论的动态链接的差别。在这个例子中，您强行将目标文件中的符号地址用作函数指针，然后调用它。而在前面的例子是将对象名作为函数，由动态链接器来确保符号指向正确的位置。虽然动态链接器能够为您做所有麻烦的工作，但这个方法会让您构建出极其动态的应用程序，它们可以再运行时被扩展。

调用 ELF 对象中的目标函数后，通过调用 dlclose 来关闭对它的访问。

清单 3 展示了一个如何使用这个测试程序的例子。在这个例子中，首先编译程序而后执行它。接着调用了 math 库（libm.so）中的几个函数。完成演示后，程序现在能够用动态加载来调用共享对象（库）中的任意函数了。这是一个很强大的功能，通过它还能够给程序扩充新的功能。

清单 3. 使用简单的程序来调用库函数

	
mtj@camus:~/dl$ gcc -rdynamic -o dl dl.c -ldl
mtj@camus:~/dl$ ./dl
> libm.so cosf 0.0
  1.000000
> libm.so sinf 0.0
  0.000000
> libm.so tanf 1.0
  1.557408
> bye
mtj@camus:~/dl$

回页首

工具

Linux 提供了很多种查看和解析 ELF 对象（包括共享库）的工具。其中最有用的一个当属 ldd 命令，您可以使用它来发送共享库依赖项。例如，在 dl 应用程序上使用 ldd 命令会显示如下内容：

mtj@camus:~/dl$ ldd dl
        linux-gate.so.1 =>  (0xffffe000)
        libdl.so.2 => /lib/tls/i686/cmov/libdl.so.2 (0xb7fdb000)
        libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7eac000)
        /lib/ld-linux.so.2 (0xb7fe7000)
mtj@camus:~/dl$

ldd 所告诉您的是：该 ELF 映像依赖于 linux-gate.so（一个特殊的共享对象，它处理系统调用，它在文件系统中无关联文件）、libdl.so（DL API）、GNU C 库（libc.so）以及 Linux 动态加载器（因为它里面有共享库依赖项）。

readelf 命令是一个有很多特性的实用程序，它让您能够解析和读取 ELF 对象。readelf 有一个有趣的用途，就是用来识别对象内可再定位的项。对于我们这个简单的程序来说（清单 2 展示的程序），您可以看到需要再定位的符号为：

mtj@camus:~/dl$ readelf -r dl

Relocation section '.rel.dyn' at offset 0x520 contains 2 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
08049a3c  00001806 R_386_GLOB_DAT    00000000   __gmon_start__
08049a78  00001405 R_386_COPY        08049a78   stdin

Relocation section '.rel.plt' at offset 0x530 contains 8 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
08049a4c  00000207 R_386_JUMP_SLOT   00000000   dlsym
08049a50  00000607 R_386_JUMP_SLOT   00000000   fgets
08049a54  00000b07 R_386_JUMP_SLOT   00000000   dlerror
08049a58  00000c07 R_386_JUMP_SLOT   00000000   __libc_start_main
08049a5c  00000e07 R_386_JUMP_SLOT   00000000   printf
08049a60  00001007 R_386_JUMP_SLOT   00000000   dlclose
08049a64  00001107 R_386_JUMP_SLOT   00000000   sscanf
08049a68  00001907 R_386_JUMP_SLOT   00000000   dlopen
mtj@camus:~/dl$

从这个列表中，您可以看到各种各样的需要再定位（到 libc.so）的 C 库调用，包括对 DL API（libdl.so）的调用。函数 __libc_start_main 是一个 C 库函数，它优先于程序的 main 函数（一个提供必要初始化的 shell）而被调用。

其他操作对象文件的实用程序包括：objdump，它展示了关于对象文件的信息；nm，它列出来自对象文件（包括调试信息）的符号。还可以将 EFL 程序作为参数，直接调用 Linux 动态链接器，从而手动启动映像：

mtj@camus:~/dl$ /lib/ld-linux.so.2 ./dl
> libm.so expf 0.0
  1.000000
>

另外，可以使用 ld-linux.so 的 --list 选项来罗列 ELF 映像的依赖项（ldd 命令也如此）。切记，它仅仅是一个用户空间程序，是由内核在需要时引导的。

JonsenElizee 2010-10-01 02:10 发表评论