C++博客-beautykingdom-随笔分类-OS

memcached完全剖析系列教程《转》

chatler — Sat, 16 Jun 2012 01:17:00 GMT

摘要: memcached完全剖析系列教程–1. memcached的基础memcached是什么？memcached 是以LiveJournal 旗下Danga Interactive 公司的Brad Fitzpatric 为首开发的一款软件。现在已成为豆瓣、Facebook、 Vox 等众... 阅读全文

chatler 2012-06-16 09:17 发表评论

死锁和活锁 deadlock and livelock

chatler — Fri, 08 Jun 2012 09:15:00 GMT

一、活锁 
如果事务T1封锁了数据R，事务T2又请求封锁R，于是T2等待。T3也请求封锁R，

当T1释放了R上的封锁之后系统首先批准了T3的请求，T2仍然等待。然后T4又

请求封锁R，当T3释放了R上的封锁之后系统又批准了T4的请求，...，T2有可

能永远等待，这就是活锁的情形,避免活锁的简单方法是采用先来先服务的策略。

二、死锁 
如果事务T1封锁了数据R1，T2封锁了数据R2，然后T1又请求封锁R2，因T2已

封锁了R2，于是T1等待T2释放R2上的锁。接着T2又申请封锁R1，因T1已封锁了R1，

T2也只能等待T1释放R1上的锁。这样就出现了T1在等待T2，而T2又在等待T1的局面，

T1和T2两个事务永远不能结束，形成死锁。 
1. 死锁的预防
在数据库中，产生死锁的原因是两个或多个事务都已封锁了一些数据对象，然后又都

请求对已为其他事务封锁的数据对象加锁，从而出现死等待。防止死锁的发生其实就

是要破坏产生死锁的条件。预防死锁通常有两种方法： 
① 一次封锁法  
一次封锁法要求每个事务必须一次将所有要使用的数据全部加锁，否则就不能继续执行。

一次封锁法虽然可以有效地防止死锁的发生，但也存在问题，一次就将以后要用到的全

部数据加锁，势必扩大了封锁的范围，从而降低了系统的并发度。
② 顺序封锁法 
顺序封锁法是预先对数据对象规定一个封锁顺序，所有事务都按这个顺序实行封锁。

顺序封锁法可以有效地防止死锁，但也同样存在问题。事务的封锁请求可以随着事务的

执行而动态地决定，很难事先确定每一个事务要封锁哪些对象，因此也就很难按规定的

顺序去施加封锁。
 
可见，在操作系统中广为采用的预防死锁的策略并不很适合数据库的特点，因此DBMS在

解决死锁的问题上普遍采用的是诊断并解除死锁的方法。

 2. 死锁的诊断与解除
 
① 超时法

 如果一个事务的等待时间超过了规定的时限，就认为发生了死锁。超时法实现简单，但

其不足也很明显。一是有可能误判死锁，事务因为其他原因使等待时间超过时限，系统会

误认为发生了死锁。二是时限若设置得太长，死锁发生后不能及时发现。
 
② 等待图法
 
事务等待图是一个有向图G=(T,U)。 T为结点的集合，每个结点表示正运行的事务；U为

边的集合，每条边表示事务等待的情况。若T1等待T2,则T1、T2之间划一条有向边，从T1

指向T2。事务等待图动态地反映了所有事务的等待情况。并发控制子系统周期性地（比如

每隔1分钟）检测事务等待图，如果发现图中存在回路，则表示系统中出现了死锁。
 
DBMS的并发控制子系统一旦检测到系统中存在死锁，就要设法解除。通常采用的方法是选择

一个处理死锁代价最小的事务，将其撤消，释放此事务持有的所有的锁，使其它事务得以继续

运行下去。当然，对撤消的事务所执行的数据修改操作必须加以恢复。

chatler 2012-06-08 17:15 发表评论

为 C/C++ 项目构建您自己的内存管理器

chatler — Sat, 26 May 2012 14:41:00 GMT

摘要: Arpan Sen, 技术负责人, Synapti Computer Aided Design Pvt LtdRahul Kumar Kardam (rahul@syncad.com), 高级软件工程师, Synapti Computer Aided Design Pvt Ltd简介：代码的性能优化是一项非常重要的工作。经常可以看到，采用 C 或 C++ 编写的、功能正确的... 阅读全文

chatler 2012-05-26 22:41 发表评论

TCMalloc : Thread-Caching Malloc

chatler — Wed, 04 Apr 2012 13:15:00 GMT

from:

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

TCMalloc : Thread-Caching Malloc

Sanjay Ghemawat, Paul Menage

Motivation

TCMalloc is faster than the glibc 2.3 malloc (available as a separate library called ptmalloc2) and other mallocs that I have tested. ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair. Speed is important for a malloc implementation because if malloc is not fast enough, application writers are inclined to write their own custom free lists on top of malloc. This can lead to extra complexity, and more memory usage unless the application writer is very careful to appropriately size the free lists and scavenge idle objects out of the free list

TCMalloc also reduces lock contention for multi-threaded programs. For small objects, there is virtually zero contention. For large objects, TCMalloc tries to use fine grained and efficient spinlocks. ptmalloc2 also reduces lock contention by using per-thread arenas but there is a big problem with ptmalloc2's use of per-thread arenas. In ptmalloc2 memory can never move from one arena to another. This can lead to huge amounts of wasted space. For example, in one Google application, the first phase would allocate approximately 300MB of memory for its data structures. When the first phase finished, a second phase would be started in the same address space. If this second phase was assigned a different arena than the one used by the first phase, this phase would not reuse any of the memory left after the first phase and would add another 300MB to the address space. Similar memory blowup problems were also noticed in other applications.

Another benefit of TCMalloc is space-efficient representation of small objects. For example, N 8-byte objects can be allocated while using space approximately 8N * 1.01 bytes. I.e., a one-percent space overhead. ptmalloc2 uses a four-byte header for each object and (I think) rounds up the size to a multiple of 8 bytes and ends up using 16N bytes.

Usage

To use TCmalloc, just link tcmalloc into your application via the "-ltcmalloc" linker flag.

You can use tcmalloc in applications you didn't compile yourself, by using LD_PRELOAD:

   $ LD_PRELOAD="/usr/lib/libtcmalloc.so"

LD_PRELOAD is tricky, and we don't necessarily recommend this mode of usage.

TCMalloc includes a heap checker and heap profiler as well.

If you'd rather link in a version of TCMalloc that does not include the heap profiler and checker (perhaps to reduce binary size for a static binary), you can link in libtcmalloc_minimal instead.

Overview

TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.

TCMalloc treates objects with size <= 32K ("small" objects) differently from larger objects. Large objects are allocated directly from the central heap using a page-level allocator (a page is a 4K aligned region of memory). I.e., a large object is always page-aligned and occupies an integral number of pages.

A run of pages can be carved up into a sequence of small objects, each equally sized. For example a run of one page (4K) can be carved up into 32 objects of size 128 bytes each.

Small Object Allocation

Each small object size maps to one of approximately 170 allocatable size-classes. For example, all allocations in the range 961 to 1024 bytes are rounded up to 1024. The size-classes are spaced so that small sizes are separated by 8 bytes, larger sizes by 16 bytes, even larger sizes by 32 bytes, and so forth. The maximal spacing (for sizes >= ~2K) is 256 bytes.

A thread cache contains a singly linked list of free objects per size-class.

When allocating a small object: (1) We map its size to the corresponding size-class. (2) Look in the corresponding free list in the thread cache for the current thread. (3) If the free list is not empty, we remove the first object from the list and return it. When following this fast path, TCMalloc acquires no locks at all. This helps speed-up allocation significantly because a lock/unlock pair takes approximately 100 nanoseconds on a 2.8 GHz Xeon.

If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.

If the central free list is also empty: (1) We allocate a run of pages from the central page allocator. (2) Split the run into a set of objects of this size-class. (3) Place the new objects on the central free list. (4) As before, move some of these objects to the thread-local free list.

Large Object Allocation

A large object size (> 32K) is rounded up to a page size (4K) and is handled by a central page heap. The central page heap is again an array of free lists. For i < 256, the kth entry is a free list of runs that consist of k pages. The 256th entry is a free list of runs that have length >= 256 pages:

An allocation for k pages is satisfied by looking in the kth free list. If that free list is empty, we look in the next free list, and so forth. Eventually, we look in the last free list if necessary. If that fails, we fetch memory from the system (using sbrk, mmap, or by mapping in portions of /dev/mem).

If an allocation for k pages is satisfied by a run of pages of length > k, the remainder of the run is re-inserted back into the appropriate free list in the page heap.

Spans

The heap managed by TCMalloc consists of a set of pages. A run of contiguous pages is represented by a Span object. A span can either be allocated, or free. If free, the span is one of the entries in a page heap linked-list. If allocated, it is either a large object that has been handed off to the application, or a run of pages that have been split up into a sequence of small objects. If split into small objects, the size-class of the objects is recorded in the span.

A central array indexed by page number can be used to find the span to which a page belongs. For example, span a below occupies 2 pages, span b occupies 1 page, span c occupies 5 pages and span d occupies 3 pages.

A 32-bit address space can fit 2^20 4K pages, so this central array takes 4MB of space, which seems acceptable. On 64-bit machines, we use a 3-level radix tree instead of an array to map from a page number to the corresponding span pointer.

Deallocation

When an object is deallocated, we compute its page number and look it up in the central array to find the corresponding span object. The span tells us whether or not the object is small, and its size-class if it is small. If the object is small, we insert it into the appropriate free list in the current thread's thread cache. If the thread cache now exceeds a predetermined size (2MB by default), we run a garbage collector that moves unused objects from the thread cache into central free lists.

If the object is large, the span tells us the range of pages covered by the object. Suppose this range is [p,q]. We also lookup the spans for pages p-1 andq+1. If either of these neighboring spans are free, we coalesce them with the [p,q] span. The resulting span is inserted into the appropriate free list in the page heap.

Central Free Lists for Small Objects

As mentioned before, we keep a central free list for each size-class. Each central free list is organized as a two-level data structure: a set of spans, and a linked list of free objects per span.

An object is allocated from a central free list by removing the first entry from the linked list of some span. (If all spans have empty linked lists, a suitably sized span is first allocated from the central page heap.)

An object is returned to a central free list by adding it to the linked list of its containing span. If the linked list length now equals the total number of small objects in the span, this span is now completely free and is returned to the page heap.

Garbage Collection of Thread Caches

A thread cache is garbage collected when the combined size of all objects in the cache exceeds 2MB. The garbage collection threshold is automatically decreased as the number of threads increases so that we don't waste an inordinate amount of memory in a program with lots of threads.

We walk over all free lists in the cache and move some number of objects from the free list to the corresponding central list.

The number of objects to be moved from a free list is determined using a per-list low-water-mark L. L records the minimum length of the list since the last garbage collection. Note that we could have shortened the list by L objects at the last garbage collection without requiring any extra accesses to the central list. We use this past history as a predictor of future accesses and move L/2 objects from the thread cache free list to the corresponding central free list. This algorithm has the nice property that if a thread stops using a particular size, all objects of that size will quickly move from the thread cache to the central free list where they can be used by other threads.

Performance Notes

PTMalloc2 unittest

The PTMalloc2 package (now part of glibc) contains a unittest program t-test1.c. This forks a number of threads and performs a series of allocations and deallocations in each thread; the threads do not communicate other than by synchronization in the memory allocator.

t-test1 (included in google-perftools/tests/tcmalloc, and compiled as ptmalloc_unittest1) was run with a varying numbers of threads (1-20) and maximum allocation sizes (64 bytes - 32Kbytes). These tests were run on a 2.4GHz dual Xeon system with hyper-threading enabled, using Linux glibc-2.3.2 from RedHat 9, with one million operations per thread in each test. In each case, the test was run once normally, and once with LD_PRELOAD=libtcmalloc.so.

The graphs below show the performance of TCMalloc vs PTMalloc2 for several different metrics. Firstly, total operations (millions) per elapsed second vs max allocation size, for varying numbers of threads. The raw data used to generate these graphs (the output of the "time" utility) is available in t-test1.times.txt.

TCMalloc is much more consistently scalable than PTMalloc2 - for all thread counts >1 it achieves ~7-9 million ops/sec for small allocations, falling to ~2 million ops/sec for larger allocations. The single-thread case is an obvious outlier, since it is only able to keep a single processor busy and hence can achieve fewer ops/sec. PTMalloc2 has a much higher variance on operations/sec - peaking somewhere around 4 million ops/sec for small allocations and falling to <1 million ops/sec for larger allocations.
TCMalloc is faster than PTMalloc2 in the vast majority of cases, and particularly for small allocations. Contention between threads is less of a problem in TCMalloc.
TCMalloc's performance drops off as the allocation size increases. This is because the per-thread cache is garbage-collected when it hits a threshold (defaulting to 2MB). With larger allocation sizes, fewer objects can be stored in the cache before it is garbage-collected.
There is a noticeably drop in the TCMalloc performance at ~32K maximum allocation size; at larger sizes performance drops less quickly. This is due to the 32K maximum size of objects in the per-thread caches; for objects larger than this tcmalloc allocates from the central page heap.

Next, operations (millions) per second of CPU time vs number of threads, for max allocation size 64 bytes - 128 Kbytes.

Here we see again that TCMalloc is both more consistent and more efficient than PTMalloc2. For max allocation sizes <32K, TCMalloc typically achieves ~2-2.5 million ops per second of CPU time with a large number of threads, whereas PTMalloc achieves generally 0.5-1 million ops per second of CPU time, with a lot of cases achieving much less than this figure. Above 32K max allocation size, TCMalloc drops to 1-1.5 million ops per second of CPU time, and PTMalloc drops almost to zero for large numbers of threads (i.e. with PTMalloc, lots of CPU time is being burned spinning waiting for locks in the heavily multi-threaded case).

Caveats

For some systems, TCMalloc may not work correctly on with applications that aren't linked against libpthread.so (or the equivalent on your OS). It should work on Linux using glibc 2.3, but other OS/libc combinations have not been tested.

TCMalloc may be somewhat more memory hungry than other mallocs, though it tends not to have the huge blowups that can happen with other mallocs. In particular, at startup TCMalloc allocates approximately 6 MB of memory. It would be easy to roll a specialized version that trades a little bit of speed for more space efficiency.

TCMalloc currently does not return any memory to the system.

Don't try to load TCMalloc into a running binary (e.g., using JNI in Java programs). The binary will have allocated some objects using the system malloc, and may try to pass them to TCMalloc for deallocation. TCMalloc will not be able to handle such objects.

chatler 2012-04-04 21:15 发表评论

How does the DMA work

chatler — Sun, 14 Nov 2010 11:23:00 GMT

The DMA is another two chips on your motherboard (usually is an Intel 8237A-5 chips) that allow you (the programmer) to offload data transfers between I/O boards. DMA actually stands for 'Direct Memory Access'.

DMA can work: memory->I/O, I/O->memory. The memory->memory transfer doesn't work. It doesn't matter because ISA DMA is slow as hell and thus is unusable. Futhermore, using DMA for zeroing out memory would massacre the contents of memory caches.

What about caches and DMA? L1 and L2 caches work absolutely transparently. When DMA writes to memory, caches autmatically load or least invalidate the data that go into the memory. When DMA reads memory, caches supply the unwritten bytes so not old but new values are tranferred to the peripheral.

There are signals DACK, DRQ, and TC. When a peripheral wants to move a byte or 2 bytes into memory (is dependent on whether 8 bit or 16 bit DMA channel is in use -- 0,1,2,3 are 8-bit, 5,6,7 are 16-bit), it issues DRQ. DMA controller chats with CPU and after some time DMA controller issues DACK. Seeing DACK, the peripheral puts it's byte on data bus, DMA controller takes it and puts it in memory. If it was the last byte/word to move, DMA controller sets up also TC during the DACK. When peripheral sees TC, it is possible it will not want any more movements,

In the other direction, everything is the same, but first the byte/word is fetched from the memory and then DACK is generated and the peripheral takes the data.

DMA controller has only 8-bit address counter inside. There is external ALS573 counter for each chip so it makes programmer see it as DMA controller had 16 bits of address counter per channel inside. There are more 8 bits of address per channel of so called page register in LS612 that unfortunately do not increment as those in ALS573. All these 24 bits can address 16777216 of distict addresses.

Recapitulation: for each channel, independently, you see 16 bits of auto-incrementing counter, and 8 bits of page register which doesn't increment.

The difference between 16-bit DMA channels and 8-bit DMA channels is that the address bits for 16-bit channels are wired one bit left to the address bus so every address is 2 times bigger. The lowest bit is 0. The highest bit of page register would fit into bit 24 which is not on ISA so that it is left unconnected. The bus control logic is wired for 16-bit channels in a manner every single DMA transfer, a 16-bit cycle is generated, so ISA device puts 16 bits onto the bus at the time. I don't know what happens if you use 16-bit DMA channel with XT peripheral. I guess it could work but only be slower.

8-bit DMA: increments by 1, cycles inside 65536 bytes, addresses 16MB, moves 8 bits a time.

16-bit DMA: increments by 2, goes only over even addresses, cycles inside 131072 bytes, addresses 16MB, moves 16 bits a time. Uses 16-bit ISA I/O cycle so it takes less ticks to make one move that the 8-bit DMA.

An example of DMA usage would be the Sound Blaster's ability to play samples in the background. The CPU sets up the sound card and the DMA. When the DMA is told to 'go', it simply shovels the data from RAM to the card. Since this is done off-CPU, the CPU can do other things while the data is being transferred.

Enough basics. Here's how you program the DMA chip.

When you want to start a DMA transfer, you need to know several things:

Number of DMA channel you want to use
What page to use
The offset in the page
The length
How to tell you peripheral to ask for DMA

You cannot transfer more than 64K or 128K of data in one shot, and
You cannot cross a page boundary. If you cross it, the lower 16 or 17 bits of address will simply wrap and you only suddenly jump 65536 or 131072 bytes lower that where you expected. It will be absolutely OK and no screw up will be performed. If you will take it in account in your program you can use it.

Restriction #1 is rather easy to get around. Simply transfer the first block, and when the transfer is done, send the next block.

For those of you not familiar with pages, I'll try to explain.

Picture the first 16MB region of memory in your system. It is divided into 256 pages of 64K or 128 pages of 128K. Every page starts at a multiple of 65536 or 131072. They are numbered from 0 to 255 or from 0 to 127.

In plain English, the page is the highest 8 bits or 7 bits of the absolute 24 bit address of our memory location. The offset is the lower 16 or 17 bits of the absolute 24 bit address.

Now that we know where our data is, we need to find the length.

The DMA has a little quirk on length. The true length sent to the DMA is actually length + 1. So if you send a zero length to the DMA, it actually transfers one byte or word, whereas if you send 0xFFFF, it transfers 64K or 128K. I guess they made it this way because it would be pretty senseless to program the DMA to do nothing (a length of zero), and in doing it this way, it allowed a full 64K or 128K span of data to be transferred.

Now that you know what to send to the DMA, how do you actually start it? This enters us into the different DMA channels.

The following chart will describe each channel and it's corresponding port number:

DMA Channel	Page	Address	Count
0	87h	0h	1h
1	83h	2h	3h
2	81h	4h	5h
3	82h	6h	7h
4	8Fh	C0h	C2h
5	8Bh	C4h	C6h
6	89h	C8h	CAh
7	8Ah	CCh	CEh

DMA 4. Doesn't exist. DMA 4 is used to cascade the two 8237A chips. When first 8237A wants to DMA, it issues "HRQ" to second chip's DRQ 4. The second chip thinks DMA 4 is wanna be made so issues DRQ 4 to the first chip's HLDA. First chip makes it's own DMA 0-3, then sends to the second "OK second chip, my DMA 4 is complete" and second chip knows it's free on the bus. If this mechanism would not work, the two chips could peck each other on the BUS and the PC would screw up. :+)

from:

http://www.osdever.net/papers/view/how-does-the-dma-work

chatler 2010-11-14 19:23 发表评论

有关异步读写、通信

chatler — Mon, 06 Sep 2010 09:33:00 GMT

简介

一般来说，简单的异步（Asynchronous）调用是这样一种调用方式：发起者请求一个异步调用，通知执行者，然后处理其他工作，在某一个同步点等待执行者的完成；执行者执行调用的实际操作，完成后通知发起者。可以看出，在异步调用中有两种角色：发起者和执行者，它们都是能主动运行的对象，我们称为主动对象，同时还有一个同步点，主动对象在同步点协调同步。在本文中，我们讨论主要是通用计算机、多进程多线程的分时操作系统上的异步调用。在操作系统的角度上来看，主动对象包括了进程、线程和硬件上的IC等，至于中断，可以看作总是在某个进程或者线程的上下文借用一下CPU。而同步操作可以通过操作系统得各种同步机制：互斥锁，信号灯等等来完成。

我们可以先看看异步调用在Windows(本文中一般不加指出的话，都是特指NT/2000)读写文件中的应用。Windows中的ReadFile和WriteFile都提供了异步的接口。以ReadFile为例，

BOOL ReadFile(HANDLE hFile, LPVOID lpBuffer, DWORD nNumberOfBytesToRead, LPDWORD lpNumberOfBytesRead, LPOVERLAPPED lpOverlapped);

如果最后一个参数lpOverlapped不为NULL，并且文件以FILE_FLAG_OVERLAPPED 标志打开，那么这个调用就是异步的：ReadFile会立刻返回，如果操作没有立刻完成（返回FALSE并且GetLastError()返回 ERROR_IO_PENDING），那么调用者可以在某个时刻通过WaitForSingleObject等函数来等待中的hEvent来等待操作完成（可能已经完成）进行同步，当操作完成以后，可以调用GetOverlappedResult者获得操作的结果，比如是否成功，读取了多少字节等等。这里的发起者就是应用程序，而执行者就是操作系统本身，至于执行者是怎么执行的，我们会在后面的篇幅讨论。而两者的同步就是通过一个Windows Event来完成。

把这个异步调用的过程再抽象和扩展一些，我们可以把异步调用需要解决的问题归结为两个：一个是执行的动力，另一个是主动对象的调度。简单来说，前者是各个主动对象（线程、进程或者一些代码）是如何获得CPU，后者是各个主动对象如何协同工作，保证操作的流程是协调正确的。一般来说，进程和线程都可以由操作系统直接调度而获得CPU，而更细粒度的，比如一些代码的调度，往往就需要一个更复杂的模型（比如在操作系统内部的实现，这时候线程的粒度太粗了）。而主动对象的调度，当参与者较少的时候，可以通过基本的同步机制来完成，在更复杂的情况下，可能通过一个schedule机制来做会更实际一些。

动力和调度

如前所述，异步调用主要需要解决两个问题：执行的动力和执行的调度。最普遍的情况就是，一个主导流程的调用者进程（线程），一个或多个工作者进程（线程），通过操作系统提供的同步机制来完成异步调用。这个同步机制在扩展化的情形下，是一个或多个栅栏Barrier，对应于每个同步的执行点。所有需要在这个执行点同步的主动对象会等待相应的 Barrier，直到所有对象都完成。在一些简化的情形，比如说工作者并不关心调用者的同步，那么这个Barrier可以简化成信号灯，在只有一个工作者的情况下，可以简化成一个Windows事件Event或者条件变量 Condition Variable。

现在来考虑复杂的情形。假设我们用一些线程来协作完成一项工作，各个线程的执行之间有先后顺序上的限制，而操作系统就是这项工作的调度者，负责在适当的时候调度适当的线程来获得CPU。显然，并发执行中的一个线程对于另外一个线程来说，本质上就是异步的，假如它们之间有调用关系，那也就是一个异步调用。而操作系统可以通过基本的同步机制使得合适的线程才被调度，其他未完成的线程则处于等待状态。举例说，我们有4个线程A,B,C,D来完成一项工作，其中的顺序限制是 A>B;C>D，“>”表示左边的线程完成必须先于右边的线程执行，而“;”表示两个线程可以同时进行。同时假设B的一个操作需要调用 C来完成，显而易见，这时候这个操作就是一个异步调用。我们可以在每个“>”的位置设定一个同步点，然后通过一个信号灯来完成同步。线程B，C等待第一个信号灯，而D会等待第二个信号灯。这个例子的动力和调度都是通过操作系统的基本机制（线程调度和同步机制）来完成。

把这个过程抽象一下，可以描述为：若干个主动对象（包括代码）协调来完成一项工作，通过一个调度器来调度，实际上，这个调度器可能只是一些调度规则。显然，进程或者线程只要被调度就能获得CPU，所以我们主要考虑代码（比如一个函数）怎么样才能获得执行。用工作者线程来调用这个函数显然是直观和通用的一个方案。事实上，在用户空间(user space)或者用户态(user mode)，这个方法是很常用的。而在内核态(kernel mode)，则可以通过中断来获得CPU，这个通过注册IDT入口和触发软中断就可以完成。硬件设备上的IC是另一个动力之源。而主动对象的调度，最基本的也是前面说的各种同步机制。另一个常用的机制就是回调函数，需要注意的是，回调函数一般会发生在跟调用者不一样的上下文，比如说同一个进程的不同线程，这个差别会带来一些限制。如果需要回调发生在调用者的进程（线程）上下文，则需要一些类似Unix下的signal或者Windows下的APC机制，这一点我们在后面会有所阐述。那么在回调函数里面一般作些什么事情呢？最常用的，跟同步机制结合在一起，当然就是释放一个互斥锁，信号灯或者Windows Event（Unix的条件变量）等等，从而使得等待同步的其他对象可以得到调度而重新执行，实际上，也可以看作是通知调度器（操作系统）某些主动对象（等待同步的）可以重新被调度了，从而调度器重新调度。但是对于另外一些调度器，在这个过程中可能不需要同步对象的参与。在一些极端一些的例子里，调度甚至不要求严格有序的。

在实际应用中，根据环境的限制，异步调用的动力和调度的实现方式可以有很大差别。我们会在后面的例子里加以说明。操作系统中的异步：Windows的异步I/O。

Windows NT/2000是一个抢占式的分时操作系统。Windows的调度单位是线程，它的 I/O架构是完全异步的，也就是说同步的I/O实际上都基于异步I/O来完成。一个用户态的线程请求一个I/O的时候会导致一个运行状态从user mode到kernel mode的转变（操作系统把内核映射到每个进程的2G-4G的地址上，对于每个进程都是一样的）。这个过程是通过中断调用内核输出的一些System Service来完成，比如说ReadFile实际上会执行NtReadFile（ZwReadFile），需要注意的是，运行上下文仍然是当前线程。 NtReadFile的实现则基于Windows内核的异步I/O框架，在I/O Manager的协助下完成。需要指出的是，I/O Manager只是由若干API构成的一个抽象概念，并没有一个真正的I/O Manager线程在运行。

Windows的I/O驱动程序是层次堆积的。每个驱动程序会提供一致的接口以供初始化、清理和功能调用。驱动程序的调用基于I/O请求包（I/O Request Packet, IRP），而不是像普通的函数调用那样使用栈来传递参数。操作系统和PnP管理器根据注册表在适当的时机初始化和清理相应的驱动程序。在一般的功能调用的时候，IRP里面会指定功能调用号码以及相应的上下文或者参数（I/O stack location）。一个驱动程序可能调用别的驱动程序，这个过程可能是同步的（线程上下文不改变)，也可能是异步的。NtReadFile的实现，大致是向最上层的驱动程序发出一个或多个IRP，然后等待相应事件的完成（同步的情况），或者直接返回（带Overlapped的情况），这些都在发起请求的线程执行。

当驱动程序处理IRP的时候，它可能立刻完成，也可能在中断里才能完成，比如说，往硬件设备发出一个请求（通常可以是写 I/O port），当设备完成操作的时候会触发一个中断，然后在中断处理函数里得到操作结果。Windows有两类中断，硬件设备的中断和软中断，分成若干个不同的优先级（IRQL）。软中断主要有两种：DPC(Delayed Procedure Call)和APC(Asynchronous Procedure Call)，都处于较低的优先级。驱动程序可以为硬件中断注册ISR(Interrupt Service Routine)，一般就是修改IDT某个条目的入口。同样，操作系统也会为DPC和APC注册适当的中断处理例程（也是在IDT中）。

值得指出的是，DPC是跟处理器相关的，每个处理器会有一个DPC队列，而APC是跟线程相关的，每个线程会有它的APC队列（实际上包括一个 Kernel APC队列和User APC队列，它们的调度策略有所区别），可以想象，APC并不算严格意义上的中断，因为中断可能发生在任何一个线程的上下文中，它被称为中断，主要是因为 IRQL的提升（从PASSIVE到APC），APC的调度一般在线程切换等等情形下进行。当中断发生的时候，操作系统会调用中断处理例程，对于硬件设备的ISR，一般处理是关设备中断，发出一个DPC请求，然后返回。不在设备的中断处理中使用太多的CPU时间，主要考虑是否则可能丢失别的中断。由于硬件设备中断的IRQL比DPC中断的高，所以在ISR里面DPC会阻塞，直到ISR返回IRQL回到较低的水平，才会触发DPC中断，在 DPC中断里执行从硬件设备读取数据以及重新请求、开中断等操作。ISR或者DPC可能在任何被中断的线程上下文（arbitrary thread context）执行，事实上线程的上下文是不可见的，可以认为是系统借用一下时间片而已。

总的来说，Windows的异步I/O架构中，主要有两种动力，一是发起请求的线程，一部分内核代码会在这个线程上下文执行，二是ISR和DPC，这部分内核代码会在中断里完成，可能使用任何一个线程的上下文。而调度常见使用回调和事件（KEVENT），比如说在往下一层的驱动程序发出请求的时候，可以指定一个完成例程Completion Routine，当下层的驱动完成这个请求的时候会调用这个例程，而往往在这个例程里，就是简单的触发一下一个事件。另外可以顺便提一下Linux。Linux 2.6也有类似的中断机制，它有更多的软中断优先级，即不同优先级的softirq，而类似于DPC，Linux也提供了专门的软中断，对应DPC的就是 tasklet。Linux没有一个像windows这么一致的层次驱动程序架构，所以它的异步I/O稍微粗糙一些，主要是通过以前的一些阻塞点，现在直接返回-EIOCBRETRY，而让调用者在合适的时机继续重试。在这个方法中，可以认为整个操作由一个函数完成，每次操作有进展时，都把这个函数从头执行一遍，当然已经完成的部分就不会再有实际的I/O。这样的最大好处是原有的文件系统和驱动程序不用完全重写。而对于同步调用，只要阻塞就可以了，这样对系统的修改较小。这时候，要提供POSIX aio的语义，就可能需要提供一些用户线程来完成重试的过程了（回想Windows可以通过中断和DPC完成的）。而对于Solaris，也是类似的处理，如果设备支持异步I/O，那就通过中断可以完成，否则就使用内部的LWP来模拟。

应用程序：一个异步的HTTP服务器的设计

假设我们要设计一个HTTP服务器，它的设计目标包括：高并发性、精简（部分支持HTTP/1.1）、支持plug-in结构。在不少场合可能都有这个需求。总体上来说，HTTP服务器可以类比成一个基于多线程的操作系统：OS调度每个工作线程在适当的时候获得执行，而工作线程提供服务（也就是处理HTTP请求）。在这个基础上，主要的考虑就是调度粒度的大小，粒度太大的时候并发性会降低，而粒度太小又可能因为任务切换（考虑OS的Context Switching）而导致效率降低，所以这又是一个折衷的结果。类似于Apache（以及其他的HTTP服务器），我们可以把一个HTTP处理过程分为若干个状态，基于这些状态可以构造出一个HTTP处理的状态机。这种情况下，我们就可以把每个状态的处理作为调度的粒度。一个调度过程就是：一个工作线程从全局的任务队列里取出一个HTTP_Context结构；根据当前的状态完成相应处理；然后根据状态机设置下一个状态；再放回到全局的任务队列里。这样子，若干个HTTP状态就可以通过这个调度策略构成一个完整HTTP处理过程。显而易见，一个状态对于下一个状态处理的调用都可以认为是异步的。一个 HTTP状态机的设计如下图所示。

图1. HTTP状态机

工作线程的函数其实就是两个操作：从状态队列里取出一个HTTP_Context，调用HTTP_Context的service()函数，周而复此。在这个架构上，就很容易引入异步I/O和Plug-in的机制了。事实上我们也可以使用基于事件（例如select/poll）的I/O策略来模拟异步I /O，实现中使用一个用户线程就可以了。

对于异步I/O和Plug-in的调用，我们也是采用类似于Linux 2.6里面aio的重试方案，而异步完成的时候采用回调函数。在某个状态上，如果系统需要I/O操作（recv或者send），则会请求一个异步I /O（操作系统提供的异步I/O或者由用户线程模拟的异步I/O），这时候相应的HTTP_Context不会重新回到状态队列里，而在I/O完成的回调函数里面才会重新放回到状态队列，得到重新调度的机会。HTTP_Context得到重新调度的时候会检查I/O状态（这个可以通过一些标志位来完成），如果已经完成，则处理然后设置下一状态，重新调度，否则可以重新请求一个新的I/O请求。Plug-in也可以使用类似的方案，比如说一个Plug-in 要跟外部的一个服务器通信，这时候就可以在通信完成的时候才把HTTP_Context重新放回到状态队列。显然，Plug-in跟HTTP状态是多对多的关系，一个Plug-in可以在若干个关心的状态注册自身，同时还可以设置一些short-path来提高处理的效率。

结论

总的来说，异步调用的设计和应用归根结底就是对多个主动对象的管理问题：如何提供执行的动力以及如何保证执行的顺序逻辑。主要考虑的问题是主动对象的粒度以及执行方式，同步或者回调来完成顺序的调度，或者使用近似的调度而加一些鲁棒的错误处理机制来保证语义的正确。后者可以考虑在使用基于事件的 socket的时候，readable事件的通知可以是冗余的，或者说可以比实际中发生的readable事件更多，这个时候使用非阻塞的socket, 有些read()（或者recv()）会直接返回EWOULDBLOCK，系统只要考虑处理这种情况（使用non blocking socket而不是blocking socket），当例外的情况不多的时候是可以接受的。这时候可以说事件的报告就只是近似的。

from:

http://hi.baidu.com/hytjfxk/blog/item/d9262cdfcb298c14632798b3.html

chatler 2010-09-06 17:33 发表评论

semaphore and spinlock

chatler — Thu, 01 Apr 2010 03:50:00 GMT

内核同步措施

    为了避免并发，防止竞争。内核提供了一组同步方法来提供对共享数据的保护。我们的重点不是介绍这些方法的详细用法，而是强调为什么使用这些方法和它们之间的差别。
    Linux 使用的同步机制可以说从2.0到2.6以来不断发展完善。从最初的原子操作，到后来的信号量，从大内核锁到今天的自旋锁。这些同步机制的发展伴随 Linux从单处理器到对称多处理器的过度；伴随着从非抢占内核到抢占内核的过度。锁机制越来越有效，也越来越复杂。
    目前来说内核中原子操作多用来做计数使用，其它情况最常用的是两种锁以及它们的变种:一个是自旋锁，另一个是信号量。我们下面就来着重介绍一下这两种锁机制。

自旋锁
------------------------------------------------------
    自旋锁是专为防止多处理器并发而引入的一种锁，它在内核中大量应用于中断处理等部分(对于单处理器来说，防止中断处理中的并发可简单采用关闭中断的方式，不需要自旋锁)。
    自旋锁最多只能被一个内核任务持有，如果一个内核任务试图请求一个已被争用(已经被持有)的自旋锁，那么这个任务就会一直进行忙循环——旋转——等待锁重新可用。要是锁未被争用，请求它的内核任务便能立刻得到它并且继续进行。自旋锁可以在任何时刻防止多于一个的内核任务同时进入临界区，因此这种锁可有效地避免多处理器上并发运行的内核任务竞争共享资源。
    事实上，自旋锁的初衷就是：在短期间内进行轻量级的锁定。一个被争用的自旋锁使得请求它的线程在等待锁重新可用的期间进行自旋(特别浪费处理器时间)，所以自旋锁不应该被持有时间过长。如果需要长时间锁定的话, 最好使用信号量。
自旋锁的基本形式如下：
    spin_lock(&mr_lock);
    //临界区
    spin_unlock(&mr_lock);

    因为自旋锁在同一时刻只能被最多一个内核任务持有，所以一个时刻只有一个线程允许存在于临界区中。这点很好地满足了对称多处理机器需要的锁定服务。在单处理器上，自旋锁仅仅当作一个设置内核抢占的开关。如果内核抢占也不存在，那么自旋锁会在编译时被完全剔除出内核。
    简单的说，自旋锁在内核中主要用来防止多处理器中并发访问临界区，防止内核抢占造成的竞争。另外自旋锁不允许任务睡眠(持有自旋锁的任务睡眠会造成自死锁——因为睡眠有可能造成持有锁的内核任务被重新调度，而再次申请自己已持有的锁)，它能够在中断上下文中使用。
    死锁：假设有一个或多个内核任务和一个或多个资源，每个内核都在等待其中的一个资源，但所有的资源都已经被占用了。这便会发生所有内核任务都在相互等待，但它们永远不会释放已经占有的资源，于是任何内核任务都无法获得所需要的资源，无法继续运行，这便意味着死锁发生了。自死琐是说自己占有了某个资源，然后自己又申请自己已占有的资源，显然不可能再获得该资源，因此就自缚手脚了。

信号量
------------------------------------------------------
    Linux中的信号量是一种睡眠锁。如果有一个任务试图获得一个已被持有的信号量时，信号量会将其推入等待队列，然后让其睡眠。这时处理器获得自由去执行其它代码。当持有信号量的进程将信号量释放后，在等待队列中的一个任务将被唤醒，从而便可以获得这个信号量。
    信号量的睡眠特性，使得信号量适用于锁会被长时间持有的情况；只能在进程上下文中使用，因为中断上下文中是不能被调度的；另外当代码持有信号量时，不可以再持有自旋锁。

信号量基本使用形式为：
static DECLARE_MUTEX(mr_sem);//声明互斥信号量
if(down_interruptible(&mr_sem))
    //可被中断的睡眠，当信号来到，睡眠的任务被唤醒
    //临界区
up(&mr_sem);

信号量和自旋锁区别
------------------------------------------------------
    虽然听起来两者之间的使用条件复杂，其实在实际使用中信号量和自旋锁并不易混淆。注意以下原则:
    如果代码需要睡眠——这往往是发生在和用户空间同步时——使用信号量是唯一的选择。由于不受睡眠的限制，使用信号量通常来说更加简单一些。如果需要在自旋锁和信号量中作选择，应该取决于锁被持有的时间长短。理想情况是所有的锁都应该尽可能短的被持有，但是如果锁的持有时间较长的话，使用信号量是更好的选择。另外，信号量不同于自旋锁，它不会关闭内核抢占，所以持有信号量的代码可以被抢占。这意味者信号量不会对影响调度反应时间带来负面影响。

自旋锁对信号量
------------------------------------------------------
需求                     建议的加锁方法

低开销加锁               优先使用自旋锁
短期锁定                 优先使用自旋锁
长期加锁                 优先使用信号量
中断上下文中加锁          使用自旋锁
持有锁是需要睡眠、调度     使用信号量

from:
http://blog.chinaunix.net/u1/38576/showart_367985.html

chatler 2010-04-01 11:50 发表评论

An In-Depth Look into the Win32 Portable Executable File Format

chatler — Thu, 25 Mar 2010 07:17:00 GMT

摘要: from: http://msdn.microsoft.com/zh-cn/magazine/cc301808%28en-us%29.aspx Download the code for this article: PE.exe (98KB) part1: SUMMARY A good understanding of the Portable Executable (PE) fi... 阅读全文

chatler 2010-03-25 15:17 发表评论

IA-32 保护模式内存管理

chatler — Thu, 25 Mar 2010 07:03:00 GMT

CHAPTER 3 PROTECTED-MODE MEMORY MANAGEMENT

3.1. MEMORY MANAGEMENT OVERVIEW

The memory management facilities of the IA-32 architecture are divided into two parts: segmentation and paging. Segmentation provides a mechanism of isolating individual code, data, and stack modules so that multiple programs (or tasks) can run on the same processor without interfering with one another. Paging provides a mechanism for implementing a conventional demand-paged, virtual-memory system where sections of a program’s execution environment are mapped into physical memory as needed. Paging can also be used to provide isolation between multiple tasks. When operating in protected mode, some form of segmentation must be used. There is no mode bit to disable segmentation. The use of paging, however, is optional.

IA-32架构的内存管理机构（facilities）可划分为两个部分：分段（segmentation）和分页（paging）。分段功能提供了分隔代码、数据和堆栈的机制，从而使多个进程运行在同一个CPU物理地址空间内而互不影响；分页可用来实现一种“请求页式（demand-paged）”的虚拟内存机制，从而页化程序执行环境，在程序运行时可将所需要的页映射到物理内存。分页机制也可用作隔离多进程任务。分段功能是CPU保护模式必须的，没有设置位可以屏蔽内存分段；不过内存分页则是可选的。

These two mechanisms (segmentation and paging) can be configured to support simple single-program (or single-task) systems, multitasking systems, or multiple-processor systems that used shared memory.

As shown in Figure 3-1, segmentation provides a mechanism for dividing the processor’s addressable memory space (called the linear address space) into smaller protected address spaces called segments. Segments can be used to hold the code, data, and stack for a program or to hold system data structures (such as a TSS or LDT). If more than one program (or task) is running on a processor, each program can be assigned its own set of segments. The processor then enforces the boundaries between these segments and insures that one program does not interfere with the execution of another program by writing into the other program’s segments.

分段和分页机制被配置成支持单任务系统、多任务系统或多处理器系统。

如图3-1，内存分段将CPU的可寻址空间（称为线性地址空间）划分更小的受保护的内存段，这些段存放程序的数据（代码、数据和堆栈）和系统的数据结构（像TSS 或 LDT）。如果处理器运行着多个任务，那么每个任务都有一集自己独立的内存段。

The segmentation mechanism also allows typing of segments so that the operations that may be performed on a particular type of segment can be restricted.

All the segments in a system are contained in the processor’s linear address space. To locate a byte in a particular segment, a logical address (also called a far pointer) must be provided. A logical address consists of a segment selector and an offset. The segment selector is a unique identifier for a segment. Among other things it provides an offset into a descriptor table (such as the global descriptor table, GDT) to a data structure called a segment descriptor. Each segment has a segment descriptor, which specifies the size of the segment, the access rights and privilege level for the segment, the segment type, and the location of the first byte of the segment in the linear address space (called the base address of the segment). The offset part of the logical address is added to the base address for the segment to locate a byte within the segment. The base address plus the offset thus forms a linear address in the processor’s linear address space.

进程的各个段都必须位于CPU的线性空间之内，进程要访问某段的一个字节，必须给出该字节的逻辑地址（也叫远指针）。逻辑地址由段选择子（segment selector ）和偏移值组成。段选择子是段的唯一标识，指向一个叫段描述符的数据结构；段描述符位于一个叫描述表之内（如全局描述表GDT）; 每个段必须都有相应的段描述符，用以指定段大小、访问权限和段的特权级别（privilege level）、段类型和段的首地址在线性地址空间的位置（叫段的基地址）。逻辑地址通过基地址加上段内偏移得到。

If paging is not used, the linear address space of the processor is mapped directly into the physical address space of processor. The physical address space is defined as the range of addresses that the processor can generate on its address bus.

Because multitasking computing systems commonly define a linear address space much larger than it is economically feasible to contain all at once in physical memory, some method of “virtualizing” the linear address space is needed. This virtualization of the linear address space is handled through the processor’s paging mechanism.

如果不用分页功能，处理器的[线性地址空间]就会直接映射到[物理地址空间]。[物理地址空间]的大小就是处理器能通过地址总线产生的地址范围。为了直接使用线性地址空间从而简化编程和实现多进程而提高内存的利用率，需要实现某种对线性地址空间进行“虚拟化（virtualizing）”，CPU的分页机制实现了这种虚拟化。

Paging supports a “virtual memory” environment where a large linear address space is simulated with a small amount of physical memory (RAM and ROM) and some disk storage. When using paging, each segment is divided into pages (typically 4 KBytes each in size), which are stored either in physical memory or on the disk. The operating system or executive maintains a page directory and a set of page tables to keep track of the pages. When a program (or task) attempts to access an address location in the linear address space, the processor uses the page directory and page tables to translate the linear address into a physical address and then performs the requested operation (read or write) on the memory location. If the page being accessed is not currently in physical memory, the processor interrupts execution of the program (by generating a page-fault exception). The operating system or executive then reads the page into physical memory from the disk and continues executing the program.

“虚拟内存”就是利用物理内存和磁盘来对CPU的线性地址进行模拟（kemin:高级语言源码指定的是符号地址，是虚的，有了虚拟内存即便是用汇编指定一固定地址也是虚的。问题是这些虚存是怎么管理的）。当使用分页时，进程的每个段都会被分成大小固定的页，这些页可能在内存中，也可能在磁盘。操作系统用了一张页目录（page directory）和多张页表来管理这些页。当进程试图访问线性地址空间的某个位置，处理器会通过页目录和页表先将线性地址转换成物理地址，然后再访问（读或写）（kemin：转换细节没有讲）。如果被访问的页当前不在内存，处理就会中断进程的运行（通过产生缺页异常中断）（kemin:怎么判断某页不在内存？）。操作系统负责从磁盘读入该页并继续执行该进程（kemin:页读入的前前后后没有讲）。

When paging is implemented properly in the operating-system or executive, the swapping of pages between physical memory and the disk is transparent to the correct execution of a program. Even programs written for 16-bit IA-32 processors can be paged (transparently) when they are run in virtual-8086 mode.

from:

http://blog.csdn.net/keminlau/archive/2008/10/19/3090337.aspx

chatler 2010-03-25 15:03 发表评论

Context Switch Definition

chatler — Thu, 25 Feb 2010 15:09:00 GMT

A context switch (also sometimes referred to as a process switch or a task switch ) is the switching of the CPU (central processing unit) from one process or thread to another.

A process (also sometimes referred to as a task ) is an executing (i.e., running) instance of a program. In Linux, threads are lightweight processes that can run in parallel and share an address space (i.e., a range of memory locations) and other resources with their parent processes (i.e., the processes that created them).

A context is the contents of a CPU's registers and program counter at any point in time. A register is a small amount of very fast memory inside of a CPU (as opposed to the slower RAM main memory outside of the CPU) that is used to speed the execution of computer programs by providing quick access to commonly used values, generally those in the midst of a calculation. A program counter is a specialized register that indicates the position of the CPU in its instruction sequence and which holds either the address of the instruction being executed or the address of the next instruction to be executed, depending on the specific system.

Context switching can be described in slightly more detail as the kernel operating system) performing the following activities with regard to processes (including threads) on the CPU: (1) suspending the progression of one process and storing the CPU's state (i.e., the context) for that process somewhere in memory, (2) retrieving the context of the next process from memory and restoring it in the CPU's registers and (3) returning to the location indicated by the program counter (i.e., returning to the line of code at which the process was interrupted) in order to resume the process. (i.e., the core of the

A context switch is sometimes described as the kernel suspending execution of one process on the CPU and resuming execution of some other process that had previously been suspended. Although this wording can help clarify the concept, it can be confusing in itself because a process is , by definition, an executing instance of a program. Thus the wording suspending progression of a process might be preferable.

Context Switches and Mode Switches

Context switches can occur only in kernel mode . Kernel mode is a privileged mode of the CPU in which only the kernel runs and which provides access to all memory locations and all other system resources. Other programs, including applications, initially operate in user mode , but they can run portions of the kernel code via system calls . A system call is a request in a Unix-like operating system by an active process (i.e., a process currently progressing in the CPU) for a service performed by the kernel, such as input/output (I/O) or process creation (i.e., creation of a new process). I/O can be defined as any movement of information to or from the combination of the CPU and main memory (i.e. RAM), that is, communication between this combination and the computer's users (e.g., via the keyboard or mouse), its storage devices (e.g., disk or tape drives), or other computers.

The existence of these two modes in Unix-like operating systems means that a similar, but simpler, operation is necessary when a system call causes the CPU to shift to kernel mode. This is referred to as a mode switch rather than a context switch, because it does not change the current process.

Context switching is an essential feature of multitasking operating systems. A multitasking operating system is one in which multiple processes execute on a single CPU seemingly simultaneously and without interfering with each other. This illusion of concurrency is achieved by means of context switches that are occurring in rapid succession (tens or hundreds of times per second). These context switches occur as a result of processes voluntarily relinquishing their time in the CPU or as a result of the scheduler making the switch when a process has used up its CPU time slice .

A context switch can also occur as a result of a hardware interrupt , which is a signal from a hardware device (such as a keyboard, mouse, modem or system clock) to the kernel that an event (e.g., a key press, mouse movement or arrival of data from a network connection) has occurred.

Intel 80386 and higher CPUs contain hardware support for context switches. However, most modern operating systems perform software context switching , which can be used on any CPU, rather than hardware context switching in an attempt to obtain improved performance. Software context switching was first implemented in Linux for Intel-compatible processors with the 2.4 kernel.

One major advantage claimed for software context switching is that, whereas the hardware mechanism saves almost all of the CPU state, software can be more selective and save only that portion that actually needs to be saved and reloaded. However, there is some question as to how important this really is in increasing the efficiency of context switching. Its advocates also claim that software context switching allows for the possibility of improving the switching code, thereby further enhancing efficiency, and that it permits better control over the validity of the data that is being loaded.

The Cost of Context Switching

Context switching is generally computationally intensive. That is, it requires considerable processor time, which can be on the order of nanoseconds for each of the tens or hundreds of switches per second. Thus, context switching represents a substantial cost to the system in terms of CPU time and can, in fact, be the most costly operation on an operating system.

Consequently, a major focus in the design of operating systems has been to avoid unnecessary context switching to the extent possible. However, this has not been easy to accomplish in practice. In fact, although the cost of context switching has been declining when measured in terms of the absolute amount of CPU time consumed, this appears to be due mainly to increases in CPU clock speeds rather than to improvements in the efficiency of context switching itself.

One of the many advantages claimed for Linux as compared with other operating systems, including some other Unix-like systems, is its extremely low cost of context switching and mode switching.

from:
http://blog.csdn.net/wave_1102/archive/2007/09/04/1771745.aspx

chatler 2010-02-25 23:09 发表评论

堆和栈的区别

chatler — Fri, 11 Dec 2009 15:53:00 GMT

堆：　是大家共有的空间，分全局堆和局部堆。全局堆就是所有没有分配的空间，局部堆就是用户分配的空间。堆在操作系统对进程初始化的时候分配，运行过程中也可以向系统要额外的堆，但是记得用完了要还给操作系统，要不然就是内存泄漏。

栈：是个线程独有的，保存其运行状态和局部自动变量的。栈在线程开始的时候初始化，每个线程的栈互相独立，因此，栈是　thread safe的。每个Ｃ　＋＋对象的数据成员也存在在栈中，每个函数都有自己的栈，栈被用来在函数之间传递参数。操作系统在切换线程的时候会自动的切换栈，就是切换　ＳＳ／ＥＳＰ寄存器。栈空间不需要在高级语言里面显式的分配和释放。

堆和栈的区别

一、预备知识—程序的内存分配
一个由c/C++编译的程序占用的内存分为以下几个部分：
1、栈区（stack）— 由编译器自动分配释放，存放函数的参数值，局部变量的值等。其操作方式类似于数据结构中的栈。

2、堆区（heap） — 一般由程序员分配释放，若程序员不释放，程序结束时可能由OS回收。注意它与数据结构中的堆是两回事，分配方式倒是类似于链表。

3、全局区（静态区）（static）—，全局变量和静态变量的存储是放在一块的，初始化的全局变量和静态变量在一块区域，未初始化的全局变量和未初始化的静态变量在相邻的另一块区域。 - 程序结束后由系统释放。

4、文字常量区 —常量字符串就是放在这里的。程序结束后由系统释放。

5、程序代码区—存放函数体的二进制代码。

二、例子程序
//main.cpp
int a = 0; 全局初始化区
char *p1; 全局未初始化区
main()
{
int b; 栈
char s[] = "abc"; 栈
char *p2; 栈
char *p3 = "123456"; 123456\0在常量区，p3在栈上。
static int c =0；全局（静态）初始化区
p1 = (char *)malloc(10);
p2 = (char *)malloc(20);
分配得来得10和20字节的区域就在堆区。
strcpy(p1, "123456"); 123456\0放在常量区，编译器可能会将它与p3所指向的"123456"优化成一个地方。
}
二、堆和栈的理论知识
2.1申请方式
stack:
由系统自动分配。例如，声明在函数中一个局部变量 int b; 系统自动在栈中为b开辟空间
heap:
需要程序员自己申请，并指明大小，在c中malloc函数
如p1 = (char *)malloc(10);
在C++中用new运算符
如p2 = (char *)malloc(10);
但是注意p1、p2本身是在栈中的。

2.2
申请后系统的响应
栈：只要栈的剩余空间大于所申请空间，系统将为程序提供内存，否则将报异常提示栈溢出。
堆：首先应该知道操作系统有一个记录空闲内存地址的链表，当系统收到程序的申请时，会遍历该链表，寻找第一个空间大于所申请空间的堆结点，然后将该结点从空闲结点链表中删除，并将该结点的空间分配给程序，另外，对于大多数系统，会在这块内存空间中的首地址处记录本次分配的大小，这样，代码中的delete语句才能正确的释放本内存空间。另外，由于找到的堆结点的大小不一定正好等于申请的大小，系统会自动的将多余的那部分重新放入空闲链表中。

2.3申请大小的限制
栈：在Windows下,栈是向低地址扩展的数据结构，是一块连续的内存的区域。这句话的意思是栈顶的地址和栈的最大容量是系统预先规定好的，在WINDOWS下，栈的大小是2M（也可能是1M，它是一个编译时就确定的常数），如果申请的空间超过栈的剩余空间时，将提示overflow。因此，能从栈获得的空间较小
。
堆：堆是向高地址扩展的数据结构，是不连续的内存区域。这是由于系统是用链表来存储的空闲内存地址的，自然是不连续的，而链表的遍历方向是由低地址向高地址。堆的大小受限于计算机系统中有效的虚拟内存。由此可见，堆获得的空间比较灵活，也比较大。

2.4申请效率的比较：
栈由系统自动分配，速度较快。但程序员是无法控制的。
堆是由new分配的内存，一般速度比较慢，而且容易产生内存碎片,不过用起来最方便.
另外，在WINDOWS下，最好的方式是用VirtualAlloc分配内存，他不是在堆，也不是在栈是直接在进程的地址空间中保留一快内存，虽然用起来最不方便。但是速度快，也最灵活。

2.5堆和栈中的存储内容
栈：在函数调用时，第一个进栈的是主函数中后的下一条指令（函数调用语句的下一条可执行语句）的地址，然后是函数的各个参数，在大多数的C编译器中，参数是由右往左入栈的，然后是函数中的局部变量。注意静态变量是不入栈的。
当本次函数调用结束后，局部变量先出栈，然后是参数，最后栈顶指针指向最开始存的地址，也就是主函数中的下一条指令，程序由该点继续运行。
堆：一般是在堆的头部用一个字节存放堆的大小。堆中的具体内容有程序员安排。

2.6存取效率的比较

char s1[] = "aaaaaaaaaaaaaaa";
char *s2 = "bbbbbbbbbbbbbbbbb";
aaaaaaaaaaa是在运行时刻赋值的；
而bbbbbbbbbbb是在编译时就确定的；
但是，在以后的存取中，在栈上的数组比指针所指向的字符串(例如堆)快。
比如：
void main()
{
char a = 1;
char c[] = "1234567890";
char *p ="1234567890";
a = c[1];
a = p[1];
return;
}
对应的汇编代码
10: a = c[1];
00401067 8A 4D F1 mov cl,byte ptr [ebp-0Fh]
0040106A 88 4D FC mov byte ptr [ebp-4],cl
11: a = p[1];
0040106D 8B 55 EC mov edx,dword ptr [ebp-14h]
00401070 8A 42 01 mov al,byte ptr [edx+1]
00401073 88 45 FC mov byte ptr [ebp-4],al
第一种在读取时直接就把字符串中的元素读到寄存器cl中，而第二种则要先把指针值读到edx中，在根据
edx读取字符，显然慢了。

2.7小结：
堆和栈的区别可以用如下的比喻来看出：
使用栈就象我们去饭馆里吃饭，只管点菜（发出申请）、付钱、和吃（使用），吃饱了就走，不必理会切菜、洗菜等准备工作和洗碗、刷锅等扫尾工作，他的好处是快捷，但是自由度小。
使用堆就象是自己动手做喜欢吃的菜肴，比较麻烦，但是比较符合自己的口味，而且自由度大。

下面是另一篇，总结的比上面好：

堆和栈的联系与区别dd

在bbs上，堆与栈的区分问题，似乎是一个永恒的话题，由此可见，初学者对此往往是混淆不清的，所以我决定拿他第一个开刀。

首先，我们举一个例子：

void f() { int* p=new int[5]; }

这条短短的一句话就包含了堆与栈，看到new，我们首先就应该想到，我们分配了一块堆内存，那么指针p呢？他分配的是一块栈内存，所以这句话的意思就是：在栈内存中存放了一个指向一块堆内存的指针p。在程序会先确定在堆中分配内存的大小，然后调用operator new分配内存，然后返回这块内存的首地址，放入栈中，他在VC6下的汇编代码如下：

00401028 push 14h

0040102A call operator new (00401060)

0040102F add esp,4

00401032 mov dword ptr [ebp-8],eax

00401035 mov eax,dword ptr [ebp-8]

00401038 mov dword ptr [ebp-4],eax

这里，我们为了简单并没有释放内存，那么该怎么去释放呢？是delete p么？澳，错了，应该是delete []p，这是为了告诉编译器：我删除的是一个数组，VC6就会根据相应的Cookie信息去进行释放内存的工作。

好了，我们回到我们的主题：堆和栈究竟有什么区别？

主要的区别由以下几点：

1、管理方式不同；

2、空间大小不同；

3、能否产生碎片不同；

4、生长方向不同；

5、分配方式不同；

6、分配效率不同；

管理方式：对于栈来讲，是由编译器自动管理，无需我们手工控制；对于堆来说，释放工作由程序员控制，容易产生memory leak。

空间大小：一般来讲在32位系统下，堆内存可以达到4G的空间，从这个角度来看堆内存几乎是没有什么限制的。但是对于栈来讲，一般都是有一定的空间大小的，例如，在VC6下面，默认的栈空间大小是1M（好像是，记不清楚了）。当然，我们可以修改：

打开工程，依次操作菜单如下：Project->Setting->Link，在Category 中选中Output，然后在Reserve中设定堆栈的最大值和commit。

注意：reserve最小值为4Byte；commit是保留在虚拟内存的页文件里面，它设置的较大会使栈开辟较大的值，可能增加内存的开销和启动时间。

碎片问题：对于堆来讲，频繁的new/delete势必会造成内存空间的不连续，从而造成大量的碎片，使程序效率降低。对于栈来讲，则不会存在这个问题，因为栈是先进后出的队列，他们是如此的一一对应，以至于永远都不可能有一个内存块从栈中间弹出，在他弹出之前，在他上面的后进的栈内容已经被弹出，详细的可以参考数据结构，这里我们就不再一一讨论了。

生长方向：对于堆来讲，生长方向是向上的，也就是向着内存地址增加的方向；对于栈来讲，它的生长方向是向下的，是向着内存地址减小的方向增长。

分配方式：堆都是动态分配的，没有静态分配的堆。栈有2种分配方式：静态分配和动态分配。静态分配是编译器完成的，比如局部变量的分配。动态分配由 alloca函数进行分配，但是栈的动态分配和堆是不同的，他的动态分配是由编译器进行释放，无需我们手工实现。

分配效率：栈是机器系统提供的数据结构，计算机会在底层对栈提供支持：分配专门的寄存器存放栈的地址，压栈出栈都有专门的指令执行，这就决定了栈的效率比较高。堆则是C/C++函数库提供的，它的机制是很复杂的，例如为了分配一块内存，库函数会按照一定的算法（具体的算法可以参考数据结构/操作系统）在堆内存中搜索可用的足够大小的空间，如果没有足够大小的空间（可能是由于内存碎片太多），就有可能调用系统功能去增加程序数据段的内存空间，这样就有机会分到足够大小的内存，然后进行返回。显然，堆的效率比栈要低得多。

从这里我们可以看到，堆和栈相比，由于大量new/delete的使用，容易造成大量的内存碎片；由于没有专门的系统支持，效率很低；由于可能引发用户态和核心态的切换，内存的申请，代价变得更加昂贵。所以栈在程序中是应用最广泛的，就算是函数的调用也利用栈去完成，函数调用过程中的参数，返回地址，EBP和局部变量都采用栈的方式存放。所以，我们推荐大家尽量用栈，而不是用堆。

虽然栈有如此众多的好处，但是由于和堆相比不是那么灵活，有时候分配大量的内存空间，还是用堆好一些。

无论是堆还是栈，都要防止越界现象的发生（除非你是故意使其越界），因为越界的结果要么是程序崩溃，要么是摧毁程序的堆、栈结构，产生以想不到的结果,就算是在你的程序运行过程中，没有发生上面的问题，你还是要小心，说不定什么时候就崩掉，那时候debug可是相当困难的 :) 对了，还有一件事，如果有人把堆栈合起来说，那它的意思是栈，可不是堆，呵呵, 清楚了？

from：
http://blog.chinaunix.net/u2/76292/showart_1327414.html

http://hi.baidu.com/54wangjun/blog/item/d1b4a74424d5934f510ffedd.html

chatler 2009-12-11 23:53 发表评论

同步、异步、阻塞和非阻塞的概念

chatler — Sat, 05 Dec 2009 15:10:00 GMT

在进行网络编程时，我们常常见到同步、异步、阻塞和非阻塞四种调用方式。这些方式彼此概念并不好理解。下面是我对这些术语的理解。同步所谓同步，就是在发出一个功能调用时，在没有得到结果之前，该调用就不返回。按照这个定义，其实绝大多数函数都是同步调用（例如sin, isdigit等）。但是一般而言，我们在说同步、异步的时候，特指那些需要其他部件协作或者需要一定时间完成的任务。最常见的例子就是 SendMessage。该函数发送一个消息给某个窗口，在对方处理完消息之前，这个函数不返回。当对方处理完毕以后，该函数才把消息处理函数所返回的 LRESULT值返回给调用者。异步异步的概念和同步相对。当一个异步过程调用发出后，调用者不能立刻得到结果。实际处理这个调用的部件在完成后，通过状态、通知和回调来通知调用者。以CAsycSocket类为例（注意，CSocket从CAsyncSocket派生，但是起功能已经由异步转化为同步），当一个客户端通过调用 Connect函数发出一个连接请求后，调用者线程立刻可以朝下运行。当连接真正建立起来以后，socket底层会发送一个消息通知该对象。这里提到执行部件和调用者通过三种途径返回结果：状态、通知和回调。可以使用哪一种依赖于执行部件的实现，除非执行部件提供多种选择，否则不受调用者控制。如果执行部件用状态来通知，那么调用者就需要每隔一定时间检查一次，效率就很低（有些初学多线程编程的人，总喜欢用一个循环去检查某个变量的值，这其实是一种很严重的错误）。如果是使用通知的方式，效率则很高，因为执行部件几乎不需要做额外的操作。至于回调函数，其实和通知没太多区别。阻塞阻塞调用是指调用结果返回之前，当前线程会被挂起。函数只有在得到结果之后才会返回。有人也许会把阻塞调用和同步调用等同起来，实际上他是不同的。对于同步调用来说，很多时候当前线程还是激活的，只是从逻辑上当前函数没有返回而已。例如，我们在CSocket中调用Receive函数，如果缓冲区中没有数据，这个函数就会一直等待，直到有数据才返回。而此时，当前线程还会继续处理各种各样的消息。如果主窗口和调用函数在同一个线程中，除非你在特殊的界面操作函数中调用，其实主界面还是应该可以刷新。 socket接收数据的另外一个函数recv则是一个阻塞调用的例子。当socket工作在阻塞模式的时候，如果没有数据的情况下调用该函数，则当前线程就会被挂起，直到有数据为止。非阻塞非阻塞和阻塞的概念相对应，指在不能立刻得到结果之前，该函数不会阻塞当前线程，而会立刻返回。对象的阻塞模式和阻塞函数调用对象是否处于阻塞模式和函数是不是阻塞调用有很强的相关性，但是并不是一一对应的。阻塞对象上可以有非阻塞的调用方式，我们可以通过一定的API去轮询状态，在适当的时候调用阻塞函数，就可以避免阻塞。而对于非阻塞对象，调用特殊的函数也可以进入阻塞调用。函数select就是这样的一个例子。

chatler 2009-12-05 23:10 发表评论

中断、DMA、通道

chatler — Tue, 10 Nov 2009 15:34:00 GMT

一、轮询方式
对I/O设备的程序轮询的方式，是早期的计算机系统对I/O设备的一种管理方式。它定时对各种设备轮流询问一遍有无处理要求。轮流询问之后，有要求的，则加以处理。在处理I/O设备的要求之后，处理机返回继续工作。
尽管轮询需要时间，但轮询不比I/O设备的速度要快得多，所以一般不会发生不能及时处理的问题。
当然，再快的处理机，能处理的输入输出设备的数量也是有一定限度的。而且，程序轮询毕竟占据了CPU相当一部分处理时间，因此程序轮询是一种效率较低的方式，在现代计算机系统中已很少应用。

二、中断方式
处理器的高速和输入输出设备的低速是一对矛盾，是设备管理要解决的一个重要问题。为了提高整体效率，减少在程序直接控制方式中CPU之间的数据传送，是很必要的。
在I/O设备中断方式下，中央处理器与I/O设备之间数据的传输步骤如下：
⑴在某个进程需要数据时，发出指令启动输入输出设备准备数据
⑵在进程发出指令启动设备之后，该进程放弃处理器，等待相关I/O操作完成。此时，进程调度程序会调度其他就绪进程使用处理器。
⑶当I/O操作完成时，输入输出设备控制器通过中断请求线向处理器发出中断信号，处理器收到中断信号之后，转向预先设计好的中断处理程序，对数据传送工作进行相应的处理。
⑷得到了数据的进程，转入就绪状态。在随后的某个时刻，进程调度程序会选中该进程继续工作。
中断方式的优缺点
I/O设备中断方式使处理器的利用率提高，且能支持多道程序和I/O设备的并行操作。
不过，中断方式仍然存在一些问题。首先，现代计算机系统通常配置有各种各样的输入输出设备。如果这些I/O设备都同过中断处理方式进行并行操作，那么中断次数的急剧增加会造成CPU无法响应中断和出现数据丢失现象。
其次，如果I/O控制器的数据缓冲区比较小，在缓冲区装满数据之后将会发生中断。那么，在数据传送过程中，发生中断的机会较多，这将耗去大量的CPU处理时间。

三、直接内存存取（DMA）方式
直接内存存取技术是指，数据在内存与I/O设备间直接进行成块传输。
DMA技术特征
DMA有两个技术特征，首先是直接传送，其次是块传送。
所谓直接传送，即在内存与IO设备间传送一个数据块的过程中，不需要CPU的任何中间干涉，只需要CPU在过程开始时向设备发出“传送块数据”的命令，然后通过中断来得知过程是否结束和下次操作是否准备就绪。
DMA工作过程
⑴当进程要求设备输入数据时，CPU把准备存放输入数据的内存起始地址以及要传送的字节数分别送入DMA控制器中的内存地址寄存器和传送字节计数器。
⑵发出数据传输要求的进行进入等待状态。此时正在执行的CPU指令被暂时挂起。进程调度程序调度其他进程占据CPU。
⑶输入设备不断地窃取CPU工作周期，将数据缓冲寄存器中的数据源源不断地写入内存，直到所要求的字节全部传送完毕。
⑷DMA控制器在传送完所有字节时，通过中断请求线发出中断信号。CPU在接收到中断信号后，转入中断处理程序进行后续处理。
⑸中断处理结束后，CPU返回到被中断的进程中，或切换到新的进程上下文环境中，继续执行。
　　DMA与中断的区别
⑴中断方式是在数据缓冲寄存器满之后发出中断，要求CPU进行中断处理，而DMA方式则是在所要求传送的数据块全部传送结束时要求CPU 进行中断处理。这就大大减少了CPU进行中断处理的次数。
⑵中断方式的数据传送是在中断处理时由CPU控制完成的，而DMA方式则是在DMA控制器的控制下，不经过CPU控制完成的。这就排除了CPU因并行设备过多而来不及处理以及因速度不匹配而造成数据丢失等现象。
　　DMA方式的优缺点
在DMA方式中，由于I/O设备直接同内存发生成块的数据交换，因此I/O效率比较高。由于DMA技术可以提高I/O效率，因此在现代计算机系统中，得到了广泛的应用。许多输入输出设备的控制器，特别是块设备的控制器，都支持DMA方式。
通过上述分析可以看出，DMA控制器功能的强弱，是决定DMA效率的关键因素。DMA控制器需要为每次数据传送做大量的工作，数据传送单位的增大意味着传送次数的减少。另外，DMA方式窃取了始终周期，CPU处理效率降低了，要想尽量少地窃取始终周期，就要设法提高DMA控制器的性能，这样可以较少地影响CPU出理效率。

四、通道方式
输入/输出通道是一个独立于CPU的，专门管理I/O的处理机，它控制设备与内存直接进行数据交换。它有自己的通道指令，这些通道指令由CPU启动，并在操作结束时向CPU发出中断信号，见图6-3。
输入/输出通道控制是一种以内存为中心，实现设备和内参内直接交换数据的控制方式。在通道方式中，数据的传送方向、存放数据的内存起始地址以及传送的数据块长度等都由通道来进行控制。
另外，通道控制方式可以做到一个通道控制多台设备与内存进行数据交换。因而，通道方式进一步减轻了CPU的工作负担，增加了计算机系统的并行工作程度。
　　输入/输出通道分类
按照信息交换方式和所连接的设备种类不同，通道可以分为以下三种类型：
⑴字节多路通道
它适用于连接打印机、终端等低速或中速的I/O设备。这种通道以字节为单位交叉工作：当为一台设备传送一个字节后，立即转去为另一它设备传送一个字节。
⑵选择通道
它适用于连接磁盘、磁带等高速设备。这种通道以“组方式”工作，每次传送一批数据，传送速率很高，但在一段时间只能为一台设备服务。每当一个I/O请求处理完之后，就选择另一台设备并为其服务。
⑶成组多路通道
这种通道综合了字节多路通道分时工作和选择通道传输速率高的特点，其实质是：对通道程序采用多道程序设计技术，使得与通道连接的设备可以并行工作。
　　通道工作原理
在通道控制方式中，I/O设备控制器（常简称为I/O控制器）中没有传送字节计数器和内存地址寄存器，但多了通道设备控制器和指令执行部件。CPU只需发出启动指令，指出通道相应的操作和I/O设备，该指令就可启动通道并使该通道从内存中调出相应的通道指令执行。
一旦CPU发出启动通道的指令，通道就开始工作。I/O通道控制I/O控制器工作，I/O控制器又控制I/O设备。这样，一个通道可以连接多个I/O控制器，而一个I/O控制器又可以连接若干台同类型的外部设备。
　　通道的连接
由于通道和控制器的数量一般比设备数量要少，因此，如果连接不当，往往会导致出现“瓶颈”。故一般设备的连接采用交叉连接，这样做的好处是：
①  提高系统的可靠性：当某条通路因控制器或通道故障而断开时，可使用其他通路。
②   提高设备的并行性：对于同一个设备，当与它相连的某一条通路中的控制器或通道被占用时，可以选择另一条空闲通路，减少了设备因等待通路所需要花费的时间。
　　通道处理机
通道相当于一个功能单纯的处理机，它具有自己的指令系统，包括读、写、控制、转移、结束以及空操作等指令，并可以执行由这些指令编写的通道程序。
通道的运算控制部件包括：
① 通道地址字（CAW）：记录下一条通道指令存放的地址，其功能类似于中央处理机的指令寄存器。
② 通道命令字（CCW）：记录正在执行的通道指令，其作用相当于中央处理机的指令寄存器。
③  通道状态字（CSW）：记录通道、控制器、设备的状态，包括I/O传输完成信息、出错信息、重复执行次数等。
　　通道对主机的访问
通道一般需要与主机共享同一个内存，以保存通道程序和交换数据。通道访问内存采用“周期窃用”方式。
采用通道方式后，输入/输出的执行过程如下：
CPU在执行用户程序时遇到I/O请求，根据用户的I/O请求生成通道程序（也可以是事先编好的）。放到内存中，并把该通道程序首地址放入CAW中。
然后，CPU执行“启动I/O”指令，启动通道工作。通道接收“启动I/O”指令信号，从CAW中取出通道程序首地址，并根据此地址取出通道程序的第一条指令，放入CCW中；同时向CU发回答信号，通知“启动I/O”指令完成完毕，CPU可继续执行。
通道开始执行通道程序，进行物理I/O操作。当执行完一条指令后，如果还有下一条指令则继续执行；否则表示传输完成，同时自行停止，通知CPU转去处理通道结束事件，并从CCW中得到有关通道状态。
总之，在通道中，I/O运用专用的辅助处理器处理I/O操作，从而剪径了主处理器处理I/O的负担。主处理器只要发出一个I/O操作命令，剩下的工作完全由通道负责。I/O操作结束后，I/O通道会发出一个中断请求，表示相应操作已完成。
　　通道的发展
通道的思想是从早期的大型计算机系统中发展起来的。在早期的大型计算机系统中，一般配有大量的I/O设备。为了把对I/O设备的管理从计算机主机中分离出来，形成了I/O通道的概念，并专门设计出I/O通道处理机。
I/O通道在计算机系统中是一个非常重要的部件，它对系统整体性能的提高起了相当重要的作用。不过，随着技术不断的发展，处理机和I/O设备性能的不断提高，专用的、独立I/O通道处理机已不容易见到。但是通道的思想又融入了许多新的技术，所以仍在广泛地应用着。由于光纤通道技术具有数据传输速率高、数据传输距离远以及可简化大型存储系统设计的优点，新的通用光纤通道技术正在快速发展。这种通用光纤通道可以在一个通道上容纳多达127个的大容量硬盘驱动器。显然，在大容量高速存储应用领域，通用光纤通道有着广泛的应用前景。

转自：
http://blog.chinaunix.net/u2/67780/showart_2063742.html

chatler 2009-11-10 23:34 发表评论

callback function from wikipedia

chatler — Tue, 12 May 2009 15:28:00 GMT

Callback (computer science)

From Wikipedia, the free encyclopedia

Jump to: navigation, search

For a discussion of callback with computer modems, see callback (telecommunications).

In computer programming, a callback is executable code that is passed as an argument to other code. It allows a lower-level software layer to call a subroutine (or function) defined in a higher-level layer.

A callback is often back on the level of the original caller.

However, while technically accurate, this might not be the most illustrative explanation. Think of it as an "In case of fire, break glass" subroutine. Many computer programs tend to be written such that they expect a certain set of possibilities at any given moment. If "Thing That Was Expected", then "Do something", otherwise, "Do something else." is a common theme. However, there are many situations in which events (such as fire) could happen at any time. Rather than checking for them at each possible step ("Thing that was expected OR Things are on fire"), it is easier to have a system which detects a number of events, and will call the appropriate function upon said event (this also keeps us from having to write programs like "Thing that was expected OR Things are on fire OR Nuclear meltdown OR alien invasion OR the dead rising from the grave OR...etc., etc.) Instead, a callback routine is a sort of insurance policy. If zombies attack, call this function. If the user moves their mouse over an icon, call HighlightIcon, and so forth.

Usually, there is a framework in which a series of events (some condition is met) in which the running framework (be it a generic library or unique to the program) will call a registered chunk of code based on some pre-registered function (typically, a handle or a function pointer) The events may be anything from user input (such as mouse or keyboard input), network activity (callbacks are frequently used as message handlers for new network sessions) or an internal operating system event (such as a POSIX-style signal) The concept is to develop a piece of code that can be registered within some framework (be it a GUI toolkit, network library, etc.) that will serve as the handler upon the condition stated at registration. How the flow of control is passed between the underlying framework and the registered callback function is specific to the framework itself.

In another common scenario, the callback is first registered and later called asynchronously.

[hide]

[edit] Motivation

To understand the motivation for using callbacks, consider the problem of a network server. At any given point in time, it may have an internal state machine that is currently at a point in which it is dealing with one very specific communication session, not necessarily expecting new participants. As a host, it could be dealing with all the name exchange and handshakes and pleasantries, but no real way of dealing with the next dinner party guest that walks through the door. One way to deal with this is for this server to live by a state machine in which it rejects new connections until the current one is dealt with...not very robust (What if the other end goes away unexpectedly?) and not very scalable (Would you really want to make other clients wait (or more likely, keep retrying to connect) until it's their turn?) Instead, it's easier to have some sort of management process that spins off a new thread (or process) to deal with the new connection. Rather than writing programs that keep dealing with all of the possible resource contention problems that could come of this, or all of the details involved in socket code (your desired platform may be more straight-forward than others, but one of your design goals may be cross-platform compatibility), many have opted to use more generic frameworks that will handle such details in exchange for providing a reference such that the underlying framework can call it if the registered event occurs.

[edit] Example

The following code in C demonstrates the use of callbacks for the specific case of dealing with a POSIX-style signal (in this case SIGUSR1).

#include 
#include 
void * sig(int signum)
{
printf("Received signal number %d!\n", signum);
}
int main(int argc, char *argv[])
{
signal(SIGUSR1,&sig);
while(1){};
return 0;
}

The while loop will keep this example from doing anything interesting, but it will give you plenty of time to send a signal to this process. (If you're on a unix-like system, try a "kill -USR1 " to the process ID associated with this sample program. No matter how or when you send it, the callback should respond.)

[edit] Implementation

The form of a callback varies among programming languages.

C and C++ allow function pointers as arguments to other functions.
Several programming languages (though especially functional programming languages such as Scheme or ML) allow closures, a generalization of function pointers, as arguments to other functions.
Several programming languages, especially interpreted languages, allow one to pass the name of a function A as a parameter to a function B and have B call A by means of eval.
In object-oriented programming languages, a call can accept an object that implements some abstract interface, without specifying in detail how the object should do so. The programmer who implements that object may use the interface's methods exclusively for application-specific code. Such objects are effectively a bundle of callbacks, plus the data they need to manipulate. They are useful in implementing various design patterns like Visitor, Observer, and Strategy.
C++ allows objects to provide their own implementation of the function call operation. The Standard Template Library accepts these objects (called functors), as well as function pointers, as parameters to various polymorphic algorithms
C# .NET Framework provides a type-safe encapsulating reference, a 'delegate', to manage function pointers. These can be used for callback operations.
Perl supports subroutine references.^[1]^[2]
Some systems have built-in programming languages to support extension and adaptation. These languages provide callbacks without the need for separate software development tools.

[edit] Special cases

Callback functions are also frequently used as a means to handle exceptions arising within the low level function, as a way to enable side-effects in response to some condition, or as a way to gather operational statistics in the course of a larger computation. Interrupt handlers in an operating system respond to hardware conditions, signal handlers of a process are triggered by the operating system, and event handlers process the asynchronous input a program receives.

A pure callback function is one which is purely functional (always returns the same value given the same inputs) and free of observable side-effects. Some uses of callbacks require pure callback functions to operate correctly.

A special case of a callback is called a predicate callback, or just predicate for short. This is a pure callback function which accepts a single input value and returns a Boolean value. These types of callbacks are useful for filtering collections of values by some condition.

[edit] See also

[edit] External links

[edit] References

^ "Perl Cookbook - 11.4. Taking References to Functions". http://www.unix.org.ua/orelly/perl/cookbook/ch11_05.htm. Retrieved on 2008-03-03.
^ "Advanced Perl Programming - 4.2 Using Subroutine References". http://www.unix.org.ua/orelly/perl/advprog/ch04_02.htm. Retrieved on 2008-03-03.

Retrieved from "http://en.wikipedia.org/wiki/Callback_(computer_science)"

chatler 2009-05-12 23:28 发表评论

Critical Section

chatler — Mon, 11 May 2009 03:48:00 GMT

Critical section

From Wikipedia, the free encyclopedia

Jump to: navigation, search

In concurrent programming a critical section is a piece of code that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution. A critical section will usually terminate in fixed time, and a thread, task or process will only have to wait a fixed time to enter it (i.e. bounded waiting). Some synchronization mechanism is required at the entry and exit of the critical section to ensure exclusive use, for example a semaphore.

By carefully controlling which variables are modified inside and outside the critical section (usually, by accessing important state only from within), concurrent access to that state is prevented. A critical section is typically used when a multithreaded program must update multiple related variables without a separate thread making conflicting changes to that data. In a related situation, a critical section may be used to ensure a shared resource, for example a printer, can only be accessed by one process at a time.

How critical sections are implemented varies among operating systems.

The simplest method is to prevent any change of processor control inside the critical section. On uni-processor systems, this can be done by disabling interrupts on entry into the critical section, avoiding system calls that can cause a context switch while inside the section and restoring interrupts to their previous state on exit. Any thread of execution entering any critical section anywhere in the system will, with this implementation, prevent any other thread, including an interrupt, from getting the CPU and therefore from entering any other critical section or, indeed, any code whatsoever, until the original thread leaves its critical section.

This brute-force approach can be improved upon by using semaphores. To enter a critical section, a thread must obtain a semaphore, which it releases on leaving the section. Other threads are prevented from entering the critical section at the same time as the original thread, but are free to gain control of the CPU and execute other code, including other critical sections that are protected by different semaphores.

Some confusion exists in the literature about the relationship between different critical sections in the same program.^{[citation needed]} In general, a resource that must be protected from concurrent access may be accessed by several pieces of code. Each piece must be guarded by a common semaphore. Is each piece now a critical section or are all the pieces guarded by the same semaphore in aggregate a single critical section? This confusion is evident in definitions of a critical section such as "... a piece of code that can only be executed by one process or thread at a time". This only works if all access to a protected resource is contained in one "piece of code", which requires either the definition of a piece of code or the code itself to be somewhat contrived.

[hide]

[edit] Application Level Critical Sections

Application-level critical sections reside in the memory range of the process and are usually modifiable by the process itself. This is called a user-space object because the program run by the user (as opposed to the kernel) can modify and interact with the object. However the functions called may jump to kernel-space code to register the user-space object with the kernel.

Example Code For Critical Sections with POSIX pthread library

/* Sample C/C++, Unix/Linux */
#include 
 
/* This is the critical section object (statically allocated). */
static pthread_mutex_t cs_mutex = PTHREAD_MUTEX_INITIALIZER;
 
void f()
{
    /* Enter the critical section -- other threads are locked out */
    pthread_mutex_lock( &cs_mutex );
 
    /* Do some thread-safe processing! */
 
    /*Leave the critical section -- other threads can now pthread_mutex_lock()  */
    pthread_mutex_unlock( &cs_mutex );
}

Example Code For Critical Sections with Win32 API

/* Sample C/C++, Windows, link to kernel32.dll */
#include 
 
static CRITICAL_SECTION cs; /* This is the critical section object -- once initialized,
                               it cannot be moved in memory */
                            /* If you program in OOP, declare this in your class */
 
/* Initialize the critical section before entering multi-threaded context. */
InitializeCriticalSection(&cs);
 
void f()
{
    /* Enter the critical section -- other threads are locked out */
    EnterCriticalSection(&cs);
 
    /* Do some thread-safe processing! */
 
    /* Leave the critical section -- other threads can now EnterCriticalSection() */
    LeaveCriticalSection(&cs);
}
 
/* Release system object when all finished -- usually at the end of the cleanup code */
DeleteCriticalSection(&cs);

Note that on Windows NT (not 9x/ME), the function TryEnterCriticalSection() can be used to attempt to enter the critical section. This function returns immediately so that the thread can do other things if it fails to enter the critical section (usually due to another thread having locked it). With the pthreads library, the equivalent function is pthread_mutex_trylock(). Note that the use of a CriticalSection is not the same as a Win32 Mutex, which is an object used for inter-process synchronization. A Win32 CriticalSection is for intra-process synchronization (and is much faster as far as lock times), however it cannot be shared across processes.

[edit] Kernel Level Critical Sections

This section does not cite any references or sources. Please help improve this article by adding citations to reliable sources. Unverifiable material may be challenged and removed. (July 2007)

Typically, critical sections prevent process and thread migration between processors and the preemption of processes and threads by interrupts and other processes and threads.

Critical sections often allow nesting. Nesting allows multiple critical sections to be entered and exited at little cost.

If the scheduler interrupts the current process or thread in a critical section, the scheduler will either allow the process or thread to run to completion of the critical section, or it will schedule the process or thread for another complete quantum. The scheduler will not migrate the process or thread to another processor, and it will not schedule another process or thread to run while the current process or thread is in a critical section.

Similarly, if an interrupt occurs in a critical section, the interrupt's information is recorded for future processing, and execution is returned to the process or thread in the critical section. Once the critical section is exited, and in some cases the scheduled quantum completes, the pending interrupt will be executed.

Since critical sections may execute only on the processor on which they are entered, synchronization is only required within the executing processor. This allows critical sections to be entered and exited at almost zero cost. No interprocessor synchronization is required, only instruction stream synchronization. Most processors provide the required amount of synchronization by the simple act of interrupting the current execution state. This allows critical sections in most cases to be nothing more than a per processor count of critical sections entered.

Performance enhancements include executing pending interrupts at the exit of all critical sections and allowing the scheduler to run at the exit of all critical sections. Furthermore, pending interrupts may be transferred to other processors for execution.

Critical sections should not be used as a long-lived locking primitive. They should be short enough that the critical section will be entered, executed, and exited without any interrupts occurring, from neither hardware much less the scheduler.

Kernel Level Critical Sections are the base of the software lockout issue.

[edit] See also

Lock (computer science)

[edit] External links

Critical Section documentation on the MSDN Library homepage: http://msdn2.microsoft.com/en-us/library/ms682530.aspx

chatler 2009-05-11 11:48 发表评论

进程和线程编程<转>

chatler — Thu, 26 Mar 2009 07:56:00 GMT

摘要: http://man.lupaworld.com/content/develop/joyfire/system/11.html#I255 进程和线程编程目录进程和线程编程原始管道 ... 阅读全文

chatler 2009-03-26 15:56 发表评论

OS FAQ

chatler — Sun, 15 Mar 2009 05:20:00 GMT

V6:::::::::

1.3 What is the main advantage of multiprogramming?
Answer: Multiprogramming makes efficient use of the CPU by overlapping the demands for the CPU and its I/O devices from various users. It attempts to increase CPU utilization by always having something for the CPU to execute.

1.5 In a multiprogramming and time-sharing environment, several users share the system simultaneously. This situation can result in various security problems.
a. What are two such problems?
b. Can we ensure the same degree of security in a time-shared machine as we have in a
dedicated machine? Explain your answer.
Answer:
a. Stealing or copying one’s programs or data; using system resources (CPU, memory, disk space, peripherals) without proper accounting.
b. Probably not, since any protection scheme devised by humans can inevitably be broken by a human, and the more complex the scheme, the more difficult it is to feel
confident of its correct implementation.

1.9 Describe the differences between symmetric and asymmetric multiprocessing. What are three advantages and one disadvantage of multiprocessor systems?
Answer: Symmetric multiprocessing treats all processors as equals, and I/O can be processed on any CPU. Asymmetric multiprocessing has one master CPU and the remainder CPUs are slaves. The master distributes tasks among the slaves, and I/O is usually done by themaster only. Multiprocessors can savemoney by not duplicating power supplies, housings, and peripherals. They can execute programs more quickly and can have increased reliability. They are also more complex in both hardware and software than uniprocessor systems.

1.10 What is the main difficulty that a programmer must overcome in writing an operating system for a real-time environment?
Answer: The main difficulty is keeping the operating system within the fixed time constraints of a real-time system. If the system does not complete a task in a certain time
frame, it may cause a breakdown of the entire system it is running. Therefore when writing an operating system for a real-time system, the writer must be sure that his scheduling schemes don’t allow response time to exceed the time constraint.

2.1 Prefetching is a method of overlapping the I/O of a job with that job’s own computation.
The idea is simple. After a read operation completes and the job is about to start operating on the data, the input device is instructed to begin the next read immediately. The CPU and input device are then both busy. With luck, by the time the job is ready for the next data item, the input device will have finished reading that data item. The CPU can then begin processing the newly read data, while the input device starts to read the following data.
A similar idea can be used for output. In this case, the job creates data that are put into a buffer until an output device can accept them. Compare the prefetching scheme with the spooling scheme, where the CPU overlaps the input of one job with the computation and output of other jobs.
Answer: Prefetching is a user-based activity, while spooling is a system-based activity.
Spooling is a much more effective way of overlapping I/O and CPU operations.

2.3 What are the differences between a trap and an interrupt? What is the use of each function?
   An interrupt is a hardware-generated change-of-flow within the system. An interrupt handler is summoned to deal with the cause of the interrupt; control is then re-turned to the interrupted context and instruction.
   A trap is a software-generated interrupt.
   An interrupt can be used to signal the completion of an I/O to obviate the need for device polling.
   A trap can be used to call operating system routines or to catch arithmetic errors.

V7::::::::

19.3 The Linux 2.6 kernel can be built with no virtual memory system. Explain how this feature may appeal to designers of real-time systems.
Answer: By disabling the virtual memory system, processes are guaranteed to have portions of its address space resident in physical memory.
This results in a system that does not suffer from page faults and therefore does not have to deal with unanticipated costs corresponding to paging the address space.
The resulting system is appealing to designers of real-time systems who prefer to avoid variability in performance.

chatler 2009-03-15 13:20 发表评论

创建线程函数的使用

chatler — Fri, 26 Sep 2008 01:07:00 GMT

应用场景：
做了一个client，去和Message Middleware通信，实时获取消息中间件以topic方式(不是Queue，对Message Middleware来说，Queue是发送一个destination，topic可以发多个)。

        从实时获取的角度来说，需要启一个线程，接收Message Middleware消息，然后做场景需要的处理。创建线程的函数如下所示：
// for compilers which have it, we should use C RTL function for thread
// creation instead of Win32 API one because otherwise we will have memory
// leaks if the thread uses C RTL (and most threads do)
#if defined(__VISUALC__) || \
    (defined(__BORLANDC__) && (__BORLANDC__ >= 0x500)) || \
    (defined(__GNUG__) && defined(__MSVCRT__))
    typedef unsigned (__stdcall *RtlThreadStart)(void *);

    m_hThread = (HANDLE)_beginthreadex(NULL, 0,
                                       (RtlThreadStart)
                                       wxThreadInternal::WinThreadStart,
                                       thread, CREATE_SUSPENDED,
                                       (unsigned int *)&m_tid);
#else // compiler doesn't have _beginthreadex
    m_hThread = ::CreateThread
                  (
                    NULL,                               // default security
                    0,                                  // default stack size
                    (LPTHREAD_START_ROUTINE)            // thread entry point
                    wxThreadInternal::WinThreadStart,   // the function that runs under thread
                    (LPVOID)thread,                     // parameter
                    CREATE_SUSPENDED,                   // flags
                    &m_tid                              // [out] thread id
                  );
#endif // _beginthreadex/CreateThread
note: there should be a function definition before these lines.eg:
DWORD wxThreadInternal::WinThreadStart(wxThread *thread)

chatler 2008-09-26 09:07 发表评论

C++博客-beautykingdom-随笔分类-OS

memcached完全剖析系列教程《转》

死锁和活锁 deadlock and livelock

为 C/C++ 项目构建您自己的内存管理器

TCMalloc : Thread-Caching Malloc

from:

http://goog-perftools.sourceforge.net/doc/tcmalloc.html

TCMalloc : Thread-Caching Malloc

Motivation

Usage

Overview

Small Object Allocation

Large Object Allocation

Spans

Deallocation

Central Free Lists for Small Objects

Garbage Collection of Thread Caches

Performance Notes

PTMalloc2 unittest

Caveats

How does the DMA work

有关异步读写、通信

semaphore and spinlock

An In-Depth Look into the Win32 Portable Executable File Format

IA-32 保护模式内存管理

CHAPTER 3 PROTECTED-MODE MEMORY MANAGEMENT

3.1. MEMORY MANAGEMENT OVERVIEW

Context Switch Definition

堆和栈的区别

同步、异步、阻塞和非阻塞的概念

中断、DMA、通道

callback function from wikipedia

Callback (computer science)

From Wikipedia, the free encyclopedia

Contents

[edit] Motivation

[edit] Example

[edit] Implementation

[edit] Special cases

[edit] See also

[edit] External links

[edit] References

Critical Section

Critical section

From Wikipedia, the free encyclopedia

Contents

[edit] Application Level Critical Sections

[edit] Kernel Level Critical Sections

[edit] See also

[edit] External links

进程和线程编程<转>

OS FAQ

创建线程函数的使用