C++博客-beautykingdom-随笔分类-Socket

Comparing Two High-Performance I/O Design Patterns

chatler — Mon, 21 May 2012 03:24:00 GMT

by Alexander Libman with Vladimir Gilbourd
November 25, 2005

Summary

This article investigates and compares different design patterns of high performance TCP-based servers. In addition to existing approaches, it proposes a scalable single-codebase, multi-platform solution (with code examples) and describes its fine-tuning on different platforms. It also compares performance of Java, C# and C++ implementations of proposed and existing solutions.

System I/O can be blocking, or non-blocking synchronous, or non-blocking asynchronous [1, 2]. Blocking I/O means that the calling system does not return control to the caller until the operation is finished. As a result, the caller is blocked and cannot perform other activities during that time. Most important, the caller thread cannot be reused for other request processing while waiting for the I/O to complete, and becomes a wasted resource during that time. For example, aread() operation on a socket in blocking mode will not return control if the socket buffer is empty until some data becomes available.

By contrast, a non-blocking synchronous call returns control to the caller immediately. The caller is not made to wait, and the invoked system immediately returns one of two responses: If the call was executed and the results are ready, then the caller is told of that. Alternatively, the invoked system can tell the caller that the system has no resources (no data in the socket) to perform the requested action. In that case, it is the responsibility of the caller may repeat the call until it succeeds. For example, a read() operation on a socket in non-blocking mode may return the number of read bytes or a special return code -1 with errno set to EWOULBLOCK/EAGAIN, meaning "not ready; try again later."

In a non-blocking asynchronous call, the calling function returns control to the caller immediately, reporting that the requested action was started. The calling system will execute the caller's request using additional system resources/threads and will notify the caller (by callback for example), when the result is ready for processing. For example, a Windows ReadFile() or POSIX aio_read() API returns immediately and initiates an internal system read operation. Of the three approaches, this non-blocking asynchronous approach offers the best scalability and performance.

This article investigates different non-blocking I/O multiplexing mechanisms and proposes a single multi-platform design pattern/solution. We hope that this article will help developers of high performance TCP based servers to choose optimal design solution. We also compare the performance of Java, C# and C++ implementations of proposed and existing solutions. We will exclude the blocking approach from further discussion and comparison at all, as it the least effective approach for scalability and performance.

Reactor and Proactor: two I/O multiplexing approaches

In general, I/O multiplexing mechanisms rely on an event demultiplexor [1, 3], an object that dispatches I/O events from a limited number of sources to the appropriate read/write event handlers. The developer registers interest in specific events and provides event handlers, or callbacks. The event demultiplexor delivers the requested events to the event handlers.

Two patterns that involve event demultiplexors are called Reactor and Proactor [1]. The Reactor patterns involve synchronous I/O, whereas the Proactor pattern involves asynchronous I/O. In Reactor, the event demultiplexor waits for events that indicate when a file descriptor or socket is ready for a read or write operation. The demultiplexor passes this event to the appropriate handler, which is responsible for performing the actual read or write.

In the Proactor pattern, by contrast, the handler—or the event demultiplexor on behalf of the handler—initiates asynchronous read and write operations. The I/O operation itself is performed by the operating system (OS). The parameters passed to the OS include the addresses of user-defined data buffers from which the OS gets data to write, or to which the OS puts data read. The event demultiplexor waits for events that indicate the completion of the I/O operation, and forwards those events to the appropriate handlers. For example, on Windows a handler could initiate async I/O (overlapped in Microsoft terminology) operations, and the event demultiplexor could wait for IOCompletion events [1]. The implementation of this classic asynchronous pattern is based on an asynchronous OS-level API, and we will call this implementation the "system-level" or "true" async, because the application fully relies on the OS to execute actual I/O.

An example will help you understand the difference between Reactor and Proactor. We will focus on the read operation here, as the write implementation is similar. Here's a read in Reactor:

An event handler declares interest in I/O events that indicate readiness for read on a particular socket
The event demultiplexor waits for events
An event comes in and wakes-up the demultiplexor, and the demultiplexor calls the appropriate handler
The event handler performs the actual read operation, handles the data read, declares renewed interest in I/O events, and returns control to the dispatcher

By comparison, here is a read operation in Proactor (true async):

A handler initiates an asynchronous read operation (note: the OS must support asynchronous I/O). In this case, the handler does not care about I/O readiness events, but is instead registers interest in receiving completion events.
The event demultiplexor waits until the operation is completed
While the event demultiplexor waits, the OS executes the read operation in a parallel kernel thread, puts data into a user-defined buffer, and notifies the event demultiplexor that the read is complete
The event demultiplexor calls the appropriate handler;
The event handler handles the data from user defined buffer, starts a new asynchronous operation, and returns control to the event demultiplexor.

Current practice

The open-source C++ development framework ACE [1, 3] developed by Douglas Schmidt, et al., offers a wide range of platform-independent, low-level concurrency support classes (threading, mutexes, etc). On the top level it provides two separate groups of classes: implementations of the ACE Reactor and ACE Proactor. Although both of them are based on platform-independent primitives, these tools offer different interfaces.

The ACE Proactor gives much better performance and robustness on MS-Windows, as Windows provides a very efficient async API, based on operating-system-level support [4, 5].

Unfortunately, not all operating systems provide full robust async OS-level support. For instance, many Unix systems do not. Therefore, ACE Reactor is a preferable solution in UNIX (currently UNIX does not have robust async facilities for sockets). As a result, to achieve the best performance on each system, developers of networked applications need to maintain two separate code-bases: an ACE Proactor based solution on Windows and an ACE Reactor based solution for Unix-based systems.

As we mentioned, the true async Proactor pattern requires operating-system-level support. Due to the differing nature of event handler and operating-system interaction, it is difficult to create common, unified external interfaces for both Reactor and Proactor patterns. That, in turn, makes it hard to create a fully portable development framework and encapsulate the interface and OS- related differences.

Proposed solution

In this section, we will propose a solution to the challenge of designing a portable framework for the Proactor and Reactor I/O patterns. To demonstrate this solution, we will transform a Reactor demultiplexor I/O solution to an emulated async I/O by moving read/write operations from event handlers inside the demultiplexor (this is "emulated async" approach). The following example illustrates that conversion for a read operation:

An event handler declares interest in I/O events (readiness for read) and provides the demultiplexor with information such as the address of a data buffer, or the number of bytes to read.
Dispatcher waits for events (for example, on select());
When an event arrives, it awakes up the dispatcher. The dispatcher performs a non- blocking read operation (it has all necessary information to perform this operation) and on completion calls the appropriate handler.
The event handler handles data from the user-defined buffer, declares new interest, along with information about where to put the data buffer and the number bytes to read in I/O events. The event handler then returns control to the dispatcher.

As we can see, by adding functionality to the demultiplexor I/O pattern, we were able to convert the Reactor pattern to a Proactor pattern. In terms of the amount of work performed, this approach is exactly the same as the Reactor pattern. We simply shifted responsibilities between different actors. There is no performance degradation because the amount of work performed is still the same. The work was simply performed by different actors. The following lists of steps demonstrate that each approach performs an equal amount of work:

Standard/classic Reactor:

Step 1) wait for event (Reactor job)
Step 2) dispatch "Ready-to-Read" event to user handler ( Reactor job)
Step 3) read data (user handler job)
Step 4) process data ( user handler job)

Proposed emulated Proactor:

Step 1) wait for event (Proactor job)
Step 2) read data (now Proactor job)
Step 3) dispatch "Read-Completed" event to user handler (Proactor job)
Step 4) process data (user handler job)

With an operating system that does not provide an async I/O API, this approach allows us to hide the reactive nature of available socket APIs and to expose a fully proactive async interface. This allows us to create a fully portable platform-independent solution with a common external interface.

TProactor

The proposed solution (TProactor) was developed and implemented at Terabit P/L [6]. The solution has two alternative implementations, one in C++ and one in Java. The C++ version was built using ACE cross-platform low-level primitives and has a common unified async proactive interface on all platforms.

The main TProactor components are the Engine and WaitStrategy interfaces. Engine manages the async operations lifecycle. WaitStrategy manages concurrency strategies. WaitStrategy depends on Engine and the two always work in pairs. Interfaces between Engine and WaitStrategy are strongly defined.

Engines and waiting strategies are implemented as pluggable class-drivers (for the full list of all implemented Engines and corresponding WaitStrategies, see Appendix 1). TProactor is a highly configurable solution. It internally implements three engines (POSIX AIO, SUN AIO and Emulated AIO) and hides six different waiting strategies, based on an asynchronous kernel API (for POSIX- this is not efficient right now due to internal POSIX AIO API problems) and synchronous Unix select(), poll(), /dev/poll (Solaris 5.8+), port_get (Solaris 5.10), RealTime (RT) signals (Linux 2.4+), epoll (Linux 2.6), k-queue (FreeBSD) APIs. TProactor conforms to the standard ACE Proactor implementation interface. That makes it possible to develop a single cross-platform solution (POSIX/MS-WINDOWS) with a common (ACE Proactor) interface.

With a set of mutually interchangeable "lego-style" Engines and WaitStrategies, a developer can choose the appropriate internal mechanism (engine and waiting strategy) at run time by setting appropriate configuration parameters. These settings may be specified according to specific requirements, such as the number of connections, scalability, and the targeted OS. If the operating system supports async API, a developer may use the true async approach, otherwise the user can opt for an emulated async solutions built on different sync waiting strategies. All of those strategies are hidden behind an emulated async façade.

For an HTTP server running on Sun Solaris, for example, the /dev/poll or port_get()-based engines is the most suitable choice, able to serve huge number of connections, but for another UNIX solution with a limited number of connections but high throughput requirements, aselect()-based engine may be a better approach. Such flexibility cannot be achieved with a standard ACE Reactor/Proactor, due to inherent algorithmic problems of different wait strategies (see Appendix 2).

In terms of performance, our tests show that emulating from reactive to proactive does not impose any overhead—it can be faster, but not slower. According to our test results, the TProactor gives on average of up to 10-35 % better performance (measured in terms of both throughput and response times) than the reactive model in the standard ACE Reactor implementation on various UNIX/Linux platforms. On Windows it gives the same performance as standard ACE Proactor.

Performance comparison (JAVA versus C++ versus C#).

In addition to C++, as we also implemented TProactor in Java. As for JDK version 1.4, Java provides only the sync-based approach that is logically similar to C select() [7, 8]. Java TProactor is based on Java's non-blocking facilities (java.nio packages) logically similar to C++ TProactor with waiting strategy based on select().

Figures 1 and 2 chart the transfer rate in bits/sec versus the number of connections. These charts represent comparison results for a simple echo-server built on standard ACE Reactor, using RedHat Linux 9.0, TProactor C++ and Java (IBM 1.4JVM) on Microsoft's Windows and RedHat Linux9.0, and a C# echo-server running on the Windows operating system. Performance of native AIO APIs is represented by "Async"-marked curves; by emulated AIO (TProactor)—AsyncE curves; and by TP_Reactor—Synch curves. All implementations were bombarded by the same client application—a continuous stream of arbitrary fixed sized messages via N connections.

The full set of tests was performed on the same hardware. Tests on different machines proved that relative results are consistent.

Figure 1. Windows XP/P4 2.6GHz HyperThreading/512 MB RAM.

Figure 2. Linux RedHat 2.4.20-smp/P4 2.6GHz HyperThreading/512 MB RAM.

User code example

The following is the skeleton of a simple TProactor-based Java echo-server. In a nutshell, the developer only has to implement the two interfaces: OpRead with buffer where TProactor puts its read results, and OpWrite with a buffer from which TProactor takes data. The developer will also need to implement protocol-specific logic via providing callbacks onReadCompleted() and onWriteCompleted() in the AsynchHandler interface implementation. Those callbacks will be asynchronously called by TProactor on completion of read/write operations and executed on a thread pool space provided by TProactor (the developer doesn't need to write his own pool).

class EchoServerProtocol implements AsynchHandler
{
    AsynchChannel achannel = null;
    EchoServerProtocol(Demultiplexor m, SelectableChannel channel) throws Exception 
    {
        this.achannel = new AsynchChannel( m, this, channel );
    }

    public void start() throws Exception
    {
	// called after construction 
	System.out.println( Thread.currentThread().getName() + ": EchoServer protocol started" ); 
        achannel.read( buffer);
    }

    public void onReadCompleted( OpRead opRead ) throws Exception
    {
	if (opRead.getError() != null )
	{
   	// handle error, do clean-up if needed  
		System.out.println( "EchoServer::readCompleted: " + opRead.getError().toString());
		achannel.close();
		return;
	}
		
	if (opRead.getBytesCompleted () <= 0)
	{
		System.out.println( "EchoServer::readCompleted: Peer closed " + opRead.getBytesCompleted();
		achannel.close();
		return;
	}

	ByteBuffer buffer = opRead.getBuffer();
	achannel.write(buffer);
}
		
public void onWriteCompleted(OpWrite opWrite) throws Exception 
{
// logically similar to onReadCompleted         ...     
}
};

IOHandler is a TProactor base class. AsynchHandler and Multiplexor, among other things, internally execute the wait strategy chosen by the developer.

Conclusion

TProactor provides a common, flexible, and configurable solution for multi-platform high- performance communications development. All of the problems and complexities mentioned in Appendix 2, are hidden from the developer.

It is clear from the charts that C++ is still the preferable approach for high performance communication solutions, but Java on Linux comes quite close. However, the overall Java performance was weakened by poor results on Windows. One reason for that may be that the Java 1.4 nio package is based on select()-style API. � It is true, Java NIO package is kind of Reactor pattern based on select()-style API (see [7, 8]). Java NIO allows to write your own select()-style provider (equivalent of TProactor waiting strategies). Looking at Java NIO implementation for Windows (to do this enough to examine import symbols in jdk1.5.0\jre\bin\nio.dll), we can make a conclusion that Java NIO 1.4.2 and 1.5.0 for Windows is based on WSAEventSelect () API. That is better than select(), but slower than IOCompletionPort�s for significant number of connections. . Should the 1.5 version of Java's nio be based on IOCompletionPorts, then that should improve performance. If Java NIO would use IOCompletionPorts, than conversion of Proactor pattern to Reactor pattern should be made inside nio.dll. Although such conversion is more complicated than Reactor- >Proactor conversion, but it can be implemented in frames of Java NIO interfaces. (this the topic of next arcticle, but we can provide algorithm). At this time, no TProactor performance tests were done on JDK 1.5.

Note. All tests for Java are performed on "raw" buffers (java.nio.ByteBuffer) without data processing.

Taking into account the latest activities to develop robust AIO on Linux [9], we can conclude that Linux Kernel API (io_xxxx set of system calls) should be more scalable in comparison with POSIX standard, but still not portable. In this case, TProactor with new Engine/Wait Strategy pair, based on native LINUX AIO can be easily implemented to overcome portability issues and to cover Linux native AIO with standard ACE Proactor interface.

Appendix I

Engines and waiting strategies implemented in TProactor

Engine Type	Wait Strategies	Operating System
POSIX_AIO (true async) `aio_read()`/`aio_write()`	`aio_suspend() Waiting for RT signal Callback function`	POSIX complained UNIX (not robust) POSIX (not robust) SGI IRIX, LINUX (not robust)
SUN_AIO (true async) `aio_read()`/`aio_write()`	`aio_wait()`	SUN (not robust)
Emulated Async Non-blocking `read()`/`write()`	`select()` `poll()` /dev/poll Linux RT signals Kqueue	generic POSIX Mostly all POSIX implementations SUN Linux FreeBSD

Appendix II

All sync waiting strategies can be divided into two groups:

edge-triggered (e.g. Linux RT signals)—signal readiness only when socket became ready (changes state);
level-triggered (e.g. select(), poll(), /dev/poll)—readiness at any time.

Let us describe some common logical problems for those groups:

edge-triggered group: after executing I/O operation, the demultiplexing loop can lose the state of socket readiness. Example: the "read" handler did not read whole chunk of data, so the socket remains still ready for read. But the demultiplexor loop will not receive next notification.
level-triggered group: when demultiplexor loop detects readiness, it starts the write/read user defined handler. But before the start, it should remove socket descriptior from the set of monitored descriptors. Otherwise, the same event can be dispatched twice.
Obviously, solving these problems adds extra complexities to development. All these problems were resolved internally within TProactor and the developer should not worry about those details, while in the synch approach one needs to apply extra effort to resolve them.

Resources

[1] Douglas C. Schmidt, Stephen D. Huston "C++ Network Programming." 2002, Addison-Wesley ISBN 0-201-60464-7

[2] W. Richard Stevens "UNIX Network Programming" vol. 1 and 2, 1999, Prentice Hill, ISBN 0-13- 490012-X

[3] Douglas C. Schmidt, Michael Stal, Hans Rohnert, Frank Buschmann "Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects, Volume 2" Wiley & Sons, NY 2000

[4] INFO: Socket Overlapped I/O Versus Blocking/Non-blocking Mode. Q181611. Microsoft Knowledge Base Articles.

[5] Microsoft MSDN. I/O Completion Ports.
http://msdn.microsoft.com/library/default.asp?url=/library/en- us/fileio/fs/i_o_completion_ports.asp

[6] TProactor (ACE compatible Proactor).
www.terabit.com.au

[7] JavaDoc java.nio.channels
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/package-summary.html

[8] JavaDoc Java.nio.channels.spi Class SelectorProvider
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/spi/SelectorProvider.html

[9] Linux AIO development
http://lse.sourceforge.net/io/aio.html, and
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf

About the authors

Alex Libman has been programming for 15 years. During the past 5 years his main area of interest is pattern-oriented multiplatform networked programming using C++ and Java. He is big fan and contributor of ACE.

Vlad Gilbourd works as a computer consultant, but wishes to spend more time listening jazz :) As a hobby, he started and runswww.corporatenews.com.au website.

from:
http://www.artima.com/articles/io_design_patterns.html

chatler 2012-05-21 11:24 发表评论

Comparing Two High-Performance I/O Design Patterns

chatler — Wed, 08 Sep 2010 09:20:00 GMT

Summary

System I/O can be blocking, or non-blocking synchronous, or non-blocking asynchronous [1, 2]. Blocking I/O means that the calling system does not return control to the caller until the operation is finished. As a result, the caller is blocked and cannot perform other activities during that time. Most important, the caller thread cannot be reused for other request processing while waiting for the I/O to complete, and becomes a wasted resource during that time. For example, a read() operation on a socket in blocking mode will not return control if the socket buffer is empty until some data becomes available.

An example will help you understand the difference between Reactor and Proactor. We will focus on the read operation here, as the write implementation is similar. Here's a read in Reactor:

An event handler declares interest in I/O events that indicate readiness for read on a particular socket
The event demultiplexor waits for events
An event comes in and wakes-up the demultiplexor, and the demultiplexor calls the appropriate handler
The event handler performs the actual read operation, handles the data read, declares renewed interest in I/O events, and returns control to the dispatcher

By comparison, here is a read operation in Proactor (true async):

A handler initiates an asynchronous read operation (note: the OS must support asynchronous I/O). In this case, the handler does not care about I/O readiness events, but is instead registers interest in receiving completion events.
The event demultiplexor waits until the operation is completed
While the event demultiplexor waits, the OS executes the read operation in a parallel kernel thread, puts data into a user-defined buffer, and notifies the event demultiplexor that the read is complete
The event demultiplexor calls the appropriate handler;
The event handler handles the data from user defined buffer, starts a new asynchronous operation, and returns control to the event demultiplexor.

Current practice

The ACE Proactor gives much better performance and robustness on MS-Windows, as Windows provides a very efficient async API, based on operating-system-level support [4, 5].

Proposed solution

An event handler declares interest in I/O events (readiness for read) and provides the demultiplexor with information such as the address of a data buffer, or the number of bytes to read.
Dispatcher waits for events (for example, on select());
When an event arrives, it awakes up the dispatcher. The dispatcher performs a non- blocking read operation (it has all necessary information to perform this operation) and on completion calls the appropriate handler.
The event handler handles data from the user-defined buffer, declares new interest, along with information about where to put the data buffer and the number bytes to read in I/O events. The event handler then returns control to the dispatcher.

Standard/classic Reactor:

Step 1) wait for event (Reactor job)
Step 2) dispatch "Ready-to-Read" event to user handler ( Reactor job)
Step 3) read data (user handler job)
Step 4) process data ( user handler job)

Proposed emulated Proactor:

Step 1) wait for event (Proactor job)
Step 2) read data (now Proactor job)
Step 3) dispatch "Read-Completed" event to user handler (Proactor job)
Step 4) process data (user handler job)

TProactor

For an HTTP server running on Sun Solaris, for example, the /dev/poll or port_get()-based engines is the most suitable choice, able to serve huge number of connections, but for another UNIX solution with a limited number of connections but high throughput requirements, a select()-based engine may be a better approach. Such flexibility cannot be achieved with a standard ACE Reactor/Proactor, due to inherent algorithmic problems of different wait strategies (see Appendix 2).

Performance comparison (JAVA versus C++ versus C#).

The full set of tests was performed on the same hardware. Tests on different machines proved that relative results are consistent.

Figure 1. Windows XP/P4 2.6GHz HyperThreading/512 MB RAM.

Figure 2. Linux RedHat 2.4.20-smp/P4 2.6GHz HyperThreading/512 MB RAM.

User code example

The following is the skeleton of a simple TProactor-based Java echo-server. In a nutshell, the developer only has to implement the two interfaces:OpRead with buffer where TProactor puts its read results, and OpWrite with a buffer from which TProactor takes data. The developer will also need to implement protocol-specific logic via providing callbacks onReadCompleted() and onWriteCompleted() in the AsynchHandlerinterface implementation. Those callbacks will be asynchronously called by TProactor on completion of read/write operations and executed on a thread pool space provided by TProactor (the developer doesn't need to write his own pool).

class EchoServerProtocol implements AsynchHandler
{

    AsynchChannel achannel = null;

    EchoServerProtocol( Demultiplexor m,  SelectableChannel channel ) throws Exception
    {
        this.achannel = new AsynchChannel( m, this, channel );
    }

    public void start() throws Exception
    {
        // called after construction
        System.out.println( Thread.currentThread().getName() + ": EchoServer protocol started" );
        achannel.read( buffer);
    }

    public void onReadCompleted( OpRead opRead ) throws Exception
    {
        if ( opRead.getError() != null )
        {
            // handle error, do clean-up if needed
 System.out.println( "EchoServer::readCompleted: " + opRead.getError().toString());
            achannel.close();
            return;
        }

        if ( opRead.getBytesCompleted () <= 0)
        {
            System.out.println( "EchoServer::readCompleted: Peer closed " + opRead.getBytesCompleted();
            achannel.close();
            return;
        }

        ByteBuffer buffer = opRead.getBuffer();

        achannel.write(buffer);
    }

    public void onWriteCompleted(OpWrite opWrite) throws Exception
    {
        // logically similar to onReadCompleted
        ...
    }
}

IOHandler is a TProactor base class. AsynchHandler and Multiplexor, among other things, internally execute the wait strategy chosen by the developer.

Conclusion

Note. All tests for Java are performed on "raw" buffers (java.nio.ByteBuffer) without data processing.

Appendix I

Engines and waiting strategies implemented in TProactor

Engine Type	Wait Strategies	Operating System
POSIX_AIO (true async) `aio_read()`/`aio_write()`	`aio_suspend() Waiting for RT signal Callback function`	POSIX complained UNIX (not robust) POSIX (not robust) SGI IRIX, LINUX (not robust)
SUN_AIO (true async) `aio_read()`/`aio_write()`	`aio_wait()`	SUN (not robust)
Emulated Async Non-blocking `read()`/`write()`	`select()` `poll()` /dev/poll Linux RT signals Kqueue	generic POSIX Mostly all POSIX implementations SUN Linux FreeBSD

Appendix II

All sync waiting strategies can be divided into two groups:

edge-triggered (e.g. Linux RT signals)—signal readiness only when socket became ready (changes state);
level-triggered (e.g. select(), poll(), /dev/poll)—readiness at any time.

Let us describe some common logical problems for those groups:

edge-triggered group: after executing I/O operation, the demultiplexing loop can lose the state of socket readiness. Example: the "read" handler did not read whole chunk of data, so the socket remains still ready for read. But the demultiplexor loop will not receive next notification.
level-triggered group: when demultiplexor loop detects readiness, it starts the write/read user defined handler. But before the start, it should remove socket descriptior from the set of monitored descriptors. Otherwise, the same event can be dispatched twice.
Obviously, solving these problems adds extra complexities to development. All these problems were resolved internally within TProactor and the developer should not worry about those details, while in the synch approach one needs to apply extra effort to resolve them.

Resources

[1] Douglas C. Schmidt, Stephen D. Huston "C++ Network Programming." 2002, Addison-Wesley ISBN 0-201-60464-7

[2] W. Richard Stevens "UNIX Network Programming" vol. 1 and 2, 1999, Prentice Hill, ISBN 0-13- 490012-X

[3] Douglas C. Schmidt, Michael Stal, Hans Rohnert, Frank Buschmann "Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects, Volume 2" Wiley & Sons, NY 2000

[4] INFO: Socket Overlapped I/O Versus Blocking/Non-blocking Mode. Q181611. Microsoft Knowledge Base Articles.

[5] Microsoft MSDN. I/O Completion Ports.
http://msdn.microsoft.com/library/default.asp?url=/library/en- us/fileio/fs/i_o_completion_ports.asp

[6] TProactor (ACE compatible Proactor).
www.terabit.com.au

[7] JavaDoc java.nio.channels
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/package-summary.html

[8] JavaDoc Java.nio.channels.spi Class SelectorProvider
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/spi/SelectorProvider.html

[9] Linux AIO development
http://lse.sourceforge.net/io/aio.html, and
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf

About the authors

Vlad Gilbourd works as a computer consultant, but wishes to spend more time listening jazz :) As a hobby, he started and runswww.corporatenews.com.au website.

from:

http://www.artima.com/articles/io_design_patterns.html

chatler 2010-09-08 17:20 发表评论

Linux下getsockopt/setsockopt 函数说明

chatler — Mon, 06 Sep 2010 03:17:00 GMT

Linux下getsockopt/setsockopt 函数说明
【getsockopt/setsockopt系统调用】

功能描述：
        获取或者设置与某个套接字关联的选项。选项可能存在于多层协议中，它们总会出现在最上面的套接字层。当操作套接字选项时，选项位于的层和选项的名称必须给出。为了操作套接字层的选项，应该将层的值指定为SOL_SOCKET。为了操作其它层的选项，控制选项的合适协议号必须给出。例如，为了表示一个选项由TCP协议解析，层应该设定为协议号TCP。

用法：
#include
#include

int getsockopt(int sock, int level, int optname, void *optval, socklen_t *optlen);

int setsockopt(int sock, int level, int optname, const void *optval, socklen_t optlen);

参数：
sock：将要被设置或者获取选项的套接字。
level：选项所在的协议层。
optname：需要访问的选项名。
optval：对于getsockopt()，指向返回选项值的缓冲。对于setsockopt()，指向包含新选项值的缓冲。
optlen：对于getsockopt()，作为入口参数时，选项值的最大长度。作为出口参数时，选项值的实际长度。对于setsockopt()，现选项的长度。

返回说明：

成功执行时，返回0。失败返回-1，errno被设为以下的某个值
EBADF：sock不是有效的文件描述词
EFAULT：optval指向的内存并非有效的进程空间
EINVAL：在调用setsockopt()时，optlen无效
ENOPROTOOPT：指定的协议层不能识别选项
ENOTSOCK：sock描述的不是套接字

参数详细说明：

level指定控制套接字的层次.可以取三种值:
1)SOL_SOCKET:通用套接字选项.
2)IPPROTO_IP:IP选项.
3)IPPROTO_TCP:TCP选项.　
optname指定控制的方式(选项的名称),我们下面详细解释　

optval获得或者是设置套接字选项.根据选项名称的数据类型进行转换　

选项名称　　　　　　　　说明　　　　　　　　　　　　　　　　　　数据类型
========================================================================
　　　　　　　　　　　　SOL_SOCKET
------------------------------------------------------------------------
SO_BROADCAST　　　　　　允许发送广播数据　　　　　　　　　　　　int
SO_DEBUG　　　　　　　　允许调试　　　　　　　　　　　　　　　　int
SO_DONTROUTE　　　　　　不查找路由　　　　　　　　　　　　　　　int
SO_ERROR　　　　　　　　获得套接字错误　　　　　　　　　　　　　int
SO_KEEPALIVE　　　　　　保持连接　　　　　　　　　　　　　　　　int
SO_LINGER　　　　　　　延迟关闭连接　　　　　　　　　　　　　　struct linger
SO_OOBINLINE　　　　　　带外数据放入正常数据流　　　　　　　　　int
SO_RCVBUF　　　　　　　接收缓冲区大小　　　　　　　　　　　　　int
SO_SNDBUF　　　　　　　发送缓冲区大小　　　　　　　　　　　　　int
SO_RCVLOWAT　　　　　　接收缓冲区下限　　　　　　　　　　　　　int
SO_SNDLOWAT　　　　　　发送缓冲区下限　　　　　　　　　　　　　int
SO_RCVTIMEO　　　　　　接收超时　　　　　　　　　　　　　　　　struct timeval
SO_SNDTIMEO　　　　　　发送超时　　　　　　　　　　　　　　　　struct timeval
SO_REUSERADDR　　　　　允许重用本地地址和端口　　　　　　　　　int
SO_TYPE　　　　　　　　获得套接字类型　　　　　　　　　　　　　int
SO_BSDCOMPAT　　　　　　与BSD系统兼容　　　　　　　　　　　　　 int
========================================================================
　　　　　　　　　　　　IPPROTO_IP
------------------------------------------------------------------------
IP_HDRINCL　　　　　　　在数据包中包含IP首部　　　　　　　　　　int
IP_OPTINOS　　　　　　　IP首部选项　　　　　　　　　　　　　　　int
IP_TOS　　　　　　　　　服务类型
IP_TTL　　　　　　　　　生存时间　　　　　　　　　　　　　　　　int
========================================================================
　　　　　　　　　　　　IPPRO_TCP
------------------------------------------------------------------------
TCP_MAXSEG　　　　　　　TCP最大数据段的大小　　　　　　　　　　 int
TCP_NODELAY　　　　　　不使用Nagle算法　　　　　　　　　　　　 int
========================================================================

返回说明：
成功执行时，返回0。失败返回-1，errno被设为以下的某个值
EBADF：sock不是有效的文件描述词
EFAULT：optval指向的内存并非有效的进程空间
EINVAL：在调用setsockopt()时，optlen无效
ENOPROTOOPT：指定的协议层不能识别选项
ENOTSOCK：sock描述的不是套接字

SO_RCVBUF和SO_SNDBUF每个套接口都有一个发送缓冲区和一个接收缓冲区，使用这两个套接口选项可以改变缺省缓冲区大小。

// 接收缓冲区
int nRecvBuf=32*1024; //设置为32K
setsockopt(s,SOL_SOCKET,SO_RCVBUF,(const char*)&nRecvBuf,sizeof(int));

//发送缓冲区
int nSendBuf=32*1024;//设置为32K
setsockopt(s,SOL_SOCKET,SO_SNDBUF,(const char*)&nSendBuf,sizeof(int));

注意：

当设置TCP套接口接收缓冲区的大小时，函数调用顺序是很重要的，因为TCP的窗口规模选项是在建立连接时用SYN与对方互换得到的。对于客户，O_RCVBUF选项必须在connect之前设置；对于服务器，SO_RCVBUF选项必须在listen前设置。

结合原理说明：

        1.每个套接口都有一个发送缓冲区和一个接收缓冲区。接收缓冲区被TCP和UDP用来将接收到的数据一直保存到由应用进程来读。 TCP：TCP通告另一端的窗口大小。 TCP套接口接收缓冲区不可能溢出，因为对方不允许发出超过所通告窗口大小的数据。这就是TCP的流量控制，如果对方无视窗口大小而发出了超过窗口大小的数据，则接收方TCP将丢弃它。 UDP：当接收到的数据报装不进套接口接收缓冲区时，此数据报就被丢弃。UDP是没有流量控制的；快的发送者可以很容易地就淹没慢的接收者，导致接收方的UDP丢弃数据报。
        2.我们经常听说tcp协议的三次握手,但三次握手到底是什么，其细节是什么，为什么要这么做呢?
        第一次:客户端发送连接请求给服务器，服务器接收;
        第二次:服务器返回给客户端一个确认码,附带一个从服务器到客户端的连接请求,客户机接收,确认客户端到服务器的连接.
        第三次:客户机返回服务器上次发送请求的确认码,服务器接收,确认服务器到客户端的连接.
        我们可以看到:
        1. tcp的每个连接都需要确认.
        2. 客户端到服务器和服务器到客户端的连接是独立的.
        我们再想想tcp协议的特点:连接的,可靠的,全双工的,实际上tcp的三次握手正是为了保证这些特性的实现.

3.setsockopt的用法

1.closesocket（一般不会立即关闭而经历TIME_WAIT的过程）后想继续重用该socket：
BOOL bReuseaddr=TRUE;
setsockopt(s,SOL_SOCKET ,SO_REUSEADDR,(const char*)&bReuseaddr,sizeof(BOOL));

2. 如果要已经处于连接状态的soket在调用closesocket后强制关闭，不经历TIME_WAIT的过程：
BOOL bDontLinger = FALSE;
setsockopt(s,SOL_SOCKET,SO_DONTLINGER,(const char*)&bDontLinger,sizeof(BOOL));

3.在send(),recv()过程中有时由于网络状况等原因，发收不能预期进行,而设置收发时限：
int nNetTimeout=1000;//1秒
//发送时限
setsockopt(socket，SOL_S0CKET,SO_SNDTIMEO，(char *)&nNetTimeout,sizeof(int));
//接收时限
setsockopt(socket，SOL_S0CKET,SO_RCVTIMEO，(char *)&nNetTimeout,sizeof(int));

4.在send()的时候，返回的是实际发送出去的字节(同步)或发送到socket缓冲区的字节
(异步);系统默认的状态发送和接收一次为8688字节(约为8.5K)；在实际的过程中发送数据
和接收数据量比较大，可以设置socket缓冲区，而避免了send(),recv()不断的循环收发：
// 接收缓冲区
int nRecvBuf=32*1024;//设置为32K
setsockopt(s,SOL_SOCKET,SO_RCVBUF,(const char*)&nRecvBuf,sizeof(int));
//发送缓冲区
int nSendBuf=32*1024;//设置为32K
setsockopt(s,SOL_SOCKET,SO_SNDBUF,(const char*)&nSendBuf,sizeof(int));

5. 如果在发送数据的时，希望不经历由系统缓冲区到socket缓冲区的拷贝而影响
程序的性能：
int nZero=0;
setsockopt(socket，SOL_S0CKET,SO_SNDBUF，(char *)&nZero,sizeof(nZero));

6.同上在recv()完成上述功能(默认情况是将socket缓冲区的内容拷贝到系统缓冲区)：
int nZero=0;
setsockopt(socket，SOL_S0CKET,SO_RCVBUF，(char *)&nZero,sizeof(int));

7.一般在发送UDP数据报的时候，希望该socket发送的数据具有广播特性：
BOOL bBroadcast=TRUE;
setsockopt(s,SOL_SOCKET,SO_BROADCAST,(const char*)&bBroadcast,sizeof(BOOL));

8.在client连接服务器过程中，如果处于非阻塞模式下的socket在connect()的过程中可以设置connect()延时,直到accpet()被呼叫(本函数设置只有在非阻塞的过程中有显著的作用，在阻塞的函数调用中作用不大)
BOOL bConditionalAccept=TRUE;
setsockopt(s,SOL_SOCKET,SO_CONDITIONAL_ACCEPT,(const char*)&bConditionalAccept,sizeof(BOOL));

9.如果在发送数据的过程中(send()没有完成，还有数据没发送)而调用了closesocket(),以前我们一般采取的措施是"从容关闭"shutdown(s,SD_BOTH),但是数据是肯定丢失了，如何设置让程序满足具体应用的要求(即让没发完的数据发送出去后在关闭socket)？
struct linger {
u_short l_onoff;
u_short l_linger;
};
linger m_sLinger;
m_sLinger.l_onoff=1;//(在closesocket()调用,但是还有数据没发送完毕的时候容许逗留)
// 如果m_sLinger.l_onoff=0;则功能和2.)作用相同;
m_sLinger.l_linger=5;//(容许逗留的时间为5秒)
setsockopt(s,SOL_SOCKET,SO_LINGER,(const char*)&m_sLinger,sizeof(linger));

转载出处：http://blog.csdn.net/chinafe/archive/2008/12/15/3517537.aspx
转载出处：http://blog.csdn.net/xioahw/archive/2009/04/08/4056514.aspx

chatler 2010-09-06 11:17 发表评论

一个基于完成端口的TCP Server Framework,浅析IOCP

chatler — Wed, 25 Aug 2010 12:42:00 GMT

如果你不投递（POST）Overlapped I/O，那么I/O Completion Ports 只能为你提供一个Queue.
    CreateIoCompletionPort的NumberOfConcurrentThreads：
1.只有当第二个参数ExistingCompletionPort为NULL时它才有效，它是个max threads limits.
2.大家有谁把它设置为超出cpu个数的值，当然不只是cpu个数的2倍，而是下面的MAX_THREADS 100甚至更大。
对于这个值的设定，msdn并没有说非得设成cpu个数的2倍，而且也没有把减少线程之间上下文交换这些影响扯到这里来。I/O Completion Ports MSDN:"If your transaction required a lengthy computation, a larger concurrency value will allow more threads to run. Each completion packet may take longer to finish, but more completion packets will be processed at the same time. "。
    对于struct OVERLAPPED，我们常会如下扩展，
typedef struct {
WSAOVERLAPPED overlapped; //must be first member?   是的，必须是第一个。如果你不肯定，你可以试试。
SOCKET client_s;
SOCKADDR_IN client_addr;
WORD optCode;//1--read,2--send. 有人常会定义这个数据成员，但也有人不用，争议在send/WSASend,此时的同步和异步是否有必要？至少我下面的server更本就没用它。
char buf[MAX_BUF_SIZE];
WSABUF wsaBuf;//inited ? 这个不要忘了！
DWORD numberOfBytesTransferred;
DWORD flags;

}QSSOverlapped;//for per connection
我下面的server框架的基本思想是:
One connection VS one thread in worker thread pool ,worker thread performs completionWorkerRoutine.
A Acceptor thread 专门用来accept socket,关联至IOCP,并WSARecv:post Recv Completion Packet to IOCP.
在completionWorkerRoutine中有以下的职责:
1.handle request,当忙时增加completionWorkerThread数量但不超过maxThreads,post Recv Completion Packet to IOCP.
2.timeout时检查是否空闲和当前completionWorkerThread数量,当空闲时保持或减少至minThreads数量.
3.对所有Accepted-socket管理生命周期,这里利用系统的keepalive probes,若想实现业务层"心跳探测"只需将QSS_SIO_KEEPALIVE_VALS_TIMEOUT 改回系统默认的2小时.
下面结合源代码,浅析一下IOCP:
socketserver.h
#ifndef __Q_SOCKET_SERVER__
#define __Q_SOCKET_SERVER__
#include
#include
#define QSS_SIO_KEEPALIVE_VALS_TIMEOUT 30*60*1000
#define QSS_SIO_KEEPALIVE_VALS_INTERVAL 5*1000

#define MAX_THREADS 100
#define MAX_THREADS_MIN 10
#define MIN_WORKER_WAIT_TIMEOUT 20*1000
#define MAX_WORKER_WAIT_TIMEOUT 60*MIN_WORKER_WAIT_TIMEOUT

#define MAX_BUF_SIZE 1024

/*当Accepted socket和socket关闭或发生异常时回调CSocketLifecycleCallback*/
typedef void (*CSocketLifecycleCallback)(SOCKET cs,int lifecycle);//lifecycle:0:OnAccepted,-1:OnClose//注意OnClose此时的socket未必可用,可能已经被非正常关闭或其他异常.

/*协议处理回调*/
typedef int (*InternalProtocolHandler)(LPWSAOVERLAPPED overlapped);//return -1:SOCKET_ERROR

typedef struct Q_SOCKET_SERVER SocketServer;
DWORD initializeSocketServer(SocketServer ** ssp,WORD passive,WORD port,CSocketLifecycleCallback cslifecb,InternalProtocolHandler protoHandler,WORD minThreads,WORD maxThreads,long workerWaitTimeout);
DWORD startSocketServer(SocketServer *ss);
DWORD shutdownSocketServer(SocketServer *ss);

#endif
qsocketserver.c      简称 qss,相应的OVERLAPPED简称qssOl.
#include "socketserver.h"
#include "stdio.h"
typedef struct {
WORD passive;//daemon
WORD port;
WORD minThreads;
WORD maxThreads;
volatile long lifecycleStatus;//0-created,1-starting, 2-running,3-stopping,4-exitKeyPosted,5-stopped
long workerWaitTimeout;//wait timeout
CRITICAL_SECTION QSS_LOCK;
volatile long workerCounter;
volatile long currentBusyWorkers;
volatile long CSocketsCounter;//Accepted-socket引用计数
CSocketLifecycleCallback cslifecb;
InternalProtocolHandler protoHandler;
WORD wsaVersion;//=MAKEWORD(2,0);
WSADATA wsData;
SOCKET server_s;
SOCKADDR_IN serv_addr;
HANDLE iocpHandle;
}QSocketServer;

typedef struct {
WSAOVERLAPPED overlapped;
SOCKET client_s;
SOCKADDR_IN client_addr;
WORD optCode;
char buf[MAX_BUF_SIZE];
WSABUF wsaBuf;
DWORD numberOfBytesTransferred;
DWORD flags;
}QSSOverlapped;

DWORD acceptorRoutine(LPVOID);
DWORD completionWorkerRoutine(LPVOID);

static void adjustQSSWorkerLimits(QSocketServer *qss){
  /*adjust size and timeout.*/
  /*if(qss->maxThreads <= 0) {
   qss->maxThreads = MAX_THREADS;
        } else if (qss->maxThreads < MAX_THREADS_MIN) {
        qss->maxThreads = MAX_THREADS_MIN;
        }
        if(qss->minThreads > qss->maxThreads) {
        qss->minThreads = qss->maxThreads;
        }
        if(qss->minThreads <= 0) {
            if(1 == qss->maxThreads) {
            qss->minThreads = 1;
            } else {
            qss->minThreads = qss->maxThreads/2;
            }
        }

        if(qss->workerWaitTimeout        qss->workerWaitTimeout=MIN_WORKER_WAIT_TIMEOUT;
        if(qss->workerWaitTimeout>MAX_WORKER_WAIT_TIMEOUT)
        qss->workerWaitTimeout=MAX_WORKER_WAIT_TIMEOUT;        */
}

typedef struct{
QSocketServer * qss;
HANDLE th;
}QSSWORKER_PARAM;

static WORD addQSSWorker(QSocketServer *qss,WORD addCounter){
WORD res=0;
if(qss->workerCounterminThreads||(qss->currentBusyWorkers==qss->workerCounter&&qss->workerCountermaxThreads)){
  DWORD threadId;
  QSSWORKER_PARAM * pParam=NULL;
  int i=0;
  EnterCriticalSection(&qss->QSS_LOCK);
  if(qss->workerCounter+addCounter<=qss->maxThreads)
   for(;i   {
    pParam=malloc(sizeof(QSSWORKER_PARAM));
    if(pParam){
     pParam->th=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)completionWorkerRoutine,pParam,CREATE_SUSPENDED,&threadId);
     pParam->qss=qss;
     ResumeThread(pParam->th);
     qss->workerCounter++,res++;
    }
   }
  LeaveCriticalSection(&qss->QSS_LOCK);
}
return res;
}

static void SOlogger(const char * msg,SOCKET s,int clearup){
perror(msg);
if(s>0)
closesocket(s);
if(clearup)
WSACleanup();
}

static int _InternalEchoProtocolHandler(LPWSAOVERLAPPED overlapped){
QSSOverlapped *qssOl=(QSSOverlapped *)overlapped;

printf("numOfT:%d,WSARecvd:%s,\n",qssOl->numberOfBytesTransferred,qssOl->buf);
//Sleep(500);
return send(qssOl->client_s,qssOl->buf,qssOl->numberOfBytesTransferred,0);
}

DWORD initializeSocketServer(SocketServer ** ssp,WORD passive,WORD port,CSocketLifecycleCallback cslifecb,InternalProtocolHandler protoHandler,WORD minThreads,WORD maxThreads,long workerWaitTimeout){
QSocketServer * qss=malloc(sizeof(QSocketServer));
qss->passive=passive>0?1:0;
qss->port=port;
qss->minThreads=minThreads;
qss->maxThreads=maxThreads;
qss->workerWaitTimeout=workerWaitTimeout;
qss->wsaVersion=MAKEWORD(2,0);
qss->lifecycleStatus=0;
InitializeCriticalSection(&qss->QSS_LOCK);
qss->workerCounter=0;
qss->currentBusyWorkers=0;
qss->CSocketsCounter=0;
qss->cslifecb=cslifecb,qss->protoHandler=protoHandler;
if(!qss->protoHandler)
qss->protoHandler=_InternalEchoProtocolHandler;
adjustQSSWorkerLimits(qss);
*ssp=(SocketServer *)qss;
return 1;
}

DWORD startSocketServer(SocketServer *ss){
QSocketServer * qss=(QSocketServer *)ss;
if(qss==NULL||InterlockedCompareExchange(&qss->lifecycleStatus,1,0))
return 0;
qss->serv_addr.sin_family=AF_INET;
qss->serv_addr.sin_port=htons(qss->port);
qss->serv_addr.sin_addr.s_addr=INADDR_ANY;//inet_addr("127.0.0.1");
if(WSAStartup(qss->wsaVersion,&qss->wsData)){
/*这里还有个插曲就是这个WSAStartup被调用的时候,它居然会启动一条额外的线程,当然稍后这条线程会自动退出的.不知WSAClearup又会如何?......*/

  SOlogger("WSAStartup failed.\n",0,0);
  return 0;
}
qss->server_s=socket(AF_INET,SOCK_STREAM,IPPROTO_IP);
if(qss->server_s==INVALID_SOCKET){
  SOlogger("socket failed.\n",0,1);
  return 0;
}
if(bind(qss->server_s,(LPSOCKADDR)&qss->serv_addr,sizeof(SOCKADDR_IN))==SOCKET_ERROR){
  SOlogger("bind failed.\n",qss->server_s,1);
  return 0;
}
if(listen(qss->server_s,SOMAXCONN)==SOCKET_ERROR)/*这里来谈谈backlog,很多人不知道设成何值,我见到过1,5,50,100的,有人说设定的越大越耗资源,的确,这里设成SOMAXCONN不代表windows会真的使用SOMAXCONN,而是" If set to SOMAXCONN, the underlying service provider responsible for socket s will set the backlog to a maximum reasonable value. "，同时在现实环境中，不同操作系统支持TCP缓冲队列有所不同，所以还不如让操作系统来决定它的值。像Apache这种服务器：
#ifndef DEFAULT_LISTENBACKLOG
#define DEFAULT_LISTENBACKLOG 511
#endif
*/
    {
  SOlogger("listen failed.\n",qss->server_s,1);
        return 0;
    }
qss->iocpHandle=CreateIoCompletionPort(INVALID_HANDLE_VALUE,NULL,0,/*NumberOfConcurrentThreads-->*/qss->maxThreads);
//initialize worker for completion routine.
addQSSWorker(qss,qss->minThreads);
qss->lifecycleStatus=2;
{
  QSSWORKER_PARAM * pParam=malloc(sizeof(QSSWORKER_PARAM));
  pParam->qss=qss;
  pParam->th=NULL;
  if(qss->passive){
   DWORD threadId;
   pParam->th=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)acceptorRoutine,pParam,0,&threadId);
  }else
   return acceptorRoutine(pParam);
}
return 1;
}

DWORD shutdownSocketServer(SocketServer *ss){
QSocketServer * qss=(QSocketServer *)ss;
if(qss==NULL||InterlockedCompareExchange(&qss->lifecycleStatus,3,2)!=2)
  return 0;
closesocket(qss->server_s/*listen-socket*/);//..other accepted-sockets associated with the listen-socket will not be closed,except WSACleanup is called..
if(qss->CSocketsCounter==0)
  qss->lifecycleStatus=4,PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
WSACleanup();
return 1;
}

DWORD  acceptorRoutine(LPVOID ss){
QSSWORKER_PARAM * pParam=(QSSWORKER_PARAM *)ss;
QSocketServer * qss=pParam->qss;
HANDLE curThread=pParam->th;
QSSOverlapped *qssOl=NULL;
SOCKADDR_IN client_addr;
int client_addr_leng=sizeof(SOCKADDR_IN);
SOCKET cs;
free(pParam);
while(1){
  printf("accept starting.....\n");
  cs/*Accepted-socket*/=accept(qss->server_s,(LPSOCKADDR)&client_addr,&client_addr_leng);
  if(cs==INVALID_SOCKET)
        {
   printf("accept failed:%d\n",GetLastError());
            break;
        }else{//SO_KEEPALIVE,SIO_KEEPALIVE_VALS 这里是利用系统的"心跳探测",keepalive probes.linux:setsockopt,SOL_TCP:TCP_KEEPIDLE,TCP_KEEPINTVL,TCP_KEEPCNT
            struct tcp_keepalive alive,aliveOut;
            int so_keepalive_opt=1;
            DWORD outDW;
            if(!setsockopt(cs,SOL_SOCKET,SO_KEEPALIVE,(char *)&so_keepalive_opt,sizeof(so_keepalive_opt))){
               alive.onoff=TRUE;
               alive.keepalivetime=QSS_SIO_KEEPALIVE_VALS_TIMEOUT;
               alive.keepaliveinterval=QSS_SIO_KEEPALIVE_VALS_INTERVAL;
               if(WSAIoctl(cs,SIO_KEEPALIVE_VALS,&alive,sizeof(alive),&aliveOut,sizeof(aliveOut),&outDW,NULL,NULL)==SOCKET_ERROR){
                    printf("WSAIoctl SIO_KEEPALIVE_VALS failed:%d\n",GetLastError());
                    break;
                }

            }else{
                     printf("setsockopt SO_KEEPALIVE failed:%d\n",GetLastError());
                     break;
            }
  }

  CreateIoCompletionPort((HANDLE)cs,qss->iocpHandle,cs,0);
  if(qssOl==NULL){
   qssOl=malloc(sizeof(QSSOverlapped));
  }
  qssOl->client_s=cs;
  qssOl->wsaBuf.len=MAX_BUF_SIZE,qssOl->wsaBuf.buf=qssOl->buf,qssOl->numberOfBytesTransferred=0,qssOl->flags=0;//initialize WSABuf.
  memset(&qssOl->overlapped,0,sizeof(WSAOVERLAPPED));
  {
   DWORD lastErr=GetLastError();
   int ret=0;
   SetLastError(0);
   ret=WSARecv(cs,&qssOl->wsaBuf,1,&qssOl->numberOfBytesTransferred,&qssOl->flags,&qssOl->overlapped,NULL);
   if(ret==0||(ret==SOCKET_ERROR&&GetLastError()==WSA_IO_PENDING)){
    InterlockedIncrement(&qss->CSocketsCounter);//Accepted-socket计数递增.
    if(qss->cslifecb)
     qss->cslifecb(cs,0);
    qssOl=NULL;
   }

   if(!GetLastError())
    SetLastError(lastErr);
  }

  printf("accept flags:%d ,cs:%d.\n",GetLastError(),cs);
}//end while.

if(qssOl)
  free(qssOl);
if(qss)
  shutdownSocketServer((SocketServer *)qss);
if(curThread)
  CloseHandle(curThread);

return 1;
}

static int postRecvCompletionPacket(QSSOverlapped * qssOl,int SOErrOccurredCode){
int SOErrOccurred=0;
DWORD lastErr=GetLastError();
SetLastError(0);
//SOCKET_ERROR:-1,WSA_IO_PENDING:997
if(WSARecv(qssOl->client_s,&qssOl->wsaBuf,1,&qssOl->numberOfBytesTransferred,&qssOl->flags,&qssOl->overlapped,NULL)==SOCKET_ERROR
  &&GetLastError()!=WSA_IO_PENDING)//this case lastError maybe 64, 10054
{
  SOErrOccurred=SOErrOccurredCode;
}
if(!GetLastError())
  SetLastError(lastErr);
if(SOErrOccurred)
  printf("worker[%d] postRecvCompletionPacket SOErrOccurred=%d,preErr:%d,postedErr:%d\n",GetCurrentThreadId(),SOErrOccurred,lastErr,GetLastError());
return SOErrOccurred;
}

DWORD  completionWorkerRoutine(LPVOID ss){
QSSWORKER_PARAM * pParam=(QSSWORKER_PARAM *)ss;
QSocketServer * qss=pParam->qss;
HANDLE curThread=pParam->th;
QSSOverlapped * qssOl=NULL;
DWORD numberOfBytesTransferred=0;
ULONG_PTR completionKey=0;
int postRes=0,handleCode=0,exitCode=0,SOErrOccurred=0;
free(pParam);
while(!exitCode){
  SetLastError(0);
  if(GetQueuedCompletionStatus(qss->iocpHandle,&numberOfBytesTransferred,&completionKey,(LPOVERLAPPED *)&qssOl,qss->workerWaitTimeout)){
   if(completionKey==-1&&qss->lifecycleStatus>=4)
   {
    printf("worker[%d] completionKey -1:%d \n",GetCurrentThreadId(),GetLastError());
    if(qss->workerCounter>1)
     PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
    exitCode=1;
    break;
   }
   if(numberOfBytesTransferred>0){

    InterlockedIncrement(&qss->currentBusyWorkers);
    addQSSWorker(qss,1);
    handleCode=qss->protoHandler((LPWSAOVERLAPPED)qssOl);
    InterlockedDecrement(&qss->currentBusyWorkers);

    if(handleCode>=0){
     SOErrOccurred=postRecvCompletionPacket(qssOl,1);
    }else
     SOErrOccurred=2;
   }else{
    printf("worker[%d] numberOfBytesTransferred==0 ***** closesocket servS or cs *****,%d,%d ,ol is:%d\n",GetCurrentThreadId(),GetLastError(),completionKey,qssOl==NULL?0:1);
    SOErrOccurred=3;
   }
  }else{ //GetQueuedCompletionStatus rtn FALSE, lastError 64 ,995[timeout worker thread exit.] ,WAIT_TIMEOUT:258
   if(qssOl){
    SOErrOccurred=postRecvCompletionPacket(qssOl,4);
   }else {

    printf("worker[%d] GetQueuedCompletionStatus F:%d \n",GetCurrentThreadId(),GetLastError());
    if(GetLastError()!=WAIT_TIMEOUT){
     exitCode=2;
    }else{//wait timeout
     if(qss->lifecycleStatus!=4&&qss->currentBusyWorkers==0&&qss->workerCounter>qss->minThreads){
      EnterCriticalSection(&qss->QSS_LOCK);
      if(qss->lifecycleStatus!=4&&qss->currentBusyWorkers==0&&qss->workerCounter>qss->minThreads){
       qss->workerCounter--;//until qss->workerCounter decrease to qss->minThreads
       exitCode=3;
      }
      LeaveCriticalSection(&qss->QSS_LOCK);
     }
    }
   }
  }//end GetQueuedCompletionStatus.

  if(SOErrOccurred){
   if(qss->cslifecb)
    qss->cslifecb(qssOl->client_s,-1);
   /*if(qssOl)*/{
    closesocket(qssOl->client_s);
    free(qssOl);
   }
   if(InterlockedDecrement(&qss->CSocketsCounter)==0&&qss->lifecycleStatus>=3){
    //for qss workerSize,PostQueuedCompletionStatus -1
    qss->lifecycleStatus=4,PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
    exitCode=4;
   }
  }
  qssOl=NULL,numberOfBytesTransferred=0,completionKey=0,SOErrOccurred=0;//for net while.
}//end while.

//last to do
if(exitCode!=3){
  int clearup=0;
  EnterCriticalSection(&qss->QSS_LOCK);
  if(!--qss->workerCounter&&qss->lifecycleStatus>=4){//clearup QSS
    clearup=1;
  }
  LeaveCriticalSection(&qss->QSS_LOCK);
  if(clearup){
   DeleteCriticalSection(&qss->QSS_LOCK);
   CloseHandle(qss->iocpHandle);
   free(qss);
  }
}
CloseHandle(curThread);
return 1;
}
------------------------------------------------------------------------------------------------------------------------
对于IOCP的LastError的辨别和处理是个难点,所以请注意我的completionWorkerRoutine的while结构,
结构如下:
while(!exitCode){
    if(completionKey==-1){...break;}
    if(GetQueuedCompletionStatus){/*在这个if体中只要你投递的OVERLAPPED is not NULL,那么这里你得到的就是它.*/
        if(numberOfBytesTransferred>0){
               /*在这里handle request,记得要继续投递你的OVERLAPPED哦! */
        }else{
              /*这里可能客户端或服务端closesocket(the socket),但是OVERLAPPED is not NULL,只要你投递的不为NULL!*/
        }
    }else{/*在这里的if体中,虽然GetQueuedCompletionStatus return FALSE,但是不代表OVERLAPPED一定为NULL.特别是OVERLAPPED is not NULL的情况下,不要以为LastError发生了,就代表当前的socket无用或发生致命的异常,比如发生lastError:995这种情况下此时的socket有可能是一切正常的可用的,你不应该关闭它.*/
        if(OVERLAPPED is not NULL){
             /*这种情况下,请不管37,21继续投递吧!在投递后再检测错误.*/
        }else{

        }
    }
if(socket error occured){

}
prepare for next while.
}

行文仓促,难免有错误或不足之处,希望大家踊跃指正评论,谢谢!

这个模型在性能上还是有改进的空间哦！

from:

http://www.cppblog.com/adapterofcoms/archive/2010/06/26/118781.aspx

chatler 2010-08-25 20:42 发表评论

一个基于Event Poll(epoll)的TCP Server Framework,浅析epoll

chatler — Wed, 25 Aug 2010 12:41:00 GMT

摘要: epoll,event poll,on linux kernel 2.6.x.pthread,nptl-2.12 LT/ET:ET也会多次发送event,当然频率远低于LT,但是epoll one shot才是真正的对"one connection VS one thread in worker thread pool,不依赖于任何connection-... 阅读全文

chatler 2010-08-25 20:41 发表评论

[转]close_wait状态和time_wait状态

chatler — Sat, 17 Jul 2010 14:37:00 GMT

不久前，我的Socket Client程序遇到了一个非常尴尬的错误。它本来应该在一个socket长连接上持续不断地向服务器发送数据，如果socket连接断开，那么程序会自动不断地重试建立连接。
有一天发现程序在不断尝试建立连接，但是总是失败。用netstat查看，这个程序竟然有上千个socket连接处于CLOSE_WAIT状态，以至于达到了上限，所以无法建立新的socket连接了。
为什么会这样呢？
它们为什么会都处在CLOSE_WAIT状态呢？
CLOSE_WAIT状态的生成原因
首先我们知道，如果我们的Client程序处于CLOSE_WAIT状态的话，说明套接字是被动关闭的！
因为如果是Server端主动断掉当前连接的话，那么双方关闭这个TCP连接共需要四个packet：
       Server ---> FIN ---> Client
       Server <--- ACK <--- Client
    这时候Server端处于FIN_WAIT_2状态；而我们的程序处于CLOSE_WAIT状态。
       Server <--- FIN <--- Client
这时Client发送FIN给Server，Client就置为LAST_ACK状态。
        Server ---> ACK ---> Client
Server回应了ACK，那么Client的套接字才会真正置为CLOSED状态。

我们的程序处于CLOSE_WAIT状态，而不是LAST_ACK状态，说明还没有发FIN给Server，那么可能是在关闭连接之前还有许多数据要发送或者其他事要做，导致没有发这个FIN packet。

原因知道了，那么为什么不发FIN包呢，难道会在关闭己方连接前有那么多事情要做吗？
还有一个问题，为什么有数千个连接都处于这个状态呢？难道那段时间内，服务器端总是主动拆除我们的连接吗？

不管怎么样，我们必须防止类似情况再度发生！
首先，我们要防止不断开辟新的端口，这可以通过设置SO_REUSEADDR套接字选项做到：
重用本地地址和端口
以前我总是一个端口不行，就换一个新的使用，所以导致让数千个端口进入CLOSE_WAIT状态。如果下次还发生这种尴尬状况，我希望加一个限定，只是当前这个端口处于CLOSE_WAIT状态！
在调用
sockConnected = socket(AF_INET, SOCK_STREAM, 0);
之后，我们要设置该套接字的选项来重用：

/// 允许重用本地地址和端口:
/// 这样的好处是，即使socket断了，调用前面的socket函数也不会占用另一个，而是始终就是一个端口
/// 这样防止socket始终连接不上，那么按照原来的做法，会不断地换端口。
int nREUSEADDR = 1;
setsockopt(sockConnected,
              SOL_SOCKET,
              SO_REUSEADDR,
              (const char*)&nREUSEADDR,
              sizeof(int));

教科书上是这么说的：这样，假如服务器关闭或者退出，造成本地地址和端口都处于TIME_WAIT状态，那么SO_REUSEADDR就显得非常有用。
也许我们无法避免被冻结在CLOSE_WAIT状态永远不出现，但起码可以保证不会占用新的端口。
其次，我们要设置SO_LINGER套接字选项：
从容关闭还是强行关闭？
LINGER是“拖延”的意思。
默认情况下(Win2k)，SO_DONTLINGER套接字选项的是1；SO_LINGER选项是，linger为{l_onoff：0，l_linger：0}。
如果在发送数据的过程中(send()没有完成，还有数据没发送)而调用了closesocket()，以前我们一般采取的措施是“从容关闭”：
因为在退出服务或者每次重新建立socket之前，我都会先调用
/// 先将双向的通讯关闭
     shutdown(sockConnected, SD_BOTH);
     /// 安全起见，每次建立Socket连接前，先把这个旧连接关闭
closesocket(sockConnected);

我们这次要这么做：
设置SO_LINGER为零（亦即linger结构中的l_onoff域设为非零，但l_linger为0），便不用担心closesocket调用进入“锁定”状态（等待完成），不论是否有排队数据未发送或未被确认。这种关闭方式称为“强行关闭”，因为套接字的虚电路立即被复位，尚未发出的所有数据都会丢失。在远端的recv()调用都会失败，并返回WSAECONNRESET错误。
在connect成功建立连接之后设置该选项：
linger m_sLinger;
m_sLinger.l_onoff = 1; // (在closesocket()调用,但是还有数据没发送完毕的时候容许逗留)
m_sLinger.l_linger = 0; // (容许逗留的时间为0秒)
setsockopt(sockConnected,
         SOL_SOCKET,
         SO_LINGER,
         (const char*)&m_sLinger,
         sizeof(linger));

总结
也许我们避免不了CLOSE_WAIT状态冻结的再次出现，但我们会使影响降到最小，希望那个重用套接字选项能够使得下一次重新建立连接时可以把CLOSE_WAIT状态踢掉。

Feedback
# 回复：[Socket]尴尬的CLOSE_WAIT状态以及应对策略 2005-01-30 3:41 PM yun.zheng
回复人： elssann(臭屁虫和他的开心果) ( ) 信誉：51 2005-01-30 14:00:00 得分: 0

我的意思是：当一方关闭连接后，另外一方没有检测到，就导致了CLOSE_WAIT的出现，上次我的一个朋友也是这样，他写了一个客户端和 APACHE连接，当APACHE把连接断掉后，他没检测到，出现了CLOSE_WAIT，后来我叫他检测了这个地方，他添加了调用 closesocket的代码后，这个问题就消除了。
如果你在关闭连接前还是出现CLOSE_WAIT,建议你取消shutdown的调用，直接两边closesocket试试。

另外一个问题:

比如这样的一个例子：
当客户端登录上服务器后，发送身份验证的请求，服务器收到了数据，对客户端身份进行验证，发现密码错误，这时候服务器的一般做法应该是先发送一个密码错误的信息给客户端，然后把连接断掉。

如果把
m_sLinger.l_onoff = 1;
m_sLinger.l_linger = 0;
这样设置后，很多情况下，客户端根本就收不到密码错误的消息，连接就被断了。

# 回复：[Socket]尴尬的CLOSE_WAIT状态以及应对策略 2005-01-30 3:41 PM yun.zheng
elssann(臭屁虫和他的开心果) ( ) 信誉：51 2005-01-30 13:24:00 得分: 0

出现CLOSE_WAIT的原因很简单，就是某一方在网络连接断开后，没有检测到这个错误，没有执行closesocket，导致了这个状态的实现，这在TCP/IP协议的状态变迁图上可以清楚看到。同时和这个相对应的还有一种叫TIME_WAIT的。

另外，把SOCKET的SO_LINGER设置为0秒拖延（也就是立即关闭）在很多时候是有害处的。
还有，把端口设置为可复用是一种不安全的网络编程方法。

# 回复：[Socket]尴尬的CLOSE_WAIT状态以及应对策略 2005-01-30 3:42 PM yun.zheng
elssann(臭屁虫和他的开心果) ( ) 信誉：51 2005-01-30 14:48:00 得分: 0

能不能解释请看这里
http://blog.csdn.net/cqq/archive/2005/01/26/269160.aspx

再看这个图：

http://tech.ccidnet.com/pub/attachment/2004/8/322252.png

断开连接的时候，
当发起主动关闭的左边这方发送一个FIN过去后，右边被动关闭的这方要回应一个ACK，这个ACK是TCP回应的，而不是应用程序发送的，此时，被动关闭的一方就处于CLOSE_WAIT状态了。如果此时被动关闭的这一方不再继续调用closesocket,那么他就不会发送接下来的FIN，导致自己老是处于CLOSE_WAIT。只有被动关闭的这一方调用了closesocket,才会发送一个FIN给主动关闭的这一方，同时也使得自己的状态变迁为LAST_ACK。

# 回复：[Socket]尴尬的CLOSE_WAIT状态以及应对策略 2005-01-30 3:54 PM yun.zheng
elssann(臭屁虫和他的开心果) ( ) 信誉：51 2005-01-30 15:39:00 得分: 0

比如被动关闭的是客户端。。。

当对方调用closesocket的时候，你的程序正在

int nRet = recv(s,....);
if (nRet == SOCKET_ERROR)
{
// closesocket(s);
return FALSE;
}

很多人就是忘记了那句closesocket，这种代码太常见了。

我的理解，当主动关闭的一方发送FIN到被动关闭这边后，被动关闭这边的TCP马上回应一个ACK过去，同时向上面应用程序提交一个ERROR，导致上面的SOCKET的send或者recv返回SOCKET_ERROR，正常情况下，如果上面在返回SOCKET_ERROR后调用了 closesocket,那么被动关闭的者一方的TCP就会发送一个FIN过去，自己的状态就变迁到LAST_ACK.

# 回复：[Socket]尴尬的CLOSE_WAIT状态以及应对策略 2005-01-30 4:17 PM yun.zheng
int nRecvBufLength =
recv(sockConnected,
szRecvBuffer,
sizeof(szRecvBuffer),
0);
/// zhengyun 20050130:
/// elssann举例说，当对方调用closesocket的时候，我的程序正在
/// recv，这时候有可能对方发送的FIN包我没有收到，而是由TCP代回了
/// 一个ACK包，所以我这边程序进入CLOSE_WAIT状态。
/// 所以他建议在这里判断是否已出错，是就主动closesocket。
/// 因为前面我们已经设置了recv超时时间为30秒，那么如果真的是超时了，
/// 这里收到的错误应该是WSAETIMEDOUT，这种情况下也可以关闭连接的
if (nRecvBufLength == SOCKET_ERROR)
{
TRACE_INFO(_T("=用recv接收发生Socket错误="));
closesocket(sockConnected);
continue;
}

这样可以吗？

网络连接无法释放—— CLOSE_WAIT
关键字：TCP ，CLOSE_WAIT, Java, SocketChannel

问题描述：最近性能测试碰到的一个问题。客户端使用NIO，服务器还是一般的Socket连接。当测试进行一段时间以后，发现服务器端的系统出现大量未释放的网络连接。用netstat -na查看，连接状态为CLOSE_WAIT。这就奇怪了，为什么Socket已经关闭而连接依然未释放。

解决：Google了半天，发现关于CLOSE_WAIT的问题一般是C的，Java似乎碰到这个问题的不多（这有一篇不错的，也是解决CLOSE_WAIT的，但是好像没有根本解决，而是选择了一个折中的办法）。接着找，由于使用了NIO，所以怀疑可能是这方面的问题，结果找到了这篇。顺着帖子翻下去，其中有几个人说到了一个问题—— 一端的Socket调用close后，另一端的Socket没有调用close.于是查了一下代码，果然发现Server端在某些异常情况时，没有关闭Socket。改正后问题解决。

时间基本上花在Google上了，不过也学到不少东西。下面为一张TCP连接的状态转换图：

说明：虚线和实线分别对应服务器端(被连接端)和客户端端(主动连接端)。

结合上图使用netstat -na命令即可知道到当前的TCP连接状态。一般LISTEN、ESTABLISHED、TIME_WAIT是比较常见。

分析：

上面我碰到的这个问题主要因为TCP的结束流程未走完，造成连接未释放。现设客户端主动断开连接，流程如下

Client 消息 Server

         close()
                                      ------ FIN ------->
        FIN_WAIT1                                                         CLOSE_WAIT
                                      <----- ACK -------
        FIN_WAIT2
                                                                                  close()
                                       <------ FIN ------
        TIME_WAIT                                                       LAST_ACK

                                      ------ ACK ------->
                                                                                   CLOSED
           CLOSED

如上图所示，由于Server的Socket在客户端已经关闭时而没有调用关闭，造成服务器端的连接处在“挂起”状态，而客户端则处在等待应答的状态上。此问题的典型特征是：一端处于FIN_WAIT2 ，而另一端处于CLOSE_WAIT. 不过，根本问题还是程序写的不好，有待提高。

TIME_WAIT状态
根据TCP协议，主动发起关闭的一方，会进入TIME_WAIT状态，持续2*MSL(Max Segment Lifetime)，缺省为240秒，在这个post中简洁的介绍了为什么需要这个状态。

值得一说的是，对于基于TCP的HTTP协议，关闭TCP连接的是Server端，这样，Server端会进入TIME_WAIT状态，可想而知，对于访问量大的Web Server，会存在大量的TIME_WAIT状态，假如server一秒钟接收1000个请求，那么就会积压240*1000=240，000个 TIME_WAIT的记录，维护这些状态给Server带来负担。当然现代操作系统都会用快速的查找算法来管理这些TIME_WAIT，所以对于新的 TCP连接请求，判断是否hit中一个TIME_WAIT不会太费时间，但是有这么多状态要维护总是不好。

HTTP协议1.1版规定default行为是Keep-Alive，也就是会重用TCP连接传输多个 request/response，一个主要原因就是发现了这个问题。还有一个方法减缓TIME_WAIT压力就是把系统的2*MSL时间减少，因为 240秒的时间实在是忒长了点，对于Windows，修改注册表，在HKEY_LOCAL_MACHINE\ SYSTEM\CurrentControlSet\Services\ Tcpip\Parameters上添加一个DWORD类型的值TcpTimedWaitDelay，一般认为不要少于60，不然可能会有麻烦。

对于大型的服务，一台server搞不定，需要一个LB(Load Balancer)把流量分配到若干后端服务器上，如果这个LB是以NAT方式工作的话，可能会带来问题。假如所有从LB到后端Server的IP包的 source address都是一样的(LB的对内地址），那么LB到后端Server的TCP连接会受限制，因为频繁的TCP连接建立和关闭，会在server上留下TIME_WAIT状态，而且这些状态对应的remote address都是LB的，LB的source port撑死也就60000多个(2^16=65536,1~1023是保留端口，还有一些其他端口缺省也不会用），每个LB上的端口一旦进入 Server的TIME_WAIT黑名单，就有240秒不能再用来建立和Server的连接，这样LB和Server最多也就能支持300个左右的连接。如果没有LB，不会有这个问题，因为这样server看到的remote address是internet上广阔无垠的集合，对每个address，60000多个port实在是够用了。

一开始我觉得用上LB会很大程度上限制TCP的连接数，但是实验表明没这回事，LB后面的一台Windows Server 2003每秒处理请求数照样达到了600个，难道TIME_WAIT状态没起作用？用Net Monitor和netstat观察后发现，Server和LB的XXXX端口之间的连接进入TIME_WAIT状态后，再来一个LB的XXXX端口的 SYN包，Server照样接收处理了，而是想像的那样被drop掉了。翻书，从书堆里面找出覆满尘土的大学时代买的《UNIX Network Programming, Volume 1, Second Edition: Networking APIs: Sockets and XTI》，中间提到一句，对于BSD-derived实现，只要SYN的sequence number比上一次关闭时的最大sequence number还要大，那么TIME_WAIT状态一样接受这个SYN，难不成Windows也算BSD-derived?有了这点线索和关键字 (BSD)，找到这个post，在NT4.0的时候，还是和BSD-derived不一样的，不过Windows Server 2003已经是NT5.2了，也许有点差别了。

做个试验，用Socket API编一个Client端，每次都Bind到本地一个端口比如2345，重复的建立TCP连接往一个Server发送Keep-Alive=false 的HTTP请求，Windows的实现让sequence number不断的增长，所以虽然Server对于Client的2345端口连接保持TIME_WAIT状态，但是总是能够接受新的请求，不会拒绝。那如果SYN的Sequence Number变小会怎么样呢？同样用Socket API，不过这次用Raw IP，发送一个小sequence number的SYN包过去，Net Monitor里面看到，这个SYN被Server接收后如泥牛如海，一点反应没有，被drop掉了。

按照书上的说法，BSD-derived和Windows Server 2003的做法有安全隐患，不过至少这样至少不会出现TIME_WAIT阻止TCP请求的问题，当然，客户端要配合，保证不同TCP连接的sequence number要上涨不要下降。

本文来自CSDN博客，转载请标明出处：http://blog.csdn.net/lionzl/archive/2009/03/20/4007206.as

chatler 2010-07-17 22:37 发表评论

TCP: SYN ACK FIN RST PSH URG 详解

chatler — Fri, 16 Jul 2010 06:14:00 GMT

版权声明：转载时请以超链接形式标明文章原始出处和作者信息及本声明
 http://xufish.blogbus.com/logs/40536553.html

TCP 的三次握手是怎么进行的了：发送端发送一个SYN=1，ACK=0标志的数据包给接收端，请求进行连接，这是第一次握手；接收端收到请求并且允许连接的话，就会发送一个SYN=1，ACK=1标志的数据包给发送端，告诉它，可以通讯了，并且让发送端发送一个确认数据包，这是第二次握手；最后，发送端发送一个SYN=0，ACK=1的数据包给接收端，告诉它连接已被确认，这就是第三次握手。之后，一个TCP连接建立，开始通讯。

*SYN：同步标志
同步序列编号(Synchronize Sequence Numbers)栏有效。该标志仅在三次握手建立TCP连接时有效。它提示TCP连接的服务端检查序列编号，该序列编号为TCP连接初始端(一般是客户端)的初始序列编号。在这里，可以把TCP序列编号看作是一个范围从0到4，294，967，295的32位计数器。通过TCP连接交换的数据中每一个字节都经过序列编号。在TCP报头中的序列编号栏包括了TCP分段中第一个字节的序列编号。

*ACK：确认标志
确认编号(Acknowledgement Number)栏有效。大多数情况下该标志位是置位的。TCP报头内的确认编号栏内包含的确认编号(w+1，Figure-1)为下一个预期的序列编号，同时提示远端系统已经成功接收所有数据。

*RST：复位标志
复位标志有效。用于复位相应的TCP连接。

*URG：紧急标志
紧急(The urgent pointer) 标志有效。紧急标志置位，

*PSH：推标志
该标志置位时，接收端不将该数据进行队列处理，而是尽可能快将数据转由应用处理。在处理 telnet 或 rlogin 等交互模式的连接时，该标志总是置位的。

*FIN：结束标志
带有该标志置位的数据包用来结束一个TCP回话，但对应端口仍处于开放状态，准备接收后续数据。

=============================================================

三次握手Three-way Handshake

一个虚拟连接的建立是通过三次握手来实现的

1. (B) --> [SYN] --> (A)

假如服务器A和客户机B通讯. 当A要和B通信时，B首先向A发一个SYN (Synchronize) 标记的包，告诉A请求建立连接.

注意: 一个 SYN包就是仅SYN标记设为1的TCP包(参见TCP包头Resources). 认识到这点很重要，只有当A受到B发来的SYN包，才可建立连接，除此之外别无他法。因此，如果你的防火墙丢弃所有的发往外网接口的SYN包，那么你将不能让外部任何主机主动建立连接。

2. (B) <-- [SYN/ACK] <--(A)

接着，A收到后会发一个对SYN包的确认包(SYN/ACK)回去，表示对第一个SYN包的确认，并继续握手操作.

注意: SYN/ACK包是仅SYN 和 ACK 标记为1的包.

3. (B) --> [ACK] --> (A)

B收到SYN/ACK 包,B发一个确认包(ACK)，通知A连接已建立。至此，三次握手完成，一个TCP连接完成

Note: ACK包就是仅ACK 标记设为1的TCP包. 需要注意的是当三此握手完成、连接建立以后，TCP连接的每个包都会设置ACK位

这就是为何连接跟踪很重要的原因了. 没有连接跟踪,防火墙将无法判断收到的ACK包是否属于一个已经建立的连接.一般的包过滤(Ipchains)收到ACK包时,会让它通过(这绝对不是个好主意). 而当状态型防火墙收到此种包时，它会先在连接表中查找是否属于哪个已建连接，否则丢弃该包

四次握手Four-way Handshake

四次握手用来关闭已建立的TCP连接

1. (B) --> ACK/FIN --> (A)

2. (B) <-- ACK <-- (A)

3. (B) <-- ACK/FIN <-- (A)

4. (B) --> ACK --> (A)

注意: 由于TCP连接是双向连接, 因此关闭连接需要在两个方向上做。ACK/FIN 包(ACK 和FIN 标记设为1)通常被认为是FIN(终结)包.然而, 由于连接还没有关闭, FIN包总是打上ACK标记. 没有ACK标记而仅有FIN标记的包不是合法的包，并且通常被认为是恶意的

连接复位Resetting a connection

四次握手不是关闭 TCP连接的唯一方法. 有时,如果主机需要尽快关闭连接(或连接超时,端口或主机不可达),RST (Reset)包将被发送. 注意在，由于RST包不是TCP连接中的必须部分, 可以只发送RST包(即不带ACK标记). 但在正常的TCP连接中RST包可以带ACK确认标记

请注意RST包是可以不要收到方确认的?

无效的TCP标记Invalid TCP Flags

到目前为止，你已经看到了 SYN, ACK, FIN, 和RST 标记. 另外，还有PSH (Push) 和URG (Urgent)标记.

最常见的非法组合是SYN/FIN 包. 注意:由于 SYN包是用来初始化连接的, 它不可能和 FIN和RST标记一起出现. 这也是一个恶意攻击.

由于现在大多数防火墙已知 SYN/FIN 包, 别的一些组合,例如SYN/FIN/PSH, SYN/FIN/RST, SYN/FIN/RST/PSH。很明显，当网络中出现这种包时，很你的网络肯定受到攻击了。

别的已知的非法包有FIN (无ACK标记)和"NULL"包。如同早先讨论的，由于ACK/FIN包的出现是为了关闭一个TCP连接，那么正常的FIN包总是带有 ACK 标记。"NULL"包就是没有任何TCP标记的包(URG,ACK,PSH,RST,SYN,FIN都为0)。

到目前为止，正常的网络活动下，TCP协议栈不可能产生带有上面提到的任何一种标记组合的TCP包。当你发现这些不正常的包时，肯定有人对你的网络不怀好意。

UDP (用户数据包协议User Datagram Protocol)
TCP是面向连接的，而UDP是非连接的协议。UDP没有对接受进行确认的标记和确认机制。对丢包的处理是在应用层来完成的。(or accidental arrival).

此处需要重点注意的事情是：在正常情况下，当UDP包到达一个关闭的端口时，会返回一个UDP复位包。由于UDP是非面向连接的, 因此没有任何确认信息来确认包是否正确到达目的地。因此如果你的防火墙丢弃UDP包，它会开放所有的UDP端口(?)。

由于Internet 上正常情况下一些包将被丢弃，甚至某些发往已关闭端口(非防火墙的)的UDP包将不会到达目的，它们将返回一个复位UDP包。

因为这个原因，UDP 端口扫描总是不精确、不可靠的。

看起来大UDP包的碎片是常见的DOS (Denial of Service)攻击的常见形式 (这里有个DOS攻击的例子，http://grc.com/dos/grcdos.htm ).

ICMP (网间控制消息协议Internet Control Message Protocol)
如同名字一样， ICMP用来在主机/路由器之间传递控制信息的协议。 ICMP包可以包含诊断信息(ping, traceroute - 注意目前unix系统中的traceroute用UDP包而不是ICMP)，错误信息(网络/主机/端口不可达 network/host/port unreachable), 信息(时间戳timestamp, 地址掩码address mask request, etc.)，或控制信息 (source quench, redirect, etc.) 。

你可以在http://www.iana.org/assignments/icmp-parameters中找到ICMP包的类型。

尽管ICMP通常是无害的，还是有些类型的ICMP信息需要丢弃。

Redirect (5), Alternate Host Address (6), Router Advertisement (9) 能用来转发通讯。

Echo (8), Timestamp (13) and Address Mask Request (17) 能用来分别判断主机是否起来，本地时间和地址掩码。注意它们是和返回的信息类别有关的。它们自己本身是不能被利用的，但它们泄露出的信息对攻击者是有用的。

ICMP 消息有时也被用来作为DOS攻击的一部分(例如：洪水ping flood ping,死 ping ?呵呵，有趣 ping of death)?/p>

包碎片注意A Note About Packet Fragmentation

如果一个包的大小超过了TCP的最大段长度MSS (Maximum Segment Size) 或MTU (Maximum Transmission Unit)，能够把此包发往目的的唯一方法是把此包分片。由于包分片是正常的，它可以被利用来做恶意的攻击。

因为分片的包的第一个分片包含一个包头，若没有包分片的重组功能，包过滤器不可能检测附加的包分片。典型的攻击Typical attacks involve in overlapping the packet data in which packet header is 典型的攻击Typical attacks involve in overlapping the packet data in which packet header isnormal until is it overwritten with different destination IP (or port) thereby bypassing firewall rules。包分片能作为 DOS 攻击的一部分，它可以crash older IP stacks 或涨死CPU连接能力。

Netfilter/Iptables中的连接跟踪代码能自动做分片重组。它仍有弱点，可能受到饱和连接攻击，可以把CPU资源耗光。

握手阶段：
序号方向 seq ack
1　　A->B 10000 0
2 B->A 20000 10000+1=10001
3 A->B 10001 20000+1=20001
解释：
1：A向B发起连接请求，以一个随机数初始化A的seq,这里假设为10000，此时ACK＝0

2：B收到A的连接请求后，也以一个随机数初始化B的seq，这里假设为20000，意思是：你的请求我已收到，我这方的数据流就从这个数开始。B的ACK是A的seq加1，即10000＋1＝10001

3：A收到B的回复后，它的seq是它的上个请求的seq加1，即10000＋1＝10001，意思也是：你的回复我收到了，我这方的数据流就从这个数开始。A此时的ACK 是B的seq加1，即20000+1=20001

数据传输阶段：
序号　　方向　　　　　　seq ack size
23 A->B 40000 70000 1514
24 B->A 70000 40000+1514-54=41460 54
25 A->B 41460 70000+54-54=70000 1514
26 B->A 70000 41460+1514-54=42920 54
解释：
23:B接收到 A发来的seq=40000,ack=70000,size=1514的数据包
24: 于是B向A也发一个数据包，告诉B，你的上个包我收到了。B的seq就以它收到的数据包的ACK填充，ACK是它收到的数据包的SEQ加上数据包的大小 (不包括以太网协议头，IP头，TCP头)，以证实B发过来的数据全收到了。
25:A 在收到B发过来的ack为41460的数据包时，一看到41460，正好是它的上个数据包的seq加上包的大小，就明白，上次发送的数据包已安全到达。于是它再发一个数据包给B。这个正在发送的数据包的seq也以它收到的数据包的ACK填充，ACK就以它收到的数据包的seq(70000)加上包的 size(54)填充,即ack=70000+54-54(全是头长，没数据项)。

其实在握手和结束时确认号应该是对方序列号加1,传输数据时则是对方序列号加上对方携带应用层数据的长度.如果从以太网包返回来计算所加的长度,就嫌走弯路了.
另外,如果对方没有数据过来,则自己的确认号不变,序列号为上次的序列号加上本次应用层数据发送长度

chatler 2010-07-16 14:14 发表评论

epoll 精髓

chatler — Thu, 06 May 2010 07:12:00 GMT

在linux的网络编程中，很长的时间都在使用select来做事件触发。在linux新的内核中，有了一种替换它的机制，就是epoll。
相比于select，epoll最大的好处在于它不会随着监听fd数目的增长而降低效率。因为在内核中的select实现中，它是采用轮询来处理的，轮询的 fd数目越多，自然耗时越多。并且，在linux/posix_types.h头文件有这样的声明：
#define __FD_SETSIZE    1024
表示select最多同时监听 1024个fd，当然，可以通过修改头文件再重编译内核来扩大这个数目，但这似乎并不治本。

epoll的接口非常简单，一共就三个函数：
1. int epoll_create(int size);
创建一个epoll的句柄，size用来告诉内核这个监听的数目一共有多大。这个参数不同于select()中的第一个参数，给出最大监听的fd+1的值。需要注意的是，当创建好epoll句柄后，它就是会占用一个fd值，在linux下如果查看/proc/进程id/fd/，是能够看到这个fd的，所以在使用完epoll后，必须调用close()关闭，否则可能导致fd被耗尽。

2. int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
epoll的事件注册函数，它不同与select()是在监听事件时告诉内核要监听什么类型的事件，而是在这里先注册要监听的事件类型。第一个参数是epoll_create()的返回值，第二个参数表示动作，用三个宏来表示：
EPOLL_CTL_ADD：注册新的fd到epfd中；
EPOLL_CTL_MOD：修改已经注册的fd的监听事件；
EPOLL_CTL_DEL：从epfd中删除一个fd；
第三个参数是需要监听的fd，第四个参数是告诉内核需要监听什么事，struct epoll_event结构如下：
struct epoll_event {
__uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};

events可以是以下几个宏的集合：
EPOLLIN ：表示对应的文件描述符可以读（包括对端SOCKET正常关闭）；
EPOLLOUT：表示对应的文件描述符可以写；
EPOLLPRI：表示对应的文件描述符有紧急的数据可读（这里应该表示有带外数据到来）；
EPOLLERR：表示对应的文件描述符发生错误；
EPOLLHUP：表示对应的文件描述符被挂断；
EPOLLET：将EPOLL设为边缘触发(Edge Triggered)模式，这是相对于水平触发(Level Triggered)来说的。
EPOLLONESHOT：只监听一次事件，当监听完这次事件之后，如果还需要继续监听这个socket的话，需要再次把这个socket加入到EPOLL队列里

3. int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);
等待事件的产生，类似于 select()调用。参数events用来从内核得到事件的集合，maxevents告之内核这个events有多大，这个maxevents的值不能大于创建epoll_create()时的size，参数timeout是超时时间（毫秒，0会立即返回，-1将不确定，也有说法说是永久阻塞）。该函数返回需要处理的事件数目，如返回0表示已超时。

--------------------------------------------------------------------------------------------

从 man手册中，得到ET和LT的具体描述如下

EPOLL事件有两种模型：
Edge Triggered (ET)
Level Triggered (LT)

假如有这样一个例子：
1. 我们已经把一个用来从管道中读取数据的文件句柄(RFD)添加到epoll描述符
2. 这个时候从管道的另一端被写入了2KB的数据
3. 调用epoll_wait(2)，并且它会返回RFD，说明它已经准备好读取操作
4. 然后我们读取了1KB的数据
5. 调用epoll_wait(2)......

Edge Triggered 工作模式：
如果我们在第1步将RFD添加到 epoll描述符的时候使用了EPOLLET标志，那么在第5步调用epoll_wait(2)之后将有可能会挂起，因为剩余的数据还存在于文件的输入缓冲区内，而且数据发出端还在等待一个针对已经发出数据的反馈信息。只有在监视的文件句柄上发生了某个事件的时候 ET 工作模式才会汇报事件。因此在第5步的时候，调用者可能会放弃等待仍在存在于文件输入缓冲区内的剩余数据。在上面的例子中，会有一个事件产生在RFD句柄上，因为在第2步执行了一个写操作，然后，事件将会在第3步被销毁。因为第4步的读取操作没有读空文件输入缓冲区内的数据，因此我们在第5步调用 epoll_wait(2)完成后，是否挂起是不确定的。epoll工作在ET模式的时候，必须使用非阻塞套接口，以避免由于一个文件句柄的阻塞读/阻塞写操作把处理多个文件描述符的任务饿死。最好以下面的方式调用ET模式的epoll接口，在后面会介绍避免可能的缺陷。
   i    基于非阻塞文件句柄
   ii   只有当read(2)或者write(2)返回EAGAIN时才需要挂起，等待。但这并不是说每次read()时都需要循环读，直到读到产生一个EAGAIN才认为此次事件处理完成，当read()返回的读到的数据长度小于请求的数据长度时，就可以确定此时缓冲中已没有数据了，也就可以认为此事读事件已处理完成。

Level Triggered 工作模式
相反的，以LT方式调用epoll接口的时候，它就相当于一个速度比较快的poll(2)，并且无论后面的数据是否被使用，因此他们具有同样的职能。因为即使使用ET模式的epoll，在收到多个chunk的数据的时候仍然会产生多个事件。调用者可以设定EPOLLONESHOT标志，在 epoll_wait(2)收到事件后epoll会与事件关联的文件句柄从epoll描述符中禁止掉。因此当EPOLLONESHOT设定后，使用带有 EPOLL_CTL_MOD标志的epoll_ctl(2)处理文件句柄就成为调用者必须作的事情。

然后详细解释ET, LT:

LT(level triggered)是缺省的工作方式，并且同时支持block和no-block socket.在这种做法中，内核告诉你一个文件描述符是否就绪了，然后你可以对这个就绪的fd进行IO操作。如果你不作任何操作，内核还是会继续通知你的，所以，这种模式编程出错误可能性要小一点。传统的select/poll都是这种模型的代表．

ET(edge-triggered) 是高速工作方式，只支持no-block socket。在这种模式下，当描述符从未就绪变为就绪时，内核通过epoll告诉你。然后它会假设你知道文件描述符已经就绪，并且不会再为那个文件描述符发送更多的就绪通知，直到你做了某些操作导致那个文件描述符不再为就绪状态了(比如，你在发送，接收或者接收请求，或者发送接收的数据少于一定量时导致了一个EWOULDBLOCK 错误）。但是请注意，如果一直不对这个fd作IO操作(从而导致它再次变成未就绪)，内核不会发送更多的通知(only once),不过在TCP协议中，ET模式的加速效用仍需要更多的benchmark确认（这句话不理解）。

在许多测试中我们会看到如果没有大量的idle -connection或者dead-connection，epoll的效率并不会比select/poll高很多，但是当我们遇到大量的idle- connection(例如WAN环境中存在大量的慢速连接)，就会发现epoll的效率大大高于select/poll。（未测试）

另外，当使用epoll的ET模型来工作时，当产生了一个EPOLLIN事件后，
读数据的时候需要考虑的是当recv()返回的大小如果等于请求的大小，那么很有可能是缓冲区还有数据未读完，也意味着该次事件还没有处理完，所以还需要再次读取：
while(rs)
{
buflen = recv(activeevents[i].data.fd, buf, sizeof(buf), 0);
if(buflen < 0)
{
    // 由于是非阻塞的模式,所以当errno为EAGAIN时,表示当前缓冲区已无数据可读
    // 在这里就当作是该次事件已处理处.
    if(errno == EAGAIN)
     break;
    else
     return;
   }
   else if(buflen == 0)
   {
     // 这里表示对端的socket已正常关闭.
   }
   if(buflen == sizeof(buf)
     rs = 1;   // 需要再次读取
   else
     rs = 0;
}

还有，假如发送端流量大于接收端的流量(意思是epoll所在的程序读比转发的socket要快),由于是非阻塞的socket,那么send()函数虽然返回,但实际缓冲区的数据并未真正发给接收端,这样不断的读和发，当缓冲区满后会产生EAGAIN错误(参考man send),同时,不理会这次请求发送的数据.所以,需要封装socket_send()的函数用来处理这种情况,该函数会尽量将数据写完再返回，返回 -1表示出错。在socket_send()内部,当写缓冲已满(send()返回-1,且errno为EAGAIN),那么会等待后再重试.这种方式并不很完美,在理论上可能会长时间的阻塞在socket_send()内部,但暂没有更好的办法.

ssize_t socket_send(int sockfd, const char* buffer, size_t buflen)
{
ssize_t tmp;
size_t total = buflen;
const char *p = buffer;

while(1)
{
    tmp = send(sockfd, p, total, 0);
    if(tmp < 0)
    {
      // 当send收到信号时,可以继续写,但这里返回-1.
      if(errno == EINTR)
        return -1;

      // 当socket是非阻塞时,如返回此错误,表示写缓冲队列已满,
      // 在这里做延时后再重试.
      if(errno == EAGAIN)
      {
        usleep(1000);
        continue;
      }

      return -1;
    }

    if((size_t)tmp == total)
      return buflen;

    total -= tmp;
    p += tmp;
}

return tmp;
}

from:
http://www.cnblogs.com/OnlyXP/archive/2007/08/10/851222.html

chatler 2010-05-06 15:12 发表评论

epoll用法说明

chatler — Thu, 06 May 2010 04:03:00 GMT

[转]epoll用法说明(源代码)

epoll用到的所有函数都是在头文件sys/epoll.h中声明的，下面简要说明所用到的数据结构和函数：
所用到的数据结构
typedef union epoll_data {
                void *ptr;
                int fd;
                __uint32_t u32;
                __uint64_t u64;
        } epoll_data_t;

        struct epoll_event {
                __uint32_t events;      /* Epoll events */
                epoll_data_t data;      /* User data variable */
        };
结构体 epoll_event 被用于注册所感兴趣的事件和回传所发生待处理的事件，其中epoll_data 联合体用来保存触发事件的某个文件描述符相关的数据，例如一个client连接到服务器，服务器通过调用accept函数可以得到于这个client对应的socket文件描述符，可以把这文件描述符赋给epoll_data的fd字段以便后面的读写操作在这个文件描述符上进行。epoll_event 结构体的events字段是表示感兴趣的事件和被触发的事件可能的取值为：EPOLLIN ：表示对应的文件描述符可以读；
EPOLLOUT：表示对应的文件描述符可以写；
EPOLLPRI：表示对应的文件描述符有紧急的数据可读（我不太明白是什么意思，可能是类似client关闭 socket连接这样的事件）；
EPOLLERR：表示对应的文件描述符发生错误；
EPOLLHUP：表示对应的文件描述符被挂断；
EPOLLET：表示对应的文件描述符有事件发生；
所用到的函数：
1、epoll_create函数
     函数声明：int epoll_create(int size)
    该函数生成一个epoll专用的文件描述符，其中的参数是指定生成描述符的最大范围（我觉得这个参数和select函数的第一个参数应该是类似的但是该怎么设置才好，我也不太清楚）。
2、epoll_ctl函数
     函数声明：int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)
     该函数用于控制某个文件描述符上的事件，可以注册事件，修改事件，删除事件。
    参数：epfd：由 epoll_create 生成的epoll专用的文件描述符；
                op：要进行的操作例如注册事件，可能的取值EPOLL_CTL_ADD 注册、EPOLL_CTL_MOD 修
                        改、EPOLL_CTL_DEL 删除
                fd：关联的文件描述符；
                event：指向epoll_event的指针；
    如果调用成功返回0,不成功返回-1
3、epoll_wait函数
函数声明:int epoll_wait(int epfd,struct epoll_event * events,int maxevents,int timeout)
该函数用于轮询I/O事件的发生；
参数：
epfd:由epoll_create 生成的epoll专用的文件描述符；
epoll_event:用于回传代处理事件的数组；
maxevents:每次能处理的事件数；
timeout: 等待I/O事件发生的超时值；
返回发生事件数。
例子：

#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

#define MAXBUF 1024
#define MAXEPOLLSIZE 10000

/*
setnonblocking - 设置句柄为非阻塞方式
*/
int setnonblocking(int sockfd)
{
    if (fcntl(sockfd, F_SETFL, fcntl(sockfd, F_GETFD, 0)|O_NONBLOCK) == -1) {
        return -1;
    }
    return 0;
}

/*
handle_message - 处理每个 socket 上的消息收发
*/
int handle_message(int new_fd)
{
    char buf[MAXBUF + 1];
    int len;
    /* 开始处理每个新连接上的数据收发 */
    bzero(buf, MAXBUF + 1);
    /* 接收客户端的消息 */
    len = recv(new_fd, buf, MAXBUF, 0);
    if (len > 0)
        printf
            ("%d接收消息成功:'%s'，共%d个字节的数据\n",
             new_fd, buf, len);
    else {
        if (len < 0)
            printf
                ("消息接收失败！错误代码是%d，错误信息是'%s'\n",
                 errno, strerror(errno));
        close(new_fd);
        return -1;
    }
    /* 处理每个新连接上的数据收发结束 */
    return len;
}
/************ 关于本文档********************************************
*filename: epoll-server.c
*purpose: 演示epoll处理海量socket连接的方法
*wrote by: zhoulifa(zhoulifa@163.com) 周立发(http://zhoulifa.bokee.com)
Linux爱好者 Linux知识传播者 SOHO族开发者最擅长C语言
*date time:2007-01-31 21:00
*Note: 任何人可以任意复制代码并运用这些文档，当然包括你的商业用途
* 但请遵循GPL
*Thanks to:Google
*Hope: 希望越来越多的人贡献自己的力量，为科学技术发展出力
* 科技站在巨人的肩膀上进步更快！感谢有开源前辈的贡献！
*********************************************************************/
int main(int argc, char **argv)
{
    int listener, new_fd, kdpfd, nfds, n, ret, curfds;
    socklen_t len;
    struct sockaddr_in my_addr, their_addr;
    unsigned int myport, lisnum;
    struct epoll_event ev;
    struct epoll_event events[MAXEPOLLSIZE];
    struct rlimit rt;

    if (argv[1])
        myport = atoi(argv[1]);
    else
        myport = 7838;

    if (argv[2])
        lisnum = atoi(argv[2]);
    else
        lisnum = 2;

    /* 设置每个进程允许打开的最大文件数 */
    rt.rlim_max = rt.rlim_cur = MAXEPOLLSIZE;
    if (setrlimit(RLIMIT_NOFILE, &rt) == -1) {
        perror("setrlimit");
        exit(1);
    }
    else printf("设置系统资源参数成功！\n");

    /* 开启 socket 监听 */
    if ((listener = socket(PF_INET, SOCK_STREAM, 0)) == -1) {
        perror("socket");
        exit(1);
    } else
        printf("socket 创建成功！\n");

    setnonblocking(listener);

    bzero(&my_addr, sizeof(my_addr));
    my_addr.sin_family = PF_INET;
    my_addr.sin_port = htons(myport);
    if (argv[3])
        my_addr.sin_addr.s_addr = inet_addr(argv[3]);
    else
        my_addr.sin_addr.s_addr = INADDR_ANY;

    if (bind
        (listener, (struct sockaddr *) &my_addr, sizeof(struct sockaddr))
        == -1) {
        perror("bind");
        exit(1);
    } else
        printf("IP 地址和端口绑定成功\n");

    if (listen(listener, lisnum) == -1) {
        perror("listen");
        exit(1);
    } else
        printf("开启服务成功！\n");

    /* 创建 epoll 句柄，把监听 socket 加入到 epoll 集合里 */
    kdpfd = epoll_create(MAXEPOLLSIZE);
    len = sizeof(struct sockaddr_in);
    ev.events = EPOLLIN | EPOLLET;
    ev.data.fd = listener;
    if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, listener, &ev) < 0) {
        fprintf(stderr, "epoll set insertion error: fd=%d\n", listener);
        return -1;
    } else
        printf("监听 socket 加入 epoll 成功！\n");
    curfds = 1;
    while (1) {
        /* 等待有事件发生 */
        nfds = epoll_wait(kdpfd, events, curfds, -1);
        if (nfds == -1) {
            perror("epoll_wait");
            break;
        }
        /* 处理所有事件 */
        for (n = 0; n < nfds; ++n) {
            if (events[n].data.fd == listener) {
                new_fd = accept(listener, (struct sockaddr *) &their_addr,
                                &len);
                if (new_fd < 0) {
                    perror("accept");
                    continue;
                } else
                    printf("有连接来自于： %d:%d，分配的 socket 为:%d\n", inet_ntoa(their_addr.sin_addr), ntohs(their_addr.sin_port), new_fd);

                setnonblocking(new_fd);
                ev.events = EPOLLIN | EPOLLET;
                ev.data.fd = new_fd;
                if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, new_fd, &ev) < 0) {
                    fprintf(stderr, "把 socket '%d' 加入 epoll 失败！%s\n",
                            new_fd, strerror(errno));
                    return -1;
                }
                curfds++;
            } else {
                ret = handle_message(events[n].data.fd);
                if (ret < 1 && errno != 11) {
                    epoll_ctl(kdpfd, EPOLL_CTL_DEL, events[n].data.fd,
                              &ev);
                    curfds--;
                }
            }
        }
    }
    close(listener);
    return 0;
}

编译此程序用命令：
gcc -Wall epoll-server.c -o server

运行此程序需要具有管理员权限！

sudo ./server 7838 1

通过测试这一个服务器可能同时处理10000 -3 = 9997 个连接！

如果这是一个在线服务系统，那么它可以支持9997人同时在线，比如游戏、聊天等。

原文地址 http://blog.chinaunix.net/u/8818/showart_440623.html

chatler 2010-05-06 12:03 发表评论

EPoll Mechanism

chatler — Sat, 24 Apr 2010 14:41:00 GMT

1 功能介绍
     epoll与select/poll不同的一点是，它是由一组系统调用组成。
     int epoll_create(int size);
     int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
     int epoll_wait(int epfd, struct epoll_event *events,
                       int maxevents, int timeout);
     epoll相关系统调用是在Linux 2.5.44开始引入的。该系统调用针对传统的selec
t/poll系统调用的不足，设计上作了很大的改动。select/poll的缺点在于：
     1.每次调用时要重复地从用户态读入参数。
     2.每次调用时要重复地扫描文件描述符。
     3.每次在调用开始时，要把当前进程放入各个文件描述符的等待队列。在调用结
束后，又把进程从各个等待队列中删除。
     在实际应用中，select/poll监视的文件描述符可能会非常多，如果每次只是返回
一小部分，那么，这种情况下select/poll显得不够高效。epoll的设计思路，是把s
elect/poll单个的操作拆分为1个epoll_create+多个epoll_ctrl+一个wait。此外，
内核针对epoll操作添加了一个文件系统”eventpollfs”，每一个或者多个要监视的
文件描述符都有一个对应的eventpollfs文件系统的inode节点，主要信息保存在eve
ntpoll结构体中。而被监视的文件的重要信息则保存在epitem结构体中。所以他们
是一对多的关系。
     由于在执行epoll_create和epoll_ctrl时，已经把用户态的信息保存到内核态了
，所以之后即使反复地调用epoll_wait，也不会重复地拷贝参数，扫描文件描述符，
反复地把当前进程放入/放出等待队列。这样就避免了以上的三个缺点。
     接下去看看它们的实现：
2 关键结构体：
/* Wrapper struct used by poll queueing */
struct ep_pqueue {
         poll_table pt;
         struct epitem *epi;
};
     这个结构体类似于select/poll中的struct poll_wqueues。由于epoll需要在内核
态保存大量信息，所以光光一个回调函数指针已经不能满足要求，所以在这里引入了
一个新的结构体struct epitem。
/*
 * Each file descriptor added to the eventpoll interface will
 * have an entry of this type linked to the hash.
 */
struct epitem {
         /* RB-Tree node used to link this structure to the eventpoll rb
-tree */
         struct rb_node rbn;
红黑树，用来保存eventpoll
         /* List header used to link this structure to the eventpoll rea
dy list */
         struct list_head rdllink;
双向链表，用来保存已经完成的eventpoll
         /* The file descriptor information this item refers to */
         struct epoll_filefd ffd;
这个结构体对应的被监听的文件描述符信息
         /* Number of active wait queue attached to poll operations */
         int nwait;
poll操作中事件的个数
         /* List containing poll wait queues */
         struct list_head pwqlist;
双向链表，保存着被监视文件的等待队列，功能类似于select/poll中的poll_tab
le
         /* The "container" of this item */
         struct eventpoll *ep;
指向eventpoll，多个epitem对应一个eventpoll
         /* The structure that describe the interested events and the so
urce fd */
         struct epoll_event event;
记录发生的事件和对应的fd
         /*
          * Used to keep track of the usage count of the structure. This
 avoids
          * that the structure will desappear from underneath our proces
sing.
          */
         atomic_t usecnt;
引用计数
         /* List header used to link this item to the "struct file" item
s list */
         struct list_head fllink;
双向链表，用来链接被监视的文件描述符对应的struct file。因为file里有f_ep
_link，用来保存所有监视这个文件的epoll节点
         /* List header used to link the item to the transfer list */
         struct list_head txlink;
双向链表，用来保存传输队列
         /*
          * This is used during the collection/transfer of events to use
rspace
          * to pin items empty events set.
          */
         unsigned int revents;
文件描述符的状态，在收集和传输时用来锁住空的事件集合
};
     该结构体用来保存与epoll节点关联的多个文件描述符，保存的方式是使用红黑树
实现的hash表。至于为什么要保存，下文有详细解释。它与被监听的文件描述符一一
对应。
struct eventpoll {
         /* Protect the this structure access */
         rwlock_t lock;
读写锁
         /*
          * This semaphore is used to ensure that files are not removed
          * while epoll is using them. This is read-held during the even
t
          * collection loop and it is write-held during the file cleanup
          * path, the epoll file exit code and the ctl operations.
          */
         struct rw_semaphore sem;
读写信号量
         /* Wait queue used by sys_epoll_wait() */
         wait_queue_head_t wq;
         /* Wait queue used by file->poll() */
         wait_queue_head_t poll_wait;
         /* List of ready file descriptors */
         struct list_head rdllist;
已经完成的操作事件的队列。
         /* RB-Tree root used to store monitored fd structs */
         struct rb_root rbr;
保存epoll监视的文件描述符
};
     这个结构体保存了epoll文件描述符的扩展信息，它被保存在file结构体的priva
te_data中。它与epoll文件节点一一对应。通常一个epoll文件节点对应多个被监视
的文件描述符。所以一个eventpoll结构体会对应多个epitem结构体。
     那么，epoll中的等待事件放在哪里呢？见下面
/* Wait structure used by the poll hooks */
struct eppoll_entry {
         /* List header used to link this structure to the "struct epite
m" */
         struct list_head llink;
         /* The "base" pointer is set to the container "struct epitem" *
/
         void *base;
         /*
          * Wait queue item that will be linked to the target file wait
          * queue head.
          */
         wait_queue_t wait;
         /* The wait queue head that linked the "wait" wait queue item *
/
         wait_queue_head_t *whead;
};
     与select/poll的struct poll_table_entry相比，epoll的表示等待队列节点的结
构体只是稍有不同，与struct poll_table_entry比较一下。
struct poll_table_entry {
         struct file * filp;
         wait_queue_t wait;
         wait_queue_head_t * wait_address;
};
     由于epitem对应一个被监视的文件，所以通过base可以方便地得到被监视的文件
信息。又因为一个文件可能有多个事件发生，所以用llink链接这些事件。
3 epoll_create的实现
     epoll_create()的功能是创建一个eventpollfs文件系统的inode节点。具体由ep
_getfd()完成。ep_getfd()先调用ep_eventpoll_inode()创建一个inode节点，然后
调用d_alloc()为inode分配一个dentry。最后把file,dentry,inode三者关联起来。
     在执行了ep_getfd()之后，它又调用了ep_file_init(),分配了eventpoll结构体
，并把eventpoll的指针赋给file结构体，这样eventpoll就与file结构体关联起来了
。
     需要注意的是epoll_create()的参数size实际上只是起参考作用，只要它不小于
等于0，就并不限制这个epoll inode关联的文件描述符数量。
4 epoll_ctl的实现
     epoll_ctl的功能是实现一系列操作，如把文件与eventpollfs文件系统的inode节
点关联起来。这里要介绍一下eventpoll结构体，它保存在file->f_private中，记录
了eventpollfs文件系统的inode节点的重要信息，其中成员rbr保存了该epoll文件节
点监视的所有文件描述符。组织的方式是一棵红黑树，这种结构体在查找节点时非常
高效。
     首先它调用ep_find()从eventpoll中的红黑树获得epitem结构体。然后根据op参
数的不同而选择不同的操作。如果op为EPOLL_CTL_ADD，那么正常情况下epitem是不
可能在eventpoll的红黑树中找到的，所以调用ep_insert创建一个epitem结构体并插
入到对应的红黑树中。
     ep_insert()首先分配一个epitem对象，对它初始化后，把它放入对应的红黑树。
此外，这个函数还要作一个操作，就是把当前进程放入对应文件操作的等待队列。这
一步是由下面的代码完成的。
     init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
     。。。
     revents = tfile->f_op->poll(tfile, &epq.pt);
     函数先调用init_poll_funcptr注册了一个回调函数 ep_ptable_queue_proc，这
个函数会在调用f_op->poll时被执行。该函数分配一个epoll等待队列结点eppoll_e
ntry：一方面把它挂到文件操作的等待队列中，另一方面把它挂到epitem的队列中
。此外，它还注册了一个等待队列的回调函数ep_poll_callback。当文件操作完成，
唤醒当前进程之前，会调用ep_poll_callback()，把eventpoll放到epitem的完成队
列中，并唤醒等待进程。
     如果在执行f_op->poll以后，发现被监视的文件操作已经完成了，那么把它放在
完成队列中了，并立即把等待操作的那些进程唤醒。
5 epoll_wait的实现
     epoll_wait的工作是等待文件操作完成并返回。
     它的主体是ep_poll()，该函数在for循环中检查epitem中有没有已经完成的事件
，有的话就把结果返回。没有的话调用schedule_timeout()进入休眠，直到进程被再
度唤醒或者超时。
6 性能分析
     epoll机制是针对select/poll的缺陷设计的。通过新引入的eventpollfs文件系统
，epoll把参数拷贝到内核态，在每次轮询时不会重复拷贝。通过把操作拆分为epol
l_create,epoll_ctl,epoll_wait，避免了重复地遍历要监视的文件描述符。此外，
由于调用epoll的进程被唤醒后，只要直接从epitem的完成队列中找出完成的事件，
找出完成事件的复杂度由O(N)降到了O(1)。
     但是epoll的性能提高是有前提的，那就是监视的文件描述符非常多，而且每次完
成操作的文件非常少。所以，epoll能否显著提高效率，取决于实际的应用场景。这
方面需要进一步测试。

转自http://www.freecity.cn/agent/thread.do?id=LinuxDev-48b24eba-c4e53e6f2d89ff3cb039f2c4ed4102e9

from:

http://blog.chinaunix.net/u2/67780/showart_2064403.html

chatler 2010-04-24 22:41 发表评论

epoll使用例子

chatler — Sun, 10 Jan 2010 14:46:00 GMT

名词解释：man epoll之后，得到如下结果：

NAME
epoll - I/O event notification facility

SYNOPSIS
#include

DESCRIPTION
       epoll is a variant of poll(2) that can be used either as Edge or Level
       Triggered interface and scales well to large numbers of watched fds.
       Three system calls are provided to set up and control an epoll set:
       epoll_create(2), epoll_ctl(2), epoll_wait(2).

       An epoll set is connected to a file descriptor created by epoll_cre-
       ate(2).   Interest for certain file descriptors is then registered via
       epoll_ctl(2). Finally, the actual wait is started by epoll_wait(2).

其实，一切的解释都是多余的，按照我目前的了解，EPOLL模型似乎只有一种格式，所以大家只要参考我下面的代码，就能够对EPOLL有所了解了，代码的解释都已经在注释中：

while (TRUE)
{
  int nfds = epoll_wait (m_epoll_fd, m_events, MAX_EVENTS, EPOLL_TIME_OUT);//等待EPOLL时间的发生，相当于监听，至于相关的端口，需要在初始化EPOLL的时候绑定。
  if (nfds <= 0)
   continue;
  m_bOnTimeChecking = FALSE;
  G_CurTime = time(NULL);
  for (int i=0; i  {
   try
   {
    if (m_events[i].data.fd == m_listen_http_fd)//如果新监测到一个HTTP用户连接到绑定的HTTP端口，建立新的连接。由于我们新采用了SOCKET连接，所以基本没用。
    {
     OnAcceptHttpEpoll ();
    }
    else if (m_events[i].data.fd == m_listen_sock_fd)//如果新监测到一个SOCKET用户连接到了绑定的SOCKET端口，建立新的连接。
    {
     OnAcceptSockEpoll ();
    }
    else if (m_events[i].events & EPOLLIN)//如果是已经连接的用户，并且收到数据，那么进行读入。
    {
     OnReadEpoll (i);
    }

    OnWriteEpoll (i);//查看当前的活动连接是否有需要写出的数据。
   }
   catch (int)
   {
    PRINTF ("CATCH捕获错误\n");
    continue;
   }
  }
  m_bOnTimeChecking = TRUE;
  OnTimer ();//进行一些定时的操作，主要就是删除一些短线用户等。
}

　其实EPOLL的精华，按照我目前的理解，也就是上述的几段短短的代码，看来时代真的不同了，以前如何接受大量用户连接的问题，现在却被如此轻松的搞定，真是让人不得不感叹。

今天搞了一天的epoll，想做一个高并发的代理程序。刚开始真是郁闷,一直搞不通，网上也有几篇介绍epoll的文章。但都不深入，没有将一些注意的地方讲明。以至于走了很多弯路，现将自己的一些理解共享给大家,以少走弯路。

epoll用到的所有函数都是在头文件sys/epoll.h中声明，有什么地方不明白或函数忘记了可以去看一下。
epoll和select相比，最大不同在于:

1epoll返回时已经明确的知道哪个sokcet fd发生了事件，不用再一个个比对。这样就提高了效率。
2select的FD_SETSIZE是有限止的，而epoll是没有限止的只与系统资源有关。

1、epoll_create函数
函数声明：int epoll_create(int size)
该函数生成一个epoll专用的文件描述符。它其实是在内核申请一空间，用来存放你想关注的socket fd上是否发生以及发生了什么事件。size就是你在这个epoll fd上能关注的最大socket fd数。随你定好了。只要你有空间。可参见上面与select之不同2.

22、epoll_ctl函数
函数声明：int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)
该函数用于控制某个epoll文件描述符上的事件，可以注册事件，修改事件，删除事件。
参数：
epfd：由 epoll_create 生成的epoll专用的文件描述符；
op：要进行的操作例如注册事件，可能的取值EPOLL_CTL_ADD 注册、EPOLL_CTL_MOD 修改、EPOLL_CTL_DEL 删除

fd：关联的文件描述符；
event：指向epoll_event的指针；
如果调用成功返回0,不成功返回-1

用到的数据结构
typedef union epoll_data {
void *ptr;
int fd;
__uint32_t u32;
__uint64_t u64;
} epoll_data_t;

struct epoll_event {
__uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};

如：
struct epoll_event ev;
//设置与要处理的事件相关的文件描述符
ev.data.fd=listenfd;
//设置要处理的事件类型
ev.events=EPOLLIN|EPOLLET;
//注册epoll事件
epoll_ctl(epfd,EPOLL_CTL_ADD,listenfd,&ev);

常用的事件类型:
EPOLLIN ：表示对应的文件描述符可以读；
EPOLLOUT：表示对应的文件描述符可以写；
EPOLLPRI：表示对应的文件描述符有紧急的数据可读
EPOLLERR：表示对应的文件描述符发生错误；
EPOLLHUP：表示对应的文件描述符被挂断；
EPOLLET：表示对应的文件描述符有事件发生；

3、epoll_wait函数
函数声明:int epoll_wait(int epfd,struct epoll_event * events,int maxevents,int timeout)
该函数用于轮询I/O事件的发生；
参数：
epfd:由epoll_create 生成的epoll专用的文件描述符；
epoll_event:用于回传代处理事件的数组；
maxevents:每次能处理的事件数；
timeout:等待I/O事件发生的超时值(单位我也不太清楚)；-1相当于阻塞，0相当于非阻塞。一般用-1即可
返回发生事件数。

view plain copy to clipboard print ?

#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

view plain copy to clipboard print ?

#define MAXBUF 1024
#define MAXEPOLLSIZE 10000

view plain copy to clipboard print ?

/*
setnonblocking - 设置句柄为非阻塞方式
*/
int setnonblocking(int sockfd)
{
if (fcntl(sockfd, F_SETFL, fcntl(sockfd, F_GETFD, 0)|O_NONBLOCK) == -1)
{
return -1;
}
return 0;
}

view plain copy to clipboard print ?

/*
handle_message - 处理每个 socket 上的消息收发
*/
int handle_message(int new_fd)
{
char buf[MAXBUF + 1];
int len;
/* 开始处理每个新连接上的数据收发 */
bzero(buf, MAXBUF + 1);
/* 接收客户端的消息 */
len = recv(new_fd, buf, MAXBUF, 0);
if (len > 0)
{
printf
("%d接收消息成功:'%s'，共%d个字节的数据\n",
new_fd, buf, len);
}
else
{
if (len < 0)
printf
("消息接收失败！错误代码是%d，错误信息是'%s'\n",
errno, strerror(errno));
close(new_fd);
return -1;
}
/* 处理每个新连接上的数据收发结束 */
return len;
}
/************关于本文档********************************************
*filename: epoll-server.c
*purpose: 演示epoll处理海量socket连接的方法
*wrote by: zhoulifa() 周立发(http://zhoulifa.bokee.com)
Linux爱好者 Linux知识传播者 SOHO族开发者最擅长C语言
*date time:2007-01-31 21:00
*Note: 任何人可以任意复制代码并运用这些文档，当然包括你的商业用途
* 但请遵循GPL
*Thanks to:Google
*Hope:希望越来越多的人贡献自己的力量，为科学技术发展出力
* 科技站在巨人的肩膀上进步更快！感谢有开源前辈的贡献！
*********************************************************************/
int main(int argc, char **argv)
{
int listener, new_fd, kdpfd, nfds, n, ret, curfds;
socklen_t len;
struct sockaddr_in my_addr, their_addr;
unsigned int myport, lisnum;
struct epoll_event ev;
struct epoll_event events[MAXEPOLLSIZE];
struct rlimit rt;
myport = 5000;
lisnum = 2;
/* 设置每个进程允许打开的最大文件数 */
rt.rlim_max = rt.rlim_cur = MAXEPOLLSIZE;
if (setrlimit(RLIMIT_NOFILE, &rt) == -1)
{
perror("setrlimit");
exit(1);
}
else
{
printf("设置系统资源参数成功！\n");
}

view plain copy to clipboard print ?

/* 开启 socket 监听 */
if ((listener = socket(PF_INET, SOCK_STREAM, 0)) == -1)
{
perror("socket");
exit(1);
}
else
{
printf("socket 创建成功！\n");
}
setnonblocking(listener);

view plain copy to clipboard print ?

bzero(&my_addr, sizeof(my_addr));
my_addr.sin_family = PF_INET;
my_addr.sin_port = htons(myport);
my_addr.sin_addr.s_addr = INADDR_ANY;

view plain copy to clipboard print ?

if (bind(listener, (struct sockaddr *) &my_addr, sizeof(struct sockaddr)) == -1)
{
perror("bind");
exit(1);
}
else
{
printf("IP 地址和端口绑定成功\n");
}
if (listen(listener, lisnum) == -1)
{
perror("listen");
exit(1);
}
else
{
printf("开启服务成功！\n");
}
/* 创建 epoll 句柄，把监听 socket 加入到 epoll 集合里 */
kdpfd = epoll_create(MAXEPOLLSIZE);
len = sizeof(struct sockaddr_in);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = listener;
if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, listener, &ev) < 0)
{
fprintf(stderr, "epoll set insertion error: fd=%d\n", listener);
return -1;
}
else
{
printf("监听 socket 加入 epoll 成功！\n");
}
curfds = 1;
while (1)
{
/* 等待有事件发生 */
nfds = epoll_wait(kdpfd, events, curfds, -1);
if (nfds == -1)
{
perror("epoll_wait");
break;
}
/* 处理所有事件 */
for (n = 0; n < nfds; ++n)
{
if (events[n].data.fd == listener)
{
new_fd = accept(listener, (struct sockaddr *) &their_addr,&len);
if (new_fd < 0)
{
perror("accept");
continue;
}
else
{
printf("有连接来自于： %d:%d，分配的 socket 为:%d\n",
inet_ntoa(their_addr.sin_addr), ntohs(their_addr.sin_port), new_fd);
}
setnonblocking(new_fd);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = new_fd;
if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, new_fd, &ev) < 0)
{
fprintf(stderr, "把 socket '%d' 加入 epoll 失败！%s\n",
new_fd, strerror(errno));
return -1;
}
curfds++;
}
else
{
ret = handle_message(events[n].data.fd);
if (ret < 1 && errno != 11)
{
epoll_ctl(kdpfd, EPOLL_CTL_DEL, events[n].data.fd,&ev);
curfds--;
}
}
}
}
close(listener);
return 0;
}

epoll_wait 运行的原理是等侍注册在epfd上的socket fd的事件的发生，如果发生则将发生的sokct fd和事件类型放入到events数组中。并且将注册在epfd上的socket fd的事件类型给清空，所以如果下一个循环你还要关注这个socket fd的话，则需要用epoll_ctl(epfd,EPOLL_CTL_MOD,listenfd,&ev)来重新设置socket fd的事件类型。这时不用EPOLL_CTL_ADD,因为socket fd并未清空，只是事件类型清空。这一步非常重要。

转自：

http://blog.chinaunix.net/u/17928/showart.php?id=2098011

chatler 2010-01-10 22:46 发表评论

Windows Socket五种I/O模型

chatler — Fri, 08 Jan 2010 16:15:00 GMT

摘要: Winsock 的I/O操作：1、两种I/O模式阻塞模式：执行I/O操作完成前会一直进行等待，不会将控制权交给程序。套接字默认为阻塞模式。可以通过多线程技术进行处理。非阻塞模式：执行I/O操作时，Winsock函数会返回并交出控制权。这种模式使用起来比较复杂，因为函数在没有运行完成就进行返回，会不断地返回 WSAEWO... 阅读全文

chatler 2010-01-09 00:15 发表评论

Linux下各类TCP网络服务器的实现源代码《转》

chatler — Thu, 07 Jan 2010 15:22:00 GMT

摘要: Linux下各类TCP网络服务器的实现源代码大家都知道各类网络服务器程序的编写步骤，并且都知道网络服务器就两大类：循环服务和并发服务。这里附上源代码来个小结吧。首先，循环网络服务器编程实现的步骤是这样的：[IMG]http://zhoulifa.bokee.com/inc/directsocket.png[/IMG] 这种服务器模型是典型循环服务，如果不加上多进程/线程技术，此种服务吞吐... 阅读全文

chatler 2010-01-07 23:22 发表评论

socket know-how

chatler — Thu, 17 Dec 2009 15:14:00 GMT

1.如何判断socket已经断开
    在server端会使用专门的线程处理一条socket连接。如果socket连接断开（异常，正常）后，如何才能感知到？server端这边是绝对被动的，sever端不能主动断开连接。也没有连接链路维持包之类的。client端发送数据的时间也是不定的。在 socket连接断开后， server要能够感知到并释放资源。
    当使用 select()函数测试一个socket是否可读时，如果select()函数返回值为1，且使用recv()函数读取的数据长度为0 时，就说明该socket已经断开。
    为了更好的判定socket是否断开，判断当recv()返回值小于等于0时，socket连接断开。但是还需要判断 errno是否等于 EINTR 。如果errno == EINTR 则说明recv函数是由于程序接收到信号后返回的，socket连接还是正常的，不应close掉socket连接。

PS：对于堵塞socket的recv函数会在以下三种情况下返回：
（1）recv到数据时，会返回。
（2）在整个程序接收到信号时，返回-1。errno = EINTR。//在程序的起始阶段，屏蔽掉信号的除外。部分信号还是屏蔽不掉的。
（3）socket出现问题时，返回-1.具体错误码看 man recv()
（4）一定要看 man 说明，很详细，很有帮助。

chatler 2009-12-17 23:14 发表评论

linux 下 http 协议实现分析

chatler — Mon, 07 Dec 2009 15:12:00 GMT

程序是从http://zhoulifa.bokee.com/4640913.html 下的，做了些裁剪，使程序更加的清晰
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include

int main(int argc, char *argv[])
{
int sockfd;
char buffer[1024];
struct sockaddr_in server_addr;
struct hostent *host;
int portnumber,nbytes;
char host_addr[256];
char host_file[1024];
char local_file[256];
FILE * fp;
char request[1024];
int send, totalsend;
int i;
char * pt;

if(argc!=2)
{
    fprintf(stderr,"Usage:%s web-address\a\n",argv[0]);
    exit(1);
}
portnumber=80;
strcpy(host_addr,argv[1]);
if((host=gethostbyname(argv[1]))==NULL)/*取得主机IP地址*/
{
    fprintf(stderr,"Gethostname error, %s\n", strerror(errno));
    exit(1);
}
if((sockfd=socket(AF_INET,SOCK_STREAM,0))==-1)/*建立SOCKET连接*/
{
    fprintf(stderr,"Socket Error:%s\a\n",strerror(errno));
    exit(1);
}
/* 客户程序填充服务端的资料 */
bzero(&server_addr,sizeof(server_addr));
server_addr.sin_family=AF_INET;
server_addr.sin_port=htons(portnumber);
server_addr.sin_addr=*((struct in_addr *)host->h_addr);

/* 客户程序发起连接请求 */
if(connect(sockfd,(struct sockaddr *)(&server_addr),sizeof(struct sockaddr))==-1)/*连接网站*/
{
fprintf(stderr,"Connect Error:%s\a\n",strerror(errno));
exit(1);
}

sprintf(request, "GET /%s HTTP/1.1\r\nAccept: */*\r\nAccept-Language: zh-cn\r\n\
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)\r\n\
Host: %s:%d\r\nConnection: Close\r\n\r\n", host_file, host_addr, portnumber);
printf("%s", request);/*准备request，将要发送给主机*/

/*取得真实的文件名*/
strcpy(local_file, "index.html");
/*发送http请求request*/
send = 0;totalsend = 0;
nbytes=strlen(request);
while(totalsend < nbytes) {
    send = write(sockfd, request + totalsend, nbytes - totalsend);
    if(send==-1) {printf("send error!%s\n", strerror(errno));exit(0);}
    totalsend+=send;
    printf("%d bytes send OK!\n", totalsend);
}

fp = fopen(local_file, "a");
if(!fp) {
    printf("create file error! %s\n", strerror(errno));
    return 0;
}
printf("\nThe following is the response header:\n");
i=0;
/* 连接成功了，接收http响应，response */
while((nbytes=read(sockfd,buffer,1))==1)
{
    if(i < 4) {
      if(buffer[0] == '\r' || buffer[0] == '\n') i++;
      else i = 0;
      printf("%c", buffer[0]);/*把http头信息打印在屏幕上*/
    }
    else {
      fwrite(buffer, 1, 1, fp);/*将http主体信息写入文件*/
      i++;
      if(i%1024 == 0) fflush(fp);/*每1K时存盘一次*/
    }
}
fclose(fp);
/* 结束通讯 */
close(sockfd);
exit(0);
}

1 struct hostent *gethostbyname(const char *name);

这个函数完成进行域名和IP地址的转换，返回的为：

      struct hostent {
      char *h_name;                      /* 主机的官方域名 */
      char **h_aliases;                  /* 一个以NULL结尾的主机别名数组 */
      int h_addrtype;                     /* 返回的地址类型，在Internet环境下为AF-INET */
     int h_length;                         /*地址的字节长度 */
     char **h_addr_list;                /* 一个以0结尾的数组，包含该主机的所有地址*/
      };
     #define h_addr h_addr_list[0]           /*在h-addr-list中的第一个地址*/

我们一般用的也就第一个地址

2 接下去就是SOCKET 的建立，绑定，连接，我们想要连接，上面得到的ip地址是不行的，我们要使用服务器的地址，具体数据结构如下：

   struct sockaddr_in {
      short int sin_family;                                           /* 地址族 */
      unsigned short int sin_port;                              /* 端口号 */
      struct in_addr sin_addr;                                   /* IP地址 */ 上面得到的地址
      unsigned char sin_zero[8];                              /* 填充0 以保持与struct sockaddr同样大小 */
      };

这里也提下 struct sockaddr 这个是描述sock 地址信息的，和上面的结构大小一样，可以相互转换

   struct sockaddr {
       unsigned short sa_family; /* 地址族， AF_xxx */
        char sa_data[14]; /* 14 字节的协议地址 */
       }

3 我们和服务器连上之后，就可以向服务器发送请求了

write(sockfd, char *, size); 内容就是：GET /%s HTTP/1.1\r\nAccept: */*\r\nAccept-Language: zh-cn\r\n\User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)\r\n\
Host: %s:%d\r\nConnection: Close\r\n\r\n

具体是由其协议定的，我也还不是很清楚

4 服务器响应，就会发来信息头+ 实际页面的信息，这个中间是有4个（"\r"或"\n"）进行分开的。

转自：
http://blog.chinaunix.net/u2/76292/showart_1335922.html

chatler 2009-12-07 23:12 发表评论

Windows完成端口与Linux epoll技术简介

chatler — Tue, 24 Nov 2009 07:28:00 GMT

WINDOWS完成端口编程
1、基本概念
2、WINDOWS完成端口的特点
3、完成端口（Completion Ports ）相关数据结构和创建
4、完成端口线程的工作原理
5、Windows完成端口的实例代码
Linux的EPoll模型
1、为什么select落后
2、内核中提高I/O性能的新方法epoll
3、epoll的优点
4、epoll的工作模式
5、epoll的使用方法
6、Linux下EPOll编程实例
总结

WINDOWS完成端口编程
        摘要：开发网络程序从来都不是一件容易的事情，尽管只需要遵守很少的一些规则;创建socket,发起连接，接受连接，发送和接受数据。真正的困难在于：让你的程序可以适应从单单一个连接到几千个连接乃至于上万个连接。利用Windows平台完成端口进行重叠I/O的技术和Linux在2.6版本的内核中引入的EPOll技术，可以很方便在在在Windows和Linux平台上开发出支持大量连接的网络服务程序。本文介绍在Windows和Linux平台上使用的完成端口和EPoll模型开发的基本原理，同时给出实际的例子。本文主要关注C/S结构的服务器端程序，因为一般来说，开发一个大容量，具可扩展性的winsock程序一般就是指服务程序。

1、基本概念
    设备---windows操作系统上允许通信的任何东西，比如文件、目录、串行口、并行口、邮件槽、命名管道、无名管道、套接字、控制台、逻辑磁盘、物理磁盘等。绝大多数与设备打交道的函数都是CreateFile/ReadFile/WriteFile等。所以我们不能看到**File函数就只想到文件设备。与设备通信有两种方式，同步方式和异步方式。同步方式下，当调用ReadFile函数时，函数会等待系统执行完所要求的工作，然后才返回；异步方式下，ReadFile这类函数会直接返回，系统自己去完成对设备的操作，然后以某种方式通知完成操作。
重叠I/O----顾名思义，当你调用了某个函数（比如ReadFile）就立刻返回做自己的其他动作的时候，同时系统也在对I/0设备进行你要求的操作，在这段时间内你的程序和系统的内部动作是重叠的，因此有更好的性能。所以，重叠I/O是用于异步方式下使用I/O设备的。重叠I/O需要使用的一个非常重要的数据结构OVERLAPPED。

2、WINDOWS完成端口的特点
   Win32重叠I/O(Overlapped I/O)机制允许发起一个操作，然后在操作完成之后接受到信息。对于那种需要很长时间才能完成的操作来说，重叠IO机制尤其有用，因为发起重叠操作的线程在重叠请求发出后就可以自由的做别的事情了。在WinNT和Win2000上，提供的真正的可扩展的I/O模型就是使用完成端口（Completion Port）的重叠I/O.完成端口---是一种WINDOWS内核对象。完成端口用于异步方式的重叠I/0情况下，当然重叠I/O不一定非使用完成端口不可，还有设备内核对象、事件对象、告警I/0等。但是完成端口内部提供了线程池的管理，可以避免反复创建线程的开销，同时可以根据CPU的个数灵活的决定线程个数，而且可以让减少线程调度的次数从而提高性能其实类似于WSAAsyncSelect和select函数的机制更容易兼容Unix，但是难以实现我们想要的“扩展性”。而且windows的完成端口机制在操作系统内部已经作了优化，提供了更高的效率。所以，我们选择完成端口开始我们的服务器程序的开发。
1、发起操作不一定完成，系统会在完成的时候通知你，通过用户在完成端口上的等待，处理操作的结果。所以要有检查完成端口，取操作结果的线程。在完成端口上守候的线程系统有优化，除非在执行的线程阻塞，不会有新的线程被激活，以此来减少线程切换造成的性能代价。所以如果程序中没有太多的阻塞操作，没有必要启动太多的线程，CPU数量的两倍，一般这样来启动线程。
2、操作与相关数据的绑定方式：在提交数据的时候用户对数据打相应的标记，记录操作的类型，在用户处理操作结果的时候，通过检查自己打的标记和系统的操作结果进行相应的处理。
3、操作返回的方式:一般操作完成后要通知程序进行后续处理。但写操作可以不通知用户，此时如果用户写操作不能马上完成，写操作的相关数据会被暂存到到非交换缓冲区中，在操作完成的时候，系统会自动释放缓冲区。此时发起完写操作，使用的内存就可以释放了。此时如果占用非交换缓冲太多会使系统停止响应。

3、完成端口（Completion Ports ）相关数据结构和创建
    其实可以把完成端口看成系统维护的一个队列，操作系统把重叠IO操作完成的事件通知放到该队列里，由于是暴露 “操作完成”的事件通知，所以命名为“完成端口”（COmpletion Ports）。一个socket被创建后，可以在任何时刻和一个完成端口联系起来。
完成端口相关最重要的是OVERLAPPED数据结构
typedef struct _OVERLAPPED {
    ULONG_PTR Internal;//被系统内部赋值，用来表示系统状态
    ULONG_PTR InternalHigh;// 被系统内部赋值，传输的字节数
    union {
        struct {
            DWORD Offset;//和OffsetHigh合成一个64位的整数，用来表示从文件头部的多少字节开始
            DWORD OffsetHigh;//操作，如果不是对文件I/O来操作，则必须设定为0
        };
        PVOID Pointer;
    };
    HANDLE hEvent;//如果不使用，就务必设为0,否则请赋一个有效的Event句柄
} OVERLAPPED, *LPOVERLAPPED;

下面是异步方式使用ReadFile的一个例子
OVERLAPPED Overlapped;
Overlapped.Offset=345;
Overlapped.OffsetHigh=0;
Overlapped.hEvent=0;
//假定其他参数都已经被初始化
ReadFile(hFile,buffer,sizeof(buffer),&dwNumBytesRead,&Overlapped);
这样就完成了异步方式读文件的操作，然后ReadFile函数返回，由操作系统做自己的事情，下面介绍几个与OVERLAPPED结构相关的函数
等待重叠I/0操作完成的函数
BOOL GetOverlappedResult (
HANDLE hFile,
LPOVERLAPPED lpOverlapped,//接受返回的重叠I/0结构
LPDWORD lpcbTransfer,//成功传输了多少字节数
BOOL fWait //TRUE只有当操作完成才返回，FALSE直接返回，如果操作没有完成，通过调//用GetLastError ( )函数会返回ERROR_IO_INCOMPLETE
);
宏HasOverlappedIoCompleted可以帮助我们测试重叠I/0操作是否完成，该宏对OVERLAPPED结构的Internal成员进行了测试，查看是否等于STATUS_PENDING值。

        一般来说，一个应用程序可以创建多个工作线程来处理完成端口上的通知事件。工作线程的数量依赖于程序的具体需要。但是在理想的情况下，应该对应一个CPU 创建一个线程。因为在完成端口理想模型中，每个线程都可以从系统获得一个“原子”性的时间片，轮番运行并检查完成端口，线程的切换是额外的开销。在实际开发的时候，还要考虑这些线程是否牵涉到其他堵塞操作的情况。如果某线程进行堵塞操作，系统则将其挂起，让别的线程获得运行时间。因此，如果有这样的情况，可以多创建几个线程来尽量利用时间。
应用完成端口：
    创建完成端口：完成端口是一个内核对象，使用时他总是要和至少一个有效的设备句柄进行关联，完成端口是一个复杂的内核对象，创建它的函数是：
HANDLE CreateIoCompletionPort(
    IN HANDLE FileHandle,
    IN HANDLE ExistingCompletionPort,
    IN ULONG_PTR CompletionKey,
    IN DWORD NumberOfConcurrentThreads
    );

通常创建工作分两步：
第一步，创建一个新的完成端口内核对象，可以使用下面的函数：
       HANDLE CreateNewCompletionPort(DWORD dwNumberOfThreads)
{
          return CreateIoCompletionPort(INVALID_HANDLE_VALUE,NULL,NULL,dwNumberOfThreads);
};

第二步，将刚创建的完成端口和一个有效的设备句柄关联起来，可以使用下面的函数：
       bool AssicoateDeviceWithCompletionPort(HANDLE hCompPort,HANDLE hDevice,DWORD dwCompKey)
{
          HANDLE h=CreateIoCompletionPort(hDevice,hCompPort,dwCompKey,0);
          return h==hCompPort;
};
说明
1） CreateIoCompletionPort函数也可以一次性的既创建完成端口对象，又关联到一个有效的设备句柄
2） CompletionKey是一个可以自己定义的参数，我们可以把一个结构的地址赋给它，然后在合适的时候取出来使用，最好要保证结构里面的内存不是分配在栈上，除非你有十分的把握内存会保留到你要使用的那一刻。
3） NumberOfConcurrentThreads通常用来指定要允许同时运行的的线程的最大个数。通常我们指定为0，这样系统会根据CPU的个数来自动确定。创建和关联的动作完成后，系统会将完成端口关联的设备句柄、完成键作为一条纪录加入到这个完成端口的设备列表中。如果你有多个完成端口，就会有多个对应的设备列表。如果设备句柄被关闭，则表中自动删除该纪录。

4、完成端口线程的工作原理
完成端口可以帮助我们管理线程池，但是线程池中的线程需要我们使用_beginthreadex来创建，凭什么通知完成端口管理我们的新线程呢？答案在函数GetQueuedCompletionStatus。该函数原型：
BOOL GetQueuedCompletionStatus(
    IN HANDLE CompletionPort,
    OUT LPDWORD lpNumberOfBytesTransferred,
    OUT PULONG_PTR lpCompletionKey,
    OUT LPOVERLAPPED *lpOverlapped,
    IN DWORD dwMilliseconds
);
这个函数试图从指定的完成端口的I/0完成队列中抽取纪录。只有当重叠I/O动作完成的时候，完成队列中才有纪录。凡是调用这个函数的线程将被放入到完成端口的等待线程队列中，因此完成端口就可以在自己的线程池中帮助我们维护这个线程。完成端口的I/0完成队列中存放了当重叠I/0完成的结果---- 一条纪录，该纪录拥有四个字段，前三项就对应GetQueuedCompletionStatus函数的2、3、4参数，最后一个字段是错误信息 dwError。我们也可以通过调用PostQueudCompletionStatus模拟完成了一个重叠I/0操作。
当I/0完成队列中出现了纪录，完成端口将会检查等待线程队列，该队列中的线程都是通过调用GetQueuedCompletionStatus函数使自己加入队列的。等待线程队列很简单，只是保存了这些线程的ID。完成端口会按照后进先出的原则将一个线程队列的ID放入到释放线程列表中，同时该线程将从等待GetQueuedCompletionStatus函数返回的睡眠状态中变为可调度状态等待CPU的调度。所以我们的线程要想成为完成端口管理的线程，就必须要调用GetQueuedCompletionStatus函数。出于性能的优化，实际上完成端口还维护了一个暂停线程列表，具体细节可以参考《Windows高级编程指南》，我们现在知道的知识，已经足够了。完成端口线程间数据传递线程间传递数据最常用的办法是在_beginthreadex函数中将参数传递给线程函数，或者使用全局变量。但是完成端口还有自己的传递数据的方法，答案就在于CompletionKey和OVERLAPPED参数。
CompletionKey被保存在完成端口的设备表中，是和设备句柄一一对应的，我们可以将与设备句柄相关的数据保存到CompletionKey中，或者将CompletionKey表示为结构指针，这样就可以传递更加丰富的内容。这些内容只能在一开始关联完成端口和设备句柄的时候做，因此不能在以后动态改变。
OVERLAPPED参数是在每次调用ReadFile这样的支持重叠I/0的函数时传递给完成端口的。我们可以看到，如果我们不是对文件设备做操作，该结构的成员变量就对我们几乎毫无作用。我们需要附加信息，可以创建自己的结构，然后将OVERLAPPED结构变量作为我们结构变量的第一个成员，然后传递第一个成员变量的地址给ReadFile函数。因为类型匹配，当然可以通过编译。当GetQueuedCompletionStatus函数返回时，我们可以获取到第一个成员变量的地址，然后一个简单的强制转换，我们就可以把它当作完整的自定义结构的指针使用，这样就可以传递很多附加的数据了。太好了！只有一点要注意，如果跨线程传递，请注意将数据分配到堆上，并且接收端应该将数据用完后释放。我们通常需要将ReadFile这样的异步函数的所需要的缓冲区放到我们自定义的结构中，这样当GetQueuedCompletionStatus被返回时，我们的自定义结构的缓冲区变量中就存放了I/0操作的数据。CompletionKey和OVERLAPPED参数，都可以通过GetQueuedCompletionStatus函数获得。
线程的安全退出
       很多线程为了不止一次的执行异步数据处理，需要使用如下语句
while (true)
{
       ......
       GetQueuedCompletionStatus(...);
        ......
}
那么如何退出呢，答案就在于上面曾提到的PostQueudCompletionStatus函数，我们可以用它发送一个自定义的包含了OVERLAPPED成员变量的结构地址，里面包含一个状态变量，当状态变量为退出标志时，线程就执行清除动作然后退出。

5、Windows完成端口的实例代码：
DWORD WINAPI WorkerThread(LPVOID lpParam)
{
ULONG_PTR *PerHandleKey;
OVERLAPPED *Overlap;
OVERLAPPEDPLUS *OverlapPlus,
*newolp;
DWORD dwBytesXfered;
while (1)
{
ret = GetQueuedCompletionStatus(
hIocp,
&dwBytesXfered,
(PULONG_PTR)&PerHandleKey,
&Overlap,
INFINITE);
if (ret == 0)
{
// Operation failed
continue;
}
OverlapPlus = CONTAINING_RECORD(Overlap, OVERLAPPEDPLUS, ol);
switch (OverlapPlus->OpCode)
{
case OP_ACCEPT:
// Client socket is contained in OverlapPlus.sclient
// Add client to completion port
CreateIoCompletionPort(
(HANDLE)OverlapPlus->sclient,
hIocp,
(ULONG_PTR)0,
0);
// Need a new OVERLAPPEDPLUS structure
// for the newly accepted socket. Perhaps
// keep a look aside list of free structures.
newolp = AllocateOverlappedPlus();
if (!newolp)
{
// Error
}
newolp->s = OverlapPlus->sclient;
newolp->OpCode = OP_READ;
// This function divpares the data to be sent
PrepareSendBuffer(&newolp->wbuf);
ret = WSASend(
newolp->s,
&newolp->wbuf,
1,
&newolp->dwBytes,
0,
&newolp.ol,
NULL);
if (ret == SOCKET_ERROR)
{
if (WSAGetLastError() != WSA_IO_PENDING)
{
// Error
}
}
// Put structure in look aside list for later use
FreeOverlappedPlus(OverlapPlus);
// Signal accept thread to issue another AcceptEx
SetEvent(hAcceptThread);
break;
case OP_READ:
// Process the data read
// Repost the read if necessary, reusing the same
// receive buffer as before
memset(&OverlapPlus->ol, 0, sizeof(OVERLAPPED));
ret = WSARecv(
OverlapPlus->s,
&OverlapPlus->wbuf,
1,
&OverlapPlus->dwBytes,
&OverlapPlus->dwFlags,
&OverlapPlus->ol,
NULL);
if (ret == SOCKET_ERROR)
{
if (WSAGetLastError() != WSA_IO_PENDING)
{
// Error
}
}
break;
case OP_WRITE:
// Process the data sent, etc.
break;
} // switch
} // while
} // WorkerThread
　

查看以上代码，注意如果Overlapped操作立刻失败（比如，返回SOCKET_ERROR或其他非WSA_IO_PENDING的错误），则没有任何完成通知时间会被放到完成端口队列里。反之，则一定有相应的通知时间被放到完成端口队列。更完善的关于Winsock的完成端口机制，可以参考 MSDN的Microsoft PlatFormSDK，那里有完成端口的例子。访问http://msdn.microsoft.com/library/techart/msdn_servrapp.htm可以获得更多信息。

Linux的EPoll模型
Linux 2.6内核中提高网络I/O性能的新方法-epoll I/O多路复用技术在比较多的TCP网络服务器中有使用，即比较多的用到select函数。

1、为什么select落后
首先，在Linux内核中，select所用到的FD_SET是有限的，即内核中有个参数__FD_SETSIZE定义了每个FD_SET的句柄个数，在我用的2.6.15-25-386内核中，该值是1024，搜索内核源代码得到：
include/linux/posix_types.h:#define __FD_SETSIZE 1024
也就是说，如果想要同时检测1025个句柄的可读状态是不可能用select实现的。或者同时检测1025个句柄的可写状态也是不可能的。其次，内核中实现select是用轮询方法，即每次检测都会遍历所有FD_SET中的句柄，显然，select函数执行时间与FD_SET中的句柄个数有一个比例关系，即select要检测的句柄数越多就会越费时。当然，在前文中我并没有提及poll方法，事实上用select的朋友一定也试过poll，我个人觉得 select和poll大同小异，个人偏好于用select而已。

2、内核中提高I/O性能的新方法epoll
epoll是什么？按照man手册的说法：是为处理大批量句柄而作了改进的poll。要使用epoll只需要这三个系统调用：epoll_create(2)， epoll_ctl(2)， epoll_wait(2)。
当然，这不是2.6内核才有的，它是在2.5.44内核中被引进的(epoll(4) is a new API introduced in Linux kernel 2.5.44)

Linux2.6内核epoll介绍
先介绍2本书《The Linux Networking Architecture--Design and Implementation of Network Protocols in the Linux Kernel》，以2.4内核讲解Linux TCP/IP实现，相当不错.作为一个现实世界中的实现，很多时候你必须作很多权衡，这时候参考一个久经考验的系统更有实际意义。举个例子,linux内核中sk_buff结构为了追求速度和安全，牺牲了部分内存，所以在发送TCP包的时候，无论应用层数据多大,sk_buff最小也有272的字节.其实对于socket应用层程序来说，另外一本书《UNIX Network Programming Volume 1》意义更大一点.2003年的时候，这本书出了最新的第3版本，不过主要还是修订第2版本。其中第6章《I/O Multiplexing》是最重要的。Stevens给出了网络IO的基本模型。在这里最重要的莫过于select模型和Asynchronous I/O模型.从理论上说，AIO似乎是最高效的，你的IO操作可以立即返回，然后等待os告诉你IO操作完成。但是一直以来，如何实现就没有一个完美的方案。最著名的windows完成端口实现的AIO,实际上也是内部用线程池实现的罢了，最后的结果是IO有个线程池，你应用也需要一个线程池...... 很多文档其实已经指出了这带来的线程context-switch带来的代价。在linux 平台上，关于网络AIO一直是改动最多的地方，2.4的年代就有很多AIO内核patch,最著名的应该算是SGI那个。但是一直到2.6内核发布，网络模块的AIO一直没有进入稳定内核版本(大部分都是使用用户线程模拟方法，在使用了NPTL的linux上面其实和windows的完成端口基本上差不多了)。2.6内核所支持的AIO特指磁盘的AIO---支持io_submit(),io_getevents()以及对Direct IO的支持(就是绕过VFS系统buffer直接写硬盘，对于流服务器在内存平稳性上有相当帮助)。
所以，剩下的select模型基本上就是我们在linux上面的唯一选择，其实，如果加上no-block socket的配置，可以完成一个"伪"AIO的实现，只不过推动力在于你而不是os而已。不过传统的select/poll函数有着一些无法忍受的缺点，所以改进一直是2.4-2.5开发版本内核的任务，包括/dev/poll，realtime signal等等。最终，Davide Libenzi开发的epoll进入2.6内核成为正式的解决方案

3、epoll的优点
<1>支持一个进程打开大数目的socket描述符(FD)
select 最不能忍受的是一个进程所打开的FD是有一定限制的，由FD_SETSIZE设置，默认值是2048。对于那些需要支持的上万连接数目的IM服务器来说显然太少了。这时候你一是可以选择修改这个宏然后重新编译内核，不过资料也同时指出这样会带来网络效率的下降，二是可以选择多进程的解决方案(传统的 Apache方案)，不过虽然linux上面创建进程的代价比较小，但仍旧是不可忽视的，加上进程间数据同步远比不上线程间同步的高效，所以也不是一种完美的方案。不过 epoll则没有这个限制，它所支持的FD上限是最大可以打开文件的数目，这个数字一般远大于2048,举个例子,在1GB内存的机器上大约是10万左右，具体数目可以cat /proc/sys/fs/file-max察看,一般来说这个数目和系统内存关系很大。
<2>IO效率不随FD数目增加而线性下降
传统的select/poll另一个致命弱点就是当你拥有一个很大的socket集合，不过由于网络延时，任一时间只有部分的socket是"活跃"的，但是select/poll每次调用都会线性扫描全部的集合，导致效率呈现线性下降。但是epoll不存在这个问题，它只会对"活跃"的socket进行操作---这是因为在内核实现中epoll是根据每个fd上面的callback函数实现的。那么，只有"活跃"的socket才会主动的去调用 callback函数，其他idle状态socket则不会，在这点上，epoll实现了一个"伪"AIO，因为这时候推动力在os内核。在一些 benchmark中，如果所有的socket基本上都是活跃的---比如一个高速LAN环境，epoll并不比select/poll有什么效率，相反，如果过多使用epoll_ctl,效率相比还有稍微的下降。但是一旦使用idle connections模拟WAN环境,epoll的效率就远在select/poll之上了。
<3>使用mmap加速内核与用户空间的消息传递。
这点实际上涉及到epoll的具体实现了。无论是select,poll还是epoll都需要内核把FD消息通知给用户空间，如何避免不必要的内存拷贝就很重要，在这点上，epoll是通过内核于用户空间mmap同一块内存实现的。而如果你想我一样从2.5内核就关注epoll的话，一定不会忘记手工 mmap这一步的。
<4>内核微调
这一点其实不算epoll的优点了，而是整个linux平台的优点。也许你可以怀疑linux平台，但是你无法回避linux平台赋予你微调内核的能力。比如，内核TCP/IP协议栈使用内存池管理sk_buff结构，那么可以在运行时期动态调整这个内存pool(skb_head_pool)的大小 --- 通过echo XXXX>/proc/sys/net/core/hot_list_length完成。再比如listen函数的第2个参数(TCP完成3次握手的数据包队列长度)，也可以根据你平台内存大小动态调整。更甚至在一个数据包面数目巨大但同时每个数据包本身大小却很小的特殊系统上尝试最新的NAPI网卡驱动架构。
4、epoll的工作模式
令人高兴的是，2.6内核的epoll比其2.5开发版本的/dev/epoll简洁了许多，所以，大部分情况下，强大的东西往往是简单的。唯一有点麻烦是epoll有2种工作方式:LT和ET。
LT(level triggered)是缺省的工作方式，并且同时支持block和no-block socket.在这种做法中，内核告诉你一个文件描述符是否就绪了，然后你可以对这个就绪的fd进行IO操作。如果你不作任何操作，内核还是会继续通知你的，所以，这种模式编程出错误可能性要小一点。传统的select/poll都是这种模型的代表．
ET (edge-triggered)是高速工作方式，只支持no-block socket。在这种模式下，当描述符从未就绪变为就绪时，内核通过epoll告诉你。然后它会假设你知道文件描述符已经就绪，并且不会再为那个文件描述符发送更多的就绪通知，直到你做了某些操作导致那个文件描述符不再为就绪状态了(比如，你在发送，接收或者接收请求，或者发送接收的数据少于一定量时导致了一个EWOULDBLOCK 错误）。但是请注意，如果一直不对这个fd作IO操作(从而导致它再次变成未就绪)，内核不会发送更多的通知(only once),不过在TCP协议中，ET模式的加速效用仍需要更多的benchmark确认。
epoll只有epoll_create,epoll_ctl,epoll_wait 3个系统调用，具体用法请参考http://www.xmailserver.org/linux-patches/nio-improve.html ，在http://www.kegel.com/rn/也有一个完整的例子，大家一看就知道如何使用了
Leader/follower模式线程pool实现，以及和epoll的配合。

5、 epoll的使用方法
首先通过create_epoll(int maxfds)来创建一个epoll的句柄，其中maxfds为你epoll所支持的最大句柄数。这个函数会返回一个新的epoll句柄，之后的所有操作将通过这个句柄来进行操作。在用完之后，记得用close()来关闭这个创建出来的epoll句柄。之后在你的网络主循环里面，每一帧的调用epoll_wait(int epfd, epoll_event events, int max events, int timeout)来查询所有的网络接口，看哪一个可以读，哪一个可以写了。基本的语法为：
nfds = epoll_wait(kdpfd, events, maxevents, -1);
其中kdpfd为用epoll_create创建之后的句柄，events是一个epoll_event*的指针，当epoll_wait这个函数操作成功之后，epoll_events里面将储存所有的读写事件。max_events是当前需要监听的所有socket句柄数。最后一个timeout是 epoll_wait的超时，为0的时候表示马上返回，为-1的时候表示一直等下去，直到有事件范围，为任意正整数的时候表示等这么长的时间，如果一直没有事件，则范围。一般如果网络主循环是单独的线程的话，可以用-1来等，这样可以保证一些效率，如果是和主逻辑在同一个线程的话，则可以用0来保证主循环的效率。

epoll_wait范围之后应该是一个循环，遍利所有的事件：
for(n = 0; n < nfds; ++n) {
                if(events[n].data.fd == listener) { //如果是主socket的事件的话，则表示有新连接进入了，进行新连接的处理。
                    client = accept(listener, (struct sockaddr *) &local,
                                    &addrlen);
                    if(client < 0){
                        perror("accept");
                        continue;
                    }
                    setnonblocking(client); // 将新连接置于非阻塞模式
                    ev.events = EPOLLIN | EPOLLET; // 并且将新连接也加入EPOLL的监听队列。
注意，这里的参数EPOLLIN | EPOLLET并没有设置对写socket的监听，如果有写操作的话，这个时候epoll是不会返回事件的，如果要对写操作也监听的话，应该是EPOLLIN | EPOLLOUT | EPOLLET
                    ev.data.fd = client;
                    if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) {
// 设置好event之后，将这个新的event通过epoll_ctl加入到epoll的监听队列里面，这里用EPOLL_CTL_ADD来加一个新的 epoll事件，通过EPOLL_CTL_DEL来减少一个epoll事件，通过EPOLL_CTL_MOD来改变一个事件的监听方式。
                        fprintf(stderr, "epoll set insertion error: fd=%d0,
                                client);
                        return -1;
                    }
                }
                else // 如果不是主socket的事件的话，则代表是一个用户socket的事件，则来处理这个用户socket的事情，比如说read(fd,xxx)之类的，或者一些其他的处理。
                    do_use_fd(events[n].data.fd);
}

对，epoll的操作就这么简单，总共不过4个API：epoll_create, epoll_ctl, epoll_wait和close。
如果您对epoll的效率还不太了解，请参考我之前关于网络游戏的网络编程等相关的文章。

以前公司的服务器都是使用HTTP连接，但是这样的话，在手机目前的网络情况下不但显得速度较慢，而且不稳定。因此大家一致同意用SOCKET来进行连接。虽然使用SOCKET之后，对于用户的费用可能会增加(由于是用了CMNET而非CMWAP)，但是，秉着用户体验至上的原则，相信大家还是能够接受的(希望那些玩家月末收到帐单不后能够保持克制...)。
这次的服务器设计中，最重要的一个突破，是使用了EPOLL模型，虽然对之也是一知半解，但是既然在各大PC网游中已经经过了如此严酷的考验，相信他不会让我们失望，使用后的结果，确实也是表现相当不错。在这里，我还是主要大致介绍一下这个模型的结构。
6、Linux下EPOll编程实例
EPOLL模型似乎只有一种格式，所以大家只要参考我下面的代码，就能够对EPOLL有所了解了，代码的解释都已经在注释中：

while (TRUE)
{
int nfds = epoll_wait (m_epoll_fd, m_events, MAX_EVENTS, EPOLL_TIME_OUT);//等待EPOLL时间的发生，相当于监听，至于相关的端口，需要在初始化EPOLL的时候绑定。
if (nfds <= 0)
continue;
m_bOnTimeChecking = FALSE;
G_CurTime = time(NULL);
for (int i=0; i
{
try
{
if (m_events[i].data.fd == m_listen_http_fd)//如果新监测到一个HTTP用户连接到绑定的HTTP端口，建立新的连接。由于我们新采用了SOCKET连接，所以基本没用。
{
OnAcceptHttpEpoll ();
}
else if (m_events[i].data.fd == m_listen_sock_fd)//如果新监测到一个SOCKET用户连接到了绑定的SOCKET端口，建立新的连接。
{
OnAcceptSockEpoll ();
}
else if (m_events[i].events & EPOLLIN)//如果是已经连接的用户，并且收到数据，那么进行读入。
{
OnReadEpoll (i);
}

OnWriteEpoll (i);//查看当前的活动连接是否有需要写出的数据。
}
catch (int)
{
PRINTF ("CATCH捕获错误\n");
continue;
}
}
m_bOnTimeChecking = TRUE;
OnTimer ();//进行一些定时的操作，主要就是删除一些短线用户等。
}
　其实EPOLL的精华，也就是上述的几段短短的代码，看来时代真的不同了，以前如何接受大量用户连接的问题，现在却被如此轻松的搞定，真是让人不得不感叹，对哪。

总结
Windows完成端口与Linux epoll技术方案是这2个平台上实现异步IO和设计开发一个大容量，具可扩展性的winsock程序指服务程序的很好的选择，本文对这2中技术的实现原理和实际的使用方法做了一个详细的介绍。

转自：

http://blog.chinaunix.net/u2/67780/showart_2057153.html

chatler 2009-11-24 15:28 发表评论

Linux网络socket编程指南

chatler — Sun, 15 Nov 2009 13:40:00 GMT

摘要: Linux Socket Introduction 阅读全文

chatler 2009-11-15 21:40 发表评论

epoll mechanism from linux man -2.6.16

chatler — Sun, 08 Nov 2009 05:13:00 GMT

1.NAME
epoll - I/O event notification facility

SYNOPSIS
#include

DESCRIPTION
       epoll is a variant of poll(2) that can be used either as Edge or Level Triggered interface and scales well to
       large numbers of watched fds. Three system calls are provided to set up and control an epoll set: epoll_create(2),
       epoll_ctl(2), epoll_wait(2).

An epoll set is connected to a file descriptor created by epoll_create(2). Interest for certain file descriptors
is then registered via epoll_ctl(2). Finally, the actual wait is started by epoll_wait(2).

NOTES
       The epoll event distribution interface is able to behave both as Edge Triggered ( ET ) and Level Triggered ( LT ).
       The difference between ET and LT event distribution mechanism can be described as follows. Suppose that this sce-
       nario happens :

1 The file descriptor that represents the read side of a pipe ( RFD ) is added inside the epoll device.

2 Pipe writer writes 2Kb of data on the write side of the pipe.

3 A call to epoll_wait(2) is done that will return RFD as ready file descriptor.

4 The pipe reader reads 1Kb of data from RFD.

5 A call to epoll_wait(2) is done.

       If the RFD file descriptor has been added to the epoll interface using the EPOLLET flag, the call to epoll_wait(2)
       done in step 5 will probably hang because of the available data still present in the file input buffers and the
       remote peer might be expecting a response based on the data it already sent. The reason for this is that Edge
       Triggered event distribution delivers events only when events happens on the monitored file. So, in step 5 the
       caller might end up waiting for some data that is already present inside the input buffer. In the above example,
       an event on RFD will be generated because of the write done in 2 and the event is consumed in 3. Since the read
       operation done in 4 does not consume the whole buffer data, the call to epoll_wait(2) done in step 5 might lock
       indefinitely. The epoll interface, when used with the EPOLLET flag ( Edge Triggered ) should use non-blocking file
       descriptors to avoid having a blocking read or write starve the task that is handling multiple file descriptors.
       The suggested way to use epoll as an Edge Triggered (EPOLLET) interface is below, and possible pitfalls to avoid
       follow.
              i      with non-blocking file descriptors

ii by going to wait for an event only after read(2) or write(2) return EAGAIN

       On the contrary, when used as a Level Triggered interface, epoll is by all means a faster poll(2), and can be used
       wherever the latter is used since it shares the same semantics. Since even with the Edge Triggered epoll multiple
       events can be generated up on receival of multiple chunks of data, the caller has the option to specify the EPOL-
       LONESHOT flag, to tell epoll to disable the associated file descriptor after the receival of an event with
       epoll_wait(2). When the EPOLLONESHOT flag is specified, it is caller responsibility to rearm the file descriptor
       using epoll_ctl(2) with EPOLL_CTL_MOD.

EXAMPLE FOR SUGGESTED USAGE
       While the usage of epoll when employed like a Level Triggered interface does have the same semantics of poll(2),
       an Edge Triggered usage requires more clarification to avoid stalls in the application event loop. In this exam-
       ple, listener is a non-blocking socket on which listen(2) has been called. The function do_use_fd() uses the new
       ready file descriptor until EAGAIN is returned by either read(2) or write(2).   An event driven state machine
       application should, after having received EAGAIN, record its current state so that at the next call to do_use_fd()
       it will continue to read(2) or write(2) from where it stopped before.

struct epoll_event ev, *events;

for(;;) {
nfds = epoll_wait(kdpfd, events, maxevents, -1);

           for(n = 0; n < nfds; ++n) {
               if(events[n].data.fd == listener) {
                   client = accept(listener, (struct sockaddr *) &local,
                                   &addrlen);
                   if(client < 0){
                       perror("accept");
                       continue;
                   }
                   setnonblocking(client);
                   ev.events = EPOLLIN | EPOLLET;
                   ev.data.fd = client;
                   if (epoll_ctl(kdpfd, EPOLL_CTL_ADD, client, &ev) < 0) {
                       fprintf(stderr, "epoll set insertion error: fd=%d\n",
                               client);
                       return -1;
                   }
               }
               else
               else
                   do_use_fd(events[n].data.fd);
           }
       }

       When used as an Edge triggered interface, for performance reasons, it is possible to add the file descriptor
       inside the epoll interface ( EPOLL_CTL_ADD ) once by specifying ( EPOLLIN|EPOLLOUT ). This allows you to avoid
       continuously switching between EPOLLIN and EPOLLOUT calling epoll_ctl(2) with EPOLL_CTL_MOD.

QUESTIONS AND ANSWERS (from linux-kernel)
Q1 What happens if you add the same fd to an epoll_set twice?

A1 You will probably get EEXIST. However, it is possible that two threads may add the same fd twice. This is a
harmless condition.

Q2 Can two epoll sets wait for the same fd? If so, are events reported to both epoll sets fds?

A2 Yes. However, it is not recommended. Yes it would be reported to both.

Q3 Is the epoll fd itself poll/epoll/selectable?

A3 Yes.

Q4 What happens if the epoll fd is put into its own fd set?

A4 It will fail. However, you can add an epoll fd inside another epoll fd set.

Q5 Can I send the epoll fd over a unix-socket to another process?

A5 No.

Q6 Will the close of an fd cause it to be removed from all epoll sets automatically?

A6 Yes.

Q7 If more than one event comes in between epoll_wait(2) calls, are they combined or reported separately?

A7 They will be combined.

Q8 Does an operation on an fd affect the already collected but not yet reported events?

A8 You can do two operations on an existing fd. Remove would be meaningless for this case. Modify will re-read
available I/O.

       Q9     Do I need to continuously read/write an fd until EAGAIN when using the EPOLLET flag ( Edge Triggered
              behaviour ) ?
       Q9     Do I need to continuously read/write an fd until EAGAIN when using the EPOLLET flag ( Edge Triggered
              behaviour ) ?

       A9     No you don't. Receiving an event from epoll_wait(2) should suggest to you that such file descriptor is
              ready for the requested I/O operation. You have simply to consider it ready until you will receive the next
              EAGAIN. When and how you will use such file descriptor is entirely up to you. Also, the condition that the
              read/write I/O space is exhausted can be detected by checking the amount of data read/write from/to the
              target file descriptor. For example, if you call read(2) by asking to read a certain amount of data and
              read(2) returns a lower number of bytes, you can be sure to have exhausted the read I/O space for such file
              descriptor. Same is valid when writing using the write(2) function.

POSSIBLE PITFALLS AND WAYS TO AVOID THEM
o Starvation ( Edge Triggered )

If there is a large amount of I/O space, it is possible that by trying to drain it the other files will not get
processed causing starvation. This is not specific to epoll.

       The solution is to maintain a ready list and mark the file descriptor as ready in its associated data structure,
       thereby allowing the application to remember which files need to be processed but still round robin amongst all
       the ready files. This also supports ignoring subsequent events you receive for fd's that are already ready.

o If using an event cache...

       If you use an event cache or store all the fd's returned from epoll_wait(2), then make sure to provide a way to
       mark its closure dynamically (ie- caused by a previous event's processing). Suppose you receive 100 events from
       epoll_wait(2), and in event #47 a condition causes event #13 to be closed.   If you remove the structure and
       close() the fd for event #13, then your event cache might still say there are events waiting for that fd causing
       confusion.

       One solution for this is to call, during the processing of event 47, epoll_ctl(EPOLL_CTL_DEL) to delete fd 13 and
       close(), then mark its associated data structure as removed and link it to a cleanup list. If you find another
       event for fd 13 in your batch processing, you will discover the fd had been previously removed and there will be
       no confusion.

CONFORMING TO
       epoll(7) is a new API introduced in Linux kernel 2.5.44. Its interface should be finalized in Linux kernel
       2.5.66.

2.NAME
       epoll_wait - wait for an I/O event on an epoll file descriptor

SYNOPSIS
#include

int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout)

DESCRIPTION
       Wait for events on the epoll file descriptor epfd for a maximum time of timeout milliseconds. The memory area
       pointed to by events will contain the events that will be available for the caller. Up to maxevents are returned
       by epoll_wait(2).   The maxevents parameter must be greater than zero. Specifying a timeout of -1 makes
       epoll_wait(2) wait indefinitely, while specifying a timeout equal to zero makes epoll_wait(2) to return immedi-
       ately even if no events are available (return code equal to zero). The struct epoll_event is defined as :

            typedef union epoll_data {
                 void *ptr;
                 int fd;
                 __uint32_t u32;
                 __uint64_t u64;
            } epoll_data_t;

            struct epoll_event {
                 __uint32_t events; /* Epoll events */
                 epoll_data_t data; /* User data variable */
            };

The data of each returned structure will contain the same data the user set with a epoll_ctl(2)
(EPOLL_CTL_ADD,EPOLL_CTL_MOD) while the events member will contain the returned event bit field.

RETURN VALUE
       When successful, epoll_wait(2) returns the number of file descriptors ready for the requested I/O, or zero if no
       file descriptor became ready during the requested timeout milliseconds. When an error occurs, epoll_wait(2)
       returns -1 and errno is set appropriately.

ERRORS
EBADF epfd is not a valid file descriptor.
EFAULT The memory area pointed to by events is not accessible with write permissions.

EINTR The call was interrupted by a signal handler before any of the requested events occurred or the timeout
expired.

EINVAL epfd is not an epoll file descriptor, or maxevents is less than or equal to zero.

3.NAME
epoll_ctl - control interface for an epoll descriptor

SYNOPSIS
#include

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)

DESCRIPTION
Control an epoll descriptor, epfd, by requesting that the operation op be performed on the target file descriptor,
fd. The event describes the object linked to the file descriptor fd. The struct epoll_event is defined as :

           typedef union epoll_data {
               void *ptr;
               int fd;
               __uint32_t u32;
               __uint64_t u64;
           } epoll_data_t;

           struct epoll_event {
               __uint32_t events;      /* Epoll events */
               epoll_data_t data;      /* User data variable */
           };

The events member is a bit set composed using the following available event types :

EPOLLIN
The associated file is available for read(2) operations.

EPOLLOUT
The associated file is available for write(2) operations.

EPOLLPRI
There is urgent data available for read(2) operations.

       EPOLLERR
              Error condition happened on the associated file descriptor. epoll_wait(2) will always wait for this event;
              it is not necessary to set it in events.

       EPOLLHUP
              Hang up happened on the associated file descriptor. epoll_wait(2) will always wait for this event; it is
              not necessary to set it in events.
       EPOLLET
              Sets the Edge Triggered behaviour for the associated file descriptor. The default behaviour for epoll is
              Level Triggered. See epoll(7) for more detailed information about Edge and Level Triggered event distribu-
              tion architectures.

       EPOLLONESHOT (since kernel 2.6.2)
              Sets the one-shot behaviour for the associated file descriptor. This means that after an event is pulled
              out with epoll_wait(2) the associated file descriptor is internally disabled and no other events will be
              reported by the epoll interface. The user must call epoll_ctl(2) with EPOLL_CTL_MOD to re-enable the file
              descriptor with a new event mask.

The epoll interface supports all file descriptors that support poll(2). Valid values for the op parameter are :

              EPOLL_CTL_ADD
                     Add the target file descriptor fd to the epoll descriptor epfd and associate the event event with
                     the internal file linked to fd.

EPOLL_CTL_MOD
Change the event event associated with the target file descriptor fd.

              EPOLL_CTL_DEL
                     Remove the target file descriptor fd from the epoll file descriptor, epfd. The event is ignored and
                     can be NULL (but see BUGS below).

RETURN VALUE
When successful, epoll_ctl(2) returns zero. When an error occurs, epoll_ctl(2) returns -1 and errno is set appro-
priately.

ERRORS
EBADF epfd is not a valid file descriptor.

EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd is already in epfd.

EINVAL epfd is not an epoll file descriptor, or fd is the same as epfd, or the requested operation op is not sup-
ported by this interface.

ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not in epfd.

ENOMEM There was insufficient memory to handle the requested op control operation.

EPERM The target file fd does not support epoll.

CONFORMING TO
epoll_ctl(2) is a new API introduced in Linux kernel 2.5.44. The interface should be finalized by Linux kernel
2.5.66.

BUGS
In kernel versions before 2.6.9, the EPOLL_CTL_DEL operation required a non-NULL pointer in event, even though
this argument is ignored. Since kernel 2.6.9, event can be specified as NULL when using EPOLL_CTL_DEL.

4.NAME
epoll_wait - wait for an I/O event on an epoll file descriptor

SYNOPSIS
#include

int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout)

            typedef union epoll_data {
                 void *ptr;
                 int fd;
                 __uint32_t u32;
                 __uint64_t u64;
            } epoll_data_t;

            struct epoll_event {
                 __uint32_t events; /* Epoll events */
                 epoll_data_t data; /* User data variable */
            };

The data of each returned structure will contain the same data the user set with a epoll_ctl(2)
(EPOLL_CTL_ADD,EPOLL_CTL_MOD) while the events member will contain the returned event bit field.

ERRORS
EBADF epfd is not a valid file descriptor.
EFAULT The memory area pointed to by events is not accessible with write permissions.

EINTR The call was interrupted by a signal handler before any of the requested events occurred or the timeout
expired.

EINVAL epfd is not an epoll file descriptor, or maxevents is less than or equal to zero.

CONFORMING TO
epoll_wait(2) is a new API introduced in Linux kernel 2.5.44. The interface should be finalized by Linux kernel
2.5.66.

chatler 2009-11-08 13:13 发表评论