C++博客-beautykingdom-随笔分类-Network

Comparing Two High-Performance I/O Design Patterns

chatler — Mon, 21 May 2012 03:24:00 GMT

by Alexander Libman with Vladimir Gilbourd
November 25, 2005

Summary

This article investigates and compares different design patterns of high performance TCP-based servers. In addition to existing approaches, it proposes a scalable single-codebase, multi-platform solution (with code examples) and describes its fine-tuning on different platforms. It also compares performance of Java, C# and C++ implementations of proposed and existing solutions.

System I/O can be blocking, or non-blocking synchronous, or non-blocking asynchronous [1, 2]. Blocking I/O means that the calling system does not return control to the caller until the operation is finished. As a result, the caller is blocked and cannot perform other activities during that time. Most important, the caller thread cannot be reused for other request processing while waiting for the I/O to complete, and becomes a wasted resource during that time. For example, aread() operation on a socket in blocking mode will not return control if the socket buffer is empty until some data becomes available.

By contrast, a non-blocking synchronous call returns control to the caller immediately. The caller is not made to wait, and the invoked system immediately returns one of two responses: If the call was executed and the results are ready, then the caller is told of that. Alternatively, the invoked system can tell the caller that the system has no resources (no data in the socket) to perform the requested action. In that case, it is the responsibility of the caller may repeat the call until it succeeds. For example, a read() operation on a socket in non-blocking mode may return the number of read bytes or a special return code -1 with errno set to EWOULBLOCK/EAGAIN, meaning "not ready; try again later."

In a non-blocking asynchronous call, the calling function returns control to the caller immediately, reporting that the requested action was started. The calling system will execute the caller's request using additional system resources/threads and will notify the caller (by callback for example), when the result is ready for processing. For example, a Windows ReadFile() or POSIX aio_read() API returns immediately and initiates an internal system read operation. Of the three approaches, this non-blocking asynchronous approach offers the best scalability and performance.

This article investigates different non-blocking I/O multiplexing mechanisms and proposes a single multi-platform design pattern/solution. We hope that this article will help developers of high performance TCP based servers to choose optimal design solution. We also compare the performance of Java, C# and C++ implementations of proposed and existing solutions. We will exclude the blocking approach from further discussion and comparison at all, as it the least effective approach for scalability and performance.

Reactor and Proactor: two I/O multiplexing approaches

In general, I/O multiplexing mechanisms rely on an event demultiplexor [1, 3], an object that dispatches I/O events from a limited number of sources to the appropriate read/write event handlers. The developer registers interest in specific events and provides event handlers, or callbacks. The event demultiplexor delivers the requested events to the event handlers.

Two patterns that involve event demultiplexors are called Reactor and Proactor [1]. The Reactor patterns involve synchronous I/O, whereas the Proactor pattern involves asynchronous I/O. In Reactor, the event demultiplexor waits for events that indicate when a file descriptor or socket is ready for a read or write operation. The demultiplexor passes this event to the appropriate handler, which is responsible for performing the actual read or write.

In the Proactor pattern, by contrast, the handler—or the event demultiplexor on behalf of the handler—initiates asynchronous read and write operations. The I/O operation itself is performed by the operating system (OS). The parameters passed to the OS include the addresses of user-defined data buffers from which the OS gets data to write, or to which the OS puts data read. The event demultiplexor waits for events that indicate the completion of the I/O operation, and forwards those events to the appropriate handlers. For example, on Windows a handler could initiate async I/O (overlapped in Microsoft terminology) operations, and the event demultiplexor could wait for IOCompletion events [1]. The implementation of this classic asynchronous pattern is based on an asynchronous OS-level API, and we will call this implementation the "system-level" or "true" async, because the application fully relies on the OS to execute actual I/O.

An example will help you understand the difference between Reactor and Proactor. We will focus on the read operation here, as the write implementation is similar. Here's a read in Reactor:

An event handler declares interest in I/O events that indicate readiness for read on a particular socket
The event demultiplexor waits for events
An event comes in and wakes-up the demultiplexor, and the demultiplexor calls the appropriate handler
The event handler performs the actual read operation, handles the data read, declares renewed interest in I/O events, and returns control to the dispatcher

By comparison, here is a read operation in Proactor (true async):

A handler initiates an asynchronous read operation (note: the OS must support asynchronous I/O). In this case, the handler does not care about I/O readiness events, but is instead registers interest in receiving completion events.
The event demultiplexor waits until the operation is completed
While the event demultiplexor waits, the OS executes the read operation in a parallel kernel thread, puts data into a user-defined buffer, and notifies the event demultiplexor that the read is complete
The event demultiplexor calls the appropriate handler;
The event handler handles the data from user defined buffer, starts a new asynchronous operation, and returns control to the event demultiplexor.

Current practice

The open-source C++ development framework ACE [1, 3] developed by Douglas Schmidt, et al., offers a wide range of platform-independent, low-level concurrency support classes (threading, mutexes, etc). On the top level it provides two separate groups of classes: implementations of the ACE Reactor and ACE Proactor. Although both of them are based on platform-independent primitives, these tools offer different interfaces.

The ACE Proactor gives much better performance and robustness on MS-Windows, as Windows provides a very efficient async API, based on operating-system-level support [4, 5].

Unfortunately, not all operating systems provide full robust async OS-level support. For instance, many Unix systems do not. Therefore, ACE Reactor is a preferable solution in UNIX (currently UNIX does not have robust async facilities for sockets). As a result, to achieve the best performance on each system, developers of networked applications need to maintain two separate code-bases: an ACE Proactor based solution on Windows and an ACE Reactor based solution for Unix-based systems.

As we mentioned, the true async Proactor pattern requires operating-system-level support. Due to the differing nature of event handler and operating-system interaction, it is difficult to create common, unified external interfaces for both Reactor and Proactor patterns. That, in turn, makes it hard to create a fully portable development framework and encapsulate the interface and OS- related differences.

Proposed solution

In this section, we will propose a solution to the challenge of designing a portable framework for the Proactor and Reactor I/O patterns. To demonstrate this solution, we will transform a Reactor demultiplexor I/O solution to an emulated async I/O by moving read/write operations from event handlers inside the demultiplexor (this is "emulated async" approach). The following example illustrates that conversion for a read operation:

An event handler declares interest in I/O events (readiness for read) and provides the demultiplexor with information such as the address of a data buffer, or the number of bytes to read.
Dispatcher waits for events (for example, on select());
When an event arrives, it awakes up the dispatcher. The dispatcher performs a non- blocking read operation (it has all necessary information to perform this operation) and on completion calls the appropriate handler.
The event handler handles data from the user-defined buffer, declares new interest, along with information about where to put the data buffer and the number bytes to read in I/O events. The event handler then returns control to the dispatcher.

As we can see, by adding functionality to the demultiplexor I/O pattern, we were able to convert the Reactor pattern to a Proactor pattern. In terms of the amount of work performed, this approach is exactly the same as the Reactor pattern. We simply shifted responsibilities between different actors. There is no performance degradation because the amount of work performed is still the same. The work was simply performed by different actors. The following lists of steps demonstrate that each approach performs an equal amount of work:

Standard/classic Reactor:

Step 1) wait for event (Reactor job)
Step 2) dispatch "Ready-to-Read" event to user handler ( Reactor job)
Step 3) read data (user handler job)
Step 4) process data ( user handler job)

Proposed emulated Proactor:

Step 1) wait for event (Proactor job)
Step 2) read data (now Proactor job)
Step 3) dispatch "Read-Completed" event to user handler (Proactor job)
Step 4) process data (user handler job)

With an operating system that does not provide an async I/O API, this approach allows us to hide the reactive nature of available socket APIs and to expose a fully proactive async interface. This allows us to create a fully portable platform-independent solution with a common external interface.

TProactor

The proposed solution (TProactor) was developed and implemented at Terabit P/L [6]. The solution has two alternative implementations, one in C++ and one in Java. The C++ version was built using ACE cross-platform low-level primitives and has a common unified async proactive interface on all platforms.

The main TProactor components are the Engine and WaitStrategy interfaces. Engine manages the async operations lifecycle. WaitStrategy manages concurrency strategies. WaitStrategy depends on Engine and the two always work in pairs. Interfaces between Engine and WaitStrategy are strongly defined.

Engines and waiting strategies are implemented as pluggable class-drivers (for the full list of all implemented Engines and corresponding WaitStrategies, see Appendix 1). TProactor is a highly configurable solution. It internally implements three engines (POSIX AIO, SUN AIO and Emulated AIO) and hides six different waiting strategies, based on an asynchronous kernel API (for POSIX- this is not efficient right now due to internal POSIX AIO API problems) and synchronous Unix select(), poll(), /dev/poll (Solaris 5.8+), port_get (Solaris 5.10), RealTime (RT) signals (Linux 2.4+), epoll (Linux 2.6), k-queue (FreeBSD) APIs. TProactor conforms to the standard ACE Proactor implementation interface. That makes it possible to develop a single cross-platform solution (POSIX/MS-WINDOWS) with a common (ACE Proactor) interface.

With a set of mutually interchangeable "lego-style" Engines and WaitStrategies, a developer can choose the appropriate internal mechanism (engine and waiting strategy) at run time by setting appropriate configuration parameters. These settings may be specified according to specific requirements, such as the number of connections, scalability, and the targeted OS. If the operating system supports async API, a developer may use the true async approach, otherwise the user can opt for an emulated async solutions built on different sync waiting strategies. All of those strategies are hidden behind an emulated async façade.

For an HTTP server running on Sun Solaris, for example, the /dev/poll or port_get()-based engines is the most suitable choice, able to serve huge number of connections, but for another UNIX solution with a limited number of connections but high throughput requirements, aselect()-based engine may be a better approach. Such flexibility cannot be achieved with a standard ACE Reactor/Proactor, due to inherent algorithmic problems of different wait strategies (see Appendix 2).

In terms of performance, our tests show that emulating from reactive to proactive does not impose any overhead—it can be faster, but not slower. According to our test results, the TProactor gives on average of up to 10-35 % better performance (measured in terms of both throughput and response times) than the reactive model in the standard ACE Reactor implementation on various UNIX/Linux platforms. On Windows it gives the same performance as standard ACE Proactor.

Performance comparison (JAVA versus C++ versus C#).

In addition to C++, as we also implemented TProactor in Java. As for JDK version 1.4, Java provides only the sync-based approach that is logically similar to C select() [7, 8]. Java TProactor is based on Java's non-blocking facilities (java.nio packages) logically similar to C++ TProactor with waiting strategy based on select().

Figures 1 and 2 chart the transfer rate in bits/sec versus the number of connections. These charts represent comparison results for a simple echo-server built on standard ACE Reactor, using RedHat Linux 9.0, TProactor C++ and Java (IBM 1.4JVM) on Microsoft's Windows and RedHat Linux9.0, and a C# echo-server running on the Windows operating system. Performance of native AIO APIs is represented by "Async"-marked curves; by emulated AIO (TProactor)—AsyncE curves; and by TP_Reactor—Synch curves. All implementations were bombarded by the same client application—a continuous stream of arbitrary fixed sized messages via N connections.

The full set of tests was performed on the same hardware. Tests on different machines proved that relative results are consistent.

Figure 1. Windows XP/P4 2.6GHz HyperThreading/512 MB RAM.

Figure 2. Linux RedHat 2.4.20-smp/P4 2.6GHz HyperThreading/512 MB RAM.

User code example

The following is the skeleton of a simple TProactor-based Java echo-server. In a nutshell, the developer only has to implement the two interfaces: OpRead with buffer where TProactor puts its read results, and OpWrite with a buffer from which TProactor takes data. The developer will also need to implement protocol-specific logic via providing callbacks onReadCompleted() and onWriteCompleted() in the AsynchHandler interface implementation. Those callbacks will be asynchronously called by TProactor on completion of read/write operations and executed on a thread pool space provided by TProactor (the developer doesn't need to write his own pool).

class EchoServerProtocol implements AsynchHandler
{
    AsynchChannel achannel = null;
    EchoServerProtocol(Demultiplexor m, SelectableChannel channel) throws Exception 
    {
        this.achannel = new AsynchChannel( m, this, channel );
    }

    public void start() throws Exception
    {
	// called after construction 
	System.out.println( Thread.currentThread().getName() + ": EchoServer protocol started" ); 
        achannel.read( buffer);
    }

    public void onReadCompleted( OpRead opRead ) throws Exception
    {
	if (opRead.getError() != null )
	{
   	// handle error, do clean-up if needed  
		System.out.println( "EchoServer::readCompleted: " + opRead.getError().toString());
		achannel.close();
		return;
	}
		
	if (opRead.getBytesCompleted () <= 0)
	{
		System.out.println( "EchoServer::readCompleted: Peer closed " + opRead.getBytesCompleted();
		achannel.close();
		return;
	}

	ByteBuffer buffer = opRead.getBuffer();
	achannel.write(buffer);
}
		
public void onWriteCompleted(OpWrite opWrite) throws Exception 
{
// logically similar to onReadCompleted         ...     
}
};

IOHandler is a TProactor base class. AsynchHandler and Multiplexor, among other things, internally execute the wait strategy chosen by the developer.

Conclusion

TProactor provides a common, flexible, and configurable solution for multi-platform high- performance communications development. All of the problems and complexities mentioned in Appendix 2, are hidden from the developer.

It is clear from the charts that C++ is still the preferable approach for high performance communication solutions, but Java on Linux comes quite close. However, the overall Java performance was weakened by poor results on Windows. One reason for that may be that the Java 1.4 nio package is based on select()-style API. � It is true, Java NIO package is kind of Reactor pattern based on select()-style API (see [7, 8]). Java NIO allows to write your own select()-style provider (equivalent of TProactor waiting strategies). Looking at Java NIO implementation for Windows (to do this enough to examine import symbols in jdk1.5.0\jre\bin\nio.dll), we can make a conclusion that Java NIO 1.4.2 and 1.5.0 for Windows is based on WSAEventSelect () API. That is better than select(), but slower than IOCompletionPort�s for significant number of connections. . Should the 1.5 version of Java's nio be based on IOCompletionPorts, then that should improve performance. If Java NIO would use IOCompletionPorts, than conversion of Proactor pattern to Reactor pattern should be made inside nio.dll. Although such conversion is more complicated than Reactor- >Proactor conversion, but it can be implemented in frames of Java NIO interfaces. (this the topic of next arcticle, but we can provide algorithm). At this time, no TProactor performance tests were done on JDK 1.5.

Note. All tests for Java are performed on "raw" buffers (java.nio.ByteBuffer) without data processing.

Taking into account the latest activities to develop robust AIO on Linux [9], we can conclude that Linux Kernel API (io_xxxx set of system calls) should be more scalable in comparison with POSIX standard, but still not portable. In this case, TProactor with new Engine/Wait Strategy pair, based on native LINUX AIO can be easily implemented to overcome portability issues and to cover Linux native AIO with standard ACE Proactor interface.

Appendix I

Engines and waiting strategies implemented in TProactor

Engine Type	Wait Strategies	Operating System
POSIX_AIO (true async) `aio_read()`/`aio_write()`	`aio_suspend() Waiting for RT signal Callback function`	POSIX complained UNIX (not robust) POSIX (not robust) SGI IRIX, LINUX (not robust)
SUN_AIO (true async) `aio_read()`/`aio_write()`	`aio_wait()`	SUN (not robust)
Emulated Async Non-blocking `read()`/`write()`	`select()` `poll()` /dev/poll Linux RT signals Kqueue	generic POSIX Mostly all POSIX implementations SUN Linux FreeBSD

Appendix II

All sync waiting strategies can be divided into two groups:

edge-triggered (e.g. Linux RT signals)—signal readiness only when socket became ready (changes state);
level-triggered (e.g. select(), poll(), /dev/poll)—readiness at any time.

Let us describe some common logical problems for those groups:

edge-triggered group: after executing I/O operation, the demultiplexing loop can lose the state of socket readiness. Example: the "read" handler did not read whole chunk of data, so the socket remains still ready for read. But the demultiplexor loop will not receive next notification.
level-triggered group: when demultiplexor loop detects readiness, it starts the write/read user defined handler. But before the start, it should remove socket descriptior from the set of monitored descriptors. Otherwise, the same event can be dispatched twice.
Obviously, solving these problems adds extra complexities to development. All these problems were resolved internally within TProactor and the developer should not worry about those details, while in the synch approach one needs to apply extra effort to resolve them.

Resources

[1] Douglas C. Schmidt, Stephen D. Huston "C++ Network Programming." 2002, Addison-Wesley ISBN 0-201-60464-7

[2] W. Richard Stevens "UNIX Network Programming" vol. 1 and 2, 1999, Prentice Hill, ISBN 0-13- 490012-X

[3] Douglas C. Schmidt, Michael Stal, Hans Rohnert, Frank Buschmann "Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects, Volume 2" Wiley & Sons, NY 2000

[4] INFO: Socket Overlapped I/O Versus Blocking/Non-blocking Mode. Q181611. Microsoft Knowledge Base Articles.

[5] Microsoft MSDN. I/O Completion Ports.
http://msdn.microsoft.com/library/default.asp?url=/library/en- us/fileio/fs/i_o_completion_ports.asp

[6] TProactor (ACE compatible Proactor).
www.terabit.com.au

[7] JavaDoc java.nio.channels
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/package-summary.html

[8] JavaDoc Java.nio.channels.spi Class SelectorProvider
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/spi/SelectorProvider.html

[9] Linux AIO development
http://lse.sourceforge.net/io/aio.html, and
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf

About the authors

Alex Libman has been programming for 15 years. During the past 5 years his main area of interest is pattern-oriented multiplatform networked programming using C++ and Java. He is big fan and contributor of ACE.

Vlad Gilbourd works as a computer consultant, but wishes to spend more time listening jazz :) As a hobby, he started and runswww.corporatenews.com.au website.

from:
http://www.artima.com/articles/io_design_patterns.html

chatler 2012-05-21 11:24 发表评论

Comparing Two High-Performance I/O Design Patterns

chatler — Wed, 08 Sep 2010 09:20:00 GMT

Summary

System I/O can be blocking, or non-blocking synchronous, or non-blocking asynchronous [1, 2]. Blocking I/O means that the calling system does not return control to the caller until the operation is finished. As a result, the caller is blocked and cannot perform other activities during that time. Most important, the caller thread cannot be reused for other request processing while waiting for the I/O to complete, and becomes a wasted resource during that time. For example, a read() operation on a socket in blocking mode will not return control if the socket buffer is empty until some data becomes available.

An example will help you understand the difference between Reactor and Proactor. We will focus on the read operation here, as the write implementation is similar. Here's a read in Reactor:

An event handler declares interest in I/O events that indicate readiness for read on a particular socket
The event demultiplexor waits for events
An event comes in and wakes-up the demultiplexor, and the demultiplexor calls the appropriate handler
The event handler performs the actual read operation, handles the data read, declares renewed interest in I/O events, and returns control to the dispatcher

By comparison, here is a read operation in Proactor (true async):

A handler initiates an asynchronous read operation (note: the OS must support asynchronous I/O). In this case, the handler does not care about I/O readiness events, but is instead registers interest in receiving completion events.
The event demultiplexor waits until the operation is completed
While the event demultiplexor waits, the OS executes the read operation in a parallel kernel thread, puts data into a user-defined buffer, and notifies the event demultiplexor that the read is complete
The event demultiplexor calls the appropriate handler;
The event handler handles the data from user defined buffer, starts a new asynchronous operation, and returns control to the event demultiplexor.

Current practice

The ACE Proactor gives much better performance and robustness on MS-Windows, as Windows provides a very efficient async API, based on operating-system-level support [4, 5].

Proposed solution

An event handler declares interest in I/O events (readiness for read) and provides the demultiplexor with information such as the address of a data buffer, or the number of bytes to read.
Dispatcher waits for events (for example, on select());
When an event arrives, it awakes up the dispatcher. The dispatcher performs a non- blocking read operation (it has all necessary information to perform this operation) and on completion calls the appropriate handler.
The event handler handles data from the user-defined buffer, declares new interest, along with information about where to put the data buffer and the number bytes to read in I/O events. The event handler then returns control to the dispatcher.

Standard/classic Reactor:

Step 1) wait for event (Reactor job)
Step 2) dispatch "Ready-to-Read" event to user handler ( Reactor job)
Step 3) read data (user handler job)
Step 4) process data ( user handler job)

Proposed emulated Proactor:

Step 1) wait for event (Proactor job)
Step 2) read data (now Proactor job)
Step 3) dispatch "Read-Completed" event to user handler (Proactor job)
Step 4) process data (user handler job)

TProactor

For an HTTP server running on Sun Solaris, for example, the /dev/poll or port_get()-based engines is the most suitable choice, able to serve huge number of connections, but for another UNIX solution with a limited number of connections but high throughput requirements, a select()-based engine may be a better approach. Such flexibility cannot be achieved with a standard ACE Reactor/Proactor, due to inherent algorithmic problems of different wait strategies (see Appendix 2).

Performance comparison (JAVA versus C++ versus C#).

The full set of tests was performed on the same hardware. Tests on different machines proved that relative results are consistent.

Figure 1. Windows XP/P4 2.6GHz HyperThreading/512 MB RAM.

Figure 2. Linux RedHat 2.4.20-smp/P4 2.6GHz HyperThreading/512 MB RAM.

User code example

The following is the skeleton of a simple TProactor-based Java echo-server. In a nutshell, the developer only has to implement the two interfaces:OpRead with buffer where TProactor puts its read results, and OpWrite with a buffer from which TProactor takes data. The developer will also need to implement protocol-specific logic via providing callbacks onReadCompleted() and onWriteCompleted() in the AsynchHandlerinterface implementation. Those callbacks will be asynchronously called by TProactor on completion of read/write operations and executed on a thread pool space provided by TProactor (the developer doesn't need to write his own pool).

class EchoServerProtocol implements AsynchHandler
{

    AsynchChannel achannel = null;

    EchoServerProtocol( Demultiplexor m,  SelectableChannel channel ) throws Exception
    {
        this.achannel = new AsynchChannel( m, this, channel );
    }

    public void start() throws Exception
    {
        // called after construction
        System.out.println( Thread.currentThread().getName() + ": EchoServer protocol started" );
        achannel.read( buffer);
    }

    public void onReadCompleted( OpRead opRead ) throws Exception
    {
        if ( opRead.getError() != null )
        {
            // handle error, do clean-up if needed
 System.out.println( "EchoServer::readCompleted: " + opRead.getError().toString());
            achannel.close();
            return;
        }

        if ( opRead.getBytesCompleted () <= 0)
        {
            System.out.println( "EchoServer::readCompleted: Peer closed " + opRead.getBytesCompleted();
            achannel.close();
            return;
        }

        ByteBuffer buffer = opRead.getBuffer();

        achannel.write(buffer);
    }

    public void onWriteCompleted(OpWrite opWrite) throws Exception
    {
        // logically similar to onReadCompleted
        ...
    }
}

IOHandler is a TProactor base class. AsynchHandler and Multiplexor, among other things, internally execute the wait strategy chosen by the developer.

Conclusion

Note. All tests for Java are performed on "raw" buffers (java.nio.ByteBuffer) without data processing.

Appendix I

Engines and waiting strategies implemented in TProactor

Engine Type	Wait Strategies	Operating System
POSIX_AIO (true async) `aio_read()`/`aio_write()`	`aio_suspend() Waiting for RT signal Callback function`	POSIX complained UNIX (not robust) POSIX (not robust) SGI IRIX, LINUX (not robust)
SUN_AIO (true async) `aio_read()`/`aio_write()`	`aio_wait()`	SUN (not robust)
Emulated Async Non-blocking `read()`/`write()`	`select()` `poll()` /dev/poll Linux RT signals Kqueue	generic POSIX Mostly all POSIX implementations SUN Linux FreeBSD

Appendix II

All sync waiting strategies can be divided into two groups:

edge-triggered (e.g. Linux RT signals)—signal readiness only when socket became ready (changes state);
level-triggered (e.g. select(), poll(), /dev/poll)—readiness at any time.

Let us describe some common logical problems for those groups:

edge-triggered group: after executing I/O operation, the demultiplexing loop can lose the state of socket readiness. Example: the "read" handler did not read whole chunk of data, so the socket remains still ready for read. But the demultiplexor loop will not receive next notification.
level-triggered group: when demultiplexor loop detects readiness, it starts the write/read user defined handler. But before the start, it should remove socket descriptior from the set of monitored descriptors. Otherwise, the same event can be dispatched twice.
Obviously, solving these problems adds extra complexities to development. All these problems were resolved internally within TProactor and the developer should not worry about those details, while in the synch approach one needs to apply extra effort to resolve them.

Resources

[1] Douglas C. Schmidt, Stephen D. Huston "C++ Network Programming." 2002, Addison-Wesley ISBN 0-201-60464-7

[2] W. Richard Stevens "UNIX Network Programming" vol. 1 and 2, 1999, Prentice Hill, ISBN 0-13- 490012-X

[3] Douglas C. Schmidt, Michael Stal, Hans Rohnert, Frank Buschmann "Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects, Volume 2" Wiley & Sons, NY 2000

[4] INFO: Socket Overlapped I/O Versus Blocking/Non-blocking Mode. Q181611. Microsoft Knowledge Base Articles.

[5] Microsoft MSDN. I/O Completion Ports.
http://msdn.microsoft.com/library/default.asp?url=/library/en- us/fileio/fs/i_o_completion_ports.asp

[6] TProactor (ACE compatible Proactor).
www.terabit.com.au

[7] JavaDoc java.nio.channels
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/package-summary.html

[8] JavaDoc Java.nio.channels.spi Class SelectorProvider
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/spi/SelectorProvider.html

[9] Linux AIO development
http://lse.sourceforge.net/io/aio.html, and
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf

About the authors

Vlad Gilbourd works as a computer consultant, but wishes to spend more time listening jazz :) As a hobby, he started and runswww.corporatenews.com.au website.

from:

http://www.artima.com/articles/io_design_patterns.html

chatler 2010-09-08 17:20 发表评论

一个基于完成端口的TCP Server Framework,浅析IOCP

chatler — Wed, 25 Aug 2010 12:42:00 GMT

如果你不投递（POST）Overlapped I/O，那么I/O Completion Ports 只能为你提供一个Queue.
    CreateIoCompletionPort的NumberOfConcurrentThreads：
1.只有当第二个参数ExistingCompletionPort为NULL时它才有效，它是个max threads limits.
2.大家有谁把它设置为超出cpu个数的值，当然不只是cpu个数的2倍，而是下面的MAX_THREADS 100甚至更大。
对于这个值的设定，msdn并没有说非得设成cpu个数的2倍，而且也没有把减少线程之间上下文交换这些影响扯到这里来。I/O Completion Ports MSDN:"If your transaction required a lengthy computation, a larger concurrency value will allow more threads to run. Each completion packet may take longer to finish, but more completion packets will be processed at the same time. "。
    对于struct OVERLAPPED，我们常会如下扩展，
typedef struct {
WSAOVERLAPPED overlapped; //must be first member?   是的，必须是第一个。如果你不肯定，你可以试试。
SOCKET client_s;
SOCKADDR_IN client_addr;
WORD optCode;//1--read,2--send. 有人常会定义这个数据成员，但也有人不用，争议在send/WSASend,此时的同步和异步是否有必要？至少我下面的server更本就没用它。
char buf[MAX_BUF_SIZE];
WSABUF wsaBuf;//inited ? 这个不要忘了！
DWORD numberOfBytesTransferred;
DWORD flags;

}QSSOverlapped;//for per connection
我下面的server框架的基本思想是:
One connection VS one thread in worker thread pool ,worker thread performs completionWorkerRoutine.
A Acceptor thread 专门用来accept socket,关联至IOCP,并WSARecv:post Recv Completion Packet to IOCP.
在completionWorkerRoutine中有以下的职责:
1.handle request,当忙时增加completionWorkerThread数量但不超过maxThreads,post Recv Completion Packet to IOCP.
2.timeout时检查是否空闲和当前completionWorkerThread数量,当空闲时保持或减少至minThreads数量.
3.对所有Accepted-socket管理生命周期,这里利用系统的keepalive probes,若想实现业务层"心跳探测"只需将QSS_SIO_KEEPALIVE_VALS_TIMEOUT 改回系统默认的2小时.
下面结合源代码,浅析一下IOCP:
socketserver.h
#ifndef __Q_SOCKET_SERVER__
#define __Q_SOCKET_SERVER__
#include
#include
#define QSS_SIO_KEEPALIVE_VALS_TIMEOUT 30*60*1000
#define QSS_SIO_KEEPALIVE_VALS_INTERVAL 5*1000

#define MAX_THREADS 100
#define MAX_THREADS_MIN 10
#define MIN_WORKER_WAIT_TIMEOUT 20*1000
#define MAX_WORKER_WAIT_TIMEOUT 60*MIN_WORKER_WAIT_TIMEOUT

#define MAX_BUF_SIZE 1024

/*当Accepted socket和socket关闭或发生异常时回调CSocketLifecycleCallback*/
typedef void (*CSocketLifecycleCallback)(SOCKET cs,int lifecycle);//lifecycle:0:OnAccepted,-1:OnClose//注意OnClose此时的socket未必可用,可能已经被非正常关闭或其他异常.

/*协议处理回调*/
typedef int (*InternalProtocolHandler)(LPWSAOVERLAPPED overlapped);//return -1:SOCKET_ERROR

typedef struct Q_SOCKET_SERVER SocketServer;
DWORD initializeSocketServer(SocketServer ** ssp,WORD passive,WORD port,CSocketLifecycleCallback cslifecb,InternalProtocolHandler protoHandler,WORD minThreads,WORD maxThreads,long workerWaitTimeout);
DWORD startSocketServer(SocketServer *ss);
DWORD shutdownSocketServer(SocketServer *ss);

#endif
qsocketserver.c      简称 qss,相应的OVERLAPPED简称qssOl.
#include "socketserver.h"
#include "stdio.h"
typedef struct {
WORD passive;//daemon
WORD port;
WORD minThreads;
WORD maxThreads;
volatile long lifecycleStatus;//0-created,1-starting, 2-running,3-stopping,4-exitKeyPosted,5-stopped
long workerWaitTimeout;//wait timeout
CRITICAL_SECTION QSS_LOCK;
volatile long workerCounter;
volatile long currentBusyWorkers;
volatile long CSocketsCounter;//Accepted-socket引用计数
CSocketLifecycleCallback cslifecb;
InternalProtocolHandler protoHandler;
WORD wsaVersion;//=MAKEWORD(2,0);
WSADATA wsData;
SOCKET server_s;
SOCKADDR_IN serv_addr;
HANDLE iocpHandle;
}QSocketServer;

typedef struct {
WSAOVERLAPPED overlapped;
SOCKET client_s;
SOCKADDR_IN client_addr;
WORD optCode;
char buf[MAX_BUF_SIZE];
WSABUF wsaBuf;
DWORD numberOfBytesTransferred;
DWORD flags;
}QSSOverlapped;

DWORD acceptorRoutine(LPVOID);
DWORD completionWorkerRoutine(LPVOID);

static void adjustQSSWorkerLimits(QSocketServer *qss){
  /*adjust size and timeout.*/
  /*if(qss->maxThreads <= 0) {
   qss->maxThreads = MAX_THREADS;
        } else if (qss->maxThreads < MAX_THREADS_MIN) {
        qss->maxThreads = MAX_THREADS_MIN;
        }
        if(qss->minThreads > qss->maxThreads) {
        qss->minThreads = qss->maxThreads;
        }
        if(qss->minThreads <= 0) {
            if(1 == qss->maxThreads) {
            qss->minThreads = 1;
            } else {
            qss->minThreads = qss->maxThreads/2;
            }
        }

        if(qss->workerWaitTimeout        qss->workerWaitTimeout=MIN_WORKER_WAIT_TIMEOUT;
        if(qss->workerWaitTimeout>MAX_WORKER_WAIT_TIMEOUT)
        qss->workerWaitTimeout=MAX_WORKER_WAIT_TIMEOUT;        */
}

typedef struct{
QSocketServer * qss;
HANDLE th;
}QSSWORKER_PARAM;

static WORD addQSSWorker(QSocketServer *qss,WORD addCounter){
WORD res=0;
if(qss->workerCounterminThreads||(qss->currentBusyWorkers==qss->workerCounter&&qss->workerCountermaxThreads)){
  DWORD threadId;
  QSSWORKER_PARAM * pParam=NULL;
  int i=0;
  EnterCriticalSection(&qss->QSS_LOCK);
  if(qss->workerCounter+addCounter<=qss->maxThreads)
   for(;i   {
    pParam=malloc(sizeof(QSSWORKER_PARAM));
    if(pParam){
     pParam->th=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)completionWorkerRoutine,pParam,CREATE_SUSPENDED,&threadId);
     pParam->qss=qss;
     ResumeThread(pParam->th);
     qss->workerCounter++,res++;
    }
   }
  LeaveCriticalSection(&qss->QSS_LOCK);
}
return res;
}

static void SOlogger(const char * msg,SOCKET s,int clearup){
perror(msg);
if(s>0)
closesocket(s);
if(clearup)
WSACleanup();
}

static int _InternalEchoProtocolHandler(LPWSAOVERLAPPED overlapped){
QSSOverlapped *qssOl=(QSSOverlapped *)overlapped;

printf("numOfT:%d,WSARecvd:%s,\n",qssOl->numberOfBytesTransferred,qssOl->buf);
//Sleep(500);
return send(qssOl->client_s,qssOl->buf,qssOl->numberOfBytesTransferred,0);
}

DWORD initializeSocketServer(SocketServer ** ssp,WORD passive,WORD port,CSocketLifecycleCallback cslifecb,InternalProtocolHandler protoHandler,WORD minThreads,WORD maxThreads,long workerWaitTimeout){
QSocketServer * qss=malloc(sizeof(QSocketServer));
qss->passive=passive>0?1:0;
qss->port=port;
qss->minThreads=minThreads;
qss->maxThreads=maxThreads;
qss->workerWaitTimeout=workerWaitTimeout;
qss->wsaVersion=MAKEWORD(2,0);
qss->lifecycleStatus=0;
InitializeCriticalSection(&qss->QSS_LOCK);
qss->workerCounter=0;
qss->currentBusyWorkers=0;
qss->CSocketsCounter=0;
qss->cslifecb=cslifecb,qss->protoHandler=protoHandler;
if(!qss->protoHandler)
qss->protoHandler=_InternalEchoProtocolHandler;
adjustQSSWorkerLimits(qss);
*ssp=(SocketServer *)qss;
return 1;
}

DWORD startSocketServer(SocketServer *ss){
QSocketServer * qss=(QSocketServer *)ss;
if(qss==NULL||InterlockedCompareExchange(&qss->lifecycleStatus,1,0))
return 0;
qss->serv_addr.sin_family=AF_INET;
qss->serv_addr.sin_port=htons(qss->port);
qss->serv_addr.sin_addr.s_addr=INADDR_ANY;//inet_addr("127.0.0.1");
if(WSAStartup(qss->wsaVersion,&qss->wsData)){
/*这里还有个插曲就是这个WSAStartup被调用的时候,它居然会启动一条额外的线程,当然稍后这条线程会自动退出的.不知WSAClearup又会如何?......*/

  SOlogger("WSAStartup failed.\n",0,0);
  return 0;
}
qss->server_s=socket(AF_INET,SOCK_STREAM,IPPROTO_IP);
if(qss->server_s==INVALID_SOCKET){
  SOlogger("socket failed.\n",0,1);
  return 0;
}
if(bind(qss->server_s,(LPSOCKADDR)&qss->serv_addr,sizeof(SOCKADDR_IN))==SOCKET_ERROR){
  SOlogger("bind failed.\n",qss->server_s,1);
  return 0;
}
if(listen(qss->server_s,SOMAXCONN)==SOCKET_ERROR)/*这里来谈谈backlog,很多人不知道设成何值,我见到过1,5,50,100的,有人说设定的越大越耗资源,的确,这里设成SOMAXCONN不代表windows会真的使用SOMAXCONN,而是" If set to SOMAXCONN, the underlying service provider responsible for socket s will set the backlog to a maximum reasonable value. "，同时在现实环境中，不同操作系统支持TCP缓冲队列有所不同，所以还不如让操作系统来决定它的值。像Apache这种服务器：
#ifndef DEFAULT_LISTENBACKLOG
#define DEFAULT_LISTENBACKLOG 511
#endif
*/
    {
  SOlogger("listen failed.\n",qss->server_s,1);
        return 0;
    }
qss->iocpHandle=CreateIoCompletionPort(INVALID_HANDLE_VALUE,NULL,0,/*NumberOfConcurrentThreads-->*/qss->maxThreads);
//initialize worker for completion routine.
addQSSWorker(qss,qss->minThreads);
qss->lifecycleStatus=2;
{
  QSSWORKER_PARAM * pParam=malloc(sizeof(QSSWORKER_PARAM));
  pParam->qss=qss;
  pParam->th=NULL;
  if(qss->passive){
   DWORD threadId;
   pParam->th=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)acceptorRoutine,pParam,0,&threadId);
  }else
   return acceptorRoutine(pParam);
}
return 1;
}

DWORD shutdownSocketServer(SocketServer *ss){
QSocketServer * qss=(QSocketServer *)ss;
if(qss==NULL||InterlockedCompareExchange(&qss->lifecycleStatus,3,2)!=2)
  return 0;
closesocket(qss->server_s/*listen-socket*/);//..other accepted-sockets associated with the listen-socket will not be closed,except WSACleanup is called..
if(qss->CSocketsCounter==0)
  qss->lifecycleStatus=4,PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
WSACleanup();
return 1;
}

DWORD  acceptorRoutine(LPVOID ss){
QSSWORKER_PARAM * pParam=(QSSWORKER_PARAM *)ss;
QSocketServer * qss=pParam->qss;
HANDLE curThread=pParam->th;
QSSOverlapped *qssOl=NULL;
SOCKADDR_IN client_addr;
int client_addr_leng=sizeof(SOCKADDR_IN);
SOCKET cs;
free(pParam);
while(1){
  printf("accept starting.....\n");
  cs/*Accepted-socket*/=accept(qss->server_s,(LPSOCKADDR)&client_addr,&client_addr_leng);
  if(cs==INVALID_SOCKET)
        {
   printf("accept failed:%d\n",GetLastError());
            break;
        }else{//SO_KEEPALIVE,SIO_KEEPALIVE_VALS 这里是利用系统的"心跳探测",keepalive probes.linux:setsockopt,SOL_TCP:TCP_KEEPIDLE,TCP_KEEPINTVL,TCP_KEEPCNT
            struct tcp_keepalive alive,aliveOut;
            int so_keepalive_opt=1;
            DWORD outDW;
            if(!setsockopt(cs,SOL_SOCKET,SO_KEEPALIVE,(char *)&so_keepalive_opt,sizeof(so_keepalive_opt))){
               alive.onoff=TRUE;
               alive.keepalivetime=QSS_SIO_KEEPALIVE_VALS_TIMEOUT;
               alive.keepaliveinterval=QSS_SIO_KEEPALIVE_VALS_INTERVAL;
               if(WSAIoctl(cs,SIO_KEEPALIVE_VALS,&alive,sizeof(alive),&aliveOut,sizeof(aliveOut),&outDW,NULL,NULL)==SOCKET_ERROR){
                    printf("WSAIoctl SIO_KEEPALIVE_VALS failed:%d\n",GetLastError());
                    break;
                }

            }else{
                     printf("setsockopt SO_KEEPALIVE failed:%d\n",GetLastError());
                     break;
            }
  }

  CreateIoCompletionPort((HANDLE)cs,qss->iocpHandle,cs,0);
  if(qssOl==NULL){
   qssOl=malloc(sizeof(QSSOverlapped));
  }
  qssOl->client_s=cs;
  qssOl->wsaBuf.len=MAX_BUF_SIZE,qssOl->wsaBuf.buf=qssOl->buf,qssOl->numberOfBytesTransferred=0,qssOl->flags=0;//initialize WSABuf.
  memset(&qssOl->overlapped,0,sizeof(WSAOVERLAPPED));
  {
   DWORD lastErr=GetLastError();
   int ret=0;
   SetLastError(0);
   ret=WSARecv(cs,&qssOl->wsaBuf,1,&qssOl->numberOfBytesTransferred,&qssOl->flags,&qssOl->overlapped,NULL);
   if(ret==0||(ret==SOCKET_ERROR&&GetLastError()==WSA_IO_PENDING)){
    InterlockedIncrement(&qss->CSocketsCounter);//Accepted-socket计数递增.
    if(qss->cslifecb)
     qss->cslifecb(cs,0);
    qssOl=NULL;
   }

   if(!GetLastError())
    SetLastError(lastErr);
  }

  printf("accept flags:%d ,cs:%d.\n",GetLastError(),cs);
}//end while.

if(qssOl)
  free(qssOl);
if(qss)
  shutdownSocketServer((SocketServer *)qss);
if(curThread)
  CloseHandle(curThread);

return 1;
}

static int postRecvCompletionPacket(QSSOverlapped * qssOl,int SOErrOccurredCode){
int SOErrOccurred=0;
DWORD lastErr=GetLastError();
SetLastError(0);
//SOCKET_ERROR:-1,WSA_IO_PENDING:997
if(WSARecv(qssOl->client_s,&qssOl->wsaBuf,1,&qssOl->numberOfBytesTransferred,&qssOl->flags,&qssOl->overlapped,NULL)==SOCKET_ERROR
  &&GetLastError()!=WSA_IO_PENDING)//this case lastError maybe 64, 10054
{
  SOErrOccurred=SOErrOccurredCode;
}
if(!GetLastError())
  SetLastError(lastErr);
if(SOErrOccurred)
  printf("worker[%d] postRecvCompletionPacket SOErrOccurred=%d,preErr:%d,postedErr:%d\n",GetCurrentThreadId(),SOErrOccurred,lastErr,GetLastError());
return SOErrOccurred;
}

DWORD  completionWorkerRoutine(LPVOID ss){
QSSWORKER_PARAM * pParam=(QSSWORKER_PARAM *)ss;
QSocketServer * qss=pParam->qss;
HANDLE curThread=pParam->th;
QSSOverlapped * qssOl=NULL;
DWORD numberOfBytesTransferred=0;
ULONG_PTR completionKey=0;
int postRes=0,handleCode=0,exitCode=0,SOErrOccurred=0;
free(pParam);
while(!exitCode){
  SetLastError(0);
  if(GetQueuedCompletionStatus(qss->iocpHandle,&numberOfBytesTransferred,&completionKey,(LPOVERLAPPED *)&qssOl,qss->workerWaitTimeout)){
   if(completionKey==-1&&qss->lifecycleStatus>=4)
   {
    printf("worker[%d] completionKey -1:%d \n",GetCurrentThreadId(),GetLastError());
    if(qss->workerCounter>1)
     PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
    exitCode=1;
    break;
   }
   if(numberOfBytesTransferred>0){

    InterlockedIncrement(&qss->currentBusyWorkers);
    addQSSWorker(qss,1);
    handleCode=qss->protoHandler((LPWSAOVERLAPPED)qssOl);
    InterlockedDecrement(&qss->currentBusyWorkers);

    if(handleCode>=0){
     SOErrOccurred=postRecvCompletionPacket(qssOl,1);
    }else
     SOErrOccurred=2;
   }else{
    printf("worker[%d] numberOfBytesTransferred==0 ***** closesocket servS or cs *****,%d,%d ,ol is:%d\n",GetCurrentThreadId(),GetLastError(),completionKey,qssOl==NULL?0:1);
    SOErrOccurred=3;
   }
  }else{ //GetQueuedCompletionStatus rtn FALSE, lastError 64 ,995[timeout worker thread exit.] ,WAIT_TIMEOUT:258
   if(qssOl){
    SOErrOccurred=postRecvCompletionPacket(qssOl,4);
   }else {

    printf("worker[%d] GetQueuedCompletionStatus F:%d \n",GetCurrentThreadId(),GetLastError());
    if(GetLastError()!=WAIT_TIMEOUT){
     exitCode=2;
    }else{//wait timeout
     if(qss->lifecycleStatus!=4&&qss->currentBusyWorkers==0&&qss->workerCounter>qss->minThreads){
      EnterCriticalSection(&qss->QSS_LOCK);
      if(qss->lifecycleStatus!=4&&qss->currentBusyWorkers==0&&qss->workerCounter>qss->minThreads){
       qss->workerCounter--;//until qss->workerCounter decrease to qss->minThreads
       exitCode=3;
      }
      LeaveCriticalSection(&qss->QSS_LOCK);
     }
    }
   }
  }//end GetQueuedCompletionStatus.

  if(SOErrOccurred){
   if(qss->cslifecb)
    qss->cslifecb(qssOl->client_s,-1);
   /*if(qssOl)*/{
    closesocket(qssOl->client_s);
    free(qssOl);
   }
   if(InterlockedDecrement(&qss->CSocketsCounter)==0&&qss->lifecycleStatus>=3){
    //for qss workerSize,PostQueuedCompletionStatus -1
    qss->lifecycleStatus=4,PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
    exitCode=4;
   }
  }
  qssOl=NULL,numberOfBytesTransferred=0,completionKey=0,SOErrOccurred=0;//for net while.
}//end while.

//last to do
if(exitCode!=3){
  int clearup=0;
  EnterCriticalSection(&qss->QSS_LOCK);
  if(!--qss->workerCounter&&qss->lifecycleStatus>=4){//clearup QSS
    clearup=1;
  }
  LeaveCriticalSection(&qss->QSS_LOCK);
  if(clearup){
   DeleteCriticalSection(&qss->QSS_LOCK);
   CloseHandle(qss->iocpHandle);
   free(qss);
  }
}
CloseHandle(curThread);
return 1;
}
------------------------------------------------------------------------------------------------------------------------
对于IOCP的LastError的辨别和处理是个难点,所以请注意我的completionWorkerRoutine的while结构,
结构如下:
while(!exitCode){
    if(completionKey==-1){...break;}
    if(GetQueuedCompletionStatus){/*在这个if体中只要你投递的OVERLAPPED is not NULL,那么这里你得到的就是它.*/
        if(numberOfBytesTransferred>0){
               /*在这里handle request,记得要继续投递你的OVERLAPPED哦! */
        }else{
              /*这里可能客户端或服务端closesocket(the socket),但是OVERLAPPED is not NULL,只要你投递的不为NULL!*/
        }
    }else{/*在这里的if体中,虽然GetQueuedCompletionStatus return FALSE,但是不代表OVERLAPPED一定为NULL.特别是OVERLAPPED is not NULL的情况下,不要以为LastError发生了,就代表当前的socket无用或发生致命的异常,比如发生lastError:995这种情况下此时的socket有可能是一切正常的可用的,你不应该关闭它.*/
        if(OVERLAPPED is not NULL){
             /*这种情况下,请不管37,21继续投递吧!在投递后再检测错误.*/
        }else{

        }
    }
if(socket error occured){

}
prepare for next while.
}

行文仓促,难免有错误或不足之处,希望大家踊跃指正评论,谢谢!

这个模型在性能上还是有改进的空间哦！

from:

http://www.cppblog.com/adapterofcoms/archive/2010/06/26/118781.aspx

chatler 2010-08-25 20:42 发表评论

一个基于Event Poll(epoll)的TCP Server Framework,浅析epoll

chatler — Wed, 25 Aug 2010 12:41:00 GMT

摘要: epoll,event poll,on linux kernel 2.6.x.pthread,nptl-2.12 LT/ET:ET也会多次发送event,当然频率远低于LT,但是epoll one shot才是真正的对"one connection VS one thread in worker thread pool,不依赖于任何connection-... 阅读全文

chatler 2010-08-25 20:41 发表评论

TCP: SYN ACK FIN RST PSH URG 详解

chatler — Fri, 16 Jul 2010 06:14:00 GMT

版权声明：转载时请以超链接形式标明文章原始出处和作者信息及本声明
 http://xufish.blogbus.com/logs/40536553.html

TCP 的三次握手是怎么进行的了：发送端发送一个SYN=1，ACK=0标志的数据包给接收端，请求进行连接，这是第一次握手；接收端收到请求并且允许连接的话，就会发送一个SYN=1，ACK=1标志的数据包给发送端，告诉它，可以通讯了，并且让发送端发送一个确认数据包，这是第二次握手；最后，发送端发送一个SYN=0，ACK=1的数据包给接收端，告诉它连接已被确认，这就是第三次握手。之后，一个TCP连接建立，开始通讯。

*SYN：同步标志
同步序列编号(Synchronize Sequence Numbers)栏有效。该标志仅在三次握手建立TCP连接时有效。它提示TCP连接的服务端检查序列编号，该序列编号为TCP连接初始端(一般是客户端)的初始序列编号。在这里，可以把TCP序列编号看作是一个范围从0到4，294，967，295的32位计数器。通过TCP连接交换的数据中每一个字节都经过序列编号。在TCP报头中的序列编号栏包括了TCP分段中第一个字节的序列编号。

*ACK：确认标志
确认编号(Acknowledgement Number)栏有效。大多数情况下该标志位是置位的。TCP报头内的确认编号栏内包含的确认编号(w+1，Figure-1)为下一个预期的序列编号，同时提示远端系统已经成功接收所有数据。

*RST：复位标志
复位标志有效。用于复位相应的TCP连接。

*URG：紧急标志
紧急(The urgent pointer) 标志有效。紧急标志置位，

*PSH：推标志
该标志置位时，接收端不将该数据进行队列处理，而是尽可能快将数据转由应用处理。在处理 telnet 或 rlogin 等交互模式的连接时，该标志总是置位的。

*FIN：结束标志
带有该标志置位的数据包用来结束一个TCP回话，但对应端口仍处于开放状态，准备接收后续数据。

=============================================================

三次握手Three-way Handshake

一个虚拟连接的建立是通过三次握手来实现的

1. (B) --> [SYN] --> (A)

假如服务器A和客户机B通讯. 当A要和B通信时，B首先向A发一个SYN (Synchronize) 标记的包，告诉A请求建立连接.

注意: 一个 SYN包就是仅SYN标记设为1的TCP包(参见TCP包头Resources). 认识到这点很重要，只有当A受到B发来的SYN包，才可建立连接，除此之外别无他法。因此，如果你的防火墙丢弃所有的发往外网接口的SYN包，那么你将不能让外部任何主机主动建立连接。

2. (B) <-- [SYN/ACK] <--(A)

接着，A收到后会发一个对SYN包的确认包(SYN/ACK)回去，表示对第一个SYN包的确认，并继续握手操作.

注意: SYN/ACK包是仅SYN 和 ACK 标记为1的包.

3. (B) --> [ACK] --> (A)

B收到SYN/ACK 包,B发一个确认包(ACK)，通知A连接已建立。至此，三次握手完成，一个TCP连接完成

Note: ACK包就是仅ACK 标记设为1的TCP包. 需要注意的是当三此握手完成、连接建立以后，TCP连接的每个包都会设置ACK位

这就是为何连接跟踪很重要的原因了. 没有连接跟踪,防火墙将无法判断收到的ACK包是否属于一个已经建立的连接.一般的包过滤(Ipchains)收到ACK包时,会让它通过(这绝对不是个好主意). 而当状态型防火墙收到此种包时，它会先在连接表中查找是否属于哪个已建连接，否则丢弃该包

四次握手Four-way Handshake

四次握手用来关闭已建立的TCP连接

1. (B) --> ACK/FIN --> (A)

2. (B) <-- ACK <-- (A)

3. (B) <-- ACK/FIN <-- (A)

4. (B) --> ACK --> (A)

注意: 由于TCP连接是双向连接, 因此关闭连接需要在两个方向上做。ACK/FIN 包(ACK 和FIN 标记设为1)通常被认为是FIN(终结)包.然而, 由于连接还没有关闭, FIN包总是打上ACK标记. 没有ACK标记而仅有FIN标记的包不是合法的包，并且通常被认为是恶意的

连接复位Resetting a connection

四次握手不是关闭 TCP连接的唯一方法. 有时,如果主机需要尽快关闭连接(或连接超时,端口或主机不可达),RST (Reset)包将被发送. 注意在，由于RST包不是TCP连接中的必须部分, 可以只发送RST包(即不带ACK标记). 但在正常的TCP连接中RST包可以带ACK确认标记

请注意RST包是可以不要收到方确认的?

无效的TCP标记Invalid TCP Flags

到目前为止，你已经看到了 SYN, ACK, FIN, 和RST 标记. 另外，还有PSH (Push) 和URG (Urgent)标记.

最常见的非法组合是SYN/FIN 包. 注意:由于 SYN包是用来初始化连接的, 它不可能和 FIN和RST标记一起出现. 这也是一个恶意攻击.

由于现在大多数防火墙已知 SYN/FIN 包, 别的一些组合,例如SYN/FIN/PSH, SYN/FIN/RST, SYN/FIN/RST/PSH。很明显，当网络中出现这种包时，很你的网络肯定受到攻击了。

别的已知的非法包有FIN (无ACK标记)和"NULL"包。如同早先讨论的，由于ACK/FIN包的出现是为了关闭一个TCP连接，那么正常的FIN包总是带有 ACK 标记。"NULL"包就是没有任何TCP标记的包(URG,ACK,PSH,RST,SYN,FIN都为0)。

到目前为止，正常的网络活动下，TCP协议栈不可能产生带有上面提到的任何一种标记组合的TCP包。当你发现这些不正常的包时，肯定有人对你的网络不怀好意。

UDP (用户数据包协议User Datagram Protocol)
TCP是面向连接的，而UDP是非连接的协议。UDP没有对接受进行确认的标记和确认机制。对丢包的处理是在应用层来完成的。(or accidental arrival).

此处需要重点注意的事情是：在正常情况下，当UDP包到达一个关闭的端口时，会返回一个UDP复位包。由于UDP是非面向连接的, 因此没有任何确认信息来确认包是否正确到达目的地。因此如果你的防火墙丢弃UDP包，它会开放所有的UDP端口(?)。

由于Internet 上正常情况下一些包将被丢弃，甚至某些发往已关闭端口(非防火墙的)的UDP包将不会到达目的，它们将返回一个复位UDP包。

因为这个原因，UDP 端口扫描总是不精确、不可靠的。

看起来大UDP包的碎片是常见的DOS (Denial of Service)攻击的常见形式 (这里有个DOS攻击的例子，http://grc.com/dos/grcdos.htm ).

ICMP (网间控制消息协议Internet Control Message Protocol)
如同名字一样， ICMP用来在主机/路由器之间传递控制信息的协议。 ICMP包可以包含诊断信息(ping, traceroute - 注意目前unix系统中的traceroute用UDP包而不是ICMP)，错误信息(网络/主机/端口不可达 network/host/port unreachable), 信息(时间戳timestamp, 地址掩码address mask request, etc.)，或控制信息 (source quench, redirect, etc.) 。

你可以在http://www.iana.org/assignments/icmp-parameters中找到ICMP包的类型。

尽管ICMP通常是无害的，还是有些类型的ICMP信息需要丢弃。

Redirect (5), Alternate Host Address (6), Router Advertisement (9) 能用来转发通讯。

Echo (8), Timestamp (13) and Address Mask Request (17) 能用来分别判断主机是否起来，本地时间和地址掩码。注意它们是和返回的信息类别有关的。它们自己本身是不能被利用的，但它们泄露出的信息对攻击者是有用的。

ICMP 消息有时也被用来作为DOS攻击的一部分(例如：洪水ping flood ping,死 ping ?呵呵，有趣 ping of death)?/p>

包碎片注意A Note About Packet Fragmentation

如果一个包的大小超过了TCP的最大段长度MSS (Maximum Segment Size) 或MTU (Maximum Transmission Unit)，能够把此包发往目的的唯一方法是把此包分片。由于包分片是正常的，它可以被利用来做恶意的攻击。

因为分片的包的第一个分片包含一个包头，若没有包分片的重组功能，包过滤器不可能检测附加的包分片。典型的攻击Typical attacks involve in overlapping the packet data in which packet header is 典型的攻击Typical attacks involve in overlapping the packet data in which packet header isnormal until is it overwritten with different destination IP (or port) thereby bypassing firewall rules。包分片能作为 DOS 攻击的一部分，它可以crash older IP stacks 或涨死CPU连接能力。

Netfilter/Iptables中的连接跟踪代码能自动做分片重组。它仍有弱点，可能受到饱和连接攻击，可以把CPU资源耗光。

握手阶段：
序号方向 seq ack
1　　A->B 10000 0
2 B->A 20000 10000+1=10001
3 A->B 10001 20000+1=20001
解释：
1：A向B发起连接请求，以一个随机数初始化A的seq,这里假设为10000，此时ACK＝0

2：B收到A的连接请求后，也以一个随机数初始化B的seq，这里假设为20000，意思是：你的请求我已收到，我这方的数据流就从这个数开始。B的ACK是A的seq加1，即10000＋1＝10001

3：A收到B的回复后，它的seq是它的上个请求的seq加1，即10000＋1＝10001，意思也是：你的回复我收到了，我这方的数据流就从这个数开始。A此时的ACK 是B的seq加1，即20000+1=20001

数据传输阶段：
序号　　方向　　　　　　seq ack size
23 A->B 40000 70000 1514
24 B->A 70000 40000+1514-54=41460 54
25 A->B 41460 70000+54-54=70000 1514
26 B->A 70000 41460+1514-54=42920 54
解释：
23:B接收到 A发来的seq=40000,ack=70000,size=1514的数据包
24: 于是B向A也发一个数据包，告诉B，你的上个包我收到了。B的seq就以它收到的数据包的ACK填充，ACK是它收到的数据包的SEQ加上数据包的大小 (不包括以太网协议头，IP头，TCP头)，以证实B发过来的数据全收到了。
25:A 在收到B发过来的ack为41460的数据包时，一看到41460，正好是它的上个数据包的seq加上包的大小，就明白，上次发送的数据包已安全到达。于是它再发一个数据包给B。这个正在发送的数据包的seq也以它收到的数据包的ACK填充，ACK就以它收到的数据包的seq(70000)加上包的 size(54)填充,即ack=70000+54-54(全是头长，没数据项)。

其实在握手和结束时确认号应该是对方序列号加1,传输数据时则是对方序列号加上对方携带应用层数据的长度.如果从以太网包返回来计算所加的长度,就嫌走弯路了.
另外,如果对方没有数据过来,则自己的确认号不变,序列号为上次的序列号加上本次应用层数据发送长度

chatler 2010-07-16 14:14 发表评论

NAT的缺陷

chatler — Tue, 13 Jul 2010 07:28:00 GMT

NAT的优点不必多讲,它提供了一系列相关技术来实现多个内网用户通过一个公网ip和外部通信,有效的解决了ipv4地址不够用的问题.那么位于NAT后的用户使用私网ip真的和使用公网ip一样吗?NAT解决了所有地址转换的相关问题了吗?
下面主要讲一些NAT不支持的方面,以及所谓的NAT 的"缺陷".

一些应用层协议(如TCP和SIP),在它们的应用层数据中需要包含公网IP地址.拿FTP来说吧,众所周知,FTP是通过两个不同的连接来传输控制报文和数据报文的.当传输一个文件时,FTP服务器要求通过控制报文得到即将传输的数据报文的网络层和传输层地址 (IP/PORT).如果这个时候客户主机是在NAT之后的,那么服务器端收到的ip/port将会是NAT转化前的私网IP地址,从而会导致文件传输失效.
SIP(Session Initiation Protocol)主要是来控制音频传输的,这个协议也面临同样的问题.因为SIP建立连接时,需要用到几个不同的端口来通过RTP传输音频流.而且这些端口以及IP会被编码到音频流中,传输给服务器端,从而实现后续的通信.
如果没有一些特殊的技术(如STUN),那么NAT是不支持这些协议的, 这些协议经过NAT也肯定会失败.
Some Application Layer protocols (such as FTP and SIP) send explicit network addresses within their application data. FTP in active mode, for example, uses separate connections for control traffic (commands) and for data traffic (file contents). When requesting a file transfer, the host making the request identifies the corresponding data connection by its network layer and transport layer addresses. If the host making the request lies behind a simple NAT firewall, the translation of the IP address and/or TCP port number makes the information received by the server invalid. The Session Initiation Protocol (SIP) controls Voice over IP (VoIP) communications and suffers the same problem. SIP may use multiple ports to set up a connection and transmit voice stream via RTP. IP addresses and port numbers are encoded in the payload data and must be known prior to the traversal of NATs. Without special techniques, such as STUN, NAT behavior is unpredictable and communications may fail.

下面讲一些特殊的技术,来使NAT支持这些特殊的应用层协议.

最直观的想法就是:既然NAT修改了IP/PROT,那么我们也修改应用层数据中相应的IP/PORT.应用层网关(ALG)(硬件或软件都行)就是这样来解决这个问题的.应用层网关运行在设置了NAT的防火墙设备中,它会更新传输数据中的IP/PORT.所以,应用层网关也必须能够解析应用层协议,而且对于每一种协议,可能需要不同的应用层网关来做.
Application Layer Gateway (ALG) software or hardware may correct these problems. An ALG software module running on a NAT firewall device updates any payload data made invalid by address translation. ALGs obviously need to understand the higher-layer protocol that they need to fix, and so each protocol with this problem requires a separate ALG.

另外一个解决此问题的办法就是NAT穿透.此方法主要利用STUN或 ICE等协议或者一些和会话控制相关的特有的方法来实现.理论上NAT穿透最好能够同时适用于基于TCP和基于UDP的应用,但是基于UDP的应用相对比较简单,更广为流传,也更适合兼容一些种类的NAT做穿透.这样,应用层协议在设计的时候,必须考虑到可支持NAT穿透.但一些其他类型的NAT(比如对称NAT)是无论如何也不能做穿透的.
Another possible solution to this problem is to use NAT traversal techniques using protocols such as STUN or ICE or proprietary approaches in a session border controller. NAT traversal is possible in both TCP- and UDP-based applications, but the UDP-based technique is simpler, more widely understood, and more compatible with legacy NATs. In either case, the high level protocol must be designed with NAT traversal in mind, and it does not work reliably across symmetric NATs or other poorly-behaved legacy NATs.

还有一些方法,比如UPnP (Universal Plug and Play) 或 Bonjour (NAT-PMP),但是这些方法都需要专门的NAT设备.
Other possibilities are UPnP (Universal Plug and Play) or Bonjour (NAT-PMP), but these require the cooperation of the NAT device.

大部分传统的客户-服务器协议(除了FTP),都不定义3层以上的数据格式,所以,也就可以和传统的NAT兼容.实际上,在设计应用层协议的时候应尽量避免涉及到3层以上的数据,因为这样会使它兼容NAT时复杂化.
Most traditional client-server protocols (FTP being the main exception), however, do not send layer 3 contact information and therefore do not require any special treatment by NATs. In fact, avoiding NAT complications is practically a requirement when designing new higher-layer protocols today.

NAT也会和利用ipsec加密的一些应用冲突.比如SIP电话,如果有很多SIP电话设备在 NA(P)T之后,那么在电话利用ipsc加密它们的信号时,如果也加密了port信息,那么这就意味着NAPT就不能转换port,只能转换IP.但是这样就会导致回来的数据包都被NAT到同一个客户端,从而导致通信失败(不太明白).不过,这个问题有很多方法来解决,比如用TLS.TLS是运行在第四层(OSI模型)的,所以它不包含port信息.也可以在UDP之内来封装ipsec,TISPAN 就是用这种方法来实现安全NAT转化的.
NATs can also cause problems where IPsec encryption is applied and in cases where multiple devices such as SIP phones are located behind a NAT. Phones which encrypt their signaling with IPsec encapsulate the port information within the IPsec packet meaning that NA(P)T devices cannot access and translate the port. In these cases the NA(P)T devices revert to simple NAT operation. This means that all traffic returning to the NAT will be mapped onto one client causing the service to fail. There are a couple of solutions to this problem, one is to use TLS which operates at level 4 in the OSI Reference Model and therefore does not mask the port number, or to Encapsulate the IPsec within UDP - the latter being the solution chosen by TISPAN to achieve secure NAT traversal.

Dan Kaminsky 在2008年的时候提出NAPT还会间接的影响DNS协议的健壮性,为了避免DNS服务器缓存中毒,在NA(p)T防火墙之后的DNS服务器最好不要转换来自外部的DNS请求(UDP)的源端口.而对DNS缓存中毒攻击的应对措施就是使所有的DNS服务器用随机的端口来接收DNS请求.但如果NA(P)T 使DNS请求的源端口也随机化,那么在NA(P)T防火墙后面的DNS服务器还是会崩溃的.
The DNS protocol vulnerability announced by Dan Kaminsky on 2008 July 8 is indirectly affected by NAT port mapping. To avoid DNS server cache poisoning, it is highly desirable to not translate UDP source port numbers of outgoing DNS requests from any DNS server which is behind a firewall which implements NAT. The recommended work-around for the DNS vulnerability is to make all caching DNS servers use randomized UDP source ports. If the NAT function de-randomizes the UDP source ports, the DNS server will be made vulnerable.

位于NAT后的主机不能实现真的端对端的通信,也不能使用一些和NAT冲突的internat协议.而且从外部发起的TCP连接和一些无状态的协议(利用 udp的上层协议)也不能正常的进行,除非NAT所在设备通过相关技术支持这些协议.一些协议能够利用应用层网关或其他技术,来使只有一端处于NAT后的通信双方正常通信.但要是双方都在NAT后就会失败.NAT也和一些隧道协议(如ipsec)冲突,因为NAT会修改ip或port,从而会使协议的完整性校验失败.
Hosts behind NAT-enabled routers do not have end-to-end connectivity and cannot participate in some Internet protocols. Services that require the initiation of TCP connections from the outside network, or stateless protocols such as those using UDP, can be disrupted. Unless the NAT router makes a specific effort to support such protocols, incoming packets cannot reach their destination. Some protocols can accommodate one instance of NAT between participating hosts ("passive mode" FTP, for example), sometimes with the assistance of an application-level gateway (see below), but fail when both systems are separated from the Internet by NAT. Use of NAT also complicates tunneling protocols such as IPsec because NAT modifies values in the headers which interfere with the integrity checks done by IPsec and other tunneling protocols.

端对端的连接是 internet设计时的一个重要的核心的基本原则.而NAT是违背这一原则的,但是NAT在设计的时候也充分地考虑到了这些问题.现在基于ipv6的 NAT已经被广泛关注,但许多ipv6架构设计者认为ipv6应该摒弃NAT.
End-to-end connectivity has been a core principle of the Internet, supported for example by the Internet Architecture Board. Current Internet architectural documents observe that NAT is a violation of the End-to-End Principle, but that NAT does have a valid role in careful design. There is considerably more concern with the use of IPv6 NAT, and many IPv6 architects believe IPv6 was intended to remove the need for NAT.

由于NAT的连接追踪具有短时效性.所以在特定的地址转换关系会在一小段时间后失效, 除非遵守NAT的keep-alive机制,内网主机不时的去访问外部主机.这至少会造成一些不必要的消耗,比如消耗手持设备的电量.
Because of the short-lived nature of the stateful translation tables in NAT routers, devices on the internal network lose IP connectivity typically within a very short period of time unless they implement NAT keep-alive mechanisms by frequently accessing outside hosts. This dramatically shortens the power reserves on battery-operated hand-held devices and has thwarted more widespread deployment of such IP-native Internet-enabled devices.

一些IPS会直接提供给用户私网IP地址,这样用户就必须通过IPS的 NAT来和外部INTERNET通信.这样,用户实际上没有实现端对端通信,中间加了一个IPS的NAT,这有悖于Internet Architecture Board列出的internal核心基本原则.
Some Internet service providers (ISPs) provide their customers only with "local" IP addresses.[citation needed]Thus, these customers must access services external to the ISP's network through NAT. As a result, the customers cannot achieve true end-to-end connectivity, in violation of the core principles of the Internet as laid out by the Internet Architecture Board.

NAT 最后的一个缺陷就是:NAT的推广和使用,解决了ipv4下IP地址不够用的问题,大大的推迟了IPV6的发展.
(说它是优点好呢,还是缺陷好呢?)
it is possible that its [NAT] widespread use will significantly delay the need to deploy IPv6

Reference:
Network address translation

from:
http://blog.chinaunix.net/u2/86590/showart.php?id=2208148

chatler 2010-07-13 15:28 发表评论

Linux下面socket编程的非阻塞TCP 研究

chatler — Wed, 07 Jul 2010 09:14:00 GMT

tcp协议本身是可靠的,并不等于应用程序用tcp发送数据就一定是可靠的.不管是否阻塞,send发送的大小,并不代表对端recv到多少的数据.

在阻塞模式下, send函数的过程是将应用程序请求发送的数据拷贝到发送缓存中发送并得到确认后再返回.但由于发送缓存的存在,表现为:如果发送缓存大小比请求发送的大小要大,那么send函数立即返回,同时向网络中发送数据;否则,send向网络发送缓存中不能容纳的那部分数据,并等待对端确认后再返回(接收端只要将数据收到接收缓存中,就会确认,并不一定要等待应用程序调用recv);

在非阻塞模式下,send函数的过程仅仅是将数据拷贝到协议栈的缓存区而已,如果缓存区可用空间不够,则尽能力的拷贝,返回成功拷贝的大小;如缓存区可用空间为0,则返回-1,同时设置errno为 EAGAIN.

linux下可用sysctl -a | grep net.ipv4.tcp_wmem查看系统默认的发送缓存大小:
net.ipv4.tcp_wmem = 4096 16384 81920
这有三个值,第一个值是socket的发送缓存区分配的最少字节数,第二个值是默认值(该值会被net.core.wmem_default覆盖),缓存区在系统负载不重的情况下可以增长到这个值,第三个值是发送缓存区空间的最大字节数(该值会被net.core.wmem_max覆盖).
根据实际测试, 如果手工更改了net.ipv4.tcp_wmem的值,则会按更改的值来运行,否则在默认情况下,协议栈通常是按 net.core.wmem_default和net.core.wmem_max的值来分配内存的.

应用程序应该根据应用的特性在程序中更改发送缓存大小:

socklen_t sendbuflen = 0;
socklen_t len = sizeof(sendbuflen);
getsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);
printf("default,sendbuf:%d\n", sendbuflen);

sendbuflen = 10240;
setsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, len);
getsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);
printf("now,sendbuf:%d\n", sendbuflen);

需要注意的是,虽然将发送缓存设置成了10k,但实际上,协议栈会将其扩大1倍,设为20k.

-------------------实例分析----------------------

在实际应用中,如果发送端是非阻塞发送,由于网络的阻塞或者接收端处理过慢,通常出现的情况是,发送应用程序看起来发送了10k的数据,但是只发送了2k到对端缓存中,还有8k在本机缓存中(未发送或者未得到接收端的确认).那么此时,接收应用程序能够收到的数据为2k.假如接收应用程序调用recv函数获取了1k的数据在处理,在这个瞬间,发生了以下情况之一,双方表现为:

A. 发送应用程序认为send完了10k数据,关闭了socket:
发送主机作为tcp的主动关闭者,连接将处于FIN_WAIT1的半关闭状态(等待对方的ack),并且,发送缓存中的8k数据并不清除,依然会发送给对端.如果接收应用程序依然在recv,那么它会收到余下的8k数据(这个前题是,接收端会在发送端FIN_WAIT1状态超时前收到余下的8k数据.), 然后得到一个对端socket被关闭的消息(recv返回0).这时,应该进行关闭.

B. 发送应用程序再次调用send发送8k的数据:
假如发送缓存的空间为20k,那么发送缓存可用空间为20-8=12k,大于请求发送的8k,所以send函数将数据做拷贝后,并立即返回8192;

假如发送缓存的空间为12k,那么此时发送缓存可用空间还有12-8=4k,send()会返回4096,应用程序发现返回的值小于请求发送的大小值后,可以认为缓存区已满,这时必须阻塞(或通过select等待下一次socket可写的信号),如果应用程序不理会,立即再次调用send,那么会得到-1的值, 在linux下表现为errno=EAGAIN.

C. 接收应用程序在处理完1k数据后,关闭了socket:
接收主机作为主动关闭者,连接将处于FIN_WAIT1的半关闭状态(等待对方的ack).然后,发送应用程序会收到socket可读的信号(通常是 select调用返回socket可读),但在读取时会发现recv函数返回0,这时应该调用close函数来关闭socket(发送给对方ack);

如果发送应用程序没有处理这个可读的信号,而是在send,那么这要分两种情况来考虑,假如是在发送端收到RST标志之后调用send,send将返回 -1,同时errno设为ECONNRESET表示对端网络已断开,但是,也有说法是进程会收到SIGPIPE信号, 该信号的默认响应动作是退出进程,如果忽略该信号,那么send是返回-1,errno为EPIPE(未证实);如果是在发送端收到RST标志之前,则send像往常一样工作;

以上说的是非阻塞的 send情况,假如send是阻塞调用,并且正好处于阻塞时(例如一次性发送一个巨大的buf,超出了发送缓存),对端socket关闭,那么send将返回成功发送的字节数,如果再次调用send,那么会同上一样.

D. 交换机或路由器的网络断开:
接收应用程序在处理完已收到的1k数据后,会继续从缓存区读取余下的1k数据,然后就表现为无数据可读的现象,这种情况需要应用程序来处理超时.一般做法是设定一个select等待的最大时间,如果超出这个时间依然没有数据可读,则认为socket已不可用.

发送应用程序会不断的将余下的数据发送到网络上,但始终得不到确认,所以缓存区的可用空间持续为0,这种情况也需要应用程序来处理.

如果不由应用程序来处理这种情况超时的情况,也可以通过tcp协议本身来处理,具体可以查看sysctl项中的:
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time

原文地址 http://xufish.blogbus.com/logs/40537344.html

from:
http://blog.chinaunix.net/u2/67780/showart_2056353.html

chatler 2010-07-07 17:14 发表评论

教你用c实现http协议

chatler — Sun, 27 Jun 2010 15:16:00 GMT

大家都很熟悉HTTP协议的应用，因为每天都在网络上浏览着不少东西，也都知道是HTTP协议是相当简单的。每次用 thunder之类的下载软件下载网页，当用到那个“用thunder下载全部链接”时总觉得很神奇。后来想想，其实要实现这些下载功能也并不难，只要按照HTTP协议发送request，然后对接收到的数据进行分析，如果页面上还有href之类的链接指向标志就可以进行深一层的下载了。HTTP协议目前用的最多的是1.1 版本，要全面透彻地搞懂它就参考RFC2616文档吧。我是怕rfc文档了的,要看自己去看吧^_^ 源代码如下： /******* http客户端程序 httpclient.c ************/ #include #include #include #include #include #include #include #include #include #include #include #include //////////////////////////////httpclient.c 开始/////////////////////////////////////////// /******************************************** 功能：搜索字符串右边起的第一个匹配字符 ********************************************/ char * Rstrchr(char * s, char x) { int i = strlen(s); if(!(*s)) return 0; while(s[i-1]) if(strchr(s + (i - 1), x)) return (s + (i - 1)); else i--; return 0; } /******************************************** 功能：把字符串转换为全小写 ********************************************/ void ToLowerCase(char * s) { while(s && *s) {*s=tolower(*s);s++;} } /************************************************************** 功能：从字符串src中分析出网站地址和端口，并得到用户要下载的文件 ***************************************************************/ void GetHost(char * src, char * web, char * file, int * port) { char * pA; char * pB; memset(web, 0, sizeof(web)); memset(file, 0, sizeof(file)); *port = 0; if(!(*src)) return; pA = src; if(!strncmp(pA, "http://", strlen("http://"))) pA = src+strlen("http://"); else if(!strncmp(pA, "https://", strlen("https://"))) pA = src+strlen("https://"); pB = strchr(pA, '/'); if(pB) { memcpy(web, pA, strlen(pA) - strlen(pB)); if(pB+1) { memcpy(file, pB + 1, strlen(pB) - 1); file[strlen(pB) - 1] = 0; } } else memcpy(web, pA, strlen(pA)); if(pB) web[strlen(pA) - strlen(pB)] = 0; else web[strlen(pA)] = 0; pA = strchr(web, ':'); if(pA) *port = atoi(pA + 1); else *port = 80; } int main(int argc, char *argv[]) { int sockfd; char buffer[1024]; struct sockaddr_in server_addr; struct hostent *host; int portnumber,nbytes; char host_addr[256]; char host_file[1024]; char local_file[256]; FILE * fp; char request[1024]; int send, totalsend; int i; char * pt; if(argc!=2) { fprintf(stderr,"Usage:%s web-address\a\n",argv[0]); exit(1); } printf("parameter.1 is: %s\n", argv[1]); ToLowerCase(argv[1]);/*将参数转换为全小写*/ printf("lowercase parameter.1 is: %s\n", argv[1]); GetHost(argv[1], host_addr, host_file, &portnumber);/*分析网址、端口、文件名等*/ printf("webhost:%s\n", host_addr); printf("hostfile:%s\n", host_file); printf("portnumber:%d\n\n", portnumber); if((host=gethostbyname(host_addr))==NULL)/*取得主机IP地址*/ { fprintf(stderr,"Gethostname error, %s\n", strerror(errno)); exit(1); } /* 客户程序开始建立 sockfd描述符 */ if((sockfd=socket(AF_INET,SOCK_STREAM,0))==-1)/*建立SOCKET连接*/ { fprintf(stderr,"Socket Error:%s\a\n",strerror(errno)); exit(1); } /* 客户程序填充服务端的资料 */ bzero(&server_addr,sizeof(server_addr)); server_addr.sin_family=AF_INET; server_addr.sin_port=htons(portnumber); server_addr.sin_addr=*((struct in_addr *)host->h_addr); /* 客户程序发起连接请求 */ if(connect(sockfd,(struct sockaddr *)(&server_addr),sizeof(struct sockaddr))==-1)/*连接网站*/ { fprintf(stderr,"Connect Error:%s\a\n",strerror(errno)); exit(1); } sprintf(request, "GET /%s HTTP/1.1\r\nAccept: */*\r\nAccept-Language: zh-cn\r\n\ User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)\r\n\ Host: %s:%d\r\nConnection: Close\r\n\r\n", host_file, host_addr, portnumber); printf("%s", request);/*准备request，将要发送给主机*/ /*取得真实的文件名*/ if(host_file && *host_file) pt = Rstrchr(host_file, '/'); else pt = 0; memset(local_file, 0, sizeof(local_file)); if(pt && *pt) { if((pt + 1) && *(pt+1)) strcpy(local_file, pt + 1); else memcpy(local_file, host_file, strlen(host_file) - 1); } else if(host_file && *host_file) strcpy(local_file, host_file); else strcpy(local_file, "index.html"); printf("local filename to write:%s\n\n", local_file); /*发送http请求request*/ send = 0;totalsend = 0; nbytes=strlen(request); while(totalsend < nbytes) { send = write(sockfd, request + totalsend, nbytes - totalsend); if(send==-1) {printf("send error!%s\n", strerror(errno));exit(0);} totalsend+=send; printf("%d bytes send OK!\n", totalsend); } fp = fopen(local_file, "a"); if(!fp) { printf("create file error! %s\n", strerror(errno)); return 0; } printf("\nThe following is the response header:\n"); i=0; /* 连接成功了，接收http响应，response */ while((nbytes=read(sockfd,buffer,1))==1) { if(i < 4) { if(buffer[0] == '\r' || buffer[0] == '\n') i++; else i = 0; printf("%c", buffer[0]);/*把http头信息打印在屏幕上*/ } else { fwrite(buffer, 1, 1, fp);/*将http主体信息写入文件*/ i++; if(i%1024 == 0) fflush(fp);/*每1K时存盘一次*/ } } fclose(fp); /* 结束通讯 */ close(sockfd); exit(0); }

zj@zj:~/C_pram/practice/http_client$ ls httpclient httpclient.c zj@zj:~/C_pram/practice/http_client$ ./httpclient http://www.baidu.com/ parameter.1 is: http://www.baidu.com/ lowercase parameter.1 is: http://www.baidu.com/ webhost:www.baidu.com hostfile: portnumber:80 GET / HTTP/1.1 Accept: */* Accept-Language: zh-cn User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) Host: www.baidu.com:80 Connection: Close local filename to write:index.html 163 bytes send OK! The following is the response header: HTTP/1.1 200 OK Date: Wed, 29 Oct 2008 10:41:40 GMT Server: BWS/1.0 Content-Length: 4216 Content-Type: text/html Cache-Control: private Expires: Wed, 29 Oct 2008 10:41:40 GMT Set-Cookie: BAIDUID=A93059C8DDF7F1BC47C10CAF9779030E:FG=1; expires=Wed, 29-Oct-38 10:41:40 GMT; path=/; domain=.baidu.com P3P: CP=" OTI DSP COR IVA OUR IND COM " zj@zj:~/C_pram/practice/http_client$ ls httpclient httpclient.c index.html

不指定文件名字的话,默认就是下载网站默认的首页了^_^.

from:
http://blog.chinaunix.net/u2/76292/showart_1353805.html

chatler 2010-06-27 23:16 发表评论

c语言抓取网页数据

chatler — Sun, 27 Jun 2010 15:13:00 GMT

#include #include #include #include #include #include #define HTTPPORT 80 char* head = "GET /u2/76292/ HTTP/1.1\r\n" "Accept: */*\r\n" "Accept-Language: zh-cn\r\n" "Accept-Encoding: gzip, deflate\r\n" "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; CIBA; TheWorld)\r\n" "Host:blog.chinaunix.net\r\n" "Connection: Keep-Alive\r\n\r\n"; int connect_URL(char *domain,int port) { int sock; struct hostent * host; struct sockaddr_in server; host = gethostbyname(domain); if (host == NULL) { printf("gethostbyname error\n"); return -2; } // printf("HostName: %s\n",host->h_name); // printf("IP Address: %s\n",inet_ntoa(*((struct in_addr *)host->h_addr))); sock = socket(AF_INET,SOCK_STREAM,0); if (sock < 0) { printf("invalid socket\n"); return -1; } memset(&server,0,sizeof(struct sockaddr_in)); memcpy(&server.sin_addr,host->h_addr_list[0],host->h_length); server.sin_family = AF_INET; server.sin_port = htons(port); return (connect(sock,(struct sockaddr *)&server,sizeof(struct sockaddr)) <0) ? -1 : sock; } int main() { int sock; char buf[100]; char *domain = "blog.chinaunix.net"; fp = fopen("test.txt","rb"); if(NULL == fp){ printf("can't open stockcode file!\n"); return -1; } sock = connect_URL(domain,HTTPPORT); if (sock <0){ printf("connetc err\n"); return -1; } send(sock,head,strlen(head),0); while(1) { if((recv(sock,buf,100,0))<1) break; fprintf(fp,"%s",bufp); //save http data } fclose(fp); close(sock); printf("bye!\n"); return 0; }

我这里是保存数据到本地硬盘可以在这个的基础上修改,head头的定义可以自己使用wireshark抓包来看

from:
http://blog.chinaunix.net/u2/76292/showart.php?id=2123108

chatler 2010-06-27 23:13 发表评论

TCP的流量控制

chatler — Fri, 08 Jan 2010 15:34:00 GMT

1. 前言

TCP是具备流控和可靠连接能力的协议，为防止TCP发生拥塞或为提高传输效率，在网
络发展早期就提出了一些相关的TCP流控和优化算法，而且也被RFC2581规定是每个
TCP实现时要实现的。

本文中，为求方便把将“TCP分组段(segment)”都直接称为“包”。

2. 慢启动(slow start)和拥塞避免(Congestion Avoidance)

慢启动和拥塞避免是属于TCP发送方必须(MUST)要实现的，防止TCP发送方向网络传入大量的突发数据造成网络阻塞。

先介绍几个相关参数，是在通信双方中需要考虑但不在TCP包中体现的一些参数：

拥塞窗口(congestion window，cwnd)，是指发送方在接收到对方的ACK确认前向允许网络发送的数据量，数据发送后，拥塞窗口缩小；接收到对方的ACK后，拥塞窗口相应增加，拥塞窗口越大，可发送的数据量越大。拥塞窗口初始值的RFC2581中被规定为不超过发送方MSS的两倍，而且不能超过两个TCP包，在RFC3390中更新了初始窗口大小的设置方法。

通告窗口(advertised window，rwnd)，是指接收方所能接收的没来得及发ACK确认的数据量，接收方数据接收后，通告窗口缩小；发送ACK后，通告窗口相应扩大。

慢启动阈值(slow start threshold, ssthresh)，用来判断是否要使用慢启动或拥塞避免算法来控制流量的一个参数，也是随通信过程不断变化的。

当cwnd < ssthresh时，拥塞窗口值已经比较小了，表示未经确认的数据量增大，需要启动慢启动算法；当cwnd > ssthresh时，可发送数据量大，需要启动拥塞避免算法。

拥塞窗口cwnd是根据发送的数据量自动减小的，但扩大就需要根据对方的接收情况进行扩大，慢启动和拥塞避免算法都是描述如何扩大该值的。

在启动慢启动算法时，TCP发送方接收到对方的ACK后拥塞窗口最多每次增加一个发送方MSS字节的数值，当拥塞窗口超过sshresh后或观察到拥塞才停止算法。

启动拥塞避免算法时，拥塞窗口在一个连接往返时间RTT内增加一个最大TCP包长度的量，一般实现时用以下公式计算：

cwnd += max(SMSS*SMSS/cwnd, 1) （2.1)

SMSS为发送方MSS。

TCP发送方检测到数据包丢失时，需要调整ssthresh，一般按下面公式计算：

ssthresh = max (FlightSize / 2, 2*SMSS) (2.2)

其中FlightSize表示已经发送但还没有被确认的数据量。

3. 快速重传(fast retransmit)和快速恢复(fast recovery)

TCP接收方收到错序的TCP包时要发送复制的ACK包回应，提示发送方可能出现网络丢包；发送方
收到连续3个重复的ACK包后启动快速重传算法，根据确认号快速重传那个可能丢失的包而不必等
重传定时器超时后再重传，普通的重传是要等到重传定时器超时还没收到ACK才进行的。这个算
法是TCP发送方应该(SHOULD)实现的，不是必须。TCP发送方进行了快速重传后进入快速恢复阶段
，直到没再接收重复的ACK包。

快速重传和快速恢复具体过程为：
1. 当收到第3个重复的ACK包时，ssthreh值按公式2.2重新设置；

2. 重传丢失的包后，将拥塞窗口cwnd设置为sshresh+3*SMSS，人工扩大了拥塞窗口；

3. 对于每个接收到的重复的ACK包，cwnd相应增加SMSS，扩大拥塞窗口；

4. 如果新的拥塞窗口cwnd值和接收方的通告窗口值允许的话，可以继续发新包；

5. 当收到下一个ACK确认了新数据时，将cwnd大小调整为sshresh，减少窗口；对接收方
来说，接收到重发的TCP包后就要发此ACK确认当前接收的数据。

4. 结论

这些算法重点在于保持网络的可靠性和可用性，防止网络阻塞造成的网络崩溃，是相对
比较保守的。

5. 附录讨论

A君: 这些算法都是针对通信双方的事, 但如果从开发防火墙等中间设备的角度来看,
中间设备有必要考虑这些么?
端木: 这个...我好象也看不出必要性，因为算法的参数都是在双方内部而不在TCP数据包
中体现...但应该会让中间设备轻松点，这个就象在马路开车，这些算法就是交规

让你开得规矩点，交警只关心你开车的情况，而不管你开的是什么车，开得好交警

也轻松。好车可以让你很容易开好，但差车也可以开好。

A君: 这些算法原型提出也很早了, 最早是88年的事, 当时网络都处于初级阶段, 有个

9600bps的猫就很牛了, 计算机性能也很差, 因此实施这些算法还有点用; 但现

在过了快20年了, 百兆都快淘汰, 千兆, 万兆网络都快普及了, 即使PC机的内存

也都上G了,再规矩这种几K级别的数据量有意思么? 就好象现在喷气式战斗机都到

     第4代了, 再研究螺旋桨战斗机还有意思么?
端木: 这个...这个就象病毒库了, 里面不也有无数的DOS时代的病毒, 你以后这辈子估计
      都见不着的，但没有哪个防病毒厂商会把这些病毒从库中剔除，库是只增不减的。
      有这么个东西也是一样，正因为平时没用，谁也不注意，知道了就可以吹一吹，

尤其拿去唬唬人是很有效的！

A君: 你真无聊!
端木: You got it! 不无聊干吗写博客啊!

端木: 搞技术有时候是很悲哀的一件事，必须牵扯七大姑八大姨的很多老东西，也就是向下
兼容，到一定程度将成为进一步发展的最大障碍，讲一个从smth看到的不是笑话

的笑话：

    现代铁路的铁轨间距是4英尺8点5英寸，铁轨间距采用了电车轮距的标准，而电车轮距
的标准则沿袭了马车的轮距标准。
    马车的轮距为何是4英尺8点5英寸？原来，英国的马路辙迹的宽度是4英尺8点5英寸。
如果马车改用其他尺寸的轮距，轮子很快就会在英国的老马路上撞坏。
    英国马路的辙迹宽度又从何而来？这可以上溯到古罗马时期。整个欧洲(包括英国)的老路都是罗马人为其军队铺设的，4英尺8点5英寸正是罗马战车的宽度。
    罗马战车的宽度又是怎么来的？答案很简单，它是牵引一辆战车的两匹马的屁股的总宽度。
    段子到这里还没有结束。美国航天飞机的火箭助推器也摆脱不了马屁股的纠缠———火箭助推器造好之后要经过铁路运送，而铁路上必然有一些隧道，隧道的宽度又是根据铁轨的宽度而来。代表着尖端科技的火箭助推器的宽度，竟然被两匹马的屁股的总宽度决定了。
转自：
http://www.cppblog.com/prayer/archive/2009/04/20/80527.html

chatler 2010-01-08 23:34 发表评论

流量控制和拥塞控制

chatler — Wed, 30 Dec 2009 08:54:00 GMT

拥塞（Congestion）指的是在包交换网络中由于传送的包数目太多，而存贮转发节点的资源有限而造成网络传输性能下降的情况。拥塞的一种极端情况是死锁（Deadlock），退出死锁往往需要网络复位操作。
流量控制（Flow Control）指的是在一条通道上控制发送端发送数据的数量及速度使其不超过接收端所能承受的能力，这个能力主要指接收端接收数据的速率及接收数据缓冲区的大小。通常采用停等法或滑动窗口法控制流量。
流量控制是针对端系统中资源受限而设置的；拥塞控制是针对中间节点资源受限而设置的。

chatler 2009-12-30 16:54 发表评论

用wget下载文件或目录或者是整个网站

chatler — Mon, 21 Dec 2009 17:04:00 GMT

wget -m -nH -b -q -P /home/web http://domain
具体参数的含义还没有man，等man过之后再添加进来哈

chatler 2009-12-22 01:04 发表评论

http请求的详细过程---理解计算机网络<转>

chatler — Wed, 21 Oct 2009 15:05:00 GMT

一个http请求的详细过程

我们来看当我们在浏览器输入http://www.mycompany.com:8080/mydir/index.html,幕后所发生的一切。

首先http是一个应用层的协议，在这个层的协议，只是一种通讯规范，也就是因为双方要进行通讯，大家要事先约定一个规范。

1.连接当我们输入这样一个请求时，首先要建立一个socket连接，因为socket是通过ip和端口建立的，所以之前还有一个DNS解析过程，把www.mycompany.com变成ip，如果url里不包含端口号，则会使用该协议的默认端口号。

DNS的过程是这样的：首先我们知道我们本地的机器上在配置网络时都会填写DNS，这样本机就会把这个url发给这个配置的DNS服务器，如果能够找到相应的url则返回其ip，否则该DNS将继续将该解析请求发送给上级DNS，整个DNS可以看做是一个树状结构，该请求将一直发送到根直到得到结果。现在已经拥有了目标ip和端口号，这样我们就可以打开socket连接了。

2.请求连接成功建立后，开始向web服务器发送请求，这个请求一般是GET或POST命令（POST用于FORM参数的传递）。GET命令的格式为：　　GET 路径/文件名 HTTP/1.0
文件名指出所访问的文件，HTTP/1.0指出Web浏览器使用的HTTP版本。现在可以发送GET命令：

GET /mydir/index.html HTTP/1.0，

3.应答 web服务器收到这个请求，进行处理。从它的文档空间中搜索子目录mydir的文件index.html。如果找到该文件，Web服务器把该文件内容传送给相应的Web浏览器。

为了告知浏览器，，Web服务器首先传送一些HTTP头信息，然后传送具体内容（即HTTP体信息），HTTP头信息和HTTP体信息之间用一个空行分开。
常用的HTTP头信息有：
　　① HTTP 1.0 200 OK 　这是Web服务器应答的第一行，列出服务器正在运行的HTTP版本号和应答代码。代码"200 OK"表示请求完成。
　　② MIME_Version:1.0　它指示MIME类型的版本。
　　③ content_type:类型　这个头信息非常重要，它指示HTTP体信息的MIME类型。如：content_type:text/html指示传送的数据是HTML文档。
　　④ content_length:长度值　它指示HTTP体信息的长度（字节）。

4.关闭连接：当应答结束后，Web浏览器与Web服务器必须断开，以保证其它Web浏览器能够与Web服务器建立连接。

下面我们具体分析其中的数据包在网络中漫游的经历

在网络分层结构中，各层之间是严格单向依赖的。“服务”是描述各层之间关系的抽象概念，即网络中各层向紧邻上层提供的一组操作。下层是服务提供者，上层是请求服务的用户。服务的表现形式是原语（primitive），如系统调用或库函数。系统调用是操作系统内核向网络应用程序或高层协议提供的服务原语。网络中的n层总要向n+1层提供比n-1层更完备的服务，否则n层就没有存在的价值。

传输层实现的是“端到端”通信，引进网间进程通信概念，同时也要解决差错控制，流量控制，数据排序（报文排序），连接管理等问题，为此提供不同的服务方式。通常传输层的服务通过系统调用的方式提供，以socket的方式。对于客户端，要想建立一个socket连接，需要调用这样一些函数socket() bind() connect(),然后就可以通过send()进行数据发送。

现在看数据包在网络中的穿行过程：

应用层

首先我们可以看到在应用层，根据当前的需求和动作，结合应用层的协议，有我们确定发送的数据内容，我们把这些数据放到一个缓冲区内，然后形成了应用层的报文data。

传输层

这些数据通过传输层发送，比如tcp协议。所以它们会被送到传输层处理，在这里报文打上了传输头的包头，主要包含端口号，以及tcp的各种制信息，这些信息是直接得到的，因为接口中需要指定端口。这样就组成了tcp的数据传送单位segment。tcp是一种端到端的协议，利用这些信息，比如tcp首部中的序号确认序号，根据这些数字，发送的一方不断的进行发送等待确认，发送一个数据段后，会开启一个计数器，只有当收到确认后才会发送下一个，如果超过计数时间仍未收到确认则进行重发，在接受端如果收到错误数据，则将其丢弃，这将导致发送端超时重发。通过tcp协议，控制了数据包的发送序列的产生，不断的调整发送序列，实现流控和数据完整。

网络层

然后待发送的数据段送到网络层，在网络层被打包，这样封装上了网络层的包头，包头内部含有源及目的的ip地址，该层数据发送单位被称为packet。网络层开始负责将这样的数据包在网络上传输，如何穿过路由器，最终到达目的地址。在这里，根据目的ip地址，就需要查找下一跳路由的地址。首先在本机，要查找本机的路由表，在windows上运行route print就可以看到当前路由表内容，有如下几项：
Active Routes Default Route Persistent Route.

整个查找过程是这样的:
(1)根据目的地址，得到目的网络号，如果处在同一个内网，则可以直接发送。
(2)如果不是，则查询路由表，找到一个路由。
(3)如果找不到明确的路由，此时在路由表中还会有默认网关，也可称为缺省网关，IP用缺省的网关地址将一个数据传送给下一个指定的路由器，所以网关也可能是路由器，也可能只是内网向特定路由器传输数据的网关。
(4)路由器收到数据后，它再次为远程主机或网络查询路由，若还未找到路由，该数据包将发送到该路由器的缺省网关地址。而数据包中包含一个最大路由跳数，如果超过这个跳数，就会丢弃数据包，这样可以防止无限传递。路由器收到数据包后，只会查看网络层的包裹数据，目的ip。所以说它是工作在网络层，传输层的数据对它来说则是透明的。

如果上面这些步骤都没有成功，那么该数据报就不能被传送。如果不能传送的数据报来自本机，那么一般会向生成数据报的应用程序返回一个“主机不可达”或 “网络不可达”的错误。

以windows下主机的路由表为例，看路由的查找过程
======================================================================
Active Routes:
Network Destination            Netmask                      Gateway              Interface                  Metric
0.0.0.0                                 0.0.0.0                       192.168.1.2           192.168.1.101           10
127.0.0.0                             255.0.0.0                   127.0.0.1               127.0.0.1                   1
192.168.1.0                         255.255.255.0           192.168.1.101       192.168.1.101           10
192.168.1.101                     255.255.255.255       127.0.0.1               127.0.0.1                  10
192.168.1.255                     255.255.255.255       192.168.1.101       192.168.1.101           10
224.0.0.0                            240.0.0.0                   192.168.1.101       192.168.1.101           10
255.255.255.255                 255.255.255.255       192.168.1.101       192.168.1.101           1
Default Gateway:                192.168.1.2

Network Destination 目的网段
Netmask 子网掩码
Gateway 下一跳路由器入口的ip，路由器通过interface和gateway定义一调到下一个路由器的链路，通常情况下，interface和gateway是同一网段的。
Interface 到达该目的地的本路由器的出口ip（对于我们的个人pc来说，通常由机算机A的网卡，用该网卡的IP地址标识，当然一个pc也可以有多个网卡）。

网关这个概念，主要用于不同子网间的交互，当两个子网内主机A,B要进行通讯时，首先A要将数据发送到它的本地网关，然后网关再将数据发送给B所在的网关，然后网关再发送给B。
默认网关，当一个数据包的目的网段不在你的路由记录中，那么，你的路由器该把那个数据包发送到哪里！缺省路由的网关是由你的连接上的default gateway决定的，也就是我们通常在网络连接里配置的那个值。

通常interface和gateway处在一个子网内，对于路由器来说，因为可能具有不同的interface,当数据包到达时，根据Network Destination寻找匹配的条目，如果找到，interface则指明了应当从该路由器的那个接口出去，gateway则代表了那个子网的网关地址。

第一条 0.0.0.0 0.0.0.0 192.168.1.2 192.168.1.101 10
0.0.0.0代表了缺省路由。该路由记录的意思是：当我接收到一个数据包的目的网段不在我的路由记录中，我会将该数据包通过192.168.1.101这个接口发送到192.168.1.2这个地址，这个地址是下一个路由器的一个接口，这样这个数据包就可以交付给下一个路由器处理，与我无关。该路由记录的线路质量 10。当有多个条目匹配时，会选择具有较小Metric值的那个。

第三条 192.168.1.0 255.255.255.0 192.168.1.101 192.168.1.101 10
直联网段的路由记录：当路由器收到发往直联网段的数据包时该如何处理，这种情况，路由记录的interface和gateway是同一个。当我接收到一个数据包的目的网段是192.168.1.0时，我会将该数据包通过192.168.1.101这个接口直接发送出去，因为这个端口直接连接着192.168.1.0这个网段，该路由记录的线路质量 10 （因interface和gateway是同一个，表示数据包直接传送给目的地址，不需要再转给路由器）。

一般就分这两种情况，目的地址与当前路由器接口是否在同一子网。如果是则直接发送，不需再转给路由器，否则还需要转发给下一个路由器继续进行处理。

查找到下一跳ip地址后，还需要知道它的mac地址，这个地址要作为链路层数据装进链路层头部。这时需要arp协议，具体过程是这样的，查找arp缓冲，windows下运行arp -a可以查看当前arp缓冲内容。如果里面含有对应ip的mac地址，则直接返回。否则需要发生arp请求，该请求包含源的ip和mac地址，还有目的地的ip地址，在网内进行广播，所有的主机会检查自己的ip与该请求中的目的ip是否一样，如果刚好对应则返回自己的mac地址，同时将请求者的ip mac保存。这样就得到了目标ip的mac地址。

链路层

将mac地址及链路层控制信息加到数据包里，形成Frame，Frame在链路层协议下，完成了相邻的节点间的数据传输，完成连接建立，控制传输速度，数据完整。

物理层

物理线路则只负责该数据以bit为单位从主机传输到下一个目的地。

下一个目的地接受到数据后，从物理层得到数据然后经过逐层的解包到链路层到网络层，然后开始上述的处理，在经网络层链路层物理层将数据封装好继续传往下一个地址。

在上面的过程中，可以看到有一个路由表查询过程，而这个路由表的建立则依赖于路由算法。也就是说路由算法实际上只是用来路由器之间更新维护路由表，真正的数据传输过程并不执行这个算法，只查看路由表。这个概念也很重要，需要理解常用的路由算法。而整个tcp协议比较复杂，跟链路层的协议有些相似，其中有很重要的一些机制或者概念需要认真理解，比如编号与确认，流量控制，重发机制，发送接受窗口。

tcp/ip基本模型及概念

物理层

设备，中继器（repeater）,集线器（hub）。对于这一层来说，从一个端口收到数据，会转发到所有端口。

链路层

协议：SDLC（Synchronous Data Link Control）HDLC（High-level Data Link Control） ppp协议独立的链路设备中最常见的当属网卡，网桥也是链路产品。集线器MODEM的某些功能有人认为属于链路层，对此还有些争议认为属于物理层设备。除此之外，所有的交换机都需要工作在数据链路层，但仅工作在数据链路层的仅是二层交换机。其他像三层交换机、四层交换机和七层交换机虽然可对应工作在OSI的三层、四层和七层，但二层功能仍是它们基本的功能。

因为有了MAC地址表，所以才充分避免了冲突，因为交换机通过目的MAC地址知道应该把这个数据转发到哪个端口。而不会像HUB一样，会转发到所有滴端口。所以，交换机是可以划分冲突域滴。

网络层

四个主要的协议:
网际协议IP：负责在主机和网络之间寻址和路由数据包。
地址解析协议ARP：获得同一物理网络中的硬件主机地址。
网际控制消息协议ICMP：发送消息，并报告有关数据包的传送错误。
互联组管理协议IGMP：被IP主机拿来向本地多路广播路由器报告主机组成员。

该层设备有三层交换机，路由器。

传输层

两个重要协议 TCP 和 UDP 。

端口概念：TCP/UDP 使用 IP 地址标识网上主机，使用端口号来标识应用进程，即 TCP/UDP 用主机 IP 地址和为应用进程分配的端口号来标识应用进程。端口号是 16 位的无符号整数， TCP 的端口号和 UDP 的端口号是两个独立的序列。尽管相互独立，如果 TCP 和 UDP 同时提供某种知名服务，两个协议通常选择相同的端口号。这纯粹是为了使用方便，而不是协议本身的要求。利用端口号，一台主机上多个进程可以同时使用 TCP/UDP 提供的传输服务，并且这种通信是端到端的，它的数据由 IP 传递，但与 IP 数据报的传递路径无关。网络通信中用一个三元组可以在全局唯一标志一个应用进程：（协议，本地地址，本地端口号）。

也就是说tcp和udp可以使用相同的端口。

可以看到通过(协议,源端口，源ip，目的端口，目的ip)就可以用来完全标识一组网络连接。

应用层

基于tcp：Telnet FTP SMTP DNS HTTP
基于udp：RIP NTP（网落时间协议）和DNS （DNS也使用TCP）SNMP TFTP

参考文献：

读懂本机路由表 http://hi.baidu.com/thusness/blog/item/9c18e5bf33725f0818d81f52.html

Internet 传输层协议 http://www.cic.tsinghua.edu.cn/jdx/book6/3.htm 计算机网络谢希仁

转自：
http://blog.chinaunix.net/u2/67780/showart_2065190.html

chatler 2009-10-21 23:05 发表评论

TCP三次握手/四次挥手详解<转>

chatler — Tue, 20 Oct 2009 13:15:00 GMT

、建立连接协议（三次握手）
（1）客户端发送一个带SYN标志的TCP报文到服务器。这是三次握手过程中的报文1。
（2）服务器端回应客户端的，这是三次握手中的第2个报文，这个报文同时带ACK标志和SYN标志。因此它表示对刚才客户端SYN报文的回应；同时又标志SYN给客户端，询问客户端是否准备好进行数据通讯。
（3）客户必须再次回应服务段一个ACK报文，这是报文段3。
2、连接终止协议（四次挥手）
　　由于TCP连接是全双工的，因此每个方向都必须单独进行关闭。这原则是当一方完成它的数据发送任务后就能发送一个FIN来终止这个方向的连接。收到一个 FIN只意味着这一方向上没有数据流动，一个TCP连接在收到一个FIN后仍能发送数据。首先进行关闭的一方将执行主动关闭，而另一方执行被动关闭。
　（1） TCP客户端发送一个FIN，用来关闭客户到服务器的数据传送（报文段4）。
　（2）服务器收到这个FIN，它发回一个ACK，确认序号为收到的序号加1（报文段5）。和SYN一样，一个FIN将占用一个序号。
　（3）服务器关闭客户端的连接，发送一个FIN给客户端（报文段6）。
　（4）客户段发回ACK报文确认，并将确认序号设置为收到序号加1（报文段7）。
CLOSED: 这个没什么好说的了，表示初始状态。
LISTEN: 这个也是非常容易理解的一个状态，表示服务器端的某个SOCKET处于监听状态，可以接受连接了。
SYN_RCVD: 这个状态表示接受到了SYN报文，在正常情况下，这个状态是服务器端的SOCKET在建立TCP连接时的三次握手会话过程中的一个中间状态，很短暂，基本上用netstat你是很难看到这种状态的，除非你特意写了一个客户端测试程序，故意将三次TCP握手过程中最后一个ACK报文不予发送。因此这种状态时，当收到客户端的ACK报文后，它会进入到ESTABLISHED状态。
SYN_SENT: 这个状态与SYN_RCVD遥想呼应，当客户端SOCKET执行CONNECT连接时，它首先发送SYN报文，因此也随即它会进入到了SYN_SENT状态，并等待服务端的发送三次握手中的第2个报文。SYN_SENT状态表示客户端已发送SYN报文。
ESTABLISHED：这个容易理解了，表示连接已经建立了。
FIN_WAIT_1: 这个状态要好好解释一下，其实FIN_WAIT_1和FIN_WAIT_2状态的真正含义都是表示等待对方的FIN报文。而这两种状态的区别是：FIN_WAIT_1状态实际上是当SOCKET在ESTABLISHED状态时，它想主动关闭连接，向对方发送了FIN报文，此时该SOCKET即进入到FIN_WAIT_1状态。而当对方回应ACK报文后，则进入到FIN_WAIT_2状态，当然在实际的正常情况下，无论对方何种情况下，都应该马上回应ACK报文，所以FIN_WAIT_1状态一般是比较难见到的，而FIN_WAIT_2状态还有时常常可以用netstat看到。
FIN_WAIT_2：上面已经详细解释了这种状态，实际上FIN_WAIT_2状态下的SOCKET，表示半连接，也即有一方要求close连接，但另外还告诉对方，我暂时还有点数据需要传送给你，稍后再关闭连接。
TIME_WAIT: 表示收到了对方的FIN报文，并发送出了ACK报文，就等2MSL后即可回到CLOSED可用状态了。如果FIN_WAIT_1状态下，收到了对方同时带 FIN标志和ACK标志的报文时，可以直接进入到TIME_WAIT状态，而无须经过FIN_WAIT_2状态。
CLOSING: 这种状态比较特殊，实际情况中应该是很少见，属于一种比较罕见的例外状态。正常情况下，当你发送FIN报文后，按理来说是应该先收到（或同时收到）对方的 ACK报文，再收到对方的FIN报文。但是CLOSING状态表示你发送FIN报文后，并没有收到对方的ACK报文，反而却也收到了对方的FIN报文。什么情况下会出现此种情况呢？其实细想一下，也不难得出结论：那就是如果双方几乎在同时close一个SOCKET的话，那么就出现了双方同时发送FIN报文的情况，也即会出现CLOSING状态，表示双方都正在关闭SOCKET连接。
CLOSE_WAIT: 这种状态的含义其实是表示在等待关闭。怎么理解呢？当对方close一个SOCKET后发送FIN报文给自己，你系统毫无疑问地会回应一个ACK报文给对方，此时则进入到CLOSE_WAIT状态。接下来呢，实际上你真正需要考虑的事情是察看你是否还有数据发送给对方，如果没有的话，那么你也就可以 close这个SOCKET，发送FIN报文给对方，也即关闭连接。所以你在CLOSE_WAIT状态下，需要完成的事情是等待你去关闭连接。
LAST_ACK: 这个状态还是比较容易好理解的，它是被动关闭一方在发送FIN报文后，最后等待对方的ACK报文。当收到ACK报文后，也即可以进入到CLOSED可用状态了。
最后有2个问题的回答，我自己分析后的结论（不一定保证100%正确）
1、为什么建立连接协议是三次握手，而关闭连接却是四次握手呢？
这是因为服务端的LISTEN状态下的SOCKET当收到SYN报文的建连请求后，它可以把ACK和SYN（ACK起应答作用，而SYN起同步作用）放在一个报文里来发送。但关闭连接时，当收到对方的FIN报文通知时，它仅仅表示对方没有数据发送给你了；但未必你所有的数据都全部发送给对方了，所以你可以未必会马上会关闭SOCKET,也即你可能还需要发送一些数据给对方之后，再发送FIN报文给对方来表示你同意现在可以关闭连接了，所以它这里的ACK报文和FIN报文多数情况下都是分开发送的。
2、为什么TIME_WAIT状态还需要等2MSL后才能返回到CLOSED状态？
这是因为：虽然双方都同意关闭连接了，而且握手的4个报文也都协调和发送完毕，按理可以直接回到CLOSED状态（就好比从SYN_SEND状态到 ESTABLISH状态那样）；但是因为我们必须要假想网络是不可靠的，你无法保证你最后发送的ACK报文会一定被对方收到，因此对方处于 LAST_ACK状态下的SOCKET可能会因为超时未收到ACK报文，而重发FIN报文，所以这个TIME_WAIT状态的作用就是用来重发可能丢失的 ACK报文。

转自：

http://blog.chinaunix.net/u2/67780/showart.php?id=2071265

chatler 2009-10-20 21:15 发表评论