Network Stack Specialization for Performance

最近在研究DPDK，这是sigcomm 2014的论文，纪录在此备忘

Ps: 文中关键词的概念：

segment : 对应于tcp的PDU(协议传输单元)，这里应该指tcp层的包，如果一个包太大tcp负责将它拆分成多个segment（这个概念对理解后文有帮助）

根据unix网络编程卷1 第8页注解2：packet是IP层传递给链路层并且由链路层打包好封装在帧中的数据（不包括帧头）而IP层的包(不包括ip头)应该叫datagram，链路层的包叫帧(fragment)，不过这里没有特意区分packet只是数据包的意思

DRAM: 动态随机访问存储器，系统的主要内存

SRAM: 静态随机访问存储器，cpu 的cache

Abstract

Contemporary network stacks are masterpieces of generality, supporting many edge-node and middle-node functions. Generality comes at a high performance cost: current APIs, memory models, and implementations drastically limit the effectiveness of increasingly powerful hardware. Generality has historically been required so that individual systems could perform many functions.

However,as providers have scaled services to support millions of users,

they have transitioned toward thousands (or millions) of dedicated servers, each performing a few functions. We argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary.

现在网络堆栈在通用性表现很好，支持许多边缘节点和中间节点功能。通用性伴随着高成本：当前的API，内存模型和实现极大地限制了日益强大的硬件的效能。为了使各个系统可以执行许多功能通用性是必须的。

然而，由于提供商服务数百万用户，他们已经转向数千（或数百万）的专用服务器每个都执行几个功能（ps:垂直细分），我们认为通用性的开销是有效扩展的关键障碍，专业化不仅可行，而且是必要的。

We present Sandstorm and Namestorm, web and DNS servers that utilize a clean-slate userspace network stack that exploits knowledge of application-specific workloads. Based on the netmap framework, our novel approach merges application and network-stack memory models, aggressively amortizes protocol-layer costs based on application-layer knowledge, couples tightly with the NIC event model, and exploits microarchitectural features. Simultaneously, the servers retain use of conventional programming frameworks. We compare our approach with the FreeBSD and Linux stacks using the nginx web server and NSD name server, demonstrating 2–10 and 9 improvements in web-server and DNS throughput, lower CPU usage, linear multicore scaling, and saturated NIC hardware.

我们提出了Sandstorm和Namestorm，web和DNS服务器，采用一个干净的用户空间网络堆栈并利用应用程序特定的工作负载的知识。基于netmap框架，我们的新颖方法合并了应用程序和网络堆栈内存模型根据应用程序层知识摊销协议层成本与NIC事件模型紧密耦合，并利用微架构特性。同时，服务器保持使用常规编程框架。我们在FreeBSD系统上将使用linux协议栈的nginx web服务器和NSD name server 与我们的方案进行比对。演示 2–10和 9 展示了Web服务器和DNS吞吐量，降低CPU使用率，线性多核缩放和跑满NIC硬件的改进。

INTRODUCTION

Conventional network stacks were designed in an era where individual systems had to perform multiple diverse functions. In the last decade, the advent of cloud computing and the ubiquity of networking has changed this model; today, large content providers serve hundreds of millions of customers. To scale their systems, they are forced to employ many thousands of servers, with each providing only a single network service. Yet most content is still served with conventional general-purpose network stacks.

介绍

传统网络堆栈是在各个系统必须执行多种不同功能的时代设计的。在过去十年中，云计算的出现和网络的普及改变了这种模式; 今天，大型内容提供商为数亿客户提供服务。为了扩展他们的系统，他们被迫使用成千上万的服务器，每个服务器仅提供单个网络服务。然而，大多数内容仍然服务于传统的通用网络栈。

These general-purpose stacks have not stood still, but today’s stacks are the result of numerous incremental updates on top of codebases that were originally developed in the early 1990s. Arguably, these network stacks have proved to be quite efficient, flexible, and reliable, and this is the reason that they still form the core of contemporary networked systems. They also provide a stable programming API, simplifying software development. But this generality comes with significant costs, and we argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary.

这些通用栈并没有停止前进，但今天的栈是在最初在20世纪90年代初开发的代码库上的许多增量更新的结果。可以说，这些网络堆栈已经被证明是相当高效，灵活和可靠的，这就是它们仍然形成当代网络系统的核心的原因。它们还提供稳定的编程API，简化软件开发。但这种普遍性带来了巨大的成本，我们认为，通用性的开销现在是有效扩展的关键障碍，专业化不仅可行，而且是必要的。

In this paper we revisit the idea of specialized network stacks. In particular, we develop Sandstorm, a specialized userspace stack for serving static web content, and Namestorm, a specialized stack implementing a high performance DNS server. More importantly, however, our approach does not simply shift the network stack to userspace: we also promote tight integration and specialization of application and stack functionality, achieving cross-layer optimizations antithetical to current design practices.

在本文中，我们重新思考了关于网络堆栈专用化。特别是我们开发了Sandstorm，用于提供静态Web内容的专用用户空间堆栈和Namestorm，这是一个实现高性能DNS服务器的专用堆栈。更重要的是，我们的方法不是简单地将网络栈转移到用户空间：我们还促进应用程序和堆栈功能的紧密集成和专业化，实现与当前设计实践相对立的跨越性优化。

Servers such as Sandstorm could be used for serving images such as the Facebook logo, as OCSP [20] responders for certificate revocations, or as front end caches to popular dynamic content. This is a role that conventional stacks should be good at: nginx [6] uses the sendfile() system call to hand over serving static content to the operating system. FreeBSD and Linux then implement zero-copy stacks, at least for the payload data itself, using scatter-gather to directly DMA the payload from the disk buffer cache to the NIC. They also utilize the features of smart network hardware, such as TCP Segmentation Offload (TSO) and Large Receive Offload (LRO) to further improve performance. With such optimizations, nginx does perform well, but as we will demonstrate, a specialized stack can outperform it by a large margin.

像Sandstorm这样的服务器可以用于提供诸如Facebook徽标的图像，作为用于证书撤销的OCSP [20]响应者，或者作为流行动态内容的前端缓存。这是常规堆栈应该擅长的角色：nginx [6]使用sendfile（）系统调用将服务静态内容移交给操作系统。 FreeBSD和Linux然后实现零拷贝堆栈，至少对于有效载荷数据本身，使用scatter-gather散射 - 聚集直接将有效载荷从磁盘缓冲区缓存通过DMA转发到NIC。它们还利用智能网络硬件的特性，如TCP分段(ps:segment是tcp层的包，这里Segmentation 指将大tcp包分段的功能放在硬件中完成)卸载（TSO）和大型接收卸载（LRO），以进一步提高性能。有了这样的优化，nginx的表现良好，但正如我们将证明，一个专门的堆栈可以大幅度超越它。

Namestorm is aimed at handling extreme DNS loads, such as might be seen at the root nameservers, or when a server is under a high-rate DDoS attack. The open-source state of the art here is NSD [5], which combined with a modern OS that minimizes data copies when sending and receiving UDP packets, performs well.Namestorm, however, can outperform it by a factor of nine.

Namestorm旨在处理极端的DNS负载，例如可能在根名称服务器处看到，或者当服务器受到高速DDoS攻击时。这里的开源代表是NSD [5]，它与现代操作系统相结合，在发送和接收UDP packet时最小化数据复制，性能良好。然而，Namestorm可以超过它的九倍。

Our userspace web server and DNS server are built upon FreeBSD’s netmap [31] framework, which directly maps the NIC buffer rings to userspace.We will show that not only is it possible for a specialized stack to beat nginx, but on data-center-style networks when serving small files typical of many web pages, it can achieve three times the throughput on older hardware, and more than six times the throughput on modern hardware supporting DDIO1.

我们的用户空间Web服务器和DNS服务器是基于FreeBSD的netmap [31]框架构建的，它直接将NIC缓冲环映射到用户空间。我们将展示一个专门的堆栈不仅可以击败nginx，在为许多网页提供典型的小文件时，它可以实现旧硬件的三倍的吞吐量，并且是支持DDIO1的现代硬件的吞吐量的六倍多。

The demonstrated performance improvements come from four places. First, we implement a complete zero-copy stack, not only for payload but also for all packet headers, so sending data is very efficient. Second, we allow aggressive amortization that spans traditionally stiff boundaries – e.g., application-layer code can request pre-segmentation of data intended to be sent multiple times, and extensive batching is used to mitigate system-call overhead from userspace. Third, our implementation is synchronous, clocked from received packets; this improves cache locality and minimizes the latency of sending the first packet of the response. Finally, on recent systems, Intel’s DDIO provides substantial benefits, but only if packets to be sent are already in the L3 cache and received packets are processed to completion immediately. It is hard to ensure this on conventional stacks, but a special-purpose stack can get much closer to this ideal.

显示的性能改进来自四个地方。首先，我们实现一个完整的零拷贝堆栈，不仅对于有效负载，而且对于所有packet header(ps: 应该指包含ip头的包)，因此发送数据是非常有效的。第二，我们允许跨越传统上僵硬的边界的积极的摊销 - 例如，应用层代码可以对要被发送多次的数据预分段(ps:segmentation 应该指大包分小包)，并且广泛的将来自用户空间的系统调用批量处理以降低开销。第三，我们的实现是同步的，从接收的数据包开始; 这改进了缓存局部性并且使发送响应的第一分组的等待时间最小化。最后，在最近的系统上，英特尔的DDIO提供了巨大的好处，但是只有当要发送的数据包已经在L3缓存中，并且接收到的数据包立即被处理完成。在常规堆栈中很难确保这一点，但是专用堆栈可以更接近这个理想。

Of course, userspace stacks are not a novel concept. Indeed, the Cheetah web server for MIT’s XOK Exokernel [19] operating system took a similar approach, and demonstrated significant performance gains over the NCSA web server in 1994. Despite this, the concept has never really taken off, and in the intervening years conventional stacks have improved immensely. Unlike XOK, our specialized userspace stacks are built on top of a conventional FreeBSD operating system. We will show that it is possible to get all the performance gains of a specialized stack without needing to rewrite all the ancillary support functions provided by a mature operating system (e.g., the filesystem). Combined with the need to scale server clusters, we believe that the time has come to re-evaluate specialpurpose stacks on today’s hardware.

The key contributions of our work are:

当然，用户空间堆栈不是一个新颖的概念。事实上，用于MIT XOK Exokernel [19]操作系统的Cheetah Web服务器采用了类似的方法，并且在1994年的NCSA Web服务器上显示出显着的性能提升。尽管如此，这一概念从未真正起飞，在其间，堆栈已经大大改善。与XOK不同，我们的专用用户空间堆栈是建立在常规FreeBSD操作系统之上的。我们将显示，有可能获得专用堆栈的所有性能增益，而不需要重写成熟操作系统（例如，文件系统）提供的所有辅助支持功能。结合需要扩展服务器集群，我们认为，现在是重新评估当今硬件上的专用堆栈的时候了。

我们工作的主要贡献是：

We discuss many of the issues that affect performance in conventional stacks, even though they use APIs aimed at high performance such as sendfile() and recvmmsg().

我们讨论了许多影响传统堆栈性能的问题，尽管它们使用旨在实现高性能的API，如sendfile（）和recvmmsg（）。

We describe the design and implementation of multiple modular, highly specialized, application-specific stacks built over a commodity operating system while avoiding these pitfalls. In contrast to prior work, we demonstrate that it is possible to utilize both conventional and specialized stacks in a single system. This allows us to deploy specialization selectively, optimizing networking while continuing to utilize generic OS components such as filesystems without disruption.

我们描述了在操作系统上构建的多个模块化，高度专业化的应用特定堆栈的设计和实现，同时避免这些陷阱。与以前的工作相反，我们证明，有可能在单个系统中利用常规和专用的堆栈。这使我们能够有选择地部署专业化，优化网络连接，同时继续使用通用操作系统组件，如文件系统而不丢失通用性。

We demonstrate that specialized network stacks designed for aggressive cross-layer optimizations create opportunities for new and at times counter-intuitive hardware-sensitive optimizations. For example, we find that violating the long-held tenet of data-copy minimization can increase DMA performance for certain workloads on recent CPUs.

我们展示专为积极的跨层优化设计的专用的网络堆栈为新的和偶尔反直觉的硬件敏感的优化创造机会。例如，我们发现违反数据拷贝尽量小的原则可以提高近代CPU上某些工作负载的DMA性能。

We present hardware-grounded performance analyses of our specialized network stacks side-by-side with highly optimized conventional network stacks. We evaluate our optimizations over multiple generations of hardware, suggesting portability despite rapid hardware evolution.

我们提供基于硬件的性能分析将我们的专用网络堆栈与高度优化的常规网络堆栈并排比较。我们评估我们在多代硬件上的优化，表明即使硬件快速更新换代仍然可以使用。

We explore the potential of a synchronous network stack blended with asynchronous application structures, in stark contrast to conventional asynchronous network stacks supporting synchronous applications. This approach optimizes cache utilization by both the CPU andDMA engines, yielding as much as 2-10 conventional stack performance.

我们探索同步网络堆栈与异步应用程序结构混合的潜力，与支持同步应用程序的常规异步网络堆栈形成鲜明对比。这种方法优化了CPU和DMA引擎的缓存利用率，产生了类似2-10的常规堆栈性能。

2. SPECIAL-PURPOSE ARCHITECTURE

What is the minimum amount of work that a web server can perform to serve static content at high speed? It must implement a MAC protocol, IP, TCP (including congestion control), and HTTP.

However, their implementations do not need to conform to the conventional socket model, split between userspace and kernel, or even implement features such as dynamic TCP segmentation. For a web server that serves the same static content to huge numbers of clients (e.g., the Facebook logo or GMail JavaScript), essentially the same functions are repeated again and again. We wish to explore just how far it is possible to go to improve performance. In particular, we seek to answer the following questions:

Web服务器可以执行的高速服务静态内容的最少工作量是多少？它必须实现MAC协议(ps:ARP)，IP，TCP（包括拥塞控制）和HTTP。

然而，他们的实现不需要符合常规套接字模型，分离用户空间和内核层，或者甚至实现诸如动态TCP分段的特征。对于向巨大数量的客户端（例如，Facebook徽标或GMail JavaScript）提供相同静态内容的web服务器，基本上一次又一次地重复相同的功能。我们希望探讨可以提高性能的可能性。特别是，我们寻求回答以下问题：

Conventional network stacks support zero copy for OSmaintained data – e.g., filesystem blocks in the buffer cache, but not for application-provided HTTP headers or TCP packet headers. Can we take the zero-copy concept to its logical extreme, in which received packet buffers are passed from the NIC all the way to the application, and application packets to be sent are DMAed to the NIC for transmission without even the headers being copied?

传统的网络栈支持操作系统维护的数据结构的零拷贝，例如，缓冲区高速缓存中的文件系统块，但不支持应用程序提供的HTTP报头或TCP包头。我们可以极端考虑零拷贝概念的逻辑，其中接收的数据包缓冲区从NIC一直传递到应用程序，并且要发送的应用程序包被从DMA到NIC传输，甚至不复制头(ps:接收到的包直接修改成回复并发送)？

Conventional stacks make extensive use of queuing and buffering to mitigate context switches and keep CPUs and NICs busy, at the cost of substantially increased cache footprint and latency. Can we adopt a bufferless event model that reimposes synchrony and avoids large queues that exceed cache sizes? Can we expose link-layer buffer information, such as available space in the transmit descriptor ring, to prevent buffer bloat and reduce wasted work constructing packets that will only be dropped?

传统堆栈广泛使用排队和缓冲以减少上下文切换并保持CPU和NIC忙碌，其代价是显着增加的高速缓存占用和延迟。我们可以采用无缓冲事件模型，重建同步，避免超过缓存大小的大队列吗？我们可以暴露链路层缓冲区信息，例如传输描述符环中的可用空间，以防止缓冲区膨胀，并减少那些浪费的构建只会丢弃的数据包的工作量？

Conventional stacks amortize expenses internally, but cannot amortize repetitive costs spanning application and network layers. For example, they amortize TCP connection lookup using Large Receive Offload (LRO) but they cannot amortize the cost of repeated TCP segmentation of the same data transmitted multiple times. Can we design a network-stack API that allows cross-layer amortizations to be accomplished such that after the first client is served, no work is ever repeated when serving subsequent clients?

传统堆栈在内部摊销费用，但不能摊销跨应用程序和网络层的重复成本。例如，它们使用大型接收卸载（LRO）来摊销TCP连接查找，但是它们不能摊销多次传输的相同数据的重复TCP分段(ps: 多次对同一个tcp包重复拆小包)的成本。我们可以设计一个网络堆栈API，减少跨层消耗，使第一个客户端服务后，在服务后续客户端时，不会重复任何工作？

Conventional stacks embed the majority of network code in the kernel to avoid the cost of domain transitions, limiting twoway communication flow through the stack. Can we make heavy use of batching to allow device drivers to remain in the kernel while colocating stack code with the application and avoiding significant latency overhead?

传统堆栈将大部分网络代码嵌入内核中，以避免域转换的成本，限制通过堆栈的两个通信流。我们可以大量使用批处理，以允许设备驱动程序保留在内核中，同时与应用程序和堆栈代码colocating (ps: 这个不懂得怎么翻译，大概是没有跨层消耗的意思？)，并避免显着的延迟开销？

Can we avoid any data-structure locking, and even cache-line contention, when dealing with multi-core applications that do not require it?

在不需要多核时，我们可以避免任何数据结构锁定，甚至是高速缓存行争用吗？

Finally, while performing all the above, is there a suitable programming abstraction that allows these components to be reused for other applications that may benefit from server specialization?

最后，在执行上述所有操作时，是否有合适的编程抽象，允许这些组件重用于可能受益于服务器专业化的其他应用程序？

2.1 Network-stack Modularization

Although monolithic kernels are the de facto standard for networked systems, concerns with robustness and flexibility continue to drive exploration of microkernel-like approaches. Both Sandstorm and Namestorm take on several microkernel-like qualities:

网络堆栈模块化

虽然独立的内核是网络系统的事实上的标准，但是对灵活性的关注继续推动类似微内核的方法的探索。 Sandstorm和Namestorm都有几个类似微内核的特性：

Rapid deployment & reusability: Our prototype stack is highly modular, and synthesized from the bottom up using traditional dynamic libraries as building blocks (components) to construct a special-purpose system. Each component corresponds to a standalone service that exposes a well-defined API. Our specialized network stacks are built by combining four basic components:

快速部署和可重用性：我们的原型栈是高度模块化的，并从下往上使用传统的动态库作为构建块（组件）来构建一个专用系统。每个组件对应于公开明确定义的API的独立服务。我们的专业网络堆栈是由四个基本组件组合而成：

The netmap I/O (libnmio) library that abstracts traditional data-movement and event-notification primitives needed by higher levels of the stack.

netmap I / O（libnmio）库抽象了传统的数据移动和事件通知原语需要的更高级别的堆栈。

libeth component, a lightweight Ethernet-layer implementation.

libeth组件，轻量级以太网层实现。

libtcpip that implements our optimized TCP/IP layer.

libtcpip实现我们优化的TCP / IP层。

libudpip that implements a UDP/IP layer.

libudpip实现一个UDP / IP层。

Figure 1 depicts how some of these components are used with a simple application layer to form Sandstorm, the optimized web server.

Splitting functionality into reusable components does not require us to sacrifice the benefits of exploiting cross-layer knowledge to optimize performance, as memory and control flow move easily across API boundaries. For example, Sandstorm interacts directly with libnmio to preload and push segments into the appropriate packet-buffer pools. This preserves a service-centric approach.

Developer-friendly: Despite seeking inspiration from microkernel design, our approach maintains most of the benefits of conventional monolithic systems:

图1描述了这些组件如何与简单的应用层一起使用来形成Sandstorm，优化的web服务器。

将功能分解为可重用组件不需要我们牺牲利用跨层知识来优化性能的优势，因为内存和控制流可以轻松跨越API边界。例如，Sandstorm直接与libnmio交互以预加载并将segments 推入相应的包缓冲池。这保留了以服务为中心的方法。

开发者友好：尽管从微内核设计中获得灵感，我们的方法保持了传统独立系统的大部分优势：

Debugging is at least as easy (if not easier) compared to conventional systems, as application-specific, performancecentric code shifts from the kernel to more accessible userspace.

调试至少和传统系统一样容易（如果没有更容易的话），因为特定应用程序，性能中心代码从内核转移到更易于访问的用户空间。

Our approach integrates well with the general-purpose operating systems: rewriting basic components such as device drivers or filesystems is not required. We also have direct access to conventional debugging, tracing, and profiling tools, and can also use the conventional network stack for remote access (e.g., via SSH).

我们的方法与通用操作系统完美集成：不需要重写基本组件，如设备驱动程序或文件系统。我们还可以直接访问常规调试，跟踪和分析工具，并且还可以使用常规网络栈来远程访问（例如，通过SSH）。

Instrumentation in Sandstorm is a simple and straightforward task that allows us to explore potential bottlenecks as well as necessary and sufficient costs in network processing across application and stack. In addition, off-the-shelf performance monitoring and profiling tools “just work”, and a synchronous design makes them easier to use.

Sandstorm中的工具完成简单和直接的任务，允许我们探索潜在的瓶颈，以及在应用和堆栈的网络处理中必要和足够的成本。此外，现成的性能监控和分析工具“只是工作”，同步设计使它们更容易使用。

2.2 Sandstorm web server design

Rizzo’s netmap framework provides a general-purpose API that allows received packets to be mapped directly to userspace, and packets to be transmitted to be sent directly from userspace to the NIC’s DMA rings. Combined with batching to reduce system calls, this provides a high-performance framework on which to build packet-processing applications. A web server, however, is not normally thought of as a packet-processing application, but one that handles TCP streams.

Sandstorm web服务器的设计

Rizzo的netmap框架提供了通用API，允许接收的数据包直接映射到用户空间，要发送的数据包将直接从用户空间发送到NIC的DMA环。结合批处理以减少系统调用，这提供了一个高性能框架，用于构建数据包处理应用程序。然而，Web服务器通常不被认为是包处理应用，而是处理TCP流的应用。

To serve a static file, we load it into memory, and a priori generate all the packets that will be sent, including TCP, IP, and link-layer headers. When an HTTP request for that file arrives, the server must allocate a TCP-protocol control block (TCB) to keep track of the connection’s state, but the packets to be sent have already been created for each file on the server.2

要提供一个静态文件，我们将它加载到内存，并且先验生成所有要发送的数据包，包括TCP，IP和链路层头。当对该文件的HTTP请求到达时，服务器必须分配TCP协议控制块（TCB）以跟踪连接的状态，但是已经为服务器上的每个文件创建了要发送的包。

The majority of the work is performed during inbound TCP ACK processing. The IP header is checked, and if it is acceptable, a hash table is used to locate the TCB. The offset of the ACK number from the start of the connection is used to locate the next prepackaged packet to send, and if permitted by the congestion and receive windows, subsequent packets. To send these packets, the destination address and port must be rewritten, and the TCP and IP checksums incrementally updated. The packet can then be directly fetched by the NIC using netmap. All reads of the ACK header and modifications to the transmitted packets are performed in a single pass, ensuring that both the headers and the TCB remain in the CPU’s L1 cache.

大多数工作在入站TCP ACK处理期间执行。检查IP报头，并且如果可以则使用哈希表来定位TCB(ps: 让这个TCB来处理相同链接的包)。 ACK号的偏移用来在连接开始时定位要发送的下一个预打包的数据包，并且如果拥塞和接收窗口允许，则用于定位后续包。要发送这些数据包，必须重写目标地址和端口，并且逐步更新TCP和IP校验和。然后，该包可以由NIC使用netmap直接获取。 ACK报头的所有读取和对发送的包的修改在单次通过中执行，确保报头和TCB保留在CPU的L1高速缓存中。

Once a packet has been DMAed to the NIC, the packet buffer is returned to Sandstorm, ready to be incrementally modified again and sent to a different client. However, under high load, the same packet may need to be queued in the TX ring for a second client before it has finished being sent to the first client. The same packet buffer cannot be in the TX ring twice, with different destination address and port. This presents us with two design options:

一旦包已经被DMA到NIC，buffer被返回到Sandstorm，准备再次增量地修改并发送到不同的客户端。然而，在高负载下，相同的数据包可能需要在第二个客户端的tx ring中被排队直到第一个客户端发送完它。具有不同的目的地址和端口的相同的包buffer不能被添加两次到TX环中。这里我们提供了两个设计选项

We can maintain more than one copy of each packet in memory to cope with this eventuality. The extra copy could be created at startup, but a more efficient solution would create extra copies on demand whenever a high-water mark is reached, and then retained for future use.

我们可以在内存中保存每个数据包的多个副本，以应对这种可能性。可以在启动时创建额外的副本，但是更高效的解决方案可以在达到高水位标记时根据需要创建额外副本，然后保留以供将来使用。

We can maintain only one long-term copy of each packet,creating ephemeral copies each time it needs to be sent

我们给每个包长期维护一个副本。在每次需要时创建临时拷贝

We call the former a pre-copy stack (it is an extreme form of zerocopy stack because in the steady state it never copies, but differs from the common use of the term “zero copy”), and the latter a memcpy stack. A pre-copy stack performs less per-packet work than a memcpy stack, but requires more memory; because of this, it has the potential to thrash the CPU’s L3 cache. With the memcpy stack, it is more likely for the original version of a packet to be in the L3 cache, but more work is done. We will evaluate both approaches, because it is far from obvious how CPU cycles trade off against cache misses in modern processors.

我们称前者是一个预拷贝堆栈（它是一个极端形式的zerocopy堆栈，因为在稳定状态下它从不复制，但不同于常用的术语“零拷贝”），后者是一个memcpy堆栈。预拷贝堆栈比memcpy堆栈执行更少的工作，但需要更多的内存; 因为这样，它有可能摧毁CPU的L3缓存。使用memcpy堆栈，更可能的原始版本的数据包在L3缓存，但有更多的工作要做。我们将评估这两种方法，因为它远不如CPU周期与现代处理器中的高速缓存未命中明显。

Figure 2 illustrates tradeoffs through traces taken on nginx/Linux and pre-copy Sandstorm servers that are busy (but unsaturated). On the one hand, a batched design measurably increases TCP roundtrip time with a relatively idle CPU. On the other hand, Sandstorm amortizes or eliminates substantial parts of per-request processing through a more efficient architecture. Under light load, the benefits are pronounced; at saturation, the effect is even more significant.

图2说明了通过跟踪在忙碌（但不饱和）的nginx / Linux和预拷贝Sandstorm服务器。一方面，批量的设计预期地延长具有相对空闲的CPU的TCP往返时间但同时另一方面，Sandstorm通过更高效的架构来平摊或消除每个请求处理的大部分。在轻负载下，益处明显; 在饱和时，效果甚至更显着。

Although most work is synchronous within the ACK processing code path, TCP still needs timers for certain operations. Sandstorm’s timers are scheduled by polling the Time Stamp Counter (TSC): although not as accurate as other clock sources, it is accessible from userspace at the cost of a single CPU instruction (on modern hardware).The TCP slow timer routine is invoked periodically (every ~500ms) and traverses the list of active TCBs: on RTO expiration,the congestion window and slow-start threshold are adjusted accordingly,and any unacknowledged segments are retransmitted. The same routine also releases TCBs that have been in TIME_WAIT state for longer than 2*MSL. There is no buffering whatsoever required for retransmissions: we identify the segment that needs to be retransmitted using the oldest unacknowledged number as an offset,retrieve the next available prepackaged packet and adjust its headers accordingly, as with regular transmissions. Sandstorm currently implements TCP Reno for congestion control.

虽然大多数工作在ACK处理代码路径内同步，但是TCP仍然需要某些操作的定时器。 Sandstorm的计时器通过轮询时间戳计数器（TSC）来调度：尽管不如其他时钟源精确，但是可以从单个CPU指令（在现代硬件上）的成本从用户空间访问。TCP慢速计时器例程被周期性地调用（每〜500ms）并且遍历活动TCB的列表：在RTO到期时，相应地调整拥塞窗口和慢启动阈值，并且重传任何未确认的segments。同一例程还释放已处于TIME_WAIT状态长于2 * MSL的TCB。没有对重传要求的缓冲：我们使用最旧的未确认号码作为偏移来识别需要重传的段，检索下一个可用的预先封装的分组，并相应地调整其头部，如同常规传输一样。 Sandstorm目前实现了TCP Reno的拥塞控制。

2.3 The Namestorm DNS server

The same principles applied in the Sandstorm web server, also apply to a wide range of servers returning the same content to multiple users. Authoritative DNS servers are often targets of DDoS attacks – they represent a potential single point of failure, and because DNS traditionally uses UDP, lacks TCP’s three way handshake to protect against attackers using spoofed IP addresses. Thus,high performance DNS servers are of significant interest.

在Sandstorm Web服务器中应用的相同原理也适用于将相同内容返回给多个用户类的服务器。权威DNS服务器通常是DDoS攻击的目标 - 它们代表一个潜在的单点故障，并且因为DNS传统上使用UDP，缺乏TCP的三方握手，以防止攻击者使用欺骗的IP地址。因此，高性能DNS服务器具有重大意义。

Unlike TCP, the conventional UDP stack is actually quite lightweight, and DNS servers already preprocess zone files and store response data in memory. Is there still an advantage running a specialized stack?

与TCP不同，常规UDP堆栈实际上相当轻量级，DNS服务器已经预处理zone并将响应数据存储在内存中。运行一个专用的堆栈还有优势吗？

Most DNS-request processing is simple. When a request arrives,

the server performs sanity checks, hashes the concatenation of the name and record type being requested to find the response, and sends that data. We can preprocess the responses so that they are already stored as a prepackaged UDP packet. As with HTTP, the destination address and port must be rewritten, the identifier must be updated,and the UDP and IP checksums must be incrementally updated.After the initial hash, all remaining processing is performed in one pass, allowing processing of DNS response headers to be performed from the L1 cache. As with Sandstorm, we can use pre-copy or memcpy approaches so that more than one response for the same name can be placed in the DMA ring at a time.

大多数DNS请求处理很简单。当请求到达时，

服务器执行完整性检查，对所请求的名称和记录类型合并进行哈希以找到响应，并发送该数据。我们可以预处理响应，以便它们已经存储为预先打包的UDP数据包。与HTTP一样，必须重写目标地址和端口，必须更新标识符，并且必须增量更新UDP和IP校验和。在初始哈希后，所有剩余的处理都在一次执行，允许处理DNS响应头将从L1高速缓存执行。与Sandstorm一样，我们可以使用预拷贝或memcpy方法，以便同一名称的多个响应可以一次放置在DMA环中。

Our specialized userspace DNS server stack is composed of three reusable components, libnmio, libeth, libudpip, and a DNS-specific application layer. As with Sandstorm, Namestorm uses FreeBSD’s netmap API, implementing the entire stack in userspace, and uses netmap’s batching to amortize system call overhead. libnmio and libeth are the same as used by Sandstorm, whereas libudpip contains UDP-specific code closely integrated with an IP layer. Namestorm is an authoritative nameserver, so it does not need to handle recursive lookups.

我们的专用用户空间DNS服务器堆栈由三个可重用的组件libnmio，libeth，libudpip和DNS特定的应用程序层组成。与Sandstorm一样，Namestorm使用FreeBSD的netmap API，在用户空间实现整个堆栈，并使用netmap的批处理来摊销系统调用开销。 libnmio和libeth与Sandstorm使用的相同，而libudpip包含与IP层紧密集成的UDP特定代码。 Namestorm是一个权威的名称服务器，因此它不需要处理递归查找。

Namestorm preprocesses the zone file upon startup, creating DNS response packets for all the entries in the zone, including the answer section and any glue records needed. In addition to type-specific queries for A, NS,MX and similar records, DNS also allows queries for ANY. A full implementation would need to create additional response packets to satisfy these queries; our implementation does not yet do so, but the only effect this would have is to increase the overall memory footprint. In practice, ANY requests prove comparatively rare.

Namestorm在启动时预处理zone文件，为zone中的所有条目创建DNS响应数据包，包括答案部分和所需的任何附带记录。除了针对A，NS，MX和类似记录的类型特定查询之外，DNS还允许对ANY进行查询。完全实现将需要创建额外的响应分组以满足这些查询; 我们的实现还没有这样做，但是唯一的效果是增加总体内存占用。在实践中，any请求比较罕见。

Namestorm idexes the prepackaged DNS response packets using a hash table. There are two ways to do this:

Namestorm使用hash表索引预先打包的DNS响应数据包。有两种方法可以做到这一点：

Index by concatenation of request type (e.g., A, NS, etc) and fully-qualified domain name (FQDN); for example “www.example.com”.

通过请求类型（例如，A，NS等）和完全限定域名（FQDN）的合并索引; 例如“www.example.com”.

Index by concatenation of request type and the wire-format FQDN as this appears in an actual query; for example,“[3]www[7]example[3]com[0]” where [3] is a single byte containing the numeric value 3.

通过连接请求类型和wire格式FQDN索引，因为它出现在实际查询中; 例如“[3] www [7] example [3] com [0]”，其中[3]是包含数值3的单个字节。(ps: dns包中域名使用”3www5baidu3com”格式，前面的数字表示后面的域长方便解析)

Using the wire request format is obviously faster, but DNS permits compression of names. Compression is common in DNS answers, where the same domain name occurs more than once, but proves rare in requests. If we implement wire-format hash keys, we must first perform a check for compression; these requests are decompressed and then reencoded to uncompressed wire-format for hashing.The choice is therefore between optimizing for the common case, using wire-format hash keys, or optimizing for the worst case, assuming compression will be common, and using FQDN hash keys. The former is faster, but the latter is more robust to a DDoS attack by an attacker taking advantage of compression. We evaluate both approaches, as they illustrate different performance tradeoffs.

使用wire请求格式显然更快，但DNS允许压缩名称。压缩在DNS答案中很常见，其中相同的域名不止一次地出现，但在请求中证明是罕见的。如果我们wire线格式哈希键，我们必须首先执行压缩检查; 这些请求被解压缩，然后被重新编码为未压缩的wire格式以用于散列。因此选择是针对常见情况优化，使用wire格式散列密钥，或者对于最坏情况优化，假设压缩将是常见的，并且使用FQDN散列密钥。前者更快，但后者更强大到攻击者利用压缩的DDoS攻击。我们评估这两种方法，因为它们说明了不同的性能。

Our implementation does not currently handle referrals, so it can handle only zones for which it is authoritative for all the sub-zones.It could not, for example, handle the .com zone, because it would receive queries for www.example.com, but only have hash table entries for example.com. Truncating the hash key is trivial to do as part of the translation to an FQDN, so if Namestorm were to be used for a domain such as .com, the FQDN version of hashing would be a reasonable approach.

我们的实现目前不处理引用，因此它只能处理对所有子区域都是权威的区域。

例如，它无法处理.com区域，因为它将接收www.example.com的查询，但只有example.com的哈希表条目。截断哈希键对于转换到FQDN是很重要的，所以如果Namestorm用于一个域如.com，FQDN版本的哈希将是一个合理的方法。

Outline of the main Sandstorm event loop

1. Call RX poll to receive a batch of received packets that have been

stored in the NIC’s RX ring; block if none are available.

2. For each ACK packet in the batch:

3. Perform Ethernet and IP input sanity checks.

4. Locate the TCB for the connection.

5. Update the acknowledged sequence numbers in TCB; update

receive window and congestion window.

6. For each new TCP data packet that can now be sent, or each

lost packet that needs retransmitting:

7. Find a free copy of the TCP data packet (or clone one

if needed).

8. Correct the destination IP address, destination port,

sequence numbers, and incrementally update the TCP

checksum.

9. Add the packet to the NIC’s TX ring.

10. Check if dt has passed since last TX poll. If it has, call

TX poll to send all queued packets.

Sandstorm 主事件循环概述

1.调用RX轮询批量接收在NIC的RX环中的packet;直到没有。

2.处理每个ACK数据包：

3.执行链路层和IP层完整性检查。

4.找到处理该连接的TCB。

5.更新TCB中已确认的序列号; 更新接收窗口和拥塞窗口。

6.对于可以立即发送的每个新TCP数据包，或每个需要重传的丢失数据包：

7.查找TCP数据包的空闲拷贝（如果需要，请clone一个）。

8.更正目标IP地址，目标端口，序列号，并逐步更新TCP校验和。

9.将数据包添加到NIC的TX环。

10.检查是否达到TX轮询的时间间隔。如果是调用TX poll发送所有排队的数据包。

2.4 Main event loop

To understand how the pieces fit together and the nature of interaction between Sandstorm, Namestorm, and netmap, we consider the timeline for processing ACK packets in more detail. Figure 3 summarizes Sandstorm’s main loop. SYN/FIN handling, HTTP, and timers are omitted from this outline, but also take place. However,most work is performed in the ACK processing code.

2.4主事件循环

为了理解这些部分是如何组合在一起的，以及Sandstorm，Namestorm和netmap之间的交互性质，我们更详细地考虑处理ACK数据包的时间线。图3总结了Sandstorm的主循环。 SYN / FIN处理，HTTP和计时器从此大纲中省略但偶尔也有。然而，大多数工作是在ACK处理代码中执行的。

One important consequence of this architecture is that the NIC’s TX ring serves as the sole output queue, taking the place of conventional socket buffers and software network-interface queues. This is possible because retransmitted TCP packets are generated in the same way as normal data packets. As Sandstorm is fast enough to saturate two 10Gb/s NICs with a single thread on one core, data structures are also lock free

这种架构的一个重要结果是，NIC的TX ring用作唯一的输出队列，取代了传统的套接字缓冲区和软件网卡队列。这是可能的，因为重传的TCP包以与正常数据包相同的方式生成。由于Sandstorm足够快，可以在一个核上使用单个线程来饱和两个10Gb / s网卡，数据结构也是无锁的

When the workload is heavy enough to saturate the CPU, the system-call rate decreases. The receive batch size increases as calls to RX poll become less frequent, improving efficiency at the expense of increased latency. Under extreme load, the RX ring will fill, dropping packets. At this point the system is saturated and, as with any web server, it must bound the number of open connections by dropping some incoming SYNs.

当工作负载足够大以使CPU饱和时，系统调用(ps: rx tx的poll轮询)速率降低。接收批次大小随着对RX轮询的调用变得不那么频繁而增加，以增加的延迟为代价来提高效率。在极端负载下，RX环会填满，丢弃报文。此时，系统已饱和，与任何Web服务器一样，它必须丢弃一定数量的打开的连接的SYN。

Under heavier load, the TX-poll system call happens in the RX handler. In our current design, dt, the interval between calls to TX poll in the steady state, is a constant set to 80us. The system-call rate under extreme load could likely be decreased by further increasing dt, but as the pre-copy version of Sandstorm can easily saturate all six 10Gb/s NICs in our systems for all file sizes, we have thus far not needed to examine this. Under lighter load, incoming packets might arrive too rarely to provide acceptable latency for transmitted packets; a 5ms timer will trigger transmission of straggling packets in the NIC’s TX ring.

在较重的负载下，TX-poll系统调用发生在RX处理程序中。在我们当前的设计中，dt，在稳定状态下调用TX poll之间的间隔，是一个设置为80us的常数。在极端负载下通过进一步增加dt来降低系统调用率，但是由于预拷贝版本的Sandstorm可以很容易地饱和所有文件大小的系统中的所有6个10Gb / s网卡，我们迄今为止不需要检查这个。在较轻负载下，传入packets可能很少于是提供了一个可接受延迟; 5ms发送一次。

The difference between the pre-copy version and the memcpy version of Sandstorm is purely in step 7, where the memcpy version will simply clone the single original packet rather than search for an unused existing copy.

预拷贝版本和memcpy版本之间的差异纯粹是在步骤7中，其中memcpy版本将简单地克隆单个原始数据包，而不是搜索未使用的现有副本。

Contemporary Intel server processors support Direct Data I/O (DDIO). DDIO allows NIC-originated Direct Memory Access (DMA) over PCIe to access DRAM through the processor’s Last-Level Cache (LLC). For network transmit, DDIO is able to pull data from the cache without a detour through system memory; likewise, for receive,DMA can place data in the processor cache. DDIO implements administrative limits on LLC utilization intended to prevent DMA from thrashing the cache. This design has the potential to significantly reduce latency and increase I/O bandwidth

当代英特尔服务器处理器支持直接数据I / O（DDIO）。 DDIO允许通过PCIe的NIC发起的DMA直接内存访问,通过处理器的最后级缓存（LLC）访问DRAM。对于网络传输，DDIO能够从缓存中提取数据，而不必通过系统内存; 同样，对于接收，DMA可以将数据放置在处理器高速缓存中。 DDIO实现对LLC利用率的管理限制，旨在防止DMA频繁刷缓存。此设计具有显着减少延迟和增加I / O带宽的潜力

Memcpy Sandstorm forces the payload of the copy to be in the CPU cache from which DDIO can DMA it to the NIC without needing to load it from memory again. With pre-copy, the CPU only touches the packet headers, so if the payload is not in the CPU cache, DDIO must load it, potentially impacting performance. These interactions are subtle, and we will look at them in detail.

Memcpy 版本的Sandstorm强制拷贝的负载压力在CPU缓存中，DDIO可以将其从DMA传输到NIC，而无需再次从内存加载它。使用预拷贝，CPU只触发数据包头，因此如果有效负载不在CPU缓存中，DDIO必须加载它，这可能会影响性能。这些互动是微妙的，我们将详细研究它们。(ps:照理减少拷贝使用现成的数据更快但是这里用了DDIO实现cpu到网卡的直接传输而预拷贝版本的话现成的数据不一定在cache中反而还多了一次加载，关于这点后面还会讨论)

Namestorm follows the same basic outline, but is simpler as DNS is stateless: it does not need a TCB, and sends a single response packet to each request.

Namestorm遵循相同的基本概要，但是更简单，因为DNS是无状态的：它不需要TCB，并且向每个请求发送单个响应包。

2.5 API

As discussed, all of our stack components provide well-defined APIs to promote reusability. Table 1 presents a selection of API functions exposed by libnmio and libtcpip. In this section we describe some of the most interesting properties of the APIs.

如上所述，我们的所有堆栈组件都提供了定义明确的API来提高可重用性。表1介绍了libnmio和libtcpip暴露的API函数的选择。在本节中，我们描述了一些最有趣的API的属性。

libnmio is the lowest-level component: it handles all interaction with netmap and abstracts the main event loop. Higher layers 179 (e.g., libeth) register callback functions to receive raw incoming data as well as set timers for periodic events (e.g., TCP slow timer).The function netmap_ouput() is the main transmission routine:it enqueues a packet to the transmission ring either by memory or zero copying and also implements an adaptive batching algorithm.

Since there is no socket layer, the application must directly interface with the network stack. With TCP as the transport layer, it acquires a TCB (TCP Control Block), binds it to a specific IPv4 address and port, and sets it to LISTEN state using API functions. The application must also register callback functions to accept connections,receive and process data from active connections, as well as act on successful delivery of sent data (e.g., to close the connection or send more data).

libnmio是最低级别的组件：它处理与netmap的所有交互并抽象主事件循环。高层179（例如，libeth）注册回调函数以接收原始输入数据以及设置用于周期性事件的定时器（例如，TCP慢定时器）。函数netmap_ouput（）是主传输例程：它将分组排入传输通过存储器或零拷贝来放入环形队列，并且还实现自适应批处理算法。

由于没有套接字层，应用程序必须直接与网络堆栈接口。使用TCP作为传输层，它获取TCB（TCP控制块），将其绑定到特定的IPv4地址和端口，并使用API函数将其设置为LISTEN状态。应用程序还必须注册回调函数以接受连接，从活动连接接收和处理数据、发送数据的成功传递（例如，以关闭连接或发送更多数据）。

3. EVALUATION

To explore Sandstorm and Namestorm’s performance and behavior, we evaluated using both older and more recent hardware.On older hardware, we employed Linux 3.6.7 and FreeBSD 9-STABLE. On newer hardware, we used Linux 3.12.5 and FreeBSD 10-STABLE. We ran Sandstorm and Namestorm on FreeBSD.

评估

为了探索Sandstorm和Namestorm的性能和行为，我们使用旧的和更新的硬件进行评估。在旧的硬件上，我们使用Linux 3.6.7和FreeBSD 9-STABLE。在较新的硬件上，我们使用Linux 3.12.5和FreeBSD 10-STABLE。我们在FreeBSD上运行Sandstorm和Namestorm。

For the old hardware, we use three systems: two clients and one server, connected via a 10GbE crossbar switch. All test systems are equipped with an Intel 82598EB dual port 10GbE NIC, 8GB RAM,and two quad-core 2.66 GHz Intel Xeon X5355 CPUs. In 2006,these were high-end servers. For the new hardware, we use seven systems; six clients and one server, all directly connected via dedicated 10GbE links. The server has three dual-port Intel 82599EB 10GbE NICs, 128GB RAM and a quad-core Intel Xeon E5-2643 CPU. In 2014, these are well-equipped contemporary servers.

对于旧硬件，我们使用三个系统：两个客户端和一个服务器，通过10GbE交换机连接。所有测试系统都配备了一个Intel 82598EB双端口10GbE NIC，8GB RAM和两个四核2.66 GHz Intel Xeon X5355 CPU。 2006年，这些都是高端服务器。对于新硬件，我们使用七个系统; 六个客户端和一个服务器，都通过专用的10GbE链路直接连接。该服务器有三个双端口Intel 82599EB 10GbE NIC，128GB RAM和四核Intel Xeon E5-2643 CPU。在2014年，这些是设备齐全的现代服务器。

The most interesting improvements between these hardware generations are in the memory subsystem. The older Xeons have a conventional architecture with a single 1,333MHz memory bus serving both CPUs. The newer machines, as with all recent Intel server processors,support Data Direct I/O (DDIO), so whether data to be sent is in the cache can have a significant impact on performance.

这些硬件代之间最有趣的改进是在存储器子系统中。较老的Xeons有一个传统的架构，单个1,333MHz内存总线为两个CPU服务。较新的机器（如最近的所有英特尔服务器处理器）都支持数据直接I / O（DDIO），因此要发送的数据是否在缓存中会对性能产生重大影响。

Our hypothesis is that Sandstorm will be significantly faster than nginx on both platforms; however, the reasons for this may differ. Experience [18] has shown that the older systems often bottleneck on memory latency, and as Sandstorm is not CPU-intensive, we would expect this to be the case. A zero-copy stack should thus be a big win. In addition, as cores contend for memory, we would expect that adding more cores does not help greatly.

我们的假设是，Sandstorm将在两个平台上明显快于nginx; 然而，原因可能不同。经验[18]表明，较旧的系统通常会对内存延迟造成瓶颈，而且由于Sandstorm不是CPU密集型的，我们预期会出现这种情况。零拷贝堆栈应该是一个大胜利。此外，随着核争夺内存，我们预计添加更多核并不会有很大的帮助。

On the other hand, with DDIO, the new systems are less likely to bottleneck on memory. The concern, however, would be that DDIO could thrash at least part of the CPU cache. On these systems, we expect that adding more cores would help performance, but that in doing so, we may experience scalability bottlenecks such as lock contention in conventional stacks. Sandstorm’s lock-free stack can simply be replicated onto multiple 10GbE NICs, with one core per two NICs to scale performance. In addition, as load increases, the number of packets to be sent or received per system call will increase due to application-level batching. Thus, under heavy load, we would hope that the number of system calls per second to still be acceptable despite shifting almost all network-stack processing to userspace.

另一方面，使用DDIO，新系统不太可能在内存上造成瓶颈。然而，关注的是，DDIO可能至少刷掉部分的CPU缓存。在这些系统上，我们期望添加更多的核将有助于提高性能，但在这样做时，我们可能会遇到可伸缩性瓶颈，例如传统堆栈中的锁争用。 Sandstorm的无锁堆栈可以简单地用到多个10GbE NIC上，每两个NIC一个核心可以扩展性能。此外，随着负载的增加，每个系统调用发送或接收的数据包数量将由于应用程序级别的批处理而增加。因此，在重负载下尽管将几乎所有的网络栈处理转移到用户空间，我们希望每秒的系统调用的数量仍然是可以接受的

The question, of course, is how well do these design choices play out in practice?

当然，问题是这些设计选择在实践中表现得如何？

3.1 Experiment Design: Sandstorm

We evaluated the performance of Sandstorm through a set of experiments and compare our results against the nginx web server running on both FreeBSD and Linux. Nginx is a high-performance,low-footprint web server that follows the non-blocking, event-driven model: it relies on OS primitives such as kqueue() for readiness event notifications, it uses sendfile() to send HTTP payload directly from the kernel, and it asynchronously processes requests.

我们通过一组实验评估Sandstorm的性能，并将结果与在FreeBSD和Linux上运行的nginx Web服务器进行比较。 Nginx是一个高性能，低占用率的web服务器，它遵循非阻塞，事件驱动模型：它依赖于诸如kqueue（）等用于准备事件通知的操作系统原语，它使用sendfile（）直接从内核发送HTTP有效负载，并且异步处理请求。

Contemporary web pages are immensely content-rich, but they mainly consist of smaller web objects such as images and scripts. The distribution of requested object sizes for Yahoo! CDN, reveals that 90% of the content is smaller than 25KB [11]. The conventional network stack and web-server application perform well when delivering large files by utilizing OS primitives and NIC hardware features. Conversely, multiple simultaneous short-lived HTTP connections are considered a heavy workload that stresses the kerneluserspace interface and reveals performance bottlenecks: even with sendfile() to send the payload, the size of the transmitted data is not quite enough to compensate for the system cost.

当代网页内容丰富，但它们主要包括较小的网络对象，如图像和脚本。对于Yahoo! CDN的请求的对象大小的分布，揭示了90％的内容小于25KB [11]。当通过利用OS原语和NIC硬件特征来传送大文件时，传统的网络栈和web服务器应用执行得很好。相反，多个同时短期HTTP连接被认为是一个沉重的工作负载，强调用户空间内核接口并揭示性能瓶颈：即使使用sendfile（）发送有效负载，传输数据的大小也不足以补偿系统成本。

For all the benchmarks, we configured nginx to serve content from a RAM disk to eliminate disk-related I/O bottlenecks. Similarly,Sandstorm preloads the data to be sent and performs its pre-segmentation phase before the experiments begin. We use weighttp [9] to generate load with multiple concurrent clients. Each client generates a series of HTTP requests, with a new connection being initiated immediately after the previous one terminates. For each experiment we measure throughput, and we vary the size of the file served, exploring possible tradeoffs between throughput and system load. Finally, we run experiments with a realistic workload by using a trace of files with sizes that follow the distribution of requested HTTP objects of the Yahoo! CDN.

对于所有的基准测试，我们配置了nginx来从RAM磁盘提供内容，以消除磁盘相关的I / O瓶颈。类似地，Sandstorm预加载要发送的数据，并在实验开始之前执行其预分割阶段。我们使用weighttp [9]来生成多个并发客户端的负载。每个客户端生成一系列HTTP请求，在前一个终止后立即启动新的连接。对于每个实验，我们测量吞吐量，并且我们改变所服务的文件的大小，探索吞吐量和系统负载之间可能的折衷。最后，我们使用跟踪文件的实际工作量进行实验，这些文件的大小遵循Yahoo! CDN所请求的HTTP对象的分布。

3.2 Sandstorm Results

First, we wish to explore how file size affects performance Sandstorm is designed with small files in mind, and batching to reduce overheads, whereas the conventional sendfile() ought to be better for larger files.

首先，我们希望了解文件大小如何影响性能Sandstorm的设计考虑了小文件，并批量化以减少开销，而传统的sendfile（）应该对更大的文件更好。

Figure 4 shows performance as a function of content size, comparing pre-copy Sandstorm and nginx running on both FreeBSD and Linux. With a single 10GbE NIC (Fig. 4a and 4d), Sandstorm outperforms nginx for smaller files by ~23–240%. For larger files, all three configurations saturate the link. Both conventional stacks are more CPU-hungry for the whole range of file sizes tested, despite potential advantages such as TSO on bulk transfers.

图4显示了对不同内容大小的函数的性能，比较了在FreeBSD和Linux上运行的预拷贝Sandstorm和nginx。使用单个10GbE NIC（图4a和4d），Sandstorm的性能比较小的文件的nginx大约高23-240％。对于较大的文件，所有三个配置饱和链接。尽管存在诸如批量传输的TSO的潜在优势，但对于所有测试的文件大小，这两种常规堆栈都更加需要CPU。

To scale to higher bandwidths, we added more 10GbE NICs and client machines. Figure 4b shows aggregate throughput with four 10GbE NICs. Sandstorm saturates all four NICs using just two CPU cores, but neither Linux nor FreeBSD can saturate the NICs with files smaller than 128KB, even though they use four CPU cores.

为了扩展到更高的带宽，我们增加了更多的10GbE网卡和客户端机器。图4b显示了四个10GbE NIC的聚合吞吐量。 Sandstorm使用只有两个CPU核心四个网卡，但即使Linux和FreeBSD使用四个CPU核心都不能使文件小于128KB的网络饱和，。

As we add yet more NICs, shown in Figure 4c, the difference in performance gets larger for a wider range of file sizes. With 610GbE NICs Sandstorm gives between 10% and 10 more throughput than FreeBSD for file sizes in the range of 4–256KB.Linux fares worse, experiencing a performance drop (see Figure 4c)compared to FreeBSD with smaller file sizes and 5–6 NICs. Low CPU utilization is normally good, but here (Figures 4f, 5b), idle time is undesirable since the NICs are not yet saturated.We have not identified any single obvious cause for this degradation. Packet traces show the delay to occur between the connection being accepted and the response being sent. There is no single kernel lock being held for especially long, and although locking is not negligible, it does not dominate, either. The system suffers one soft page fault for every two connections on average, but no hard faults, so data is already in the disk buffer cache, and TCB recycling is enabled. This is an example of how hard it can be to find performance problems with conventional stacks. Interestingly, this was an application-specific behavior triggered only on Linux: in benchmarks we conducted with other web servers (e.g., lighttpd [3], OpenLiteSpeed [7]) we did not experience a similar performance collapse on Linux with more than four NICs.We have chosen, however, to present the nginx datasets as it offered the greatest overall scalability in both operating systems.

当我们添加更多的NIC时，如图4c所示，对于更大范围的文件大小，性能的差异变大。使文件大小为4-256KB 用6x10GbE NIC Sandstorm比FreeBSD更高>=10％的吞吐量，与具有较小文件大小和5-6个NIC的FreeBSD相比，Linux的性能下降（见图4c）。低的CPU利用率通常是好的，但是在这里（图4f，5b），空闲是不期望的，因为NIC尚未饱和。我们没有识别出任何单个明显的原因。数据包跟踪显示在被接受的连接和正在发送的响应之间发生的延迟。没有一个单一的内核锁持有特别长的时间，虽然锁定是不可忽略的，它也不占主导地位。系统平均每两个连接会出现一个软页故障，但没有硬故障，因此数据已经在磁盘缓冲区缓存中，并且启用了TCB回收。这是一个例子，说明用常规堆栈找到性能问题有多困难。有趣的是，这是一个应用程序特定的行为仅在Linux上触发：在我们与其他Web服务器（例如lighttpd [3]，OpenLiteSpeed [7]）进行的基准测试中，我们没有在具有四个以上NIC的Linux上经历类似的性能崩溃然而，我们选择呈现nginx上的数据集，因为它在两个操作系统中提供最大的整体可伸缩性。

It is clear that Sandstorm dramatically improves network performance when it serves small web objects, but somewhat surprisingly,it performs better for larger files too. For completeness, we evaluate Sandstorm using a realistic workload: following the distribution of requested HTTP object sizes of the Yahoo! CDN [11], we generated a trace of 1000 files ranging from a few KB up to ~20MB which were served from both Sandstorm and nginx. On the clients,we modified weighttp to benchmark the server by concurrently requesting files in a random order. Figures 5a and 5b highlight the achieved network throughput and the CPU utilization of the server as a function of the number of the network adapters. The network performance improvement is more than 2 while CPU utilization is reduced.

很明显，Sandstorm在服务小型Web对象时可以显着提高网络性能，但有些令人惊讶的是，它对大型文件的性能也更好。为了完整性，我们使用现实的工作负载评估Sandstorm：给Sandstorm 和nginx分发请求的HTTP对象大小的Yahoo! CDN [11]之后，我们生成了1000个文件的踪迹，范围从几KB到〜20MB。在客户端上，我们修改weighttp以通过以随机顺序并发请求文件来对服务器进行基准测试。图5a和5b突出了网络适配器的数量变化所实现的网络吞吐量和服务器的CPU利用率。网络性能提升超过2倍同时降低CPU利用率。

Finally, we evaluated whether Sandstorm handles high packet loss correctly. With 80 simultaneous clients and 1% packet loss, as bexpected, throughput plummets. FreeBSD achieves approximately 640Mb/s and Sandstorm roughly 25% less. This is not fundamental,but due to FreeBSD’s more fine-grained retransmit timer and its use of NewReno congestion control rather than Reno, which could also be implemented in Sandstorm.Neither network nor server is stressed in this experiment – if there had been a real congested link causing the loss, both stacks would have filled it.

最后，我们评估Sandstorm是否正确处理高数据包丢失。随着80个同时客户端和1％的包丢失率，如预期，吞吐量直线下降。 FreeBSD实现约640Mb / s比Sandstorm减少约25％。这不是根本的，但是由于FreeBSD的更细粒度的重传计时器和它的使用NewReno拥塞控制而不是Reno，这也可以在Sandstorm中实现。在这个实验中强调的网络和服务 - 如果有一个真正的拥塞的链路造成损失，两个堆栈都会处理。

Throughout, we have invested considerable effort in profiling and optimizing conventional network stacks, both to understand their design choices and bottlenecks, and to provide the fairest possible comparison. We applied conventional performance tuning to Linux and FreeBSD, such as increasing hash-table sizes, manually tuning CPU work placement for multiqueue NICs, and adjusting NIC parameters such as interrupt mitigation. In collaboration with Netflix,we also developed a number of TCP and virtual-memory subsystem performance optimizations for FreeBSD, reducing lock contention under high packet loads. One important optimization is related to sendfile(), in which contention within the VM subsystem occurred while TCP-layer socket-buffer locks were held, triggering a cascade to the system as a whole. These changes have been upstreamed to FreeBSD for inclusion in a future release.

在整个过程中，我们投入了相当大的努力来分析和优化常规网络堆栈，既了解他们的设计选择和瓶颈，并提供最公平的可能的比较。我们将常规性能调优应用于Linux和FreeBSD，例如增加哈希表大小，手动调整多队列NIC的CPU工作位置，以及调整NIC参数（如中断缓冲）。与Netflix合作，我们还为FreeBSD开发了许多TCP和虚拟内存子系统性能优化，从而减少了高数据包负载下的锁争用。一个重要的优化与sendfile（）相关，冲突发生在TCP层套接字缓冲区锁定时发生VM子系统内的争用，触发了作为整体的系统级联。这些更改已上传到FreeBSD以包含在将来的版本中。

To copy or not to copy

The pre-copy variant of Sandstorm maintains more than one copy of each segment in memory so that it can send the same segment to multiple clients simultaneously. This requires more memory than nginx serving files from RAM. The memcpy variant only enqueues copies, requiring a single long-lived version of each packet, and uses a similar amount of memory to nginx. How does this memcpy affect performance? Figure 6 explores network throughput, CPU utilization, and system-call rate for two- and six-NIC configurations.

Sandstorm的预拷贝版在内存中保存了每个segment的多个副本，以便它可以同时将同一个segment发送到多个客户端。这需要比从RAM中提供文件的nginx更多的内存。 memcpy版仅排列副本，需要每个数据包的单个长期版本，并使用与nginx类似的内存量。这个memcpy如何影响性能？图6探讨了两个和六个NIC配置的网络吞吐量，CPU利用率和系统调用率。

With six NICs, the additional memcpy() marginally reduces performance (Figure 6b) while exhibiting slightly higher CPU load (Figure 6d). In this experiment, Sandstorm only uses three cores to simplify the comparison, so around 75% utilization saturates those cores. The memcpy variant saturates the CPU for files smaller than 32KB, whereas the pre-copy variant does not. Nginx, using sendfile() and all four cores, only catches up for file sizes of 512KB and above, and even then exhibits higher CPU load.

使用六个NIC时，额外的memcpy（）会略微降低性能（图6b），同时显示稍微更高的CPU负载（图6d）。在这个实验中，Sandstorm只使用三个核来简化比较，因此大约75％的利用率使这些核饱和。对于小于32KB的文件，memcpy版使CPU饱和，而预拷贝版不会。 Nginx使用sendfile（）和所有四个核，只能达到512KB及以上的文件大小，甚至表现出更高的CPU负载。

As file size decreases, the expense of SYN/FIN and HTTPrequest processing becomes measurable for both variants, but the pre-copy version has more headroom so is affected less. It is interesting to observe the effects of batching under overload with the memcpy stack in Figure 6f.With large file sizes, pre-copy and memcpy make the same number of system calls per second. With small files, however, the memcpy stack makes substantially fewer system calls per second. This illustrates the efficacy of batching: memcpy has saturated the CPU, and consequently no longer polls the RX queue as often. As the batch size increases, the system-call cost decreases, helping the server weather the storm. The pre-copy variant is not stressed here and continues to poll frequently, but would behave the same way under overload. In the end, the cost of the additional memcpy is measurable, but still performs quite well.

随着文件大小减小，SYN / FIN和HTTP请求处理的费用对于这两种变体都变得可预期，但是预拷贝版本具有更大的空间，因此受影响较小。有趣的是观察在过载下使用memcpy堆栈的batching的影响。对于大文件大小，pre-copy和memcpy每秒都会产生相同数量的系统调用。然而，对于小文件，memcpy堆栈每秒大大减少了系统调用。这说明了批处理的效率：memcpy已饱和CPU，因此不再频繁轮询RX队列。随着批量增加，降低系统调用帮助服务器承受风暴。预拷贝变体在这里继续频繁地轮询，但是在过载下表现相同的方式。最后，额外的memcpy的成本是可衡量的，但仍然表现相当不错。

Results on contemporary hardware are significantly different from those run on older pre-DDIO hardware. Figure 7 shows the results obtained on our 2006-era servers. On the older machines,Sandstorm outperforms nginx by a factor of three, but the memcpy variant suffers a 30% decrease in throughput compared to pre-copy Sandstorm as a result of adding a single memcpy to the code. It is clear that on these older systems,memory bandwidth is the main performance bottleneck.

当代硬件上的结果与旧的前DDIO硬件上的结果有显着的不同。图7显示了我们2006年服务器上的结果。在旧机器上，Sandstorm的性能超过nginx的三分之一，但是与memcpy版相比，memcpy变体的吞吐量降低了30％，这是因为在代码中添加了一个memcpy。很明显，在这些旧系统上，内存带宽是主要的性能瓶颈。

With DDIO, memory bandwidth is not such a limiting factor. Figure 9 in Section 3.5 shows the corresponding memory read throughput,as measured using CPU performance counters, for the networkthroughput graphs in Figure 6b. With small file sizes, the pre-copy variant of Sandstorm appears to do more work: the L3 cache cannot hold all of the data, so there are many more L3 misses than with memcpy. Memory-read throughput for both pre-copy and nginx are closely correlated with their network throughput, indicating that DDIO is not helping on transmit: DMA comes from memory rather than the cache. The memcpy variant, however, has higher network throughput than memory throughput, indicating that DDIO is transmitting from the cache. Unfortunately, this is offset by much higher memory write throughput. Still, this only causes a small reduction in service throughput. Larger files no longer fit in the L3 cache, even with memcpy. Memory-read throughput starts to rise with files above 64KB. Despite this, performance remains high and CPU load decreases, indicating these systems are not limited by memory bandwidth for this workload.

使用DDIO，内存带宽不是这样的限制因素。第3.5节中的图9显示了使用CPU性能计数器测量的图6b中网络吞吐量图的相应内存读取吞吐量。对于小文件大小，Sandstorm的预拷贝版似乎做了更多的工作：L3缓存不能保存所有数据，因此比memcpy 版的有更多的L3缺失。预拷贝和nginx的内存读取吞吐量与它们的网络吞吐量密切相关，表明DDIO对传输没帮助：DMA来自内存，而不是缓存。 Memcpy版本网络吞吐量反而比内存吞吐量更高，表明DDIO正在从缓存传输。不幸的是，这被更高的内存写入吞吐量所抵消。但是，这只会导致服务吞吐量的小幅下降。较大的文件不再适合L3缓存，即使使用memcpy。随着高于64KB的文件，内存读取吞吐量开始上升。尽管如此，性能仍然很高，CPU负载降低，表明这些系统不受此工作负载的内存带宽的限制。

3.3 Experiment Design: Namestorm

We use the same clients and server systems to evaluate Namestorm as we used for Sandstorm. Namestorm is expected to be significantly more CPU-intensive than Sandstorm, mostly due to fundamental DNS protocol properties: high packet rate and small packets. Based on this observation, we have changed the network topology of our experiment: we use only one NIC on the server connected to the client systems via a 10GbE cut-through switch. In order to balance the load on the server to all available CPU cores we use four dedicated NIC queues and four Namestorm instances.

We ran Nominum’s dnsperf [2] DNS profiling software on the clients. We created zone files of varying sizes, loaded them onto the DNS servers, and configured dnsperf to query the zone repeatedly.

我们使用相同的客户端和服务器系统来评估Namestorm像用于Sandstorm那样。 Namestorm预期比Sandstorm明显更多的CPU密集，主要是由于基本的DNS协议属性：高数据包速率和小数据包。基于这个观察，我们改变了我们实验的网络拓扑：我们通过一个10GbE直通交换机在服务器上只使用一个网卡连接到客户端系统。为了平衡服务器上的负载到所有可用的CPU核心，我们使用四个专用的NIC队列和四个Namestorm实例。

我们在客户端运行Nominum的dnsperf [2] DNS配置软件。我们创建了不同大小的区域文件，将它们加载到DNS服务器上，并配置dnsperf重复查询区域。

3.4 Namestorm Results

Figure 8a shows the performance of Namestorm and NSD running on Linux and FreeBSD when using a single 10GbE NIC. Performance results of NSD are similar with both FreeBSD and Linux.Neither operating system can saturate the 10GbE NIC, however, and both show some performance drop as the zone file grows. On Linux,NSD’s performance drops by ~14% (from ~689,000 to ~590,000 Queries/sec) as the zone file grows from 1 to 10,000 entries, and on FreeBSD, it drops by ~20% (from ~720,000 to ~574,000 Qps). For these benchmarks, NSD saturates all CPU cores on both systems.

图8a显示了使用单个10GbE NIC时在Linux和FreeBSD上运行的Namestorm和NSD的性能。 NSD的性能结果与FreeBSD和Linux类似。但是，操作系统可以使10GbE NIC饱和，并且随着区域文件增长，两者都显示出一些性能下降。在Linux上，随着区域文件从1到10,000个条目的增长，NSD的性能下降了约14％（从约689,000到约5.9万查询/秒），在FreeBSD上，它下降了〜20％（从〜720000到574,000 Qps ）。对于这些基准测试NSD使两个系统上的所有CPU内核饱和。

For Namestorm, we utilized two datasets, one where the hash keys are in wire-format (w/o compr.), and one where they are in FQDN format (compr.). The latter requires copying the search term before hashing it to handle possible compressed requests.

对于Namestorm，我们使用了两个数据集，一个是散列键是wire格式（w / o compr。），一个是FQDN格式（compr。）。后者处理可能的压缩请求需要在对其进行哈希处理之前复制搜索项

With wire-format hashing, Namestorm memcpy performance is ~11–13 better, depending on the zone size, when compared to the best results from NSD with either Linux or FreeBSD. Namestorm’s throughput drops by ~30% as the zone file grows from 1 to 10,000 entries (from ~9,310,000 to ~6,410,000 Qps). The reason for this decrease is mainly the LLC miss rate, which more than doubles.Dnsperf does not report throughput in Gbps, but given the typical DNS response size for our zones we can calculate ~8.4Gbps and ~5.9Gbps for the smallest and largest zone respectively.

使用wire格式哈希，Namestorm memcpy性能是〜11-13 更好，取决于区域大小与使用Linux或FreeBSD从NSD的最佳结果相比。当区域文件从1到10,000个条目（从〜9,310,000到〜6,410,000 Qps）增长时，Namestorm的吞吐量下降〜30％。这种减少的原因主要是LLC未命中率，其超过双倍。Dnsperf不报告以Gbps为单位的吞吐量，但给定我们区域的典型DNS响应大小，我们可以计算〜8.4Gbps和〜5.9Gbps最小和最大区域。

With FQDN-format hashing, Namestorm memcpy performance is worse than with wire-format hashing, but is still ~9–13 better compared to NSD. The extra processing with FQDN-format hashing costs ~10–20% in throughput, depending on the zone size.

Finally, in Figure 8a we observe a noticeable performance overhead with the pre-copy stack, which we explore in Section 3.5.

有了FQDN格式的哈希，Namestorm memcpy的性能比线格式哈希差，但仍是〜9-13 比NSD好。使用FQDN格式哈希的额外处理的成本约为吞吐量的10-20％，具体取决于区域大小。

最后，在图8a中，我们观察到预拷贝堆栈的性能开销明显，我们在3.5节中探讨。

3.4.1 Effectiveness of batching

One of the biggest performance benefits for Namestorm is that netmap provides an API that facilitates batching across the systemcall interface. To explore the effects of batching, we configured a single Namestorm instance and one hardware queue, and reran our benchmark with varying batch sizes. Figure 8b illustrates the results:

a more than 2 performance gain when growing the batch size from 1 packet (no batching) to 32 packets. Interestingly, the performance of a single-core Namestorm without any batching remains more than 2 better than NSD.

Batching的效果

Namestorm最大的性能优势之一是netmap提供了一个API，可以在整个系统调用接口上实现批处理。为了探究批处理的效果，我们配置了一个单独的Namestorm实例和一个硬件队列，并用不同的批量大小重新调整我们的基准。图8b示出了结果：

超过2倍的性能增长在将批处理大小从1个包（无批处理）增长到32个包。有趣的是，没有任何batching的单核Namestorm的性能仍然超过2倍优于NSD。

At a minimum, NSD has to make one system call to receive each request and one to send a response. Recently Linux added the new recvmmsg() and sendmmsg() system calls to receive and send multiple UDP messages with a single call. These may go some way to improving NSD’s performance compared to Namestorm.They are, however, UDP-specific, and sendmmsg() requires the application to manage its own transmit-queue batching. When we implemented Namestorm, we already had libnmio, which abstracts and handles all the batching interactions with netmap, so there is no application-specific batching code in Namestorm.

至少，NSD必须进行一次系统调用以接收每个请求并且发送一次响应。最近Linux添加了新的recvmmsg（）和sendmmsg（）系统调用，通过一次调用接收和发送多个UDP消息。与Namestorm相比，这些可能在某种程度上提高NSD的性能。然而，它们是UDP特定的，sendmmsg（）要求应用程序管理自己的传输队列批处理。当我们实现Namestorm时，我们已经有了libnmio，它使用netmap抽象和处理所有的批处理交互，因此在Namestorm中没有应用程序特定的批处理代码。

3.5 DDIO

With DDIO, incoming packets are DMAed directly to the CPU’s L3 cache, and outgoing packets are DMAed directly from the L3 cache, avoiding round trips from the CPU to the memory subsystem. For lightly loaded servers in which the working set is smaller than the L3 cache, or in which data is accessed with temporal locality by the processor and DMA engine (e.g., touched and immediately sent, or received and immediately accessed), DDIO can dramatically reduce latency by avoiding memory traffic. Thus DDIO is ideal for RPC-like mechanisms in which processing latency is low and data will be used immediately before or after DMA. On heavily loaded systems, it is far from clear whether DDIO will be a win or not. For applications with a larger cache footprint, or in which communication occurs at some delay from CPU generation or use of packet data, DDIO could unnecessarily pollute the cache and trigger additional memory traffic, damaging performance.

使用DDIO，传入数据包直接DMA直接到CPU的L3缓存，而输出数据包直接从L3缓存直接DMA，避免从CPU到内存子系统的往返。对于其中工作集小于L3高速缓存或其中数据由处理器和DMA引擎以时间局部性访问（例如，被触摸和立即发送，或接收和立即访问）的轻负载服务器，DDIO可以显着减少通过避免内存流量延迟。因此，DDIO是理想的类RPC机制，其中处理延迟低，数据将立即在DMA之前或之后使用。在负载重的系统上，很难说清楚DDIO是否会赢。对于具有较大高速缓存占用的应用程序，或者在CPU生成或使用数据包数据的某些延迟时发生通信时，DDIO可能会不必要地污染高速缓存并触发额外的内存流量，从而损坏性能。

Intuitively, one might reasonably assume that Sandstorm’s precopy mode might interact best with DDIO: as with sendfile() based designs, only packet headers enter the L1/L2 caches, with payload content rarely touched by the CPU. Figure 9 illustrates a therefore surprising effect when operating on small file sizes: overall memory throughput from the CPU package, as measured using performance counters situated on the DRAM-facing interface of the LLC, sees significantly less traffic for the memcpy implementation relative to the pre-copy one, which shows a constant rate roughly equal to network throughput.

直观地，人们可以合理地认为Sandstorm的预拷贝模式可能与DDIO最好地交互：与基于sendfile（）的设计一样，只有包头进入L1 / L2缓存，有效载荷内容很少被CPU触发。图9示出了当以小文件大小操作时的令人惊讶的效果：如使用位于LLC的面向DRAM的接口上的性能计数器所测量的，相对于预拷贝处理，memcpy版的内存吞吐量显着减少，其显示大致等于网络吞吐量的恒定速率

We believe this occurs because DDIO is, by policy, limited from occupying most of the LLC: in the pre-copy cases, DDIO is responsible for pulling untouched data into the cache – as the file data cannot fit in this subset of the cache, DMA access thrashes the cache and all network transmit is done from DRAM. In the memcpy case, the CPU loads data into the cache, allowing more complete utilization of the cache for network data. However, as the DRAM memory interface is not a bottleneck in the system as configured, the net result of the additional memcpy, despite better cache utilization,is reduced performance. As file sizes increase, the overall footprint of memory copying rapidly exceeds the LLC size, exceeding network throughput, at which point pre-copy becomes more efficient.Likewise, one might mistakenly believe simply from inspection of CPU memory counters that nginx is somehow benefiting from this same effect: in fact, nginx is experiencing CPU saturation, and it is not until file size reaches 512K that sufficient CPU is available to converge with pre-copy’s saturation of the network link.

我们相信这是因为DDIO是，通过策略，限制占据大部分的LLC：在预拷贝的情况下，DDIO负责将未触摸的数据拉入高速缓存 - 因为文件数据不能容纳在高速缓存的这个子集中，DMA访问使高速缓存thrashes所以所有网络传输都由DRAM完成反而在memcpy情况下，CPU将数据加载到缓存中，从而允许更加完整地利用网络数据的缓存，然而，由于DRAM存储器接口不是如配置的系统中的瓶颈，尽管有更好的高速缓存利用率，附加memcpy的结果是降低的性能。随着文件大小增加，存储器复制的总体占用面积快速超过LLC大小，超过网络吞吐量，此时预复制变得更有效。同样，人们可能错误地认为，从CPU内存计数器的检查nginx是以某种方式受益于这个相同的效果：事实上，nginx正在CPU饱和，并且直到文件大小达到512K，足够的CPU可用的把预拷贝的网络链路跑满了。

By contrast, Namestorm sees improved performance using the memcpy implementation, as the cache lines holding packet data must be dirtied due to protocol requirements, in which case performing the memcpy has little CPU overhead yet allows much more efficient use of the cache by DDIO

相比之下，Namestorm使用memcpy实现看到改进的性能，因为持有分组数据的缓存线必须由于协议要求而受到污染，在这种情况下执行memcpy具有很少的CPU开销，但允许更高效地通过DDIO使用缓存

(ps: 这个例子很神奇，虽然memcpy版每次都要内存拷贝但是更灵活，数据已经在cache中然后网络吞吐量大致等于内存年吞吐量，而一般认为使用预处理好的数据更快因为不需要拷贝但是由于其不在cache中所以每次都要从dram走，文件越大这个情况越明显)

4. DISCUSSION

We developed Sandstorm and Namestorm to explore the hypothesis that fundamental architectural change might be required to properly exploit rapidly growing CPU core counts and NIC capacity.Comparisons with Linux and FreeBSD appear to confirm this conclusion far more dramatically than we expected: while there are small-factor differences between Linux and FreeBSD performance curves, we observe that their shapes are fundamentally the same.We believe that this reflects near-identical underlying architectural decisions stemming from common intellectual ancestry (the BSD network stack and sockets API) and largely incremental changes from that original design.

我们开发了Sandstorm和Namestorm来探索这样的假设，即可能需要进行基础架构更改以正确利用快速增长的CPU核心数和NIC容量。与Linux和FreeBSD的比较似乎证实了这个结论比我们预期的更加显着： Linux和FreeBSD性能曲线之间的因素差异，我们观察到它们的形状基本上是相同的。我们认为，这反映了源于共同知识祖先（BSD网络栈和套接字API）和与原始设计有很大增量变化的几乎相同的底层架构决策。

Sandstorm and Namestorm adopt fundamentally different architectural approaches, emphasizing transparent memory flow within applications (and not across expensive protection-domain boundaries), process-to-completion, heavy amortization, batching, and application-specific customizations that seem antithetical to generalpurpose stack design. The results are dramatic, accomplishing nearlinear speedup with increases in core and NIC capacity – completely different curves possible only with a completely different design.

Sandstorm 和Namestorm采用根本不同的架构方法，强调应用程序（而不是跨越昂贵的保护域边界），过程到完成，重摊销，批处理和应用程序特定的定制的透明内存流，这似乎与通用堆栈设计相对立。结果是惊人的，实现近线性加速随着核心和NIC容量的增加 - 完全不同的曲线可能只有一个完全不同的设计。

4.1 Current network-stack specialization

Over the years there have been many attempts to add specialized features to general-purpose stacks such as FreeBSD and Linux. Examples include sendfile(), primarily for web servers,recvmmsg(), mostly aimed at DNS servers, and assorted socket options for telnet. In some cases, entire applications have been moved to the kernel [13, 24] because it was too difficult to achieve performance through the existing APIs. The problem with these enhancements is that each serves a narrow role, yet still must fit within a general OS architecture, and thus are constrained in what they can do. Special-purpose userspace stacks do not suffer from these constraints,and free the programmer to solve a narrow problem in an application-specific manner while still having the other advantages of a general-purpose OS stack.

多年来，已经有很多尝试为通用堆栈（如FreeBSD和Linux）添加专门的功能。示例包括sendfile（）（主要用于Web服务器），recvmmsg（）（主要针对DNS服务器）和用于telnet的套接字选项。在某些情况下，整个应用程序已经移动到内核[13,24]，因为通过现有的API实现性能太难了。这些增强的问题在于每个服务器扮演着狭窄的角色，但仍然必须适合于一般的OS体系结构，因此被限制在他们可以做什么。专用用户空间堆栈不受这些约束的困扰，并且释放程序员以特定于应用的方式解决窄的问题，同时仍然具有通用OS堆栈的其他优点。

4.2 The generality of specialization

Our approach tightly integrates the network stack and application within a single process. This model, together with optimizations aimed at cache locality or pre-packetization, naturally fit a reasonably wide range of performance-critical, event-driven applications such as web servers, key-value stores, RPC-based services and name servers. Even rate-adaptive video streaming may benefit, as developments such as MPEG-DASH and Apple’s HLS have moved intelligence to the client leaving servers as dumb static-content farms.

我们的方法在单个进程中紧密的整合了网络堆栈和应用逻辑。这种模型与针对高速缓存局部性或预分组化的优化一起，自然地适合于相当广泛的性能关键的事件驱动应用，例如web服务器，键值存储，基于RPC的服务和名称服务器。即使速率自适应视频流可以受益，因为诸如MPEG-DASH和苹果的HLS的发展已经将智能移动到客户端，将服务器留作静态内容。

Not all network services are a natural fit. For example, CGI-based web services and general-purpose databases have inherently different properties and are generally CPU- or filesystem-intensive, deemphasizing networking bottlenecks. In our design, the control loop and transport-protocol correctness depend on the timely execution of application-layer functions; blocking in the application cannot be tolerated. A thread-based approach might be more suitable for such cases. Isolating the network stack and application into different threads still yields benefits: OS-bypass networking costs less, and saved CPU cycles are available for the application. However, such an approach requires synchronization, and so increases complexity and offers less room for cross-layer optimization.

并不是所有的网络服务都是同类的。例如，基于CGI的Web服务和通用数据库具有本质上不同的属性，并且通常是CPU或文件系统密集型，削弱网络瓶颈。在我们的设计中，控制回路和传输协议的正确性取决于应用层功能的及时执行; 在应用程序中的阻塞是不能容忍的。基于线程的方法可能更适合这种情况。将网络堆栈和应用程序分离到不同的线程仍然产生以下好处：OS旁路网络成本更低，并且节省的CPU周期可用于应用程序。然而，这种方法需要同步，因此增加了复杂性并且为跨层优化提供了较少的空间。

We are neither arguing for the exclusive use of specialized stacks over generalized ones, nor deployment of general-purpose network stacks in userspace. Instead, we propose selectively identifying key scale-out applications where informed but aggressive exploitation of domain-specific knowledge and micro-architectural properties will allow cross-layer optimizations. In such cases, the benefits outweigh the costs of developing and maintaining a specialized stack.

我们既不争论专门的堆栈对广义的堆栈的独占使用，也不是在用户空间中部署通用网络堆栈。相反，我们建议选择性地识别关键的横向扩展应用程序并利用领域特定的知识和微架构属性将允许跨层优化。在这种情况下，收益超过开发和维护专门堆栈的成本。

4.3 Tracing, profiling, and measurement

One of our greatest challenges in this work was the root-cause analysis of performance issues in contemporary hardware-software implementations. The amount of time spent analyzing networkstack behavior (often unsuccessfully) dwarfed the amount of time required to implement Sandstorm and Namestorm.

在这项工作中我们最大的挑战之一是根本原因分析当代硬件 - 软件实现中的性能问题。分析网络堆栈行为所花费的时间（通常不成功）使实现Sandstorm 和Namestorm所需的时间变得相形见绌。

An enormous variety of tools exist – OS-specific PMC tools, lock contention measurement tools, tcpdump, Intel vTune, DTrace, and a plethora of application-specific tracing features – but they suffer many significant limitations. Perhaps most problematic is that the tools are not holistic: each captures only a fragment of the analysis space – different configuration models, file formats, and feature sets.

存在各种各样的工具 - 特定于操作系统的PMC工具，锁定争用测量工具，tcpdump，Intel vTune，DTrace和大量的应用程序特定跟踪功能，但是它们受到许多显着的限制。也许最有问题的是工具不是整体的：每个只捕获分析空间的一个片段 - 不同的配置模型，文件格式和特征集

Worse, as we attempted inter-OS analysis (e.g., comparing Linux and FreeBSD lock profiling), we discovered that tools often measure and report results differently, preventing sensible comparison.For example, we found that Linux took packet timestamps at different points than FreeBSD, FreeBSD uses different clocks for DTrace and BPF, and that while FreeBSD exports both per-process and percore PMC stats, Linux supports only the former. Where supported,DTrace attempts to bridge these gaps by unifying configuration,trace formats, and event namespaces [15]. However, DTrace also experiences high overhead causing bespoke tools to persist, and is unintegrated with packet-level tools preventing side-by-side comparison of packet and execution traces.We feel certain that improvement in the state-of-the-art would benefit not only research, but also the practice of network-stack implementation.

更糟糕的是，当我们尝试跨操作系统分析（例如，比较Linux和FreeBSD锁定分析）时，我们发现工具经常以不同的方式测量和报告结果，从而阻止了明智的比较。例如，我们发现Linux在不同于FreeBSD ，FreeBSD对DTrace和BPF使用不同的时钟，当FreeBSD导出per-process和percore PMC stats时，Linux仅支持前者。在支持的情况下，DTrace通过统一配置，跟踪格式和事件命名空间来尝试弥合这些差距[15]。然而，DTrace还经历了高开销，导致定制工具持续存在，并且与分组级工具不集成，阻止了分组和执行跟踪的并行比较。我们确信，现有技术的改进受益的不仅仅是研究，还有网络栈实现的实践。

Our special-purpose stacks are synchronous; after netmap hands off packets to userspace, the control flow is generally linear, and we process packets to completion. This, combined with lock-free design, means that it is very simple to reason about where time goes when handling a request flow. General-purpose stacks cannot, by their nature, be synchronous. They must be asynchronous to balance all the conflicting demands of hardware and applications, managing queues without application knowledge, allocating processing to threads in order to handle those queues, and ensuring safety via locking. To reason about performance in such systems, we often resort to statistical sampling because it is not possible to directly follow the control flow. Of course, not all network applications are well suited to synchronous models; we argue, however, that imposing the asynchrony of a general-purpose stack on all applications can unnecessarily complicate debugging, performance analysis, and performance optimization.

我们的专用堆栈是同步的;在netmap将包交给用户空间后，控制流通常是线性的，我们处理包完成。这与无锁设计相结合，意味着在处理请求流时处理时间是非常简单的。通用堆栈根据其性质不能是同步的。它们必须异步以平衡硬件和应用程序的所有冲突需求，在没有应用程序知识的情况下管理队列，为线程分配处理以处理这些队列，以及通过锁定确保安全性。为了说明在这样的系统中的性能，我们经常采用统计采样，因为不可能直接跟随控制流。当然，并不是所有的网络应用都非常适合同步模型;但我们认为，在所有应用程序上施加通用堆栈的不同步可能不必要地使调试，性能分析和性能优化复杂化。

5. RELATED WORK

Web server and network-stack performance optimization is not a new research area. Past studies have come up with many optimization techniques as well as completely different design choices.These designs range from userspace and kernel-based implementations to specialized operating systems.

Web服务器和网络堆栈性能优化不是一个新的研究领域。过去的研究已经提出了许多优化技术以及完全不同的设计选择。这些设计从基于用户空间和基于内核的实现到专用操作系统。

With the conventional approaches, userspace applications [1, 6] utilize general-purpose network stacks, relying heavily on operatingsystem primitives to achieve data movement and event notification [26]. Several proposals [23, 12, 30] focus on reducing the overhead of such primitives (e.g., KQueue, epoll, sendfile()).IO-Lite [27] unifies the data management between OS subsystems and userspace applications by providing page-based mechanisms to safely and concurrently share data. Fbufs [17] utilize techniques such as page remapping and shared memory to provide high-performance cross-domain transfers and buffer management.Pesterev and Wickizer [28, 14] have proposed efficient techniques to improve commodity-stack performance by controlling connection locality and taking advantage of modern multicore systems.Similarly, MegaPipe [21] shows significant performance gain by introducing a bidirectional, per-core pipe to facilitate data exchange and event notification between kernel and userspace applications.

使用传统方法，用户空间应用[1,6]利用通用网络堆栈，严重依赖操作系统原语来实现数据移动和事件通知[26]。几个提案[23,12,30]集中在减少这样的原语的开销（例如，KQueue，epoll，sendfile（））。IO-Lite [27]通过提供基于页面的协议来统一操作系统子系统和用户空间应用程序之间的数据管理，安全和并发共享数据的机制。 Fbufs [17]利用诸如页面重映射和共享内存等技术来提供高性能的跨域传输和缓冲管理.Pesterev和Wickizer [28,14]提出了通过控制连接局部性和利用现代多核系统来提高商品堆栈性能的高效技术，MegaPipe [21]通过引入双向，每核心管道以促进内核和用户空间应用程序之间的数据交换和事件通知，显示出显着的性能增益。

A significant number of research proposals follow a substantially different approach: they propose partial or full implementation of network applications in kernel, aiming to eliminate the cost of communication between kernel and userspace. Although this design decision improves performance significantly, it comes at the cost of limited security and reliability. A representative example of this category is kHTTPd [13], a kernel-based web server which uses the socket interface. Similar to kHTTPd, TUX [24] is another noteworthy example of in-kernel network applications. TUX achieves greater performance by eliminating the socket layer and pinning the static content it serves in memory. We have adopted several of these ideas in our prototype, although our approach is not kernel based.

大量的研究建议遵循一种截然不同的方法：它们提出在内核中部分或全部实现网络应用，旨在消除内核和用户空间之间的通信成本。虽然这种设计决策显着提高了性能，但其代价是有限的安全性和可靠性。这个类别的代表性示例是kHTTPd [13]，一种基于内核的Web服务器，它使用套接字接口。与kHTTPd类似，TUX [24]是内核网络应用的另一个值得注意的例子。 TUX通过消除套接字层并锁定其在存储器中提供的静态内容来实现更高的性能。我们在我们的原型中采用了几个这样的想法，虽然我们的方法不是基于内核的。

Microkernel designs such as Mach [10] have long appealed to OS designers, pushing core services (such as network stacks) into user processes so that they can be more easily developed, customized, and multiply-instantiated. In this direction, Thekkath et al [32], have prototyped capability-enabled, library-synthesized userspace network stacks implemented on Mach. The Cheetah web server is built on top of an Exokernel [19] library operating system that provides a filesystem and an optimized TCP/IP implementation. Lightweight libOSes enable application developers to exploit domain-specific knowledge and improve performance. Unikernel designs such as MirageOS [25] likewise blend operating-system and application components at compile-time, trimming unneeded software elements to accomplish extremely small memory footprints – although by static code analysis rather than application-specific specialization.

诸如Mach [10]的微内核设计长期以来都呼吁OS设计者，将核心服务（例如网络堆栈）推入用户进程，以便能够更容易地开发，定制和多次实例化。在这个方向上，Thekkath等人[32]，在Mach上实现了基于功能的，库合成的用户空间网络栈。 Cheetah Web服务器构建在提供文件系统和优化的TCP / IP实现的Exokernel [19]库操作系统之上。轻量级的libOS使应用程序开发人员能够利用领域特定的知识并提高性能。 Unikernel设计，如MirageOS [25]在编译时同样混合操作系统和应用程序组件，修剪不需要的软件元素以完成极小的内存占用 - 尽管通过静态代码分析而不是专用于专用化。

6. CONCLUSION

In this paper, we have demonstrated that specialized userspace stacks, built on top of netmap framework, can vastly improve the performance of scale-out applications. These performance gains sacrifice generality by adopting design principles at odds with contemporary stack design: application-specific cross-layer cost amortizations, synchronous and buffering-free protocol implementations, and an extreme focus on interactions between processors, caches, and NICs. This approach reflects a widespread adoption of scale-out computing in data centers, which deemphasizes multifunction hosts in favor of increased large-scale specialization. Our performance results are compelling: a 2–10 improvement for web service, and a roughly 9 improvement for DNS service. Further, these stacks have proven easier to develop and tune than conventional stacks, and their performance improvements are portable over multiple generations

of hardware.

在本文中，我们已经证明，专门的用户空间堆栈，建立在netmap框架之上，可以大大提高横向扩展应用程序的性能。这些性能增益通过采用与当代堆栈设计不同的设计原理来牺牲通用性：特定于应用程序的跨层成本分摊，同步和无缓冲协议实现，以及极其侧重于处理器，缓存和NIC之间的交互。这种方法反映了在数据中心中横向扩展计算的广泛采用，这削弱了多功能主机，有利于增加大规模专业化。我们的性能结果令人信服：2-10倍的性能改进web服务，和大约9倍的提高DNS服务。此外，这些堆栈已经被证明比常规堆栈更容易开发和调整，并且它们的性能改进在多代硬件上是可移植的。

General-purpose operating system stacks have been around a long time, and have demonstrated the ability to transcend multiple generations of hardware. We believe the same should be true of special-purpose stacks, but that tuning for particular hardware should be easier. We examined performance on servers manufactured seven years apart, and demonstrated that although the performance bottlenecks were now in different places, the same design delivered significant benefits on both platforms.

通用操作系统堆栈已经有很长时间了，并且已经证明了超越多代硬件的能力。我们认为专用堆栈也应该是这样，但是对于特定硬件的调整应该更容易。我们研究了相隔七年的服务器上的性能，并证明尽管性能瓶颈现在在不同的地方，但是相同的设计在这两个平台上带来了显着的优势。

posted on 2017-01-22 18:01 clcl 阅读(608) 评论(0) 编辑收藏引用

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！



网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

xiaoxiaoling

常用链接

留言簿(1)

随笔档案

文章档案

搜索

最新评论

阅读排行榜

评论排行榜