A brief history of Consensus, 2PC and Transaction Commit.

Notes:
*. Time, Clocks and the Ordering of Events in a Distributed System" (1978)
1. The issue is that in a distributed system you cannot tell if event A happened before event B, unless A caused B in some way. Each observer can see events happen in a different order, except for events that cause each other, ie there is only a partial ordering of events in a distributed system.
    2. Lamport defines the "happens before" relationship and operator, and goes on to give an algorithm that provides a total ordering of events in a distributed system, so that each process sees events in the same order as every other process.
    3. Lamport also introduces the concept of a distributed state machine: start a set of deterministic state machines in the same state and then make sure they process the same messages in the same order.
    4. Each machine is now a replica of the others. The key problem is making each replica agree what is the next message to process: a consensus problem.
    5. However, the system is not fault tolerant; if one process fails that others have to wait for it to recover.

*. "Notes on Database Operating Systems" (1979).
    1. 2PC problem: Unfortunately 2PC would block if the TM (Transaction Manager) fails at the wrong time.

*. "NonBlocking Commit Protocols" (1981)
    1. 3PC problem: The problem was coming up with a nice 3PC algorithm, this would only take nearly 25 years!

*. "Impossibility of distributed consensus with one faulty process" (1985)
    1. this famous result is known as the "FLP" result
    2. By this time "consensus" was the name given to the problem of getting a bunch of processors to agree a value.
    3. The kernel of the problem is that you cannot tell the difference between a process that has stopped and one that is running very slowly, making dealing with faults in an asynchronous system almost impossible.
    4. a distributed algorithm has two properties: safety and liveness. 2PC is safe: no bad data is ever written to the databases, but its liveness properties aren't great: if the TM fails at the wrong point the system will block.
    5. The asynchronous case is more general than the synchronous case: an algorithm that works for an asynchronous system will also work for a synchronous system, but not vice versa.

*. "The Byzantine Generals Problem" (1982)
    1. In this form of the consensus problem the processes can lie, and they can actively try to deceive other processes.

*. "A Comparison of the Byzantine Agreement Problem and the Transaction Commit Problem." (1987) .
    1. At the time the best consensus algorithm was the Byzantine Generals, but this was too expensive to use for transactions.

*. "Uniform consensus is harder than consensus" (2000)
    1. With uniform consensus all processes must agree on a value, even the faulty ones - a transaction should only commit if all RMs are prepared to commit.

*. "The Part-Time Parliament" (submitted in 1990, published 1998)
    1. Paxos consensus algorithm

*. "How to Build a Highly Availability System using Consensus" (1996).
1. This paper provides a good introduction to building fault tolerant systems and Paxos.

*. "Paxos Made Simple (2001)
    1. The kernel of Paxos is that given a fixed number of processes, any majority of them must have at least one process in common. For example given three processes A, B and C the possible majorities are: AB, AC, or BC. If a decision is made when one majority is present eg AB, then at any time in the future when another majority is available at least one of the processes can remember what the previous majority decided. If the majority is AB then both processes will remember, if AC is present then A will remember and if BC is present then B will remember.
    2. Paxos can tolerate lost messages, delayed messages, repeated messages, and messages delivered out of order.
    3. It will reach consensus if there is a single leader for long enough that the leader can talk to a majority of processes twice. Any process, including leaders, can fail and restart; in fact all processes can fail at the same time, the algorithm is still safe. There can be more than one leader at a time.
    4. Paxos is an asynchronous algorithm; there are no explicit timeouts. However, it only reaches consensus when the system is behaving in a synchronous way, ie messages are delivered in a bounded period of time; otherwise it is safe. There is a pathological case where Paxos will not reach consensus, in accordance to FLP, but this scenario is relatively easy to avoid in practice.

*.   "Consensus in the presence of partial synchrony" (1988)
    1. There are two versions of partial synchronous system: in one processes run at speeds within a known range and messages are delivered in bounded time but the actual values are not known a priori; in the other version the range of speeds of the processes and the upper bound for message deliver are known a priori, but they will only start holding at some unknown time in the future.
    2. The partial synchronous model is a better model for the real world than either the synchronous or asynchronous model; networks function in a predicatable way most of the time, but occasionally go crazy.

*.   "Consensus on Transaction Commit" (2005).
    1. A third phase is only required if there is a fault, in accordance to the Skeen result. Given 2n+1 TM replicas Paxos Commit will complete with up to n faulty replicas.
    2. Paxos Commit does not use Paxos to solve the transaction commit problem directly, ie it is not used to solve uniform consensus, rather it is used to make the system fault tolerant.
    3. Recently there has been some discussion of the CAP conjecture: Consistency, Availability and Partition. The conjecture asserts that you cannot have all three in a distributed system: a system that is consistent, that can have faulty processes and that can handle a network partition.
    4. Now take a Paxos system with three nodes: A, B and C. We can reach consensus if two nodes are working, ie we can have consistency and availability. Now if C becomes partitioned and C is queried, it cannot respond because it cannot communicate with the other nodes; it doesn't know whether it has been partitioned, or if the other two nodes are down, or if the network is being very slow. The other two nodes can carry on, because they can talk to each other and they form a majority. So for the CAP conjecture, Paxos does not handle a partition because C cannot respond to queries. However, we could engineer our way around this. If we are inside a data center we can use two independent networks (Paxos doesn't mind if messages are repeated). If we are on the internet, then we could have our client query all nodes A, B and C, and if C is partitioned the client can query A or B unless it is partitioned in a similar way to C.
    5. a synchronous network, if C is partitioned it can learn that it is partitioned if it does not receive messages in a fixed period of time, and thus can declare itself down to the client.

*.   "Co-Allocation, Fault Tolerance and Grid Computing" (2006).

[REF] http://betathoughts.blogspot.com/2007/06/brief-history-of-consensus-2pc-and.html

posted on 2010-08-12 23:37 豪阅读(1648) 评论(0) 编辑收藏引用

常用链接

留言簿(19)

随笔分类(81)

文章分类(89)

相册

ACM OJ

My friends

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！



网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理