Comments
Richard Davies wrote: The UK has a good crop of technology pioneers in cloud computing - for example ElasticHosts, FlexiScale, Flexiant, OnApp - and also some strong government initiatives such as G-Cloud. We will have to see whether this kind of technical leadership converts into swift mass-market adoption or not.
Cloud Expo on Google News

SYS-CON.TV
Cloud Expo & Virtualization 2009 East
PLATINUM SPONSORS:
IBM
Smarter Business Solutions Through Dynamic Infrastructure
IBM
Smarter Insights: How the CIO Becomes a Hero Again
Microsoft
Windows Azure
GOLD SPONSORS:
Appsense
Why VDI?
CA
Maximizing the Business Value of Virtualization in Enterprise and Cloud Computing Environments
ExactTarget
Messaging in the Cloud - Email, SMS and Voice
Freedom OSS
Stairway to the Cloud
Sun
Sun's Incubation Platform: Helping Startups Serve the Enterprise
POWER PANELS:
Cloud Computing & Enterprise IT: Cost & Operational Benefits
How and Why is a Flexible IT Infrastructure the Key To the Future?
Click For 2008 West
Event Webcasts
High Availability (HA) Systems Become Open Source
You can make clustering easy with TIPC

The convergence between telecom and datacom is a two-way process, with both worlds contributing their experience and best technology to the future network. One area where telecom companies have extensive experience is computer clustering, and more specifically, providing high-availability (HA) systems by using such technology. A well-proven product from this application area has now been released to open source.

Since the mid-nineties, telecom equipment vendor Ericsson has been developing and deploying a tailor-made reliable communication protocol for their cluster-based products. This protocol, called Transparent Inter Process Communication (TIPC), has over the past two years undergone a significant redesign and is now available as a portable source code package of about 12,000 lines of C code. The code implements a kernel driver, a design that has made it possible to boost performance (35% faster than TCP) and minimize code footprint.

The current version is available under BSD license from http://tipc.sourceforge.net. It runs on Linux 2.4 and 2.6, but several proprietary portations to other OSs (OSE, True64, Dicos) exist, and more are planned during this year.

TIPC offers an interesting combination of features, some of them quite unique, to achieve the overall goal: to make the cluster act as one single computer from a communication viewpoint while helping applications keep track of and adapt to topology changes. Figure 1 shows a functional view of TIPC.

Five-Layer Network Topology

From a TIPC viewpoint the network is organized in a five-layer structure (see Figure 2).

The top level is the TIPC network as such. This is the ensemble of all computers (nodes) interconnected via TIPC, i.e., the domain where any node can reach any other node by using a TIPC network address.

The next level in the hierarchy is an entity called zone. This "cluster of clusters" is the maximum scope of location transparency within a network, i.e., the domain where an application does not need to worry about network addresses.

The third level is what we call the cluster. This is a group of nodes interconnected all-to-all via one or two TIPC links.

The fourth level is the individual system node, or just node. There may be up to 2047 system nodes in a cluster.

The lowest level is the slave node. Slave nodes provide the same properties regarding location transparency and availability as system nodes but don't need full physical connectivity to the rest of the cluster. One link to one system node is sufficient, although there may be more for redundancy reasons.

All network entities within a TIPC network are accessed using a TIPC network address, a 32-bit value subdivided into a zone, cluster, and node field. This address is internally mapped to the address type for the communication media actually used, e.g., an Ethernet address or an IP-address/port number tuplet.

Location-Transparent Functional Addressing

To present a cluster as one computer, the addressing scheme used must hide the physical location of a requested service to its users. To achieve this, TIPC provides a functional address type, called port name, to be used both for connectionless messaging and connection setup calls. Binding a socket to a port name corresponds to binding it to a port number in other protocols, except that the port name is unique and has validity for the whole cluster, not only the local node. A caller wanting to set up a connection needs only to specify this address, and the TIPC internal translation service ensures that the request ends up in the right socket, on the right node.

A port name consists of two 32-bit fields. The first field is called the name type and typically identifies a certain service type or function. The second field is the name instance and is used as a key for accessing a certain instance of the requested service. This address structure gives excellent support for both service partitioning and service load sharing.

Further support for service partitioning is provided by an address type called port name sequence. This is a three-integer structure defining a range of port names, i.e., a name type plus the lower and upper boundary of the instance range. By allowing a socket to bind to a sequence, instead of just an individual port name, it is possible to partition a service's scope of responsibility into sub-ranges, without having to create a vast number of sockets to do so.

There are very few limitations on how name sequences may be bound to sockets. One may bind many different sequences, or many instances of the same sequence, to the same socket, to different sockets on the same node, or to different sockets anywhere in the cluster (see Figure 3).

Reliable Functional Multicast

Functional addressing is also used to provide a reliable multicast service. If the sender of a message indicates a port name sequence instead of a port name, a replica of the message is sent to all ports bound to a name sequence fully or partially overlapping with the given sequence (see Figure 4).

Only one replica of the message is sent to each identified target port, even if it is bound to more than one matching sequence. Whenever possible, this function will make use of the multicast/broadcast properties of the carrying media. In such cases, reliability is ensured by a special "reliable cluster broadcast" protocol implemented internally in TIPC.

Translation from port name to socket addresses is performed transparently via an internal translation table, replicated on each node. When a socket is bound to a port name sequence, a table entry is distributed to all nodes within the binding scope, i.e., the local cluster in most cases.

Binding Scopes and Lookup Domains

Although complete location transparency is desirable and sufficient for most applications, there must be ways to control this property for those who may need to do so. Hence, when binding a name sequence to a socket, it's possible to qualify it with a visibility scope parameter indicating how far the knowledge of the binding should be distributed in the network. The default behavior is to spread it to the nodes in the binder's cluster, but it is possible to extend the scope to the whole zone, or to limit it to the local node.

Similarly, a client may indicate a lookup domain for a message or connection setup request. This is a TIPC network address not only indicating where the lookup, i.e., the translation from a port name to socket address, should first be done, but implicitly even the lookup algorithm to be used.

Two such algorithms are available: round-robin lookup is used when the lookup domain is non-zero and there is more than one matching server. The server is selected from a circular list; which root entry is stepped between each lookup. Closest-first lookup is used when the lookup domain is zero. Here the translation is always performed at the client's node and will first look for a matching node local socket. If one is not found, the algorithm will successively look for matches elsewhere in the cluster and finally in the whole zone.

Topology Services

TIPC also provides a mechanism for inquiring or subscribing for the availability of port names or ranges of port names. This functional topology service is built on and uses the contents of the local instance of the name translation table (see Figure 5).

To access this service, a user makes a blocking or nonblocking request to TIPC, asking it to indicate when a name sequence within the requested range is bound to or unbound. The request is associated with a timer, giving the duration of the subscription. A timer value of zero causes the call to return or issue a subscription event immediately, making it a pure inquiry, while a value of -1 makes it stay forever, indicating every change pertaining to the requested name sequence.

The physical network topology may be considered a special case of the functional topology, and can be kept track of in the same way. Hence, to subscribe for the availability/disappearance of a specific node, a group of nodes, or a whole cluster, the user specifies a dedicated port name sequence, representing this "function." A special "name type" 0 is used for this purpose, while the lower and upper boundaries are represented by TIPC network addresses - as described earlier those are in reality 32-bit numbers.

In this particular case, TIPC will by itself bind/unbind the corresponding port name as soon as it discovers or loses contact with a node (see Figure 6).

Lightweight Connections

The number of active user connections within a big cluster may be extremely large, and each cluster node must be able to establish and shut down thousands of such connections per second. To deal with this dynamism, TIPC connections are made very lighweight, in reality leaving the user to decide the setup/shutdown sequence. The protocol as such does not specify how connections are established and shut down, so an application caring about performance is free to use its own scheme, e.g., only exchanging payload-carrying messages.

For convenience an alternative, TCP-style connection type is also provided on Linux, with exchange of hidden protocol messages and stream-oriented data exchange.

TIPC connections are highly reactive and give the users almost immediate failure indication if anything should happen at the endpoints, or to the media between them. This is due to a connection supervision and abortion mechanism, which takes advantage of the properties of the local operating system to detect process crashes, or the status of the concerned links to detect node crashes or carrier failure. When any of this happens, a special connection shutdown message is spontaneously generated by TIPC and sent to the affected endpoint or endpoints.

Link-Level Reliability

Assuming that most clusters are relatively static in size, some of the tasks normally performed at the transport protocol level have been moved down to the signalling link level. Implementing the retransmission protocol at this level has several advantages. First, it gives better resource utilization since all packets, connectionless and connection oriented, are funneled into one single packet sequence per node pair. Each packet can hence carry the acknowledge of many received packets, regardless of their origin, and we need not keep transmission buffers longer than strictly necessary. Second, packet losses can be detected and restransmission performed earlier than would otherwise be the case. Third, packet delivery and sequentiality guaranteed at the link level eliminates any need for per packet timers at the transport level - a background timer per link is sufficient to ensure those properties. As a result, we obtain a packet flow that is both smoother and more "traffic driven" than with corresponding transport level protocols, which are often relying on timers to keep traffic running.

Internode connectivity is also ensured at the link level. First, a background timer for each link endpoint supervises the traffic flow on the link and initiates a probing procedure if the peer is silent too long. Second, if a link is found to have failed after probing, there is a mechanism to steer its traffic over to the remaining link to the same node, if there is one. In fact, having two links and two carriers between each node pair is considered the normal configuration when using TIPC, as it eliminates any single point of failure in the communication service. The failover procedure used on such occations is completely transparent to the users, and complies to the same QOS as is guaranteed by each individual link: no message losses, no duplicates, and in-sequence delivery. The relationship between dual links is configurable; while full load sharing is the default behavior, an active-standby scheme is also supported.

Detection time for a failed link, and consequently for a crashed node, is configurable and is by default set to 1500 ms in the current implementation.

Automatic Neighbour Detection

Signalling links may be configured manually, but this is a tedious task if the size of a cluster runs up to dozens or even hundreds of nodes. Therefore, TIPC uses a designated neighbour detection protocol to establish links between nodes. Within a cluster this protocol is very simple. Each starting node uses the multicast or broadcast capability of the carrying media to tell about its existence, and expects a corresponding unicast response from all nodes recognizing it as part of the cluster.

Between clusters, both multicast and a unicast "pilot" link may be used, and results in a link pattern where each node in one cluster has links to a configurable (default two) number of nodes in the other cluster.

A Useful Toolbox

Within Ericsson, TIPC has proven to be a very useful base for design of high-availability clusters. It is our hope that this experience will be repeated by others now as the potential of advanced clustering is becoming more widely recognized.
About Jon Maloy
Jon Maloy is a researcher at the Open Systems Lab (Ericsson Corporate Unit of Research) located in Montreal.
His main research domain is cluster computing, with focus on cluster communication.
He received a Masters Degree in Electrical Engineering at the Royal Institute of Technology, Stockholm, 1988.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Latest Cloud Developer Stories
Rackspace Hosting, the service leader in cloud computing, on Thursday announced its acquisition of SharePoint911, an industry leader in SharePoint consulting, training, and "JumpStart" services within SharePoint. The unification of both companies provides capabilities to deliver ...
With Cloud Expo 2012 New York (10th Cloud Expo) now under four months away, what better time to start introducing you in greater detail to the distinguished individuals in our incredible Speaker Faculty for the technical and strategy sessions at the conference... We have techn...
Nimble, the social CRM platform has announced the launch of Nimble 2.0, billed as the “most social” CRM platform on the market today. Nimble was designed entirely with social CRM in mind and is the first social business platform that empowers companies with the ability to get clo...
2011 was a year of rapid adoption for public and private cloud services. Instant and on-demand server provisioning was the driving force behind the massive growth. On top, cloud server templates and script automation simplified application installation for simple and pre-defined ...
"Having been in the IT field for many years, I believe the cloud computing chapter in the industry is an exciting one and I am proud to be a part of it," said National Reconaissance Office (NRO) Chief Information Officer Jill T. Singer Tuesday, as it was announced that she was on...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021

SYS-CON Featured Whitepapers
ADS BY GOOGLE