• Learning Map
  • Unix Quiz Center
  • Unix Professional Network
  • Just-Unix-No-Noise FB Group

unixadminschool.com

  • Home
  • Announcements
    • Feed
    • MISC
  • Beginners zone
    • Beginners Lessons
    • Career Guidance
  • Experts Zone
    • Cloud Computing
    • Configuration Solutions
    • Migrations
    • Network Design
    • Scripting
    • Server Security
    • SUN CLUSTERS
    • SUN LDOMS
    • Tools & Applications
    • Veritas Cluster Services ( VCS ) Learning
  • Intermediate Zone
    • Linux Learning
      • Linux Booting
      • Linux Disk Management
      • Linux LVM
      • Linux Networking
      • Linux Performance
      • Linux Troubleshooting
      • Linux YUM/RPM
      • Performance Analysis
      • Redhat Linux Kernel
      • RHEL 6
        • RHEL LDAP
        • Rhel6 Storage
      • Web Servers
    • Solaris Admin
      • Blog for Unix Admin
        • Storage Administration – SAN
      • Oracle Hardware
      • Reference Docs
      • Solaris 10 Zones & LDOMs
      • Solaris 11
      • Solaris Access Control
      • Solaris Best Practices
      • Solaris Booting
      • Solaris Disk Management
      • Solaris DNS
      • Solaris How-to
      • Solaris Installation
      • Solaris Kernel
      • Solaris Networking
      • Solaris NFS
      • Solaris NIS
      • Solaris Packages & Patching
      • Solaris Performance
      • Solaris Tips
      • Solaris Troubleshooting
      • Solaris User Authentication
      • solaris X86
      • Solaris ZFS and Boot Environment
      • Storage Configurations
      • SUN Hardware
      • Troubleshooting Flow charts
    • Veritas Admin
      • Veritas Netbackup
      • VxVM Learning
      • VxVM Troubleshooting
  • QUIZ Center
  • Vlabs

Subscribe

VCS ( Veritas Cluster Services ) Beginners lesson – Cluster Membership & IO Fencing

The current members of the cluster are the systems that are actively participating in the cluster. It is critical for HAD to accurately determine current cluster membership in order to take corrective action on system failure and maintain overall cluster topology.

A change in cluster membership is one of the starting points of the logic to determine if HAD needs to perform any fault handling in the cluster. There are two aspects to cluster membership, initial joining of the cluster and how membership is determined once the cluster is up and running.

Before going to actual topic I would like to talk about one general point that is useful for every learning mind.

Every day, we are dealing with many different technologies which were actually built by some brilliant human minds for the purpose of providing solutions to the problems, which were existing before introducing that specific technology.

During the phase of learning if our focus is just on the features of the technology without making any attempt to investigate the reason behind the existence of those features, we will always end up with half knowledge. It is always wise to spend some time to understand the history of any technology, once you chose to master in it.

 

Initial joining of systems to cluster membership

When the cluster initially boots, LLT determines which systems are sending heartbeat signals, and passes that information to GAB. GAB uses this information in the process of seeding the cluster membership.

Seeding a Cluster

Seeding a new cluster nothing but ensuring that new cluster starting up with correct number of cluster nodes configured in the cluster, just to avoid starting single cluster as multiple subclusters.

Cluster Seeding happens as below :

  • When the cluster initially boots all the nodes will be in unseeded status.
  • GAB in each system checks the total number of systems configured in /etc/gabtab with the entry    

/sbin/gabconfig -c -nx  ( x is replaced with total number of cluster nodes ).

  • When GAB on each system detects that the correct number of systems are running, based on the number declared in /etc/gabtab and input from LLT, it will seed.
  • HAD will start on each seeded system. HAD will only run on a system that has seeded.

Manual seeding of a Cluster Node :

manual seeding of cluster node is nor a recommended option unless System administrator sure about the consequences. And it is required for rare situations when a cluster node is downfor maintenance during the cluster boot.

Before seeding the cluster node manually , make sure that node is able to send and receive cluster heartbeats to each other successfully. And this is important to avoid possible cluster network partition because of new cluster node to be joined.

The command used to seed the cluster node manually is  

       #/sbin/gabconfig -c -x

this will seed all the nodes in communication with the node where this command is run.


 

Ongoing cluster membership

Once the cluster is up and running, a system remains an active member of the cluster as long as peer systems receive a heartbeat signal from that system over the cluster interconnect. A change in cluster membership is determined as follows:

  • When LLT on a system no longer receives heartbeat messages from a system on any of the configured LLT interfaces for a predefined time, LLT informs GAB of the heartbeat loss from that specific system.
  • This predefined time is 16 seconds by default, but can be configured. It is set with the set-timer peerinact command as described in the llttab manual page.
  • When LLT informs GAB of a heartbeat loss, the systems that are remaining in the cluster coordinate to agree which systems are still actively participating in the cluster and which are not. This happens during a time period known as GAB Stable Timeout (5 seconds).
  • VCS has specific error handling that takes effect in the case where the systems do not agree.
  • GAB marks the system as DOWN, excludes the system from the cluster membership, and delivers the membership change to the fencing module.
  • The fencing module performs membership arbitration to ensure that there is not a split brain situation and only one functional cohesive cluster continues to run.

We will be discussing all above points in detail in this post 

Below diagram explains of the data access happens from the shared resources during the regular functioning of the cluster. Once cluster properly seeded and the cluster configured with High priority as well low priority cluster interconnects, the cluster start functioning in its expected manner.

In the diagram you can see two cluster nodes “node-1 and node-2″ interconnected with LLT hearbeat connections and running with a copy of HAD ( VCS engine) on each node.

HAD in addition to GAB and LLT make sure that each node is accessing the shared resources in a controlled manner so that no conflict in access.

 When ever there is a node failure in the cluster, VCS automatically fails over the service groups and resources from failed node to working node of the cluster.

Like any other technologies VCS also had challenges to deal with some exceptional situations like having trouble with cluster interconnects, cluster node or HAD instead of actual cluster node failure. The problem in this scenarios is VCS cannot differentiate a cluster node failure with a cluster interconnect / HAD failure unless there is a logical solution prepared for it.

The brilliant minds behind VCS came up with below two solutions, initially, to deal with below two scenarios

Scenario 1. when  the cluster interconnects failing one by one, and left with last interconnect working

In ideal case, whenever LLT on a system no longer receives heartbeat messages from another system on any of the configured LLT interfaces, GAB reports a change in membership to VCS engine.

When a cluster node had trouble with the interconnects and  has only one interconnect link remaining to the cluster, GAB can no longer reliably discriminate between loss of a system and loss of the network. The reliability of the system’s membership is considered at risk. In this situation, a special membership category called a jeopardy membership will be assigned to the cluster node with single cluster interconnect. 

When a system is placed in jeopardy membership status, two actions occur

  • Service groups running on the system are placed in autodisabled state. A service group in autodisabled state may failover on a resource or group fault, but can not fail over on a system fault until the autodisabled flag is manually cleared by the administrator.
  • VCS operates the system as a single system cluster. Other systems in the cluster are partitioned off in a separate cluster membership.

 

 

Scenario 2. HAD daemon failed on one  cluster node

Daemon Down Node Alive (DDNA) is a condition in which the VCS high availability daemon (HAD) on a node fails, but the node is running. When HAD fails, the hashadow process tries to bring HAD up again. If the hashadow process succeeds in bringing HAD up, the system leaves the DDNA membership and joins the regular membership.

In a DDNA condition, VCS does not have information about the state of service groups on the node. So, VCS places all service groups that were online on the affected node in the autodisabled state. The service groups that were online on the node cannot fail over.

Manual intervention is required to enable failover of autodisabled service groups. The administrator must release the resources running on the affected node, clear resource faults, and bring the service groups online on another node.


Above two solutions helps VCS deal with major part of the problems with cluster interconnect and HAD, but there is a real challenging scenario where the above two solution doesn’t work we need more perfect solution for that. Ofcourse, VCS minds had also offered an effective solution for that.

Let us first discuss about the problem, then we will go to the solution.

 

Scenario 3: All cluster interconnects failed at a time , and the cluster was split into multiple subclusters

As we discussed earlier, HAD (VCS engine) is brain of the cluster and each node of the cluster running with one copy of HAD loaded into their memory. And this VCS engine will control all the cluster nodes worked together under predefined rules to access shared resources and provide high availability to the applications.

When a cluster node disconnects from the main cluster  because of the the problem in all the cluster interconnects at a time, forms a subcluster and  the copy of HAD running in its memory will start acting like a second brain of the cluster.  And the second brain (HAD of disconnected node) will start competing with the initial brain ( Actual cluster HAD) to gain control on the cluster resources.

We know the result when a human  brain splits into two and each one trying to control the body parts, it will ultimately make the person sick to the death. The same rule applies the cluster and this condition will lead to data destruction in shared resources. And VCS brains named this condition as “SPLIT BRAIN” Condition.

If you look at the above diagram at step (1) all cluster interconnects failed, step (2) the HAD daemon running on both nodes of cluster started acting like separate  brains and finally at step (3) both nodes trying to access the shared resources forcibly.

Then what is the solution for Split Brain Condition? Answer is “Membership Arbitration”

 

Membership Arbitration

Membership Arbitration is nothing but set of rules  to be followed whenever a cluster member completely disconnects from the other cluster members.

Membership arbitration is necessary on a perceived membership change because systems may falsely appear to be down. When LLT on a system no longer receives heartbeat messages from another system on any configured LLT interface, GAB marks the system as DOWN. However, if the cluster interconnect network failed, a system can appear to be failed when it actually is not. In most environments when this happens, it is caused by an insufficient cluster interconnect network infrastructure, usually one that routes all communication links through a single point of failure.

If all the cluster interconnect links fail, it is possible for one cluster to separate into two subclusters, each of which does not know about the other subcluster. The two subclusters could each carry out recovery actions for the departed systems. This is termed split brain.

In a split brain condition, two systems could try to import the same storage and cause data corruption, have an IP address up in two places, or mistakenly run an application in two places at once.

Membership arbitration guarantees against such split brain conditions

There are two components in Membership Arbitration

1. Fencing Module  

2. Coordinator Disks

Below diagram explain how the Fencing module starts during the  cluster startup

The fencing module starts up as follows:

The coordinator disks are placed in a disk group.This allows the fencing start up script to use Veritas Volume Manager (VxVM) commands to easily determine which disks are coordinator disks, and what paths exist to those disks. This disk group is never imported, and is not used for any other purpose.

Step 1. The fencing start up script on each system uses VxVM commands to populate the file /etc/vxfentab with the paths available to the coordinator disks.

Step 2. The fencing driver examines GAB port B for membership information.

Step 3. If no other systems are up and running, it is the first system up and is considered the correct coordinator disk configuration.

Step 5,6 and 7 . When a new member joins and fencing module starts it will check the GAB port B for the existing nodes and finds that node-1 is already running in the cluster.

Step 8. Then the node-2 requests a coordinator disks configuration from node-1. Ideally The system with the lowest LLT ID will respond with a list of the coordinator disk serial numbers. If there is a match, the new member joins the cluster. If there is not a match, vxfen enters an error state and the new member is not allowed to join. This process ensures all systems communicate with the same coordinator disks.

How the fencing driver determines if a possible preexisting split brain condition exists?

This is done by verifying that any system that has keys on the coordinator disks can also be seen in the current GAB membership. If this verification fails, the fencing driver prints a warning to the console and system log and does not start.

Final Step:  If all verification pass, the fencing driver on each system registers keys with each coordinator disk. ( I have mentioned this task as step 4 and 9, but actually they should be the last numbers, sorry for that )


How Fencing algorithm deals with the cluster interconnect failures?

From the above diagram we can understand the function of fencing algorithm as below: 

Step 1. When the  Node-1 failed ( due to the cluster interconnect failure) , Node-2 will initiate the  the fencing operation

Step 2. The GAB module on Node-2 determines Node-1 has failed due to loss of heartbeat signal reported from LLT. GAB passes the membership change to the fencing module on each system in the cluster.

Step 3.  Node-2 gains control of the coordinator disks by ejecting the key registered by Node-1 from each coordinator disk. The ejection takes place one by one, in the order of the coordinator disk’s serial number. When the fencing module on Node 2 successfully controls the coordinator disks, HAD carries out any associated policy connected with the membership change.

Step 4. Node-1 is blocked access to the shared storage, if this shared storage was configured in a service group that was now taken over by System0 and imported

So far so good, VCS guys provided us good solutions to deal with this complicated Split Brain condition. And now the question are the difficulties end? and the answer is  ” No” .  

There are some other scenarios where this membership arbitration ( using fencing module and coordinatory disks) alone cannot provide data protection in the cluster.  And they are.

  • A system hang causes the kernel to stop processing for a period of time.
  • The system resources were so busy that the heartbeat signal was not sent.
  • A break and resume function is supported by the hardware and executed. Dropping the system to a system controller level with a break command can result in the heartbeat signal timeout

 In these types of situations, the systems are not actually down, and may return to the cluster after cluster membership has been recalculated. This could result in data corruption as a system could potentially write to disk before it determines it should no longer be in the cluster.

Combining membership arbitration with data protection of the shared storage eliminates all of the above possibilities for data corruption.

Data protection fences off (removes access to) the shared data storage from any system that is not a current and verified member of the cluster. Access is blocked by the use of SCSI-3 persistent reservations.

Membership arbitration combined with data protection is termed I/O Fencing.

From the above I/O fencing diagram you can notice that the shared disks were configured with SCSI-3 persistant reservation enabled. And enabling SCSI-3 PR along with Memory Arbitration techniques will guarantee the Data protection in the above mentioned rare scenarios.

What is SCSI-3 Persistent Reservation?

 SCSI-3 Persistent Reservation (SCSI-3 PR) supports device access from multiple systems, or from multiple paths from a single system. At the same time it blocks access to the device from other systems, or other paths.

VCS logic determines when to online a service group on a particular system. If the service group contains a disk group, the disk group is imported as part of the service group being brought online. When using SCSI-3 PR, importing the disk group puts registration and reservation on the data disks. Only the system that has imported the storage with SCSI-3 reservation can write to the shared storage. This prevents a system that did not participate in membership arbitration from corrupting the shared storage.

SCSI-3 PR ensures persistent reservations across SCSI bus resets.  

*** This ends the post here. Please drop your comments **** 

Announcement

I believe most of you already know that Symantec going to release VCS 6.0 very soon. And before releasing the actual product, Symantec giving opportunity to users to understand and discuss it’s features directly with the VCS developement team through symantec connect group.

Symantec sharing videos, to the group members, presenting the new features of VCS 6.0. If you want to experience VCS 6.0 and it’s features, i would recommend you to join the group

Instructions to Join the beta program on symantec connect group:

Please go though the below link
https://www-secure.symantec.com/connect/groups/storage-foundation-and-veritas-cluster-server-60-beta-program

Note : The primary requirement is that you have an NDA in place, which we have a doc you can fill out.


You might be interested to read below :


  • Virtual Lab : Get Your hands dirty with grep & RegEx

  • VxVM Troubleshooting – Increasing the Size of Veritas Disk Private Region

  • VCS Learning – I/O Fencing In action [ Video ]

  • VCS Learning : Learn about Cluster Hearbeats

  • Storage Operations – VxVM vs RHEL LVM2

  • Veritas Netbackup : Unable to detect Robot from master server after Veritas Netbackup upgrade to 7.1.0.4.
  • Email
  • More
  • Print
  • Digg
Posted by Ramdev
18 Comments
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

18 Comments on “VCS ( Veritas Cluster Services ) Beginners lesson – Cluster Membership & IO Fencing”

  • Ramesh
    24 December, 2011, 13:35

    Ver good article… Is there any way to setup VCS in laptop using vmware?

  • Madhav
    7 August, 2012, 11:55

    Good article … the thing is liked most & total agree with is ..
    “During the phase of learning if our focus is just on the features of the technology without making any attempt to investigate the reason behind the existence of those features, we will always end up with half knowledge. It is always wise to spend some time to understand the history of any technology, once you chose to master in it.”

  • Ramdev
    13 August, 2012, 14:41

    @Madhav – Thanks for the comment.

  • vivek
    10 September, 2012, 22:55

    Great explanation…!!!!! Thanks for all your effort.. The article best explains the technology behind VCS :)

  • vivek
    10 September, 2012, 23:00

    It would be helpful, if you can post an article on Weblogic technologies and Application hosting process. Thanks..!!!

  • Ramdev
    11 September, 2012, 1:16

    Vivek, I will ask our middleware folks for this article.

  • Daniel
    27 October, 2012, 12:53

    Excellent article. Extremely useful.
    What is an NDA?  - you mention it in the instructions to join the Beta Program at the bottom of the article. 

  • Ramdev
    27 October, 2012, 16:01

    Hi Daniel,  NDA – Non Disclosure Agreement.  The VCS 6.0 is already in the market, and the beta program is no longer valid.

  • RamaRao
    21 November, 2012, 8:24

    Step by step explanation  is very good and helpful for new VCS learners.
    Thanks for efforts.

  • Ramdev
    22 November, 2012, 0:21

    Ramarao, thank you.

  • Pavanisastry
    2 January, 2013, 9:21

    Thanks to Ramdev

  • sagar.parit
    30 January, 2013, 7:40

    Hi, I want start learning the VCS; but on net number of docs are avalabel so i am tooo confused between which dos is good to understand VCS

    Pls help me.

  • Ramdev
    30 January, 2013, 8:28

    Hi, I would recommend to go with this document – http://sfdoccentral.symantec.com/sf/5.1/solaris/pdf/vcs_admin.pdf

  • sagar.parit
    30 January, 2013, 18:02

    Hi, Thnks sir quickly reply.But honestly say 720 pages.
    What are topics that must to be read or which 1 i can exclude. BECOZ MAIN REASON IS MY DESKTOP IS 32-BIT SO I UNABLE TO PERFORM VCS ON MY PC. & IN OFFICE IT WONT BE POSSIBLE. so which are topics are must be to be read to understand vcs & also help to face the interview…….

    Again many thanks 4 reply………

  • Ramdev
    31 January, 2013, 3:37

    Sagar, for the initial learning you can focus on below

    Cluster heartbeats – know about GAB + LLT
    Cluster daemons and Cluster startup scripts
    Cluster Service Groups and Resources
    haxx commands related to Cluster oeprations – starting , stopping, switchover, freeze
    Cluster file – main.cf
    Different Cluster States, Service Group states and Resource States

  • sagar.parit
    31 January, 2013, 11:50

    Hi,
    Many many thanks 4 to giving best path to learn & start the VCS….

  • Maheshbabu
    26 April, 2013, 12:44

    nice explaniation about the vcs i/o fencing, spilt brain condition. Sir, Keep  posting good concepts on linux also. It will be very useful for beginners also. PLEASE DO NEED FUL

Trackbacks

  1. maquillaje

Leave a Comment

Join to our Professional Network (of 1400+ unixadmins ) to receive Unix Administration and Job Updates -

Pages1

Don't Miss Updates

 

Beginners Zone

 

Unixadmin Careers

Server Hardware

Beginners Lessons

Troubleshooting-Flowchart

 

Intermediate Zone

 

Solaris Booting

Solaris Volume Manager

Storage Configurations

Solaris Networking

Solaris X86

Solaris ZFS

Solaris NFS

Solaris NIS

Solaris Patching

Solaris Booting

Solaris Kernel

Veritas Volume Manager

Solaris NIS

Logical Volume Manager

Linux Networking

Linux Disk Management

Linux Troubleshooting

 

Experts Zone 

 

Solutions

Scripting and Automation

Server Security

Veritas Cluster Services

Sun Cluster Services

Cloud Computing

SUN LDOMS

Copyright © 2009 unixadminschool.com. All rights reserved.
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.