• Learning Map
  • Unix Quiz Center
  • Unix Professional Network
  • Just-Unix-No-Noise FB Group

unixadminschool.com

  • Home
  • Announcements
    • Feed
    • MISC
  • Beginners zone
    • Beginners Lessons
    • Career Guidance
  • Experts Zone
    • Cloud Computing
    • Configuration Solutions
    • Migrations
    • Network Design
    • Scripting
    • Server Security
    • SUN CLUSTERS
    • SUN LDOMS
    • Tools & Applications
    • Veritas Cluster Services ( VCS ) Learning
  • Intermediate Zone
    • Linux Learning
      • Linux Booting
      • Linux Disk Management
      • Linux LVM
      • Linux Networking
      • Linux Performance
      • Linux Troubleshooting
      • Linux YUM/RPM
      • Performance Analysis
      • Redhat Linux Kernel
      • RHEL 6
        • RHEL LDAP
        • Rhel6 Storage
      • Web Servers
    • Solaris Admin
      • Blog for Unix Admin
        • Storage Administration – SAN
      • Oracle Hardware
      • Reference Docs
      • Solaris 10 Zones & LDOMs
      • Solaris 11
      • Solaris Access Control
      • Solaris Best Practices
      • Solaris Booting
      • Solaris Disk Management
      • Solaris DNS
      • Solaris How-to
      • Solaris Installation
      • Solaris Kernel
      • Solaris Networking
      • Solaris NFS
      • Solaris NIS
      • Solaris Packages & Patching
      • Solaris Performance
      • Solaris Tips
      • Solaris Troubleshooting
      • Solaris User Authentication
      • solaris X86
      • Solaris ZFS and Boot Environment
      • Storage Configurations
      • SUN Hardware
      • Troubleshooting Flow charts
    • Veritas Admin
      • Veritas Netbackup
      • VxVM Learning
      • VxVM Troubleshooting
  • QUIZ Center
  • Vlabs

Subscribe

Day in SA Life : Working Remotely for an hardware issue

Below is  Sample production environment :

1. Server Infrastructure was located in Two different Data center ( DC ).  One DC is for Production Servers and another one for Disaster Recovery( DR) servers

2. System Administrators and Application Developers are sitting in different countries and working remotely.

3. All of the Servers having remote Console Connectivity and Both of the DC having 24×7 support team for DC operations.

Problem Scenario:

One fine morning, you have received a mail/call from the server monitoring ( L1 )  team saying that there are some alerts appearing on server  ”Prod-Server” and they have initiated a support ticket ( Incident Ticket) and assigned to his team.

SA Response Procedure, for the incident ticket:

Step 1. Gathering Server Information 

Using the System Name SA gathered following information for further diagnosis and troubleshooting

  • System IP address. If the server name not part of DNS
  • System Location – DC name and Rack Location
  • Application/Business Team’s Contact Person
  • Server Criticality – Whether the Server is Prod or DR. And Currently in Use by any applications
  • Server Serial Number – Just in case if he has to raise any hardware vendor call for the Hardware Replacements

Step 2. Confirming the Issue 

Connect to  the server using IP address / Host name and investigate the issue.  If unable to connect or if the server is not responding, then connect to Console of the server using remote console connection. In case,  If able to connect, then check for System Logs using commands

  • Check /var/adm/messages
  • use dmesg Command
  • If disk errors use – format  and  iostat -En commands, to confirm the failed device
  • If other hardware errors – use prtdiag and prtconf commands

Step 3. Collecting Detailed Diagnosis Information to raise a Vendor ( Hardware ) request for Hardware Maintenance.

Gather all the requested information to perform the hardware replacement or maintenance, using the vendor specific tools, some example tools

  • For SUN Solaris / SUN Hardware diagnosis –  Run  ’explorer’ Utility
  • For Redhat Linux issues – Run ‘SOSREPORT’ utility
  • For Fujitsu Hardware issues – Run ‘ fjvsnap’ Utility
  • For EMC related storage issues – Run ‘ emcgrab’ utility
  • For HP Hardware related issues –   ILO logs, or run ‘hpacucli or hpasmcli’ utilities
  • For Veritas issues – Run ‘VRTSexplorer’ Utilitiy

Step 4. Vendor Coordination for Further Diagnosis

If the problem is with Hardware and we need to involve Hardware vendor  for the troubleshooting , please refer following sample procedures

  • Call the Global Customer Care number
  • Inform the serial number , to check the contract warranty.
  • Inform the contact person for the call from your team
  • And ask for Case number and send them the log files you collected
  • Ask them to investigate and advice for the replacement in mail or call

Step 5. Once investigation completed and if it requires replacement of the device, ask vendor whether the component can be hot swappable or does it need any server downtime

Step 6. If the Maintenance requires downtime, Just inform to application team about the situation and ask for good time for maintenance.

Step 7. Once you receive the Scheduled downtime from the application team, call back vendor and inform them for suitable time for maintenance.

a. Sometimes Vendor just Courier the component and instruction, and will ask our local support people to perform maintenance  e.g External power supply, HDD
b. In Some cases Vendor will send expert field engineer to perform critical hardware  maintenance – e.g. Memory Replacement , Motherboard or System board Replacements

 

Step 8. Once the maintenance schedule confirmed with both application team and Vendor, Just send a mail to Data center support team  mentioning the below information and Internal IM number

  • Vendor Engineer Details / Component courier details
  • Server name / S.No / and Location
  • And action to perform  – whether to escort the Vendor engineer or to perform the replacement on the server.

Step 9. Most of the times  SA may have to perform some Pre-Maintenance tasks before starting the actual  maintenance  work e.g.

  • Sending an information mail to application team and  Monitoring teams, so that they won’t be panic with the error messages during maintenance.
  • Detaching failed disks from Veritas / SVM / SDS
  • Shutting down machine incase downtime required
  • Stopping services

Step 10. Once maintenance completed, SA will perform post maintenance tasks. e.g.

  • Attach disks to mirror
  • Starting the server and Starting services
  • Informing to application team about server status and ask to confirm application running status
  • Asking Monitoring team to resume monitoring

Step 11.Finally, the most important task is –  close the tickets assigned to your team, with appropriate resolution information related to the error and troubleshooting procedure.

You might be interested to read below :


  • SAN Storage Migration – Solaris with VxVM

  • Solaris host level SAN migration from Clariion to VMAX – Hands on Lab

  • Hands on Lab – Replacing Failed Disks from ZFS Pools ( RaidZ2 / RaidZ3 ) – Part2

  • Enabling SVM in Failsafe and password recovery in Solaris.

  • Hands on Lab – Replacing Failed Disks from ZFS Pools ( Simple / Mirrored / RaidZ )

  • Oracle Server Hardware Reference ( 3D View)
  • Email
  • More
  • Print
  • Digg
Posted by Ramdev
8 Comments
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

8 Comments on “Day in SA Life : Working Remotely for an hardware issue”

  • pavan
    23 October, 2011, 9:44

    very good real time scenario

  • Raja
    23 October, 2011, 22:57

    Good!!

  • Deepak
    13 November, 2011, 6:16

    Good, Nice Post. Thank you

  • Michael
    13 November, 2011, 18:19

    @Ram you missed out the deadley oncall support :) :)  

  • dani
    22 February, 2012, 21:35

    great, very good post for starters.

  • ramakrishna
    19 December, 2012, 5:51

    very very good scenario
    REALLY AWESOME

  • Ramdev
    20 December, 2012, 2:06

    Ramakrishna – welcome to unixadminschool.com

  • Mahesh babu.R
    22 April, 2013, 8:38

    Hi sir,
          Above explaniation is very good . I have doubt that if the server has more no number of  group is running the more service groups or appilcation. whether it is possible to shift to some other server for time being like in veritas.
    I would request u post this kind of good explaniation on this kind of adminstration  part since i am learner.

Leave a Comment

Join to our Professional Network (of 1400+ unixadmins ) to receive Unix Administration and Job Updates -

Pages1

Don't Miss Updates

 

Beginners Zone

 

Unixadmin Careers

Server Hardware

Beginners Lessons

Troubleshooting-Flowchart

 

Intermediate Zone

 

Solaris Booting

Solaris Volume Manager

Storage Configurations

Solaris Networking

Solaris X86

Solaris ZFS

Solaris NFS

Solaris NIS

Solaris Patching

Solaris Booting

Solaris Kernel

Veritas Volume Manager

Solaris NIS

Logical Volume Manager

Linux Networking

Linux Disk Management

Linux Troubleshooting

 

Experts Zone 

 

Solutions

Scripting and Automation

Server Security

Veritas Cluster Services

Sun Cluster Services

Cloud Computing

SUN LDOMS

Copyright © 2009 unixadminschool.com. All rights reserved.
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.