• Learning Map
  • Unix Quiz Center
  • Unix Professional Network
  • Just-Unix-No-Noise FB Group

unixadminschool.com

  • Home
  • Announcements
    • Feed
    • MISC
  • Beginners zone
    • Beginners Lessons
    • Career Guidance
  • Experts Zone
    • Cloud Computing
    • Configuration Solutions
    • Migrations
    • Network Design
    • Scripting
    • Server Security
    • SUN CLUSTERS
    • SUN LDOMS
    • Tools & Applications
    • Veritas Cluster Services ( VCS ) Learning
  • Intermediate Zone
    • Linux Learning
      • Linux Booting
      • Linux Disk Management
      • Linux LVM
      • Linux Networking
      • Linux Performance
      • Linux Troubleshooting
      • Linux YUM/RPM
      • Performance Analysis
      • Redhat Linux Kernel
      • RHEL 6
        • RHEL LDAP
        • Rhel6 Storage
      • Web Servers
    • Solaris Admin
      • Blog for Unix Admin
        • Storage Administration – SAN
      • Oracle Hardware
      • Reference Docs
      • Solaris 10 Zones & LDOMs
      • Solaris 11
      • Solaris Access Control
      • Solaris Best Practices
      • Solaris Booting
      • Solaris Disk Management
      • Solaris DNS
      • Solaris How-to
      • Solaris Installation
      • Solaris Kernel
      • Solaris Networking
      • Solaris NFS
      • Solaris NIS
      • Solaris Packages & Patching
      • Solaris Performance
      • Solaris Tips
      • Solaris Troubleshooting
      • Solaris User Authentication
      • solaris X86
      • Solaris ZFS and Boot Environment
      • Storage Configurations
      • SUN Hardware
      • Troubleshooting Flow charts
    • Veritas Admin
      • Veritas Netbackup
      • VxVM Learning
      • VxVM Troubleshooting
  • QUIZ Center
  • Vlabs

Subscribe

Solaris Troubleshooting : how system generates Utilization Statistics ( usr, sys, idle ) in Solaris 10 and prior.

in  Solaris 10, the method that is used to calculate usr, sys and idle time has changed. As well, the waitio time has been hard-coded to 0 in all instances (because it was never a very meaningful statistic).

Because of the changes, running the same workload on S9 and S10 may produce different statistics even though the workload is really running similarly on both systems. This can lead to confusion, especially with respect to capacity planning. Here we attempt to address the sources of differences.

In addition, there is another effect in S10 because interrupt time is not necessarily charged as sys time (as described below). which was addressed in later versions

How was utilization calculated prior to S10?

Prior to S10, utilization was calculated using a sampling technique  as described in Leffler et. al., Design & Implementation of BSD UNIX – pp.51-52. A simple summary of the technique is:

On every clock tick (in level 10 interrupt context) the CPU that handles the clock interrupt will:

1) Inspect each CPU to see what task is running (usr/sys/idle) and chalk up 1 more tick to the appropriate category.

2) Invoke callout_schedule() to process any timeouts that are due for this clock tick.

This means utilization is a statistical sampling of the CPU activity on the clock tick (100Hz by default, but changeable) and it’s reasonably accurate unless there is significant callout scheduling.

Why was the calculation changed in S10?

Statistical sampling works well when the samples are taken randomly but that’s not the case with this classic implementation. Far from being random, the sampling is actually synchronized with some of the activity it’s meant to measure.

In the case where there is significant callout scheduling, lots of threads may be made runnable by the clock tick, do their work, and block waiting for another timeout just prior to the next clock tick. If the system is less than 100% utilized, then that next clock tick may occur with most CPUs in the idle loop since the callout work finished before the tick. This may tend to misrepresent the usage by over-estimating idle time and under-estimating usr and/or sys time, implying the system has more capacity headroom than is really the case. In addition, if the system is close to 100% utilized, then the sampling error may cause time to be charged against the wrong thread.

Applications that use poll() timeouts intensively are typically the ones where the classic technique misreports most. It’s also open to subversion by clever programmers who can figure out how to get off CPU just prior to the clock interrupt and so avoid the accounting process charging their process.

Benchmark systems can be subject to misreporting as well because the benchmark driver system generating the workload may be pacing the workload using its own 100Hz clock. CPU utilization observed on the system under test (SUT) can vary markedly depending on the phase difference of the 100Hz clocks on the driver and SUT systems.

For the vast majority of systems, the work is asynchronous to the system clock and/or generated by external events (e.g. disk and network interrupts) and for  these, clock-based sampling provides very reasonable accuracy with very low overhead.

How has utilization calculation changed in S10?

With S10, the decision was made to get rid of statistical sampling and move to microstate accounting whereby time accounting calculations are made  each time processes change state. This was intended to be more accurate since each LWP and CPU has the microstate accounting data that’s updated directly as  processes move from state to state. One result is that users may see a decreased (but more correct) reporting of idle time versus Solaris 9 for the same workload, and mistakenly believe that Solaris 10 is less efficient and offers less headroom. However, the vanished headroom never really existed, and users now have a more accurate measure of headroom to use for capacity planning.

What new problem is now in S10?

With the spotlight now on microstate accounting (which presumably nobody had looked closely at before) we are now realising an important shortcoming. The microstates are not corrected for the time spent in a given state pinned by interrupts (this includes the idle process). So now, in the presence of a high  interrupt load, the usr/sys/idle breakdown can be very much off, since it will
be counted against whatever process was pinned by the interrupt. Thus, sys time is under-estimated, and either usr or idle or both may be over-estimated.

For example a CPU can run 50% of the time under pci_intr_wrapper (as per dtrace profile data) but still be reported as 100% idle under mpstat if it is only handling interrupts, since the time will be charged to idle.

There is also a small performance hit because there is extra code in the path to do the accounting on state changes, but ignoring that and the fact that it fails to report interrupt time correctly, it has another flaw: its probe effect makes it inaccurate, tending especially to under-report sys time. Take the typical case of making syscalls. We only think about recording that we are in the kernel when we are into the trap handler by 26 instructions and have already executed 6 loads, all of which could miss in the E$ (different cache lines) and a store involving a 7th cache line.

Then we call syscall_mstate(); 28 instructions, 2 or 3 cachelines of text so far. Then we may take a register window spill on the first instruction and have accessed one or two cache lines of text before calling gethrtime_unscaled(), yet another cache line of text, where we read the tick register. At least the mstate update on syscall return is closer to the end. However, there is quite a bit of data and text footprint in this call to syscall_mstate() _after_ reading the tick. Again the net result is that sys time is under-estimated, and usr time is over-estimated.

An extreme case of this was demonstrated with a test program doing a lot of semop() calls where mpstat showed 60/40 usr/sys but the true accounting should have been more like 25/75. Most cases will not see that large an effect. In addition, this source of inaccuracy does not affect idle time  and so capacity planning should still be accurate.

What was fixed in Later versions?

With the appropriate patch installed, the interrupt time will be apportioned to system time in the same manner as was done in previous versions of Solaris. This should mean that there will be fewer differences between S9 and S10 statistics and the statistics will be more accurate in systems with significant interrupt time. To do this we keep a second set of microstate stats in the CPU structure which  counts the ticks spent in interrupt mode according to whether it was usr, sys  or idle that got interrupted. Then whenever the CPU microstate is fetched,  the interrupt time for which usr and idle were pinned is subtracted from  those CPU microstate categories and added to the sys CPU microstate.

Here is a useful table that summarizes the effects of each bias in each release:

release    usr       sys        idle
————————————–
S8/S9      under   under   over
S10           over     under   over
S10U1     over     under    accurate

Note that the amount over or under in all cases (especially S10) is usually very small and with the addition of the fixes in S10U1 to handle interrupts the statistics should be more accurate than ever.

You might be interested to read below :


  • SAN Storage Migration – Solaris with VxVM

  • Solaris host level SAN migration from Clariion to VMAX – Hands on Lab

  • Hands on Lab – Replacing Failed Disks from ZFS Pools ( RaidZ2 / RaidZ3 ) – Part2

  • Enabling SVM in Failsafe and password recovery in Solaris.

  • Hands on Lab – Replacing Failed Disks from ZFS Pools ( Simple / Mirrored / RaidZ )

  • Oracle Server Hardware Reference ( 3D View)
  • Email
  • More
  • Print
  • Digg
Posted by Ramdev
Comment it
You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Comment

Join to our Professional Network (of 1400+ unixadmins ) to receive Unix Administration and Job Updates -

Pages1

Don't Miss Updates

 

Beginners Zone

 

Unixadmin Careers

Server Hardware

Beginners Lessons

Troubleshooting-Flowchart

 

Intermediate Zone

 

Solaris Booting

Solaris Volume Manager

Storage Configurations

Solaris Networking

Solaris X86

Solaris ZFS

Solaris NFS

Solaris NIS

Solaris Patching

Solaris Booting

Solaris Kernel

Veritas Volume Manager

Solaris NIS

Logical Volume Manager

Linux Networking

Linux Disk Management

Linux Troubleshooting

 

Experts Zone 

 

Solutions

Scripting and Automation

Server Security

Veritas Cluster Services

Sun Cluster Services

Cloud Computing

SUN LDOMS

Copyright © 2009 unixadminschool.com. All rights reserved.
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.