Read Microsoft PowerPoint - PowerOfControlCharts.pptx text version

Power of Control Charts

Igor Trubin, PhD, SunTrust Bank

http://www.itrubin.blogspot.com/

SCMG'09

Introduction: Agenda

o Where and why the Control Chart is used: Review of some systems performance tools on a market that build and use control charts. o What is the Control Chart? - A little bit of theory and history. o How SEDS (Statistical Exception Detection System) uses it - MASF charts vs. SPC ones. o Long gallery of already published charts in the CMG papers papers. o Plus some new ones with explanations how to read them. o How to build a Control Chart: using Excel for interactive analysis and R to automate the control charts generating with live demonstration of the technique.

SCMG'09

2

Where the Control Chart is used in IT

BMC software www.bmc.com: MASF technique in Performance Analysis for Servers and Performance Assurance tools; BMC ProactiveNet Analytics

http://documents.bmc.com/products/documents/49/13/84913/84913.pdf

Fujitsu www.fujitsu.com: ACTIVE BASELINING Technique

www.fujitsu.com/downloads/AU/active_baselining_in_passive_data_environments.pdf

McAfee www.mcafee.com Anomaly-Based Intrusion Detection Anomaly Based

www.mcafee.com/us/local_content/white_papers/wp_ddt_anomaly.pdf

BEZ systems www.bez.com for Oracle and Teradata performance www.wmoug.org/bezPresentation.pdf Integrien AliveTM http://www.integrien.com/ Netuitive http://netuitive com/ http://netuitive.com/ Firescope http://www.firescope.com/default.htm Managed Objects http://managedobjects.com/ Six Sigma http://www.isixsigma.com/st/control_charts/ SEDS (Statistical Exception Detection System) http://www.itrubin.blogspot.com/

SCMG'09

3

Why Control Chart is used for Capacity Management C t l Chart h the ability to uncover some t bilit t d Control Ch t has th trends and patterns showing actual data deviations from historical baseline Control Chart is a really proactive tool and could capture unusual resource usage before it breaks Control Chart is the best base-lining tool and can show how actual data deviate from historical baseline Control Chart provides dynamic threshold: no need in manual setting g Control Chart is the tool to detect a workload pathology (run-away, memory leaks) (run away,

SCMG'09 4

Control Chart Definitions Definitions from Internet:

o The control chart, also known as the Shewhart chart or process-behavior chart, in statistical process control is a tool used to determine whether a manufacturing or business p process is in a state of statistical control or not. o A graphical tool for monitoring changes that occur within a process, by distinguishing variation that is inherent in the p process(common cause) from variation that y ( ) yield a change to g the process(special cause). This change may be a single point or a series of points in time - each is a signal that something is different from what was previously observed and measured.

SCMG'09

5

What the Control Chart is Chart Details

o Points representing measurements of a quality characteristic in samples taken from the process at different times [the data] o A centre line drawn at the process characteristic mean which is line, calculated from the data o Upper (UCL) and lower (LCL) control limits (sometimes called "natural process limits") that indicate the threshold at which the process output is considered statistically 'unlikely'

SCMG'09

6

What the Control Chart is (continued) Choice of limits

o UCL= mean+ 3; LCL= mean- 3; ( - Standard Deviation The reason that 3 control limits balance the risk of error is that, for normally distributed data, data points will fall inside the 3 limits 99.7% f the ti i id th 3 li it 99 7% of th time when a process is in control.)

o UCL=95th Percentile; UCL 95 LCL= 5th Percentile ( p (A percentile (or centile) is the value ( ) of a variable below which a certain percent of observations fall)

SCMG'09

7

MASF, SPC control and histogram charts comparison

MASF: Reference set vs. Actual data [1] All th three charts h t demonstrate different views of exceptions for CPU utilization tili ti that occurred at 8 am. As opposed to pp classical control charts, MASF charts can be most useful for showing a 24 (7x24) hour profile of a g resource usage.

NOTE: Limits might need to be cut at 100% or 0% natural thresholds

SCMG'09 8

How close is the data to normal distribution

... for global CPU utilization on the same Unix server?

Reference set grouped by hours

Example of the 6 month hourly histograms for HP rp7400/550Mhz/6way server global CPU utilization exception

SCMG'09

9

Types of Control Charts against performance data

Classical SPC type (daily or hourly aggregated (SEDS BMC Visualizer) or raw granular data (SEDS, (Integrien ­ for near real time data alert ) 24 hour profile for Global or application level data (MASF type) (SEDS BMC) (SEDS, Weekly profile of daily data (SEDS) Weekly profile hourly data (SEDS main chart) ­ most efficient type of performance data chart Monthly profile of daily data

SCMG'09

10

How to read Control Charts Control chart is one of the existing graphical tools. One of the most powerful, but not the only one!

CONTROL CHART

Top Bar Chart

CASE: SEDS detected VM server is moved to other host by v-motion

SCMG'09

Trend Forecast Charts

11

How to read weekly profile hourly data control charts

This is the SEDS view to compare th l t 7 d the last days ( t l) vs. th l t (actual) the last 6 month baseline (historical) data. Black curve is the actual hourly data. Left side from vertical line is THIS WEEK data up to yesterday. Right side is the last week data

Green curve is the hourly average (Mean) for particular weekday and hour for the history of 6 month. y Red is the Upper Control Limit (UCL - Mean + 3 st.dev. Blue is the Lower Control Limit (LCL - Mean - 3 st.dev. st dev

SCMG'09 12

How the weekly profile hourly data control chart is built

Take one week of recent data

SCMG'09

13

How the weekly profile hourly data control chart is built

Take one week of recent data

SCMG'09

14

How the weekly profile hourly data control chart is built

Take one week of recent data and put that in weekly profile format

SCMG'09

15

How the weekly profile hourly data control chart is built

Take one week of recent data and put that in weekly profile form; Take some representative historical reference data; set it as a baseline and th b li d then compare it with the most recent actual data. ith th t t t ld t

· If the actual data exceeds some statistical thresholds, (e.g. Upper (UCL) and Lower (LCL) Control Limits C t l Li it are mean plus/minus 3 standard deviations), · generate an exception (alert via e-mail) and build a control chart .

NOTE it predicts what is suppose to be happened tomorrow

SCMG'09 16

How the weekly profile hourly data control chart is built

Take one week of recent data and put that in weekly profile form; Take some representative historical reference data; set it as a baseline and th b li d then compare it with the most recent actual data. ith th t t t ld t

· If the actual data exceeds some statistical thresholds, (e.g. Upper (UCL) and Lower (LCL) Control Limits C t l Li it are mean plus/minus 3 standard deviations), · generate an exception (alert via e-mail) and build a control chart .

NOTE it predicts what is suppose to be happened tomorrow

SCMG'09 17

Why is it so powerful? Forecasting vs. Exception Detecting

In addition to unusual resource usage capture, the Weekly Control Chart has the following features:

o "Summarization": it uses summarized data (6-8 month history of hourly data). o "Correlation" That allows you to see where system performance and/or business driver metrics correlate simply by analyzing synchronized control charts. o "Do Not Mix Shifts" Control chart by nature visualizes the separation of work or peak time and off time. f k k d ff o "Statistical Model Choice" means playing with different statistical limits (e.g. 1 st. dev. vs. 3 or more st. dev. or percentiles) to tune the system and reduce the rate of false positives. o "Si ifi "Significant E Events" to adjust itself statistically to some events " dj i lf i i ll because the historical period follows the actual data and every event will occasionally be older than the oldest day in the reference set. o "Outliers detection" All workload pathologies are d fi i l statistically unusual; they are captured definitely i i ll l h d and then suppose to be removed from historical data.

SCMG'09

18

Why is it so powerful? EXAMPLES

The SEDS and Memory Metrics (Paging exceptions ) y ( g g p

This metric has the following problem: there is no simple calculated threshold and, as such, it is hard to say if the 2 y am spike is big enough to worry about

SCMG'09

19

Why is it so powerful? EXAMPLES

The SEDS and Memory Metrics (Paging exceptions )

This metric has the following problem: there is no simple calculated threshold and, as such, it is hard to say if the 2 y am spike is big enough to worry about

SCMG'09

20

Why is it so powerful? EXAMPLES

The SEDS and Memory Metrics (Paging exceptions )

The control chart shows sho s unusual paging activity. That is confirmed by reviewing the historical paging trend:

SCMG'09

21

Why is it so powerful? EXAMPLES

The SEDS and Memory Metrics (Weekly control charts )

SCMG'09

22

Why is it so powerful? EXAMPLES

The SEDS and Memory Metrics (Weekly control charts )

SCMG'09

23

Why is it so powerful? EXAMPLES

The SEDS and Memory Metrics (Weekly control charts )

This example shows the weekly scheduled server reboot (to avoid memory leak issues). This kind of graph is also useful since, even if there were no exceptions from yesterday, it may show exceptions f from previous days.

SCMG'09 24

Why is it so powerful? EXAMPLES

The SEDS and Memory Metrics (Weekly control charts: Memory Leaks )

SCMG'09

25

Why is it so powerful? EXAMPLES

The SEDS and CPU Metrics (24 hour and weekly control charts)

Global exception correlates with some apps

SCMG'09

Some Citix apps defect on VMs

26

Why is it so powerful? EXAMPLES

The SEDS and Virtual Machine metrics

HOST OS

Runningaway VM

Runningaway VM

Control Chart detects Runaway of the VM even though f th th h the CPU util. Is <80%

SCMG'09 27

Why is it so powerful? EXAMPLES

The SEDS and CPU Run Queue metric

Run Queue is useful for capturing CPU bottlenecks. And it indirectly relates to the system response time.

This is Sun Fire V880 4-way box

SCMG'09

28

Why is it so powerful? EXAMPLES

The SEDS and CPU Run Queue metric

If a CPU Q Queue exception is detected and CPU utilization had e cept o for t e ad exception o the same hour plus CPU utilization was close to 100%, there is a high probability of a CPU capacity issue. But hi h Application B t which A li ti caused the exceptions?

This is Sun Fire V880 4-way box

SCMG'09

29

Why is it so powerful? EXAMPLES

The SEDS and CPU Run Queue metric

When a global exception occurs (CPU Queue), the workload level data can be scanned to identify what particular application on the server was responsible for the exception.

SCMG'09

30

Why is it so powerful? EXAMPLES

The SEDS and CPU Run Queue metric

The scan against the application level data showed Application5 had a similar exception. CONCLUSION: An unusual number of active processes b f ti is the cause of global CPU Queue exception and indicates a potential p application performance problem!

SCMG'09

31

Why is it so powerful? EXAMPLES

The SEDS and response time and some other application metrics

SEDS could capture exceptions of Application Response Time (ART)

and Calls Volume of particular functions (APIs Calls) within the Middleware tier.

SCMG'09

32

Why is it so powerful? EXAMPLES

The SEDS and response time and some other application metrics

SCMG'09

33

Why is it so powerful? EXAMPLES

The SEDS and disk space metrics

SCMG'09

34

Why is it so powerful? EXAMPLES

The SEDS and disk I/O metrics

o SEDS captured a Disk I/O rate exception at about 4:00 PM on ServerB,

o and the Application detector found that the Workload "Appl2" had an exception as ti well.

SCMG'09

35

Why is it so powerful? EXAMPLES

The SEDS and Unisys and Tandem metrics

The Tandem server, in contrast, had two unusual spikes of CPUs utilization that crossed the upper limit.

The Unisys server had unusual low utilization that might indicate Disk or Database performance problems

SCMG'09 36

Why is it so powerful? EXAMPLES

Mainframe metrics Control Chart (BMC) BMC Visualizer was used to find any other exceptions based on other filtering p g policies. For that, the BMC collector needed to be installed on the server and BMC Visualizer used manually to capture any MASF exceptions.

BMC Visualizer example: the System Hierarchy (spectrum) and Control charts

SCMG'09

37

Why is it so powerful? EXAMPLES

The SEDS and Mainframe metrics Captured Exceptions for One of the Logical Partitions (LPAR)

SEDS shows that Appl1 was responsible for the global maxima in the overall MIPS chart .

Looking at a stacked workload data chart it's difficult to find an application, which is responsible for spikes in overall CPU usage.

SCMG'09 38

Why is it so powerful? EXAMPLES

The SEDS and Mainframe metrics

Hourly SUM of the average response p per transaction Hourly SUM of ended transaction count - TRANS

RESP,

(It shows the values consistently higher than average)

Hourly SUM of elapsed tasks duration - CPUsec

SCMG'09

39

Why is it so powerful? EXAMPLES

The SEDS and Mainframe metrics

To capture an unusual behavior of a relatively small application that was not big enough to create a global exception:

HEALTH CHECK: To prove a stable behavior of any essential t bl b h i f ti l or critical application:

SCMG'09

40

How to build a Control Chart

Using existing statistical tools

o

o o o

SAS/Base and SAS/Graph

SAS/QC (Quality Control): JMP from SAS Minitab and other

Using Built-in Control Charts builders (BMC, BEZ and so on)

SCMG'09

How to build a Control Chart - EXCEL

What about just Excel!

o EXAMPLE: CPS Control Chart with moving or static reference set

LINK TO SPREADSHEET

UpperLimit =F+M$2*G = H LowerLimit =F-M$2*G = J 7-day Moving Average =AVERAGE(B:B+10) = F 1 st. dev =STDEV(B:B+10) other limits can be used: =PERCENTILE(B3:B+10,0.05) =PERCENTILE(B3:B+10,0.95)

(S+ =IF(B-H<0,0,B-H) = I S- =IF(B-J>0,0,B-J) = K EV= ExtraValue = I+K ) - see [1]

SCMG'09

42

How to build a Control Chart - EXCEL

What about just Excel!

o EXAMPLE2: Weekly Health Index (Concord metric) MASF Control Chart Builder LINK TO SPREADSHEET

· For SUNday (Column "B"): · Mean = AVERAGE(B2:B25) · Upperlimit = AVERAGE(B2:B25)+ 3*STDEV(B2:B25) · Lowerlimit = IF(AVERAGE(B2:B25)3*STDEV(B2:B25)<0,0, AVERAGE(B2:B25)3*STDEV(B2:B25)) · StdDeviation = STDEV(B2:B25) · For other columns "B' should be replaced with other column letter (e.g. MONday ­ "C" and so on)

SCMG'09

DATE 6-Dec-05 7-Dec-05 8-Dec-05 9-Dec-05 ... 23-May-06 24-May-06 25-May-06 26-May-06

HEALTH INDEX 2.3 1.5 0.0 1.1 ... 4.4 6.0 0.3 1.0

WEEK DAY 3 4 5 6 ... 3 4 5 6

How to build a Control Chart ­ SAS vs. EXCEL

What about just Excel!

o EXAMPLE3: SEDS chart

The raw data was captured by a SEDS based on MXG data:

SAS Version

This real performance issue was captured and data was provided by John Shuck (SunTrust )

SCMG'09 44

How to build a Control Chart ­ SAS vs. EXCEL

What about just Excel!

o EXAMPLE3: SEDS chart LINK TO SPREADSHEET

The raw data was captured by a SEDS based on MXG data:

EXCEL Pivot Table Version

SCMG'09

45

How to build a Control Chart ­ EXCEL vs. SAS vs. R

What about SAS, Excel or R!

o EXAMPLE4: Monthly Profile vs. Weekly Profile LINK TO SPREADSHEET

The data is Unix File Space Utilization

ECXEL

SAS

SCMG'09

46

How to build a Control Chart ­ EXCEL vs. SAS vs. R

What about SAS, Excel or R!

o EXAMPLE4: Monthly Profile R download: http://www.r-project.org/

The data is Unix File Space Utilization: INPUT is CSV

R-script (published on my blog):

Output JPEG

SCMG'09 47

Summary Control Chart is a really proactive tool and can help to capture unusual resource usage before it breaks Control Chart is the best Base-lining tool and can show how actual data deviate from historical baseline Control Chart is the tool to detect a pathology detection (run-away, memory leaks) Control Chart has the ability to uncover some trends and patterns showing actual data deviations from an historical baseline. Control Chart could be Classical (SPC) or MASF (Actual vs vs. Reference set with grouping by hour-weekdays). Control Chart provides dynamic threshold: no need of manual setting Control Chart can be build just using Excel or R

SCMG'09

48

References

Jeffrey Buzen and Annie Shum: "MASF -- Multivariate Adaptive Statistical Filtering," Proceedings of the Computer Measurement Group, 1995, pp. 1 10. Group 1995 pp 1-10 Igor Trubin: "Global

(http://www.cmg.org/measureit/shared/trubin_02.pdf) Linwood Merritt, Igor Trubin: "Disk

and Application Levels Exception Detection System, Based on MASF Technique ", Proceedings of the Computer Measurement Group, 2002. Subsystem Capacity Management Based on Business Drivers I/O Performance Metrics and MASF", Proceedings of the Computer

Linwood Merritt, Igor Trubin: : "Mainframe

Measurement Group, 2003. (http://regions.cmg.org/regions/ncacmg/downloads/june162004_session3.doc)

Measurement Group, 2004. (http://www.cmg.org/membersonly/2004/papers/4179.pdf) Igor Trubin: "Capturing Workload Pathology by Statistical p g gy y System", Proceedings of the Computer Measurement Group, 2005. (http://www.cmg.org/membersonly/2005/papers/5016.pdf) Igor Trubin: "System Management by Exception, Measurement Group, 2006. (http://www.cmg.org/membersonly/2006/papers/6120.pdf) (http://www cmg org/membersonly/2006/papers/6120 pdf) Igor Trubin: "System Management Computer Measurement Group, 2007. Igor Trubin: "Exception Measurement Group, 2008.

SCMG'09

Global and Workload Level Statistical Exception Detection System, Based on MASF", Proceedings of the Computer Exception Detection p

Part 6", Proceedings of the Computer

by Exception, Part Final", Proceedings of the

Based Modeling and Forecasting", Proceedings of the Computer

49

Questions?

Thank you!

Power of Control Charts

Igor Trubin, PhD, SunTrust Bank

http://www.itrubin.blogspot.com/

SCMG'09

50

Information

Microsoft PowerPoint - PowerOfControlCharts.pptx

50 pages

Find more like this

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

888840


You might also be interested in

BETA
GG199 layout
Microsoft PowerPoint - PowerOfControlCharts.pptx
Microsoft Word - BuTrans-QandA.doc