Read BI0709.tex text version

Analyzing and Visualizing Expression Data with Spotfire

Spotfire DecisionSite (http://hc-spotfire.stjude.org/spotfire/support/manuals/manuals. jsp) is a powerful data mining and visualization program with application in many disciplines. Modules are available in support of gene expression analysis, proteomics, general statistical analysis, chemical lead discovery analysis, geology, as well as others. Here the focus is on Spotfire's utility in analyzing gene expression data obtained from DNA microarray experiments. Other units in this manual present a general overview of the Spotfire environment along with the hardware and software requirements for installing it (UNIT 7.7), and how to load data into Spotfire for analysis (UNIT 7.8). This unit presents numerous methods for analyzing microarray data. Specifically, Basic Protocol 1 and Alternate Protocol 1 describe two methods for identifying differentially expressed genes. Basic Protocol 2 discusses how to conduct a profile search. Additional protocols illustrate various clustering methods, such as hierarchical clustering (see Basic Protocol 4 and Alternate Protocol 2), K-means clustering (see Basic Protocol 5), and Principal Components Analysis (see Basic Protocol 6). A protocol explaining coincidence testing (see Basic Protocol 3) allows the reader to compare the results from multiple clustering methods. Additional protocols demonstrate querying the Internet for information based on the microarray data (see Basic Protocol 7), mathematically transforming data within Spotfire to generate new data columns (see Basis Protocol 8), and exporting final Spotfire visualizations (see Basic Protocol 9). Spotfire (Functional Genomics module) can import data in nearly any format, but the authors have focused here on two popular microarray platforms, the commercial GeneChip microarray data (Affymetrix) and two-color spotted microarray data produced using GenePix software (Axon). Spotfire facilitates the seamless import of Affymetrix output files (.met) from Affymetrix MAS v4.0 or v5.0 software. The .met file is a tab-delimited text file containing information about attributes such as probe set level, gene expression levels (signal), detection quality controls (p-value and Absence/Presence calls), and so forth. In the illustration below, the authors use MAS 5.0 .met files as an example. Several types of spotted arrays and their corresponding data types exist, including commercial vendors (i.e., Agilent, Motorola, and Mergen) that supply spotted microarrays for various organisms as well as facilities that manufacture their own chips. Several different scanners and scanning software packages are available. One of the more commonly used scanners is the Axon GenePix. GenePix data files are in a tab-delimited text format (.gpr), which can be directly imported into a Spotfire session. NOTE: This unit assumes the reader is familiar with the Spotfire environment, has successfully installed Spotfire, and has uploaded and prepared data for analysis. For further information regarding these tasks, please see UNITS 7.7 & 7.8.

UNIT 7.9

IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES USING t-TEST/ANOVA

The treatment comparison tool provides methods for distinguishing between different treatments for an individual record. There are two types of treatment comparison algorithms: t-test/ANOVA (Kerr and Churchill, 2001) and Multiple Distinction (Eisen et al., 1998). Both algorithms seek to identify differentially expressed genes based on their expression values.

BASIC PROTOCOL 1

Analyzing Expression Analysis

Contributed by Deepak Kaushal and Clayton W. Naeve

Current Protocols in Bioinformatics (2004) 7.9.1-7.9.43 Copyright C 2004 by John Wiley & Sons, Inc.

7.9.1

Supplement 7

The t-test is a commonly used method to evaluate the differences between the means of two groups by verifying that observed differences between them are statistically significant. Analysis of variation (ANOVA) works along the same principle but can be used to differentiate between more than two groups. ANOVA calculates the variance within a group and compares it to the variance between the groups. The original (null) hypothesis assumes that the mean expression levels of a gene are not different between the two groups. The null hypothesis is then either rejected or accepted for each gene in consideration. The results are expressed in terms of a p-value, which is the observed significance level--i.e., the probability of a type I error concluding that a difference exists in the mean expression values of a given gene when in fact there is no difference. If the p-value is below a certain threshold, usually 0.05, it is considered that a significant difference exists. The lower the p-value, the higher the difference. The ANOVA algorithm in Spotfire has a one-way layout; therefore it can only be used to discriminate between groups based on one variable. Further, this algorithm assumes the following: (1) the data is normally distributed and (2) the variances of separate groups are similar. Failure to maintain these assumptions will lead to erroneous results. One way to ensure that the data is normally distributed is to log transform the data (UNIT 7.8).

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

Analyzing and Visualizing Expression Data with Spotfire

1. Click Analysis, followed by Pattern Detection, followed by Treatment Comparison in the Tools pane of DecisionSite Navigator (Fig. 7.9.1).

7.9.2

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.1

The Treatment Comparison tool is shown.

The treatment comparison dialog-box is displayed and all available columns are listed in the Value Columns field. (Note that if the tool has been used before, it retains the earlier grouping and the user will have to delete it.) Value Columns are the original data columns that have been uploaded into the Spotfire session. Any data column can be used as a value column as long as it includes integers or real numbers.

2. Use the following procedure to move and organize the desired value columns into the Grouped Value Columns field, which displays columns that the user has defined as being part of a group (e.g., replicate microarrays) on which the calculation is to be performed.

Note that at least two columns should be present in every group for the tool to be able to perform its calculations.

a. Select the desired column. Click the Add

button.

The column will end up in the selected group of the Grouped Value Columns field.

b. Click New Group to add a group or Delete Group to remove a group.

If the deleted group contained any value columns, they are moved back to the Value Columns field (Fig. 7.9.2).

c. Click Rename Group to open the edit group name dialog box, which can be used to rename a group.

It is useful to rename the columns to something meaningful because the default names are Group1, Group2, and so on.

3. From the same dialog box, choose whether All Records or Selected Records are to be used.

Choosing All Records causes all records that were initially uploaded into Spotfire to be used for the calculations. If any preprocessing or filtering steps have been performed and the user would like to exclude those records from calculations, the user should choose Selected Records.

Analyzing Expression Analysis

7.9.3

Current Protocols in Bioinformatics Supplement 7

Figure 7.9.2 The Treatment Comparison dialog box allows the users to group various Value Columns into different groups on which t-test/ANOVA is to be performed.

Analyzing and Visualizing Expression Data with Spotfire

Figure 7.9.3 A profile chart is generated to display the results of t-test/ANOVA analysis. The "ttest/ANOVA Query Device" (a range slider) can be manipulated to identify highly significant genes. The profile chart is colored in the Continuous Coloring mode based on the t-test/ANOVA p-values.

7.9.4

Supplement 7 Current Protocols in Bioinformatics

4. If there are empty values in the data, select a method to replace empty values from the following choices in the drop-down list: Choice Constant Numeric Value Row Average Row Interpolation Replaces empty values with Specified value Average of all the values in the row Interpolated value of the two neighboring values.

5. Select "t-test/ANOVA" from the Comparison Measure list box. 6. Type a new identifier in the Column Name text box or use the default. Check the Overwrite box to replace the values of a previously named column. If the user wishes not to overwrite, make sure that the Overwrite check box is unchecked. 7. Click OK.

This will add a new column containing p-values to the data set and creates a new Profile Chart visualization. The profiles are ordered by the group with the lowest p-value setting (Fig. 7.9.3).

IDENTIFICATION OF DIFFERENTIALLY EXPRESSED GENES USING DISTINCTION CALCULATION

The distinction calculation algorithm (Eisen et al., 1998) is slightly different from that of t-test/ANOVA (see Basic Protocol 1). It is a measure of how distinct the expression level is between two parts of a profile. The Distinction Calculation algorithm divides the variables (columns) within a row into two groups. A distinction value is then calculated for each row based on the two groups of values. The distinction value is a measure of how distinct the difference in expression level is between two parts of the row (e.g., tumor cells versus normal cells). The algorithm divides the variables in the profile data into groups based on factors such as type of tissue and tumor, and looks for genes that show a distinct difference in expression level between them. The profiles can be compared to an idealized pattern to identify genes closely matching that pattern. One such idealized pattern could be where the expression level is uniformly high for one group of experiments and uniformly low for another group for the given gene. Profiles that match this ideal pattern closely (i.e., those that have high expression values in the first set of experiments and low expression values in the second) are given high positive distinction values. Similarly, profiles that give low expression values in the first group and high expression values in the second group are given high negative correlation values. The calculated distinction value is a measure of how similar each profile is with this ideal. Profiles that have high expression values in the first group and low expression values in the second are given high positive distinction values. Likewise, profiles that have low expression values in the first group and high expression values in the second are given high negative correlation values.

ALTERNATE PROTOCOL 1

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers

Analyzing Expression Analysis

7.9.5

Current Protocols in Bioinformatics Supplement 7

Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

1. Click Analysis, followed by Pattern Detection, followed by Treatment Comparison in the Tools pane of DecisionSite Navigator (Fig. 7.9.1).

The Treatment Comparison dialog-box is displayed (Fig. 7.9.4) and all available columns are listed in the Value Columns field. (Note that if the tool has been used before, it retains the earlier grouping and the user will have to delete it.) Value Columns are the original data columns that have been uploaded into the Spotfire session. Any data column can be used as a value column provided it includes integers or real numbers.

2. Organize columns, choose records, and fill empty values as described (see Basic Protocol 1, steps 2 to 4). 3. Select Distinction/Multiple Distinction from the Comparison Measure list box and click OK (Fig. 7.9.4).

Analyzing and Visualizing Expression Data with Spotfire

Figure 7.9.4 The Treatment Comparison dialog box allows the users to group various Value Columns into different groups on which Multiple Distinction is to be performed.

7.9.6

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.5 Results of Multiple Distinction are originally displayed in a profile chart. The users can however build a heat map based on these results. (A) A set of genes on the basis of which eight experiments can be distinctly identified using the Multiple Distinction algorithm. (B) A zoomed in version of the same heat map.

This will add new columns containing distinction values to the data set and new profile visualization will be created. The profiles are ordered by the group with the lowest value (highest distinction).

4. Use these results to order a heat map based on the results of the Distinction/Multiple Distinction for better visualization and identification of genes with different profiles in different samples (Fig. 7.9.5).

Analyzing Expression Analysis

7.9.7

Current Protocols in Bioinformatics Supplement 7

A heat map is a false color image of a data set (e.g., microarray data) which allows users to detect the presence of certain patterns in the data. Heat maps resemble a spreadsheet in which each row represents a gene present on the microarray and each column represents a microarray experiment. By coloring the heat map according to signal or log ratio values, trends can be obtained about the behavior of genes as a function of experiments. BASIC PROTOCOL 2

IDENTIFICATION OF GENES SIMILAR TO A GIVEN PROFILE: THE PROFILE SEARCH

In a profile search, all profiles (i.e., all data-points or rows) are ranked according to their similarity to a master. The similarity between each of the profiles and the master is then calculated according to one of the available similarity measures. Spotfire adds a new data column with values for each individual profile (index of similarity) and a rank column, which enables users to identify numerous genes that have profiles similar to the master-profile. In order to successfully use this algorithm, the user must specify the following. A gene to be used as a master-profile. A profile search is always based on a master profile. Spotfire allows users to designate an existing and active profile as the master. Alternatively, a new master-profile can be constructed by averaging several active profiles. It is possible to edit the designated master-profile using the built-in editor function before embarking on profile search (Support Protocol 1). A similarity measure to be used. Similarity measures express the similarity between profiles in numeric terms, thus enabling users to rank profiles according to their similarity. Available methods include Euclidean Distance, Correlation, Cosine Correlation, CityBlock Distance, and Tanimoto (Sankoff and Kruskal, 1983). Whether to include or exclude empty values from the calculation. If a profile contains a missing value and the user opts to exclude empty values, the calculated similarity between the profiles is then based only on the remaining part of the profile.

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server

Analyzing and Visualizing Expression Data with Spotfire

7.9.8

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.6 The Profile Search dialog box allows users to chose Value Columns to be used for this calculation as well as variables such as Similarity Measure and Calculation Options.

Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

1. Activate the profile to be used as the master in the Profile Chart/Diagram view. Alternatively, mark a number of profiles on which to base the master profile. 2. If changing the master profile is desired, or to create a totally new profile, edit the master profile as described (see Support Protocol 1). 3. Click on Analysis, followed by Pattern Detection, followed by Profile Search in the Tools pane of the DecisionSite Navigator.

A Profile Search dialog box will appear (Fig. 7.9.6).

4. Select the Value Columns on which to perform the profile search. For multiple selections, hold down the Ctrl key while continuing to click the desired columns. 5. Click a radio button to choose to work with All Records or Selected Records (see Basic Protocol 1, step 3). 6. From the drop-down list, select a method to Replace Empty values from the dropdown list (see Basic Protocol 1, step 4). 7. If both marked records and an active record exist, select whether to use profile from the Active Record or Average from Marked Records.

Analyzing Expression Analysis

7.9.9

Current Protocols in Bioinformatics Supplement 7

Only one record can be activated at a time (by clicking on the record in any visualization). An active record appears with a black circle around it. Several or all records present can be marked by clicking and drawing around them in any visualization. Marked data corresponding to these records can then be copied to the clipboard. See UNIT 7.7 for more information. Following this selection, the selected profile is displayed in the profile editor along with its name. At this point, the profile and its name can be edited in any manner desired.

8. Select the Similarity Measure to be used.

For a detailed description on similarity measures, see Sankoff and Kruskal (1983).

9. Type a Column Name for the resulting column or use the default. Check the Overwrite box if appropriate (see Basic Protocol 1, step 6). 10. Click OK.

This will cause the search to be performed and displayed in the editor, and the results to be added to the dataset as a new column. Additionally, a new scatter plot is created which displays rank versus similarity, and annotations containing information about the calculation settings are added to the Visualization. At the end of the profile search, selected profiles in the data are ranked according to their similarity to the selected master profile.

11. If desired, create a scatter plot between Similarity and Similarity Rank.

In such a plot, the record that is most similar to the master profile will be displayed in the lower left corner of the visualization. SUPPORT PROTOCOL 1

EDITING A MASTER PROFILE

Since the starting profile does not restrict the user in any fashion, one can modify existing values to create a master profile of their choice.

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations)

Current Protocols in Bioinformatics

Analyzing and Visualizing Expression Data with Spotfire

7.9.10

Supplement 7

Figure 7.9.7 The Profile Search: Edit dialog box allows users to edit an existing profile to create an imaginary profile upon which to base the search.

Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

1. Activate the profile to be used for creating an edited master profile by simply clicking on the profile in the Profile Chart visualization. 2. Click Analysis, followed by Pattern Detection, followed by Profile Search in the Tools pane of the DecisionSite Navigator.

A profile search dialog box will appear.

3. Select the Value Columns on which to perform the profile. For multiple selections, hold down the Ctrl key while continuing to click on the desired columns. 4. Click Edit.

This will open the profile search edit dialog box (Fig. 7.9.7).

5. Click directly in the editor to activate the variable to be changed. Drag the value to obtain a suitable look on the profile. Delete any undesirable value(s) using the Delete key on the keyboard.

The new value will be instantaneously displayed in the editor.

6. Type a profile name in the text box or use the default name. 7. Click OK.

This closes the editor and shows the edited profile in the profile search dialog box (Fig. 7.9.6).

8. If desired, revert to the original profile by clicking Use Profile From: Active Record.

The Edited radio button is selected by default.

Analyzing Expression Analysis

7.9.11

Current Protocols in Bioinformatics Supplement 7

BASIC PROTOCOL 3

COINCIDENCE TESTING

This tool can be used to compare two columns and determine whether the apparent similarity between the two distributions is a coincidence or not. Essentially, the coincidence testing tool calculates the probability of getting an outcome as extreme as the particular outcome under the null hypothesis (Tavazoie et al., 1999). This tool is particularly useful in comparing the results of several different clustering methods (e.g., see Basic Protocols 4 and 5, and Alternate Protocol 2).

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

1. Click on Analysis, followed by Pattern Detection, followed by Coincidence Testing in the Tools pane of the DecisionSite Navigator.

A dialog box will be displayed (Fig. 7.9.8).

2. Select the First Category Column.

For example, in comparing the results of two different clustering methods, select the first one here.

3. Select the Second Category Column.

Analyzing and Visualizing Expression Data with Spotfire

4. Select whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 5. Type a Column Name for the resulting column or use the default.

7.9.12

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.8

The Coincidence Testing dialog box.

6. Select the Overwrite check box to overwrite a previous column with the same name (see Basic Protocol 1, step 6). 7. Click OK.

A new results column containing p-values is added to the dataset. An annotation may also be added.

HIERARCHICAL CLUSTERING

Hierarchical clustering arranges objects in a hierarchy with a tree-like structure based on the similarity between the objects. The graphical representation of the resulting hierarchy is known as a dendrogram (Eisen et al., 1998). In Spotfire DecisionSite, the vertical axis of the dendrogram consists of the individual records and the horizontal axis represents the clustering level. The individual records in the clustered data set are represented by the right-most nodes in the row dendrogram. Each remaining node in the dendrogram represents a cluster of all records that lie below it to the right in the dendrogram, thus making the left-most node in the dendrogram a cluster that contains all records. Clustering is a very useful data reduction technique; however, it can easily be misapplied. The clustering results are highly affected by the choice of similarity measure and other input parameters. If possible, the user should replicate the clustering analysis using different methods. The algorithm used in the Hierarchical Clustering tool is a hierarchical agglomerative method. This means that the cluster analysis begins with each record in a separate cluster, and in subsequent steps the two clusters that are the most similar are combined to a new aggregate cluster. The number of clusters is thereby reduced by one in each iteration step. Eventually, all records are grouped into one large cluster.

BASIC PROTOCOL 4

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or

Current Protocols in Bioinformatics

Analyzing Expression Analysis

7.9.13

Supplement 7

Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Initiating hierarchical clustering in Spotfire DecisionSite 1. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering in the Tools pane of the DecisionSite Navigator (Fig. 7.9.9).

The Hierarchical Clustering dialog box is displayed.

2. Select the Value Columns on which to base clustering. For multiple selections, hold down the Ctrl key and click on the desired columns or click on one of the columns and drag to select (Fig. 7.9.10). 3. Select whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 4. Select a Method to Replace Empty values with from the drop-down list (see Basic Protocol 1, step 4). 5. Select which Clustering Method to use for calculating the similarity between two clusters. 6. Select which Similarity Measure to use in the calculations (Sankoff and Kruskal, 1983).

Correlation measures are based on profile shape and are therefore better measures of complex microarray studies than measures like Euclidean distance, which are just based on numeric similarity.

7. Select which Ordering Function to use while displaying results.

Analyzing and Visualizing Expression Data with Spotfire

8. Use the default name or type a new column name in the text box. Check the Overwrite box if overwriting a previously added column with the same name (see Basic Protocol 1, step 6).

7.9.14

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.9 The Hierarchical Clustering algorithm can be accessed from the Tools as well as the Guides menu.

Figure 7.9.10 The Hierarchical clustering dialog box allows users to specify Value Columns to be included in the clustering calculation and various other calculation options such as the Clustering Method and Similarity Measure.

Analyzing Expression Analysis

7.9.15

Current Protocols in Bioinformatics Supplement 7

Figure 7.9.11 Hierarchical clustering results are displayed as a (default red-green) heat map with an associated dendrogram.

9. Select the Calculate Column Dendrogram check box if creating a column dendrogram is desired.

A column dendrogram arranges the most similar columns (experiments) next to each other.

10. Click OK.

The hierarchical clustering dialog box will close and the clustering initiated. The results are displayed according to the user's preferences in the dialog box (Fig. 7.9.11).

11. If desired, add the ordering column to the ordering dataset in order to compare the clustering results with other methods (see Support Protocol 2).

Marking and activating nodes 12. To mark a node in the row-dendrogram to the left of the heat map, click just outside it, drag to enclose the node within the frame that appears, and then release. Alternatively, press Ctrl and click on the node to mark it. To mark more than one node, hold down the Ctrl key and click on all the nodes to be marked. To unmark nodes, click and drag an area outside the dendrogram.

When one or more nodes are marked, that part of the dendrogram is shaded in green. The corresponding parts are also marked in the heat map and the corresponding visualizations.

13. To activate a node, click it in the dendrogram.

Analyzing and Visualizing Expression Data with Spotfire

A black ring appears around the node. Only one node can be active at a given time. This node remains active until another node is activated. It is possible to zoom in on the active node by selecting Zoom to Active from the hierarchical clustering menu.

7.9.16

Supplement 7 Current Protocols in Bioinformatics

Zooming in and resizing a dendrogram 14. Zoom to a subtree in the row-dendrogram by using either the visualization zoom bar or by right clicking in the dendrogram and clicking Zoom to Active in the resulting pop-up menu. Alternatively, double click on a node.

15. To go one-step back, double click on an area in the dendrogram not containing any part of a node. To return to the original zoom, click Reset Zoom. 16. If desired, adjust the space occupied by the dendrogram in the visualization by holding down the Ctrl key and using the left/right arrow keys on the keypad to slim or widen it.

Saving a dendrogram NOTE: Dendrograms are not saved in the Spotfire data file (.sfs) but can be saved as .xml documents.

17. To save, select Save, followed by Row Dendrogram or Column Dendrogram from the Hierarchical Clustering menu. 18. Type the file name and save the file as a .dnd file.

Opening a saved dendrogram 19. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering in the Tools pane of the DecisionSite Navigator to display the Hierarchical Clustering dialog box.

20. Click on Open to display the Dendrogram Import dialog box. 21. Click on the Browse button by the Row Dendrogram field to display an Open File dialog box.

ADDING A COLUMN FROM HIERARCHICAL CLUSTERING

The ordering column that is added to the dataset when hierarchical clustering is performed is used only to display the row dendrogram and connect it to the heat map. In order to compare the results of hierarchical clustering to that of another method much as K-means clustering (see Basic Protocol 5), a clustering column should be added to the data.

SUPPORT PROTOCOL 2

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above

Current Protocols in Bioinformatics

Analyzing Expression Analysis

7.9.17

Supplement 7

Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

1. Perform Hierarchical Clustering on a dataset as described in Basic Protocol 4 and locate the row dendrogram, which can be found to the left of the heat map (Fig. 7.9.11). 2. If the cluster line is not visible, right click and select View from the resulting pop-up menu, followed by Cluster Scale.

The cluster line, which is the dotted red line in the row dendrogram, enables users to determine the number of clusters being selected.

3. Click on the red-circle on the cluster slider above the dendrogram and drag it to control how many clusters should be included in the data column. Alternatively, use the left and right arrow keys on the keyboard to scroll through the different number of clusters.

Analyzing and Visualizing Expression Data with Spotfire

Figure 7.9.12 Hierarchical Clustering visualization allows users to zoom in and out of the heat map as well as the dendrogram. Individual or a group of clusters can be marked and a data column added to the Spotfire session.

7.9.18

Supplement 7 Current Protocols in Bioinformatics

All clusters for the current position on the cluster slider are shown as red dots within the dendrogram. Upon positioning the red circle on its right-most position in the cluster slider, one cluster can be obtained for every record. Positioning it on its left-most position, on the other hand, causes all records to be comprised of a single cluster.

4. To retain a previously added cluster column, ensure that the Overwrite check box in the hierarchical clustering dialog is unchecked (see Basic Protocol 1, step 6). 5. Select Clustering, followed by Add New Clustering Column from the Hierarchical Clustering menu.

A column with information pertaining to which cluster each record belongs, will be added to the dataset. Note that the records that are not included in the row dendrogram will have empty values in the new clustering column (Fig. 7.9.12).

HIERARCHICAL CLUSTERING ON KEYS

A structure key is a string that lists the substructures (for example various descriptions in the gene ontology tree). Clustering on keys therefore implies grouping genes with a similar set of substructures. Clustering on keys is based solely on the values within the key column that should contain comma-separated values for some, if not all, records in the dataset. This is a valuable tool to determine if there is an overlap between the expression data and gene ontology descriptions UNIT 7.2).

ALTERNATE PROTOCOL 2

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

Current Protocols in Bioinformatics

Analyzing Expression Analysis

7.9.19

Supplement 7

1. Click on Analysis, followed by Clustering, followed by Hierarchical Clustering on Keys in the Tools pane of the DecisionSite Navigator.

The Hierarchical Clustering dialog box will be displayed.

2. Select the Key Columns on which to base clustering.

The Key Column can be any string column in the data.

3. Select whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 4. Select a method to Replace Empty Values from the drop-down list (see Basic Protocol 1, step 4). 5. Select which Clustering Method to use for calculating the similarity between two clusters. 6. Select which Similarity Measure to use in the calculations (Sankoff and Kruskal, 1983). 7. Select which Ordering Function to use while displaying results. 8. Type a New Column Name or use the default in the text box. If desired, check the Overwrite check box if to overwrite a previously added column with the same name (see Basic Protocol 1, step 6). 9. Select the Calculate Column Dendrogram check box to create a column dendrogram, if desired.

A column dendrogram arranges the most similar columns (experiments) next to each other.

10. Click OK.

The Hierarchical Clustering on Keys dialog box will be closed and clustering initiated. The results are displayed according to the users preferences in the dialog box. A heat map and a row-dendrogram visualization are displayed and added to the dataset.

BASIC PROTOCOL 5

K-MEANS CLUSTERING

K-means clustering is a method for grouping objects into a predetermined number of clusters based on their similarity (MacQueen, 1967). It is a type of nonhierarchical clustering where the user must specify the number of clusters into which the data will eventually be divided. K-means clustering is an iterative process in which: (1) a number of user defined clusters are predetermined by the user for a data set, (2) a centroid (the center point for each cluster) is chosen for each cluster based on a number of methods by the user, and (3) each record in the data set is assigned to the cluster whose centroid is closest to that record. Note that the proximity of each record to the centroid is determined on the basis of a user-defined similarity measure. The centroid for each cluster is then recomputed based on the latest member of the cluster. These steps are repeated until a steady state has been reached.

Necessary Resources Hardware

Analyzing and Visualizing Expression Data with Spotfire

Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or

Current Protocols in Bioinformatics

7.9.20

Supplement 7

Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Performing K-means clustering 1. To initiate K-means clustering, click on Analysis, followed by Clustering, followed by K-means Clustering in the Tools pane of the DecisionSite Navigator.

The K-means clustering dialog box will be displayed (Fig. 7.9.13).

Figure 7.9.13 The K-means Clustering Tool dialog box allows the users to specify the number of desired clusters, the method of choice for initiating centroids, the similarity measure, and other variables.

Analyzing Expression Analysis

7.9.21

Current Protocols in Bioinformatics Supplement 7

2. Select the Value Columns on which to perform the analysis. For multiple selections, hold down the Ctrl key and click on the desired columns or click on one column at a time and drag. 3. Click on the radio button to specify whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 4. Select a method to Replace Empty Values with from the drop-down list (see Basic Protocol 1, step 4). 5. Enter the Maximum Number of Clusters.

This is the number of clusters that the K-means tool will attempt to generate from the given data set. However, if empty clusters are generated, they will be discarded and the number of clusters displayed may be less than that specified.

6. Select a Cluster Initialization method from the drop-down menu.

The user must specify the number of clusters in which the data should be organized and a method for initializing the cluster centroids. Among the methods available for this purpose are the Data Centroid Based Search, Evenly Spaced Profiles, Randomly Generated Profiles, Randomly Selected Profiles, and Marked Records. These methods are summarized in Table 7.9.1.

7. Select a Similarity Measure to use from the drop-down menu.

Several different similarity measures are available to the K-means clustering tool. These measures express the similarity between different records as numbers, thereby making it possible to rank the records according to their similarity. These include Euclidian distance, Correlation, Cosine Correlation, and City-Block distance (Sankoff and Kruskal, 1983).

Table 7.9.1 Cluster Initiation Methods

Method Data Centroid Based Search

Description An average of all profiles in the data set is chosen to be the first centroid in this method. The similarity between the centroid and all members of the cluster is calculated using the defined similarity measure. The profile that is least fit in this group or which is least similar to the centroid is then assigned to be the centroid for the second cluster. The similarity between the second centroid and all the rest of the profiles is then calculated and all those profiles that are more similar to the second centroid than the first one are the assigned to the second cluster. Of the remaining profiles, the least similar profile is then chosen to be the third centroid and the above process is repeated. This process continues until the number of clusters specified by the user is reached. This method generates profiles to be used as centroids that are evenly distributed between the minimum and maximum value for each variable in the profiles in the data set. The centroids are calculated as the average values of each part between the minimum and the maximum values. Centroids are assigned from random values based on the data set. Each value in the centroids is randomly selected as any value between the maximum and minimum for each variable in the profiles in the data set. Randomly selected existing profiles (and not some derivation) from the data set are chosen to be the centroids of different clusters. Currently marked profiles (marked before initiating K-means clustering) are used as centroids of different clusters.

Evenly Spaced Profiles

Randomly Generated Profiles Randomly Selected Profiles From Marked Records

Analyzing and Visualizing Expression Data with Spotfire

7.9.22

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.14 K-means clustering results are displayed as a group of profile charts. Each group is uniquely colored as specified by the check-box query device.

8. Type a new column name for the resulting column or use the default. Check the Overwrite check box to overwrite any previously existing column with the same name (see Basic Protocol 1, step 6). 9. Click OK.

The K-means dialog box will close and clustering initiated. At the end of clustering, the results are added to the data set as new columns and graphical representation of the results can be visualized (Fig. 7.9.14).

PRINCIPAL COMPONENTS ANALYSIS

Principal components analysis (PCA) is a tool to reduce the dimensionality of complex data so that it can be easily interpreted but without causing significant loss of data (Jolliffe, 1986). Often, this reduction in the dimensionality of data enables researchers to identify new, meaningful, underlying variables. PCA involves a mathematical procedure that converts high dimension data containing a number of (possibly) correlated variables into a new data set containing fewer uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. New variables are linear combinations of the original variables, thereby making it possible to ascribe meaning to what they represent. This tool works best with transposed data (see Support Protocol 3).

BASIC PROTOCOL 6

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or

Current Protocols in Bioinformatics

Analyzing Expression Analysis

7.9.23

Supplement 7

Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Performing PCA 1. To initiate PCA, click on Analysis, followed by Clustering, followed by Principal Components Analysis in the Tools pane of the DecisionSite navigator.

The PCA dialog box will open (Fig. 7.9.15).

2. Select the Value Columns on which to perform PCA. For multiple selections, hold down the Ctrl key and click on the desired columns or click on one column at a time and drag. 3. Click on the radio button to specify whether to work with All Records or Selected Records (see Basic Protocol 1, step 3). 4. Select a method to Replace Empty Values from the drop down list (see Basic Protocol 1, step 4). 5. Specify the number of Principal Components.

The number of Principal Components is the total number of dimensions into which the user wishes to reduce the original data. K-means clustering is an iterative process and is most valuable when it is repeated several times, using different numbers of defined clusters. There is no way to predict a good number of clusters for any data set. A pattern that is obvious in a cluster number of 20 which the user might think will be better defined with 50 clusters may in fact not appear at all in 50 clusters. It is sometimes helpful to perform a hierarchical clustering prior to K-means clustering. By looking at the heat-map and dendrogram generated by hierarchical clustering, the user will get some idea about how many clusters to specify for K-means clustering.

Analyzing and Visualizing Expression Data with Spotfire

6. Type a new Column Name for the resulting column or use the default name. If desired, check the Overwrite box to overwrite a previously existing column with the same name (see Basic Protocol 1, step 6).

7.9.24

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.15 The PCA dialog box allows the users to specify which Value Columns should be included in the calculation. In addition, it allows users to define variables such as the number of desired components.

Figure 7.9.16

PCA results are displayed as 2-D or 3-D plots according to the users specifications.

Analyzing Expression Analysis

7.9.25

Current Protocols in Bioinformatics Supplement 7

7. Select whether to create 2D or a 3D scatter plot showing the Principal Components, or to perform the PCA calculations without creating a scatter plot by clearing the Create Scatter Plot check box. The 3D scatter plot can be rotated (Ctrl + right mouse key) or zoomed (Shift + right mouse key) to assist visualization. 8. Check the Generate Report box.

This report is an HTML page that contains information about the calculation. If the user does not wish to generate this report, this box can be left unchecked.

9. Click OK.

The Principal Components are now calculated and the results added to the data set as new columns. A new scatter plot and report is created according to the settings chosen in this protocol (Fig. 7.9.16). Note that the PCA tool in Spotfire is limited to 2000 columns of transposed data (i.e., 2000 records in original data). If more records are present at the time of running this Tool, they will be eliminated from the data. SUPPORT PROTOCOL 3

TRANSPOSING DATA IN SPOTFIRE DECISION SITE

The Transpose data tool is used to rotate a dataset so that columns (measurements or experiments) now become rows (genes) and vice-versa. Often, transposition is necessary to present data for a certain type of visualization--e.g., Principal Components Analysis (PCA; see Basic Protocol 6)--or just to get a good overview of the data. Consider Table 7.9.2 as an example. As more and more genes are added, the table will grow taller. (Most typical microarrays contain thousands to tens of thousands of genes.) While useful during data collection, this may not be the format of choice of certain types for visualizations or calculations. By transposing this table, the following the format shown in Table 7.9.3.

Table 7.9.2 Typical Affymetrix or Two-Color Microarray Dataa

Gene Name Gene A Gene B Gene C Gene D Gene E Gene F Gene G

Experiment 1 250 1937 71 47358 28999 689 2004

Experiment 2 283 80 84 131 24107 801 2371

Experiment 3 219 1655 77 39155 24981 750 2205

a Analyzed microarray data typically consists of several rows, each representing a gene or a probe on

the array, and several columns, each corresponding to different experiments (e.g., different tumors or treatments). This is the "tall-skinny" format.

Table 7.9.3 Microarray Data After Transpositiona

Experiment Experiment 1 Experiment 2

Analyzing and Visualizing Expression Data with Spotfire

Gene A 250 283 219

Gene B 1937 80 1635

Gene C 71 84 77

Gene D 47358 131 39155

Gene E 28999 24107 24981

Gene F 689 801 750

Gene G 2004 2371 2205

Experiment 3

a After transposition, the data is flipped so that each row now represents an experiment whereas each column now represents

the observations for a gene. This "short-wide" data format is suitable for data visualization techniques like PCA.

7.9.26

Supplement 7 Current Protocols in Bioinformatics

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

1. Open Transpose Data Wizard 1 by clicking Analysis, followed by Data Preparation, followed by Transpose Data in the Tools pane of the DecisionSite Navigator. 2. Select an identifier column from the drop-down list.

Each value in this column will become a column name in the transposed dataset.

3. Select whether to create columns from All Records or Selected Records (see Basic Protocol 1, step 3).

The transposed data will have exactly the same number of columns as records in the original data with an upper limit of 2000. The rest of the data will be truncated.

4. Click on Next to open Transpose Data Wizard 2. 5. Select the columns to be included in the transposition and then click Add>>.

Each selected column will become a record in the new dataset.

6. Click on Next to open Transpose Data Wizard 3. 7. If needed, select Annotation Columns.

Each transposed column is annotated with the value of this column.

Analyzing Expression Analysis

8. Click Finish.

A message box opens prompting the user to save previous work.

7.9.27

Current Protocols in Bioinformatics Supplement 7

9. Click Yes to save data.

The transposed data now replaces the previous data set. Note that the user should save the previous data set with a different file name to avoid losing that data set. BASIC PROTOCOL 7

USING WEB LINKS TO QUERY THE INTERNET FOR USEFUL INFORMATION

The Web Links tool enables users to send a query to an external Web site to search for information about marked records. The search results are displayed in a separate Web browser. The Web Links tool is shipped with a number of predefined Web sites that are ready to use, though the user can easily set up new links to Web sites of their choice.

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Sending a query using Web links In order to send a query, the data must be in Spotfire DecisionSite. The query is sent for the marked records in the visualizations. If more than one record is marked, the records are separated by the Web link delimiter (specified under Web Links Options) in the query.

1a. In a particular visualization, mark those records for which information is desired.

Analyzing and Visualizing Expression Data with Spotfire

2a. Click on Access, followed by Web Links in the Tools pane of the DecisionSite Navigator.

The Web Links dialog box will be displayed (Fig. 7.9.17).

7.9.28

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.17 The Web Links dialog box allows users to specify the Web site to search and the Identifier column from which to formulate the query.

3a. Click to select the link to the Web site where the query will be sent.

Some Web sites only allow searching for one item at a time.

4a. If there are no hits from a search, mark one record at a time in the visualizations and try again. 5a. Select the Identifier Column to be used as input to the query.

Any column in the data set can be chosen.

6a. Click OK.

The query is sent to the Web site and the results are displayed in a new Web browser (Fig. 7.9.18).

Setting up a new Web link 1b. Click on Access, followed by Web Links in the Tools pane of the DecisionSite Navigator.

The Web Links dialog box will be displayed.

2b. Click on Options to cause the Web Links Options dialog box to be displayed. 3b. Click on New.

A new Web Link will be created and selected in the list of Available Web Links. The Preview shows what the finished query will look like when it is sent.

4b. Edit the name of the new link in the Web Link Name text box. 5b. Edit the URL to the Web link. Use a dollar sign within curly brackets {$} as a placeholder for ID.

Anything entered between the left bracket and the dollar sign will be placed before each ID in the query. In the same way, anything placed between the dollar sign and the right bracket will be placed after each ID in the query.

Analyzing Expression Analysis

7.9.29

Current Protocols in Bioinformatics Supplement 7

Figure 7.9.18 Results of a Web Link query are displayed in a new Web browser window. In this particular example, a significant outlier list of genes (Genbank Accession numbers) was queried using a Gene Annotation Database (created at the Hartwell Center for Bioinformatics and Biotechnology) and the results returned included Gene Descriptions and Gene Ontologies (UNIT 7.2) for the queried records.

6b. Enter the Delimiter to separate the IDs in a query.

The identifiers in a query with more than one record are put together in one search string separated by the selected delimiter. The delimiters AND, OR, or ONLY can be used. The ONLY delimiter is useful when specifying genes differentially expressed at one point of time only, or genes that result in classification of a particular kind of tumor only.

7b. Click OK.

The new Web Link will be saved and displayed together with the other available Web Links in the user interface.

Editing a Web link 1c. Click on Access, followed by Web Links in the Tools pane of the DecisionSite Navigator.

The Web Links dialog box will be displayed.

2c. Click on Options to display the Web Links Options dialog box. 3c. Click on the Web Link to be edited in the list of Available Web Links.

The Web Link Name, URL, and Delimiter for the selected Web Link will be displayed and can be edited directly in the corresponding fields. All changes that are made are reflected in Preview, which helps show what the finished query will look like.

4c. Make desired changes to the Web Link and click OK.

Analyzing and Visualizing Expression Data with Spotfire

The Web Link will be updated according to the changes and the Web Links Options dialog box will close.

7.9.30

Supplement 7 Current Protocols in Bioinformatics

Removing a Web link 1d. Click on Access, followed by Web Links in the Tools pane of the DecisionSite Navigator.

The Web Links dialog box is displayed.

2d. Click on Options to display the Web Links Options dialog box. 3d. Click on the Web Link to be removed in the list of Available Web Links.

The Web Link Name, URL, and Delimiter for the selected Web link will be displayed in the corresponding fields.

4d. Click Delete to clear all of the fields.

Many Web Links can be deleted at the same time if several Web Links are selected in the list of Available Web Links and Delete is clicked. Press Ctrl and click on the Web Links in the list to select more than one. If some of the default Web Links are deleted by mistake, they can be retrieved by clicking the Add Defaults button. This adds all of the default links to the Available Web Links list, regardless of whether or not the links already exist.

GENERATING NEW COLUMNS OF DATA IN SPOTFIRE

New columns with numerical values can be computed from the current data set by using mathematical expressions. This protocol describes how to create and evaluate such expressions. Occasionally the columns included in a data set do not allow users to perform all necessary operations, or to create the visualizations needed to fully explore the data set. Still, in many cases, the necessary information can be computed from existing columns. Spotfire provides the option to calculate new columns by applying mathematical operators to existing values. For example, it may be necessary to compute the fold change in dealing with multiple array experiments. It can easily be computed by dividing the normalized signal values of the experimental array to the normalized signal values of the control array for every gene. For a discussion of normalizing data see UNIT 7.8. This protocol discusses dividing two columns as an example. Other calculations can be similarly performed. Spotfire supports the functions listed in Table 7.9.4 in expressions used for calculating new columns.

BASIC PROTOCOL 8

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12)

Current Protocols in Bioinformatics

Analyzing Expression Analysis

7.9.31

Supplement 7

Table 7.9.4 Description of Various Functions Available in Spotfire

Function ABS ADD CEIL COS DIVIDE EXP FLOOR LOG LN MAX MIN MOD MULTIPLY NEG SQRT SUBTRACT SIN TAN

Format ABS(Arg1) Arg1 + Arg2 CEIL(Arg1) COS(Arg1) Arg1/Arg2 EXP(Arg1, Arg2) or Arg1 Arg2 FLOOR(Arg1) LOG(Arg1) LN(Arg1) MAX(Arg1, Arg2, . . .) MIN(Arg1, Arg2, . . .) MOD(Arg1, Arg2) Arg1 Arg2 NEG(Arg1) SQRT(Arg1) Arg1 - Arg2 SIN(Arg1) TAN(Arg1)

Description Returns the unsigned value of Arg1 Adds the two real number arguments and returns a real number result Arg1 rounded up; that is the smallest integer which Arg1 Returns the cosine of Arg1a Divides Arg1 by Arg2 (real numbers)b Raises Arg1 to the power of Arg2 Returns the largest integer which is Arg1 (i.e., rounds down) Returns the base 10 logarithm of Arg1 Returns the natural logarithm of Arg1 Returns the largest of the real number arguments (null arguments are ignored) Returns the smallest of the real number arguments (null arguments are ignored) Returns the remainder from integer division Multiplies two real number arguments to yield a real number result Negates the argument Returns the square root of Arg1c Subtracts Arg2 from Arg1 (real numbers) to yield a real number result Returns the sine of Arg1a Returns the tangent of Arg1a

a The argument is in radians. b If Arg2 is zero, this function results in an error. Examples: 7/2 yields 3.5, 0/0 yields #NUM, 1/0 yields #NUM. c The result can also be attained by supplying an Arg2 of 0.5 using the EXP function.

Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations) Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files)

Analyzing and Visualizing Expression Data with Spotfire

Dividing two columns 1. Initiate a new Spotfire session and load data.

For example load a few data columns from a .gpr file.

7.9.32

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.19

Right clicking in the Query Devices window allows generation of new columns.

Figure 7.9.20

The New Columns dialog box. Analyzing Expression Analysis

7.9.33

Current Protocols in Bioinformatics Supplement 7

2. Right click in the query devices window. From the resulting pop-up menu (Fig. 7.9.19), chose New Column, followed by From Expression.

A New Column dialog box will appear (Fig. 7.9.20).

3. From the Operators drop-down list, select "/" (Table 7.9.4). 4. Select the desired columns for Arguments 1 and 2.

For example, select the normalized 635 (Cy-5) signal column as Argument 1 and the normalized 532 (Cy-3) signal column as Argument 2.

5. Click Insert Function. 6. Click Next >. 7. Enter a name for the new column, for example fold change. If the function just created can be used again later, save it as a Favorite by clicking Add To Favorites.

After being saved, it will appear in the list of Favorites, and can be used again by selecting it and clicking Insert Favorite.

8. Click Finish.

A new column of data will be added to the session. BASIC PROTOCOL 9

EXPORTING SPOTFIRE VISUALIZATIONS

Microarray data analysis techniques usually involve rigorous computation. Most steps can be tracked and understood by novice users through the use of visualizations in two or three dimensions with a striking use of colors to demonstrate changes or groupings. UNIT 7.7 provides a detailed discussion of modifying and entracing visualizations. It is desirable that these visualizations be exported from within the Spotfire to other applications. Currently, Spotfire visualizations can be exported in four different fashions: to Microsoft Word, to Microsoft PowerPoint, as a Web page, or copied to the clipboard.

Necessary Resources Hardware Workstation with Intel Pentium (100 MHz) processor or equivalent, 64 MB RAM, 20 MB disk space, and VGA or better display with 800 × 600 pixels resolution (user may benefit from much higher RAM and a significantly better processor speed) or Apple Macintosh PowerPC with 8 MB available memory, 2 MB free disk space, 256 color (or better) video display, and a network interface card (NIC) for network connections to MetaFrame servers Software PC: Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 Microsoft Internet Explorer 5.0 through 6.0 Spotfire 6.2 or above Microsoft Data Access Components (MDAC) versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) Web connection to the Spotfire server (http://home.spotfire.net) or local customer-specific Spotfire Server Microsoft PowerPoint, Word, and Excel (optional for Spotfire features related to export of text results or visualizations)

Current Protocols in Bioinformatics

Analyzing and Visualizing Expression Data with Spotfire

7.9.34

Supplement 7

Figure 7.9.21

The Microsoft Word Presentation dialog box.

Macintosh: Operating system (OS) 7.5.3 or later Citrix ICA client (http://www.citrix.com) Open Transport TCIP/IP Version 1.1.1 (or later) Files Data files (e.g., .met files, .gpr files) Exporting visualizations to Word The Microsoft Word export tool exports the active visualization(s) to a Microsoft Word document. Each visualization is added to a new page in the document along with annotation, title, and legend. Note that Microsoft Word needs to be installed on the machine.

1a. Create Visualizations in Spotfire (UNIT 7.7) and if necessary, edit the Titles and Annotations. 2a. Click on Reporting, followed by Microsoft Word in the Tools pane of DecisionSite Navigator.

A dialog box will be displayed listing all the visualizations that can be exported (Fig. 7.9.21).

3a. Click to select the visualizations to be exported. To select all, click on Select All. For multiple selections hold down the Ctrl key and select desired visualizations. 4a. Click OK.

An instance of Microsoft Word will be displayed that contains the selected visualizations.

Exporting visualizations to PowerPoint The Microsoft PowerPoint export tool exports the active visualization(s) to a Microsoft PowerPoint document. Each visualization is added to a new page in the document along with annotation, title, and legend. Note that Microsoft PowerPoint needs to be installed on the machine.

Current Protocols in Bioinformatics

Analyzing Expression Analysis

7.9.35

Supplement 7

1b. Create visualizations in Spotfire and if necessary, edit the Titles and Annotations. 2b. Click on Reporting, followed by Microsoft PowerPoint in the Tools pane of DecisionSite Navigator.

A dialog box will be displayed listing all the visualizations that can be exported similar to the one displayed in Figure 7.9.21.

3b. Click to select the visualizations to be exported. To select all, click on Select all. For multiple selections hold down the Ctrl key and select desired visualizations. 4b. Click OK.

An instance of Microsoft PowerPoint that contains selected visualizations will be displayed.

Exporting visualizations as a Web page The Export as Web Page tool exports the current visualizations as an HTML file and a set of images. The user can also include annotations, titles, and legends for the visualization.

1c. Create the desired visualizations and set the query devices. If multiple visualizations are to be included, ensure that they are all visible and are in the right proportions.

This is important because unlike the export to Word or PowerPoint features where each visualization is pasted on a new page (or slide) in the document, all visualizations are exported to the same page in this case. Visualizations are included in the report exactly as they are visible on the screen. Multiple visualizations can be tiled by clicking Window, followed by Auto Tile.

2c. Click on Reporting, followed by Export as Web Page in the Tools pane of DecisionSite Navigator.

The Export as Web page dialog box will be displayed (Fig. 7.9.22).

3c. Enter a report header.

This header will appear at the top of the Web Page Report.

4c. Check the options to include in the report.

Analyzing and Visualizing Expression Data with Spotfire

Figure 7.9.22

The Export as Web Page dialog box.

7.9.36

Supplement 7 Current Protocols in Bioinformatics

Figure 7.9.23 Data exported from a Spotfire session to the Web is displayed as a Web page report containing all the images as well as marked records.

These include Legend, Annotations, SQL query (corresponding to the current query devices setting), and a table of currently marked records.

5c. Select a graphic output format for the exported images (.jpg or .png). 6c. Click Save As. Enter a file name and a directory where the report is to be saved.

The HTML report will be saved in the designated directory along with a subfolder containing the exported images.

7c. If desired, select View Report After Saving.

A browser window will be launched, displaying the report (Fig. 7.9.23).

Copying to clipboard This tool enables users to copy any active visualization to the clipboard and paste it to another application.

1d. Create the desired visualizations and set the Query Devices. 2d. From the File menu, click on Edit, followed by Copy Special, followed by Visualization (Fig. 7.9.24).

The active visualization will be copied to clipboard.

3d. Open an instance of the desired application and paste from the clipboard.

Exporting visualization from the file menu This option allows users to export data from the file menu as either .jpg or .bmp files.

1e. Create the desired visualizations and set the Query Devices.

Analyzing Expression Analysis

7.9.37

Current Protocols in Bioinformatics Supplement 7

Figure 7.9.24 mode.

Exporting currently active visualization using the Copy Special, Visualization

Figure 7.9.25

The Export Visualization dialog box.

2e. Click on File, followed by Export, followed by Current Visualization.

The Export Visualization dialog box will open.

3e. Select whether to Include Title or use the default for the visualization to be exported.

The title is exported along with the visualization.

Analyzing and Visualizing Expression Data with Spotfire

4e. Select Preserve Aspect Ratio or change the size of the visualization to be exported by changing the aspect settings. 5e. Click OK (Fig. 7.9.25).

7.9.38

Supplement 7 Current Protocols in Bioinformatics

6e. Choose the directory in which to save the visualization from the ensuing window. Also specify the format in which the visualization should be saved.

Available choices include bitmap (.bmp), JPEG image (.jpg), PNG image (.png), and extended windows metafile (.emf).

7e. Click Save.

GUIDELINES FOR UNDERSTANDING RESULTS

The goal of most microarray experiments is to survey patterns of gene expression by assaying the expression levels of thousands to tens of thousands of genes in a single assay. Typically, RNA is first isolated from different tissues, developmental stages, disease states or samples subjected to appropriate treatments. The RNA is then labeled and hybridized to the microarrays using an experimental strategy that allows expression to be assayed and compared between appropriate sample pairs. Common strategies include the use of a single label and independent arrays for each sample (Affymetrix), or a single array with distinguishable fluorescent dye labels for the individual RNAs (most homemade two-color spotted microarray platforms). Irrespective of the type of platform chosen, microarray data analysis is a challenge. The hypothesis underlying microarray analysis is that the measured intensities for each arrayed gene represent its relative expression level. Biologically relevant patterns of expression are typically identified by comparing measured expression levels between different states on a gene-by-gene basis. Before the levels can be compared appropriately, a number of transformations must be carried out on the data to eliminate questionable or low-quality measurements, to adjust the measured intensities to facilitate comparisons, and to select genes that are significantly differentially expressed between classes of samples. Most microarray experiments investigate relationships between related biological samples based on patterns of expression, and the simplest approach looks for genes that are differentially expressed. Although ratios provide an intuitive measure of expression changes, they have the disadvantage of treating up- and down-regulated genes differently. For example, genes up-regulated by a factor of two have an expression ratio of two, whereas those down-regulated by the same factor have an expression ratio of -0.5. The most widely used alternative transformation of the ratio is the logarithm base two, which has the advantage of producing a continuous spectrum of values and treating upand down-regulated genes in a similar fashion. Normalization adjusts the individual hybridization intensities to balance them appropriately so that meaningful biological comparisons can be made. There are a number of reasons why data must be normalized, including unequal quantities of starting RNA, differences in labeling or detection efficiencies between the fluorescent dyes used, and systematic biases in the measured expression levels. Expression data can be mined efficiently if the problem of similarity is converted into a mathematical one by defining an expression vector for each gene that represents its location in expression space. In this view of gene expression, each experiment represents a separate, distinct axis in space and the log2(ratio) measured for that gene in that experiment represents its geometric coordinate. For example, if there are three experiments, the log2(ratio) for a given gene in experiment 1 is its x coordinate, the log2(ratio) in experiment 2 is its y coordinate, and the log2(ratio) in experiment 3 is its z coordinate. It is then possible to represent all the information obtained about that gene by a point in x-y-z-expression space. A second gene, with nearly the same log2(ratio) values for each experiment will be represented by a (spatially) nearby point in expression space; a gene with a very different pattern of expression will be far from the original gene. This

Analyzing Expression Analysis

7.9.39

Current Protocols in Bioinformatics Supplement 7

model can be generalized to an infinite number of experiments. The dimensionality of expression space equals the number of experiments. In this way, expression data can be represented in n-dimensional expression space, where n is the number of experiments, and each gene-expression vector is represented as a single point in that space. Having been provided with a means of measuring distance between genes, clustering algorithms sort the data and group genes together on the basis of their separation in expression space. It should also be noted that if the interest is in clustering experiments, it is possible to represent each experiment as an experiment vector consisting of the expression values for each gene; these define an experiment space, the dimensionality of which is equal to the number of genes assayed in each experiment. Again, by defining distances appropriately, it is possible to apply any of the clustering algorithms defined here to analyze and group experiments. To interpret the results from any analysis of multiple experiments, it is helpful to have an intuitive visual representation. A commonly used approach relies on the creation of an expression matrix in which each column of the matrix represents a single experiment and each row represents the expression vector for a particular gene. Coloring each of the matrix elements on the basis of its expression value creates a visual representation of gene-expression patterns across the collection of experiments. There are countless ways in which the expression matrix can be colored and presented. The most commonly used method colors genes on the basis of their log2(ratio) in each experiment, with log2(ratio) values close to zero colored black, those with log2(ratio) values greater than zero colored red, and those with negative values colored green. For each element in the matrix, the relative intensity represents the relative expression, with brighter elements being more highly differentially expressed. For any particular group of experiments, the expression matrix generally appears without any apparent pattern or order. Programs designed to cluster data generally re-order the rows, columns, or both, such that patterns of expression become visually apparent when presented in this fashion. Before clustering the data, there are two further questions that need to be considered. First, should the data be adjusted in some way to enhance certain relationships? Second, what distance measure should be used to group related genes together? In many microarray experiments, the data analysis can be dominated by the variables that have the largest values, obscuring other, important differences. One way to circumvent this problem is to adjust or re-scale the data, and there are several methods in common use with microarray data. For example, each vector can be re-scaled so that the average expression of each gene is zero: a process referred to as mean centering. In this process, the basal expression level of a gene is subtracted from each experimental measurement. This has the effect of enhancing the variation of the expression pattern of each gene across experiments, without regard to whether the gene is primarily up- or down-regulated. This is particularly useful for the analysis of time-course experiments, in which one might like to find genes that show similar variation around their basal expression level. The data can also be adjusted so that the minimum and maximum are one or so that the `length' of each expression vector is one. Various clustering techniques have been applied to the identification of patterns in geneexpression data. Most cluster analysis techniques are hierarchical; the resultant classification has an increasing number of nested classes and the result resembles a phylogenetic classification. Nonhierarchical clustering techniques also exist, such as K-means clustering, which simply partition objects into different clusters without trying to specify the relationship between individual elements. Clustering techniques can further be classified as divisive or agglomerative. A divisive method begins with all elements in one cluster that is gradually broken down into smaller and smaller clusters. Agglomerative

Analyzing and Visualizing Expression Data with Spotfire

7.9.40

Supplement 7 Current Protocols in Bioinformatics

techniques start with (usually) single-member clusters and gradually fuse them together. Finally, clustering can be either supervised or unsupervised. Supervised methods use existing biological information about specific genes that are functionally related to guide the clustering algorithm. However, most methods are unsupervised and these are dealt with first. Although cluster analysis techniques are extremely powerful, great care must be taken in applying this family of techniques. Even though the methods used are objective in the sense that the algorithms are well defined and reproducible, they are still subjective in the sense that selecting different algorithms, different normalizations, or different distance metrics, will place different objects into different clusters. Furthermore, clustering unrelated data will still produce clusters, although they might not be biologically meaningful. The challenge is therefore to select the data and to apply the algorithms appropriately so that the classification that arises partitions the data sensibly.

Hierarchical Clustering Hierarchical clustering is simple and the result can be visualized easily. It is an agglomerative type of clustering in which single expression profiles are joined to form groups, which are further joined until the process has been carried to completion, forming a single hierarchical tree. First, the pairwise distance matrix is calculated for all of the genes to be clustered. Second, the distance matrix is searched for the two most similar genes or clusters; initially each cluster consists of a single gene. This is the first true stage in the clustering process. Third, the two selected clusters are merged to produce a new cluster that now contains at least two objects. Fourth, the distances are calculated between this new cluster and all other clusters. There is no need to calculate all distances as only those involving the new cluster have changed. Last, steps two through four are repeated until all objects are in one cluster. There are several variations on hierarchical clustering that differ in the rules governing how distances are measured between clusters as they are constructed. Each of these will produce slightly different results, as will any of the algorithms if the distance metric is changed. Typically for gene-expression data, average-linkage clustering gives acceptable results. K-Means Clustering If there is advanced knowledge about the number of clusters that should be represented in the data, K-means clustering is a good alternative to hierarchical methods. In Kmeans clustering, objects are partitioned into a fixed number (K) of clusters, such that the clusters are internally similar but externally dissimilar. First, all initial objects are randomly assigned to one of K clusters (where K is specified by the user). Second, an average expression vector is then calculated for each cluster and this is used to compute the distances between clusters. Third, using an iterative method, objects are moved between clusters and intra- and intercluster distances are measured with each move. Objects are allowed to remain in the new cluster only if they are closer to it than to their previous cluster. Fourth, after each move, the expression vectors for each cluster are recalculated. Last, the shuffling proceeds until moving any more objects would make the clusters more variable, increasing intracluster distances and decreasing intercluster dissimilarity. Self-Organizing Maps A self-organizing map (SOM) is a neural-network-based divisive clustering approach that assigns genes to a series of partitions on the basis of the similarity of their expression vectors to reference vectors that are defined for each partition. Before initiating the analysis, the user defines a geometric configuration for the partitions, typically a two-dimensional

Current Protocols in Bioinformatics

Analyzing Expression Analysis

7.9.41

Supplement 7

rectangular or hexagonal grid. Random vectors are generated for each partition, but before genes can be assigned to partitions, the vectors are first trained using an iterative process that continues until convergence so that the data are most effectively separated. In choosing the geometric configuration for the clusters, the user is, effectively, specifying the number of partitions into which the data is to be divided. As with K-means clustering, the user has to rely on some other source of information, such as PCA, to determine the number of clusters that best represents the available data.

Principal Component Analysis An analysis of micro-array data is a search for genes that have similar, correlated patterns of expression. This indicates that some of the data might contain redundant information. For example, if a group of experiments were more closely related than the researcher had expected, it would be possible to ignore some of the redundant experiments, or use some average of the information without loss of information.

Principal component analysis (PCA) is a mathematical technique that reduces the effective dimensionality of gene-expression space without significant loss of information while also allowing us to pick out patterns in the data. PCA allows the user to identify those views that give the best separation of the data. This technique can be applied to both genes and experiments as a means of classification. PCA is best utilized when used with another classification technique, such as K-means clustering or SOMs, that requires the user to specify the number of clusters.

COMMENTARY Background Information

DNA microarray analysis has become one of the most widely used techniques in modern molecular genetics and protocols have developed in the laboratory in recent years that have led to increasingly robust assays. The application of microarray technologies affords great opportunities for exploring patterns of gene expression and allows users to begin investigating problems ranging from deducing biological pathways to classifying patient populations. As with all assays, the starting point for developing a microarray study is planning the comparisons that will be made. The simplest experimental designs are based on the comparative analysis of two classes of samples, either using a series of paired case-control comparisons or comparisons to a common reference sample, although other approaches have been described; however, the fundamental purpose for using arrays is generally a comparison of samples to find genes that are significantly different in their patterns of expression. Microarrays have led biological and pharmaceutical research to increasingly higher throughput because of the value they bring in measuring the expression of numerous genes in parallel. The generation of all this data, however, loses much of its potential value unless important conclusions can be extracted from large data sets quickly enough to interpret the results and influence the next experimental and/or clinical steps. Generating and understanding robust and efficient tools for data mining, including experimental design, statistical analysis, data visualization, data representation, and database design, is of paramount importance. Obtaining maximal value from experimental data involves a team effort that includes biologists, chemists, pharmacologists, statisticians, and software engineers. In this unit, the authors describe data analysis techniques used in their center for analysis of large volumes of homemade and commercial Affymetrix microarrays. An attempt has been made to describe microarray data analysis methods in a language that most biologists can understand. Benefit from the knowledge and expertise of biologists who ensure the right experiments are carried out is essential, in our view, for correct interpretation of microarray data.

Analyzing and Visualizing Expression Data with Spotfire

Literature Cited

Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863-14868.

7.9.42

Supplement 7 Current Protocols in Bioinformatics

Jolliffe, I.T. 1986. Springer Series in Statistics, 1986: Principal Component Analysis. SpringerVerlag, New York. Kerr, M.K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183-201. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations In Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability, Vol I. (L.M. Le Cam and J. Neyman, eds.) pp. 281-297. University of California Press, Berkeley, Calif.

Sankoff, D. and Kruskal, J.B. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. AddisonWesley Publishing, Reading, Mass. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M. 1999. Systematic determination of genetic network architecture. Nat. Genet. 22:281-285.

Contributed by Deepak Kaushal and Clayton W. Naeve St. Jude Children's Research Hospital Memphis, Tennessee

Analyzing Expression Analysis

7.9.43

Current Protocols in Bioinformatics Supplement 7

Information

BI0709.tex

43 pages

Find more like this

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

229243


You might also be interested in

BETA
BI0708.tex
BI0709.tex