Read BI0708.tex text version

Loading and Preparing Data for Analysis in Spotfire

Microarray data exist in a variety of formats, which often depend on the particular array technology and detection instruments used. These data can easily be loaded into Spotfire DecisionSite (Spotfire DecisionSite, UNIT 7.7) by a number of methods including copying/ pasting from a spreadsheet, direct loading of text or comma separated (.csv) files, or direct loading of Microsoft Excel files. Data can also be loaded via preconfigured or ad hoc queries of relational databases and from proprietary databases and export file formats from microarray manufacturers such as Affymetrix (see Alternate Protocol 1) and Agilent, or scanner manufacturers such as GenePix (see Basic Protocol 1). Once the data are loaded, it is necessary to filter and preprocess the data prior to analysis (see Support Protocol 1). Subsequently, data transformation and normalization are critical to correctly perform powerful microarray data mining expeditions. These steps extract or enhance meaningful data characteristics and prepare the data for the application of certain analysis methods such as statistical tests to compute significance and clustering methods (UNIT 7.9)--which mostly require data to be normally distributed. A typical example of transformation methods is calculating the logarithm of raw signal values (see Support Protocol 2). Normalization is a type of transformation that accounts for systemic biases that abound in microarray data. One may then wish to normalize the data within an experiment (see Basic Protocol 2) or between multiple experiments (see Basic Protocol 3). During these processes it may be useful to combine data from multiple rows (see Basic Protocol 4). NOTE: UNIT 7.7 provides a general introduction to the Spotfire program and environment. This unit strictly focuses on data preparation within Spotfire. Readers unfamiliar with Spotfire are encouraged to read UNIT 7.7.

UNIT 7.8

UPLOADING GenePix DATA INTO SPOTFIRE

Spotfire allows the user to upload multiple spotted microarray data files in GenePix format (.gpr files) using a script that can retrieve the files from a database or from a network drive. While the original script was set up to retrieve version 3.0 .gpr files, modifications can be made to it to allow it to recognize and import data from newer versions of GenePix data files such as 4.0, 4.1, or 5.0. The script reads a .gpr file and ignores the header part based on the information provided in the .gpr file header about the number of rows and columns in the data file. It then allows the user to pick and choose the relevant columns of data from a .gpr file to upload to Spotfire.

BASIC PROTOCOL 1

Necessary Resources Hardware The recommended minimal hardware requirements are modest. The software will run on an Intel Pentium or equivalent with 100 MHz processor, 64 Mb RAM, 20 Mb disk space; a VGA or better display, and 800 × 6000 pixels resolution are needed. However, most microarray experiments yield large output files and most experimental designs require several data files to be analyzed simultaneously, so the user will benefit from both a much higher RAM and a significantly better processor speed.

Analyzing Expression Analysis Contributed by Deepak Kaushal and Clayton W. Naeve

Current Protocols in Bioinformatics (2004) 7.8.1-7.8.25 Copyright C 2004 by John Wiley & Sons, Inc.

7.8.1

Supplement 6

Software Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000 A standard install of Microsoft Internet Explorer; v. 5.0 through 6.0 may be used MDAC (Microsoft Data Access Components); versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) may be used A Web connection to the Spotfire server (http://home.spotfire.net; UNIT 7.7) or a local customer specific Spotfire Server. A Web connection is also required to take advantage of Web Links for the purpose of querying databases and Web sites on the Internet using columns of data residing in Spotfire Microsoft PowerPoint, Word, and Excel are required to take advantage of a number of features available within Spotfire related to export of text results or visualizations (UNIT 7.9) Spotfire (6.2 or above) is required (see UNIT 7.7) Files Spotfire (Functional Genomics module) can import data in nearly any format, but the authors focus here on the two-color spotted microarray data produced using GenePix software (Axon, Inc.). Several types of spotted arrays, scanners, scanning software packages, and their corresponding data types exist, including those from commercial vendors (Agilent, Motorola, and Mergen) that supply spotted microarrays for various organisms, as well as those from facilities that manufacture their own chips. GenePix data files are a tab-delimited text format (.gpr), which can be directly imported into a Spotfire session.

1. Run Spotfire (UNIT 7.7) and ensure that access is available to the .gpr files from either a network drive or a database.

Depending on the type of setup, it may be necessary to log in to the Spotfire application as well as the data source. Systems and database administrators may be able to provide more information. In this example, a GenePix version 3.0 data file is used.

2. In the Tools pane on the left-hand side of the screen, click on Access, then on Import GenePix files (Fig. 7.8.1). The Import GenePix Files dialog appears (Fig. 7.8.2A).

Loading and Preparing Data for Analysis in Spotfire

Figure 7.8.1

Tools pane with the Import GenePix Files tab highlighted.

7.8.2

Supplement 6 Current Protocols in Bioinformatics

Figure 7.8.2 (A) The Import Genepix Files dialog allows users to specify files to be uploaded into a Spotfire session. (B) The Data Import Options allow users to chose all or any columns from the data set.

Analyzing Expression Analysis

7.8.3

Current Protocols in Bioinformatics Supplement 6

3. Click Add. Point to the directory where the files to be analyzed are located, and double-click on the desired file. It is possible to load either a single file or multiple files with the help of the Shift key. The user may upload as many as seven files at one time. Uploading more than seven files will require repeating the process. The filename will appear in the center of the dialog box. 4. Specify the file(s) and click on the Columns button (Fig. 7.8.2A) to specify the data columns (Fig. 7.8.2B) to upload. One can choose to upload the entire file (requiring longer upload times).

The 43 columns listed in Figure 7.8.2B are generated by the GenePix software and are related to the position (Block, Column, Row, X, and Y), identification (Name, ID), and morphology (Diameter) of the spot and its intensity in either the Cy5 or Cy3 channel (all other columns). B represents Background and F represents Fluorescence. 635 and 532 represent the two wavelengths used during scanning (532 for Cy3 and 635 for Cy5). Suggested columns to upload include F635 Median, B635 Median, F532 Median, B532 Median, Ratio of Medians, F635 Median-B635, F532 Median-B532, Flags, Norm Ratio of Medians, and Norm Flags.

5. Check all columns to import, then click OK. The Import GenePix files window will appear again. Click OK again. Data will begin loading into Spotfire. This could take several minutes depending on the size and number of the data columns being uploaded and RAM/processor speeds.

At the end of the data-upload process, Spotfire will automatically display an initial visualization where each record is represented by a marker, along with a number of query devices for manipulating the visualization. Alternative visualizations (UNIT 7.7) can be opened by clicking on appropriate visualization toolbars, choosing Visualization from the File menu, or using the shortcuts Ctrl-1 through Ctrl-9 on the keyboard for various visualizations.

6. Filter and preprocess the data as described in Support Protocols 1 and 2.

ALTERNATE PROTOCOL 1

UPLOADING AFFYMETRIX TEXT DATA INTO SPOTFIRE

Support for standard microarray platforms, such as Affymetrix, is integrated within DecisionSite for Functional Genomics. Spotfire allows the user to upload multiple Affymetrix data files in the metric text format (.met files) using a script that can retrieve these files from a database or from a network drive. A guide is available to upload data from both MAS 4.0 and MAS 5.0 versions. The MAS 5.0 guide also works with the latest Affymetrix software GCOA 1.1. The script reads a .met file while largely ignoring the information provided in the header. It then allows the user to pivot the relevant columns of data from the .met file(s) to upload.

Necessary Resources Hardware The recommended minimal hardware requirements are modest. The software will run on an Intel Pentium or equivalent with 100 MHz processor, 64 Mb RAM, 20 Mb disk space; a VGA or better display; and 800 × 6000 pixels resolution are needed. However, most microarray experiments yield large output files and most experimental designs require several data files to be analyzed simultaneously, so the user will benefit from both a much higher RAM and a significantly better processor speed.

Loading and Preparing Data for Analysis in Spotfire

Software Windows 98 or higher, Windows NT with service pack 4.0 or higher, Windows Millennium, or Windows 2000

7.8.4

Supplement 6 Current Protocols in Bioinformatics

A standard install of Microsoft Internet Explorer; v. 5.0 through 6.0, may be used MDAC (Microsoft Data Access Components); versions 2.1 sp2 (2.1.2.4202.3) through version 2.5 (2.50.4403.12) may be used A Web connection to the Spotfire server (http://home.spotfire.net; UNIT 7.7) or a local customer specific Spotfire Server. A Web connection is also required to take advantage of Web Links for the purpose of querying databases and Web sites on the Internet using columns of data residing in Spotfire Microsoft PowerPoint, Word, and Excel are required to take advantage of a number of features available within Spotfire related to export of text results or visualizations (UNIT 7.9) Spotfire (6.2 or above) is required (see UNIT 7.7) Files Spotfire (Functional Genomics module) can import data in nearly any format, but the authors focus here on the commercial GeneChip microarray data (Affymetrix, Inc.). Spotfire facilitates the seamless import of Affymetrix output files (.met) from Affymetrix MAS v. 4.0 or v. 5.0 software. The .met file is a tab-delimited text file containing information about attributes such as probe set level, gene expression levels (signal), and detection quality controls (p value and Absence/Presence calls). In the illustration below, MAS 5.0 .met files will be used as an example.

1. Run Spotfire (UNIT 7.7) and ensure that access is available to the .met files from either a network drive or a database.

Depending on the type of setup, it may be necessary to log in to the Spotfire application as well as the data source. Systems and database administrators may be able to provide more information.

2. In the Tools pane on the Left hand side, a plus sign (+) in front of the script Access indicates that it can be expanded to explore other items under this directory. Click on Access, then on Import Affymetrix Data. This reveals all options available for downloading Affymetrix data (from version 4.0 or 5.0 MAS files on a network drive, or from a local or remote database). Click on Import Affymetrix V5 Files (Fig. 7.8.3).

Figure 7.8.3

Tools pane with the Import Affymetrix v5 Files tab highlighted.

Analyzing Expression Analysis

7.8.5

Current Protocols in Bioinformatics Supplement 6

Figure 7.8.4 (A) The Import Affymetrix Files dialog allows users to specify files to be uploaded into a Spotfire session. (B) The Data Import Options allow users to chose all or any columns from the data set.

3. Clicking on Import Affymetrix V5 Files will open a window for the user to specify the files to upload to Spotfire (Fig. 7.8.4A). 4. Click Add. Point to the directory where the files to be analyzed are located, and double-click on the desired file. It is possible to load either a single file or multiple files with the help of the Shift key. The user may upload as many as seven files at one time. Uploading more than seven files will require that the process be repeated. The filename will appear in the center of the dialog box.

Loading and Preparing Data for Analysis in Spotfire

7.8.6

Supplement 6 Current Protocols in Bioinformatics

5. Specify the file(s) and click on the Columns button (Fig. 7.8.4A) to specify the data columns (Fig. 7.8.4B) to upload. One can choose to upload the entire file (requiring longer upload times). 6. Check all columns to import, then click OK. The Import Affymetrix Files window will appear again. Click OK again. Data will begin loading into Spotfire. This could take several minutes depending on the size and number of the data columns being uploaded and RAM/processor speeds.

At the end of the data-upload process, Spotfire will automatically display an initial visualization where each record is represented by a marker, along with a number of query devices for manipulating the visualization. Alternative visualizations can be opened by clicking on appropriate visualization toolbars, choosing Visualization from the File menu, or using the shortcuts Ctrl-1 through Ctrl-9 on the keyboard for various visualizations.

7. Filter and preprocess the data as described in Support Protocols 1 and 2.

FILTERING AND PREPROCESSING MICROARRAY DATA

Successfully completing microarray experiments includes assessing the quality of the array design, the experimental design, the experimental execution, the data analysis, and the biological interpretation. At each step, data quality and data integrity should be maintained by minimizing both systematic and random measurement errors. Before embarking on the actual analysis of data, it is important to perform filtering and preprocessing, and other kinds of transformations, to remove systemic biases that are present in microarray data. It is not uncommon for users to overlook the importance of such quality-control measures. Typical filtering operations include removing genes with background levels of expression from the data, as these would likely confound later transformations and cause spurious effects during fold-change calculations and significance analysis. This can be readily achieved by filtering on the basis of absence/presence calls and detection p value. Query devices are assigned to every field of data and allow the user to perform filtering with multiple selection criteria, resulting in updates of all visualizations to display the results of this cumulative filtering. Guides can be used to perform such repetitive tasks quickly or to initiate a series of specific steps in the analysis. Throughout analysis, filtering using any data-field query device can be used to subset data and limit the number of genes that are included in further calculations and visualizations. Genes can be filtered on the basis of detection p value, Affymetrix signal, GenePix signal, GenePix signal-to-noise ratio, fold change, standard deviation, and modulation (frequency crossing a threshold). For example, filtering genes on modulation by setting a 0.05 p value threshold will split genes out by the number of times they fall above the 0.05 limit in the selected experiments.

SUPPORT PROTOCOL 1

To preprocess Affymetrix text data 1a. Initiate a Spotfire session (UNIT 7.7) and upload Affymetrix text (.met) files as described in Alternate Protocol 1.

2a. Pay careful attention to the query devices as a default visualization is loaded. A query device appears for every column of data that is uploaded and can be used to manipulate data visualization. In the Guides pane on the top-left corner, click on the link for Data Analysis, then on "Analyze Affymetrix absence/presence calls" (Fig. 7.8.5). 3a. This script allows one to choose Detection columns containing Absent (A), Marginal (M), and Present (P) calls. Click on all the detection columns to be considered from the display in the Guides pane, then click on Continue.

Analyzing Expression Analysis

7.8.7

Current Protocols in Bioinformatics Supplement 6

Figure 7.8.5

Guides pane with the Analyze Affymetrix absence/presence calls guide highlighted.

Figure 7.8.6 The data are binned on the basis of the number of times a particular Probe set was called Absent, Present, or Marginal, and presents a histogram to display the results.

4a. The frequency of absent, present, and marginal occurrences is then calculated across the selected experiments for each gene. It is possible to filter data using three new query devices: Absent Count, Present Count, and Marginal Count. A histogram may be created to view the distribution of Absent, Present, and Marginal counts using the Histogram Guide (Fig. 7.8.6).

Loading and Preparing Data for Analysis in Spotfire

This display allows users to quickly identify those genes that are repeatedly called Absent. In the above example, there are eight metric text files (Fig. 7.8.6). The histogram displays all genes based on how many times they were binned into the P category in these eight

7.8.8

Supplement 6 Current Protocols in Bioinformatics

Figure 7.8.7 The data generated from the use of the Affymetrix absence/presence guide is added to the Spotfire session as a new column and a new corresponding query device generated.

experiments. The distribution ranges from 0-1, which identifies genes that are always or almost always Absent, to 7-8, which identifies genes that are almost always called Present.

5a. Using the above histogram it is possible to exclude genes in one or more groups. Similar results can be obtained by sending "Absent call" results to different bins. When the histogram is displayed, associated data are linked to parent data in the Spotfire session and a new query device is created for this column of data (Fig. 7.8.7). 6a. By default, the query device is in the range-slider format. Right-click on the center of the Query device and choose Check Boxes (Fig. 7.8.8). 7a. Uncheck the check box for category 0-1. Notice how the number of visible records on the activity line changes from 6352 to 4441, reflecting the 1911 genes that were filtered out using this method (Fig. 7.8.9).

Records under the histogram 0-1 pertain to those genes that were called Present either 0 or 1 time out of a total of 8 Affymetrix chips in this particular experiment. This indicates that these genes are not reliably detected under these conditions. Filtering out these genes allows further calculations and transformations to be performed on the rest of the data set without any effect from these genes.

8a. Alternatively, data may be filtered based on criteria (detection p value or raw signal) other than Absence/Presence calls. To do so, click on Data Preparation in the Guides pane, followed by Filter Genes. Users can filter genes by "Standard deviation," "Fold change," or "Modulation." To filter genes by "Standard deviation," it is necessary to normalize data based on Z-score calculations (see Basic Protocol 3 and Background Information). Similarly, genes can only be filtered by "Fold change" when the appropriate normalization has been applied to the data (see Basic Protocol 3 and Background Information). Genes can also be filtered by modulation or frequency of crossing a threshold. In a set of 12 .met files, for example, one can query how many times a certain gene has a detection p value greater than 0.05. This calculation can be carried out for every gene in the dataset and groups of genes can be removed based on a particular frequency.

Analyzing Expression Analysis

7.8.9

Current Protocols in Bioinformatics Supplement 6

Figure 7.8.8 another.

Query Device for a particular column of data can be modified from one type to

Figure 7.8.9 Clearing check box corresponding to "Binned Present count 0-1" alters the number of visible records (shown on the Activity Line).

9a. Choose Modulation. Next, choose all the p value columns to be considered from the display in the Guides pane. Hit Continue (Fig. 7.8.10).

Loading and Preparing Data for Analysis in Spotfire

10a. Select a modulation threshold. If interested in filtering out genes on the basis of a p value cutoff of 0.05, for example, type 0.05. Click on Filter by Modulation (Fig. 7.8.11).

7.8.10

Supplement 6 Current Protocols in Bioinformatics

Figure 7.8.10 fashion.

The Filter Genes guide helps users to perform data preprocessing in a stepwise

Figure 7.8.11 The Filter Genes by Modulation guide bins data by the number of times a record (gene) crosses the specified threshold in the given experiments.

11a. The frequency of p value occurrences above 0.05 is across the selected experiments for each gene is displayed. It is possible to filter data using the new query device or from the histogram or trellis display (Fig. 7.8.12).

Similar filtering may be performed on raw signal data.

To preprocess spotted array (GenePix) data 1b. Initiate a Spotfire session (UNIT 7.7) and upload appropriate columns from GenePix (.gpr) files as described in Basic Protocol 1.

It is useful to retrieve data from the raw signal columns and background-corrected signal columns. In addition, GenePix data contain indicators of data quality in Signal

Analyzing Expression Analysis

7.8.11

Current Protocols in Bioinformatics Supplement 6

Figure 7.8.12 A new data column and a new query device are added to the Spotfire session, based on the Filter Genes>Modulation>p-value selection.

to Noise Ratio columns for every channel and a Flags column for every slide. It is useful to retrieve these data. In the example below, six cDNA microarray experiments (12 channels of signal data) are uploaded to Spotfire.

2b. In the Guides pane on the top-left corner, click on the link for Data Preparation and then on Filter Genes (Fig. 7.8.13). 3b. Click on Modulation. Filtering can be performed to remove bad data from GenePix files using data contained in the Flags columns and/or the Signal to Noise Ratio column. Choose Flags columns for any number of arrays to be mined, then hit Continue (Fig. 7.8.14).

GenePix software provides the ability to flag individual features with quality indicators such as Good, Bad, Absent, or Not Found. In the text data file, these indicators are converted to numeric data. Features with a Bad flag are designated -100, Good features are flagged as +100, Absent features are flagged as -75, and Not Found features as -50. All other genes are designated as 0 in the .gpr file. By modulating data on the Flags column at a setting of 0, it is possible to identify those genes that are consistently good or bad.

4b. The frequency of various flagged occurrences is then calculated across the selected experiments for each gene. It is possible to filter the data using query devices for the newly generated columns. A histogram can be created using the Histogram Guide to better view the distribution. When the histogram is created, associated data are linked to parent data in the Spotfire session and a new query device is created for this column of data (Fig. 7.8.15).

Loading and Preparing Data for Analysis in Spotfire

This display allows users to quickly identify those genes that are repeatedly called Absent. In the above example, there are six GenePix files. The histogram displays all genes based on how many times they were binned into the Flag category from six columns of data. The distribution ranges from 0, which identifies genes that are never flagged Bad or

7.8.12

Supplement 6 Current Protocols in Bioinformatics

Figure 7.8.13 Clicking on the Filter Genes Guide allows users to perform preprocessing on GenePix data.

Figure 7.8.14 columns.

Preprocessing can be performed on GenePix data using the Flags or the SNR

Not Found or Absent (hence the good genes), to 6, which identifies genes that are most frequently flagged and need to be filtered out of the data set.

5b. Using the above histogram it is possible to exclude genes in one or more groups. By default, the query device is in the range slider format. Right click on the center of the Query device and choose Check Boxes. By filtering the "flagged 6 times group," 2937 genes are filtered out (Fig. 7.8.16). 6b. Users may also filter GenePix data based on criteria other than Flag, such as Signal to Noise Ratio (SNR), raw signal, or Background pixel saturation levels. Click on

Analyzing Expression Analysis

7.8.13

Current Protocols in Bioinformatics Supplement 6

Figure 7.8.15 A new data column and a new query device are added to the Spotfire session, based on the Filter Genes>Modulation>Flags selection.

Figure 7.8.16 Clearing check box corresponding to Modulation by Flags column (category 6) alters the number of visible records (shown on the Activity Line).

Loading and Preparing Data for Analysis in Spotfire

Data Preparation in the Guides pane, followed by Filter Genes. Users can filter genes by "Standard deviation," "Fold change," or "Modulation." In order to filter genes by "Standard deviation," it is necessary to normalize data based on Z-score calculations (see Basic Protocol 3 and Background Information). Similarly, genes can only be filtered by "Fold change" when the appropriate normalization has been applied to the data (see Basic Protocol 3 and Background Information). Genes can

7.8.14

Supplement 6 Current Protocols in Bioinformatics

be filtered by modulation or frequency of crossing a threshold. In a set of 12 GenePix files, for example, one can ask how many times a certain gene has a SNR value greater than 1.5. This calculation can be carried out for every gene in the dataset and groups of genes can be removed based on a particular frequency.

LOG TRANSFORMATION OF MICROARRAY DATA

The logarithmic (henceforth referred to as log) function has been used to preprocess microarray data from the very beginning (Yang et al., 2002). The range for raw intensity values in microarray experiments spans a very large interval from zero to tens of thousands. However, only a small fraction of genes have values that high. This generates a long tail in the distribution curve, making it asymmetrical and non-normal. Log transformation provides values that are easily interpretable and more meaningful from a biological standpoint. The log transformation accomplishes the goal of defining directionality and fold change, whereas raw signal numbers only demonstrate relative expression levels. The log transformation also makes the distribution of values symmetrical and almost normal, by removing the skew originating from long tails originating from values with high intensities. 1. Open an instance of Spotfire (UNIT 7.7). Upload (see Basic Protocol 1 or Alternate Protocol 1) and prefilter (see Support Protocol 1) microarray data. 2. In the Guides pane of the DecisionSite Navigator (see UNIT 7.7), click on Data preparation>Transform columns to log scale. A new window is opened within the Guides pane (Fig. 7.8.17). 3. Select the columns on which to perform log transformation. These would typically be the signal columns in Affymetrix data and Cy-3 and Cy-5 signal data in the case

SUPPORT PROTOCOL 2

Figure 7.8.17 The "Transform columns to log scale" guide allows the user to convert any numeric data column to its logarithm counterpart, allowing the user to chose log to base 2 or 10.

Analyzing Expression Analysis

7.8.15

Current Protocols in Bioinformatics Supplement 6

of two-color arrays. Hold down the Ctrl key in order to select multiple columns. In order to select all the columns displayed in the guide, select the first column, hold down the Shift key and then select the last column (Fig. 7.8.17). 4. Click Continue. The user is now presented with the option of transforming log to the base 10 or 2. 5. Click on "log10" or "log2." Most microarray users have a preference for log2. New data columns are generated and added to the data set. Query Devices for these newly generated columns are also added and can be used to manipulate visualizations. Log transformed values for input values less than or equal to zero are not calculated and are left empty. 6. Load the Guides pane again by clicking on Back to Contents.

BASIC PROTOCOL 2

NORMALIZATION OF MICROARRAY DATA WITHIN AN EXPERIMENT

Experimental comparisons of expression are only valid if the data are corrected for systemic biases such as the technology used, protocol used and investigator. Since these biases are regularly detected in raw microarray data, it is imperative that some sort of normalization procedure be used to address this issue (Smyth and Speed, 2003). At this time, however, there is no consensus way to perform normalization. Several methods are available in the normalization module of Spotfire. These can broadly be divided into two categories: those that make experiments comparable (i.e., within experiments) and those that make the genes comparable (i.e., between experiments, see Basic Protocol 3). "Normalize by mean," "Normalize by trimmed mean," "Normalize by percentile," "Scale between 0 and 1," and "Subtract the mean or median" are all examples of the former category, which is particularly relevant for the spotted arrays but rarely need for the Affymetrix chips (Background Information). 1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1) and prefilter (see Support Protocol 1) microarray data.

To normalize by mean (also see Background Information) 2a. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens (Fig. 7.8.18).

3a. Choose the "Normalize by mean" radio button and then click the Next> button. The normalization dialog box 2(2) opens (Fig. 7.8.19). 4a. Select the "Value columns" on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5a. Click a radio button to select whether to work with "All records" or "Selected records." 6a. Select a method from the "Replace empty values with" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average" replaces empty values by the average for the entire row; and "Row interpolation" sets the missing values to the interpolated value between the two neighboring values in the row.

Loading and Preparing Data for Analysis in Spotfire

7a. Set one of the columns to be used for normalization as a baseline by selecting from the "Baseline for rescaling" drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious examples. Select None if no baseline is needed.

7.8.16

Supplement 6 Current Protocols in Bioinformatics

Figure 7.8.18 The Normalization dialog 1(2) allows the users to choose from several Normalization options.

Figure 7.8.19 The Normalization dialog 2(2) allows the users to choose Value column on which to perform Normalization and other variables.

Analyzing Expression Analysis

7.8.17

Current Protocols in Bioinformatics Supplement 6

8a. Check the "Overwrite existing columns" check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 9a. Click a radio button to specify whether to calculate mean from "All genes" or "Genes from Portfolio." If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 10a. Click Finish. Normalized columns are computed and added to the data set.

To normalize by trimmed mean (also see Background Information) 2b. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens.

3b. Choose the "Normalize by trimmed mean" radio button and then click the Next> button. The normalization dialog box 2(2) opens. 4b. Select the "Value columns" on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5b. Click a radio button to select whether to work with "All records" or "Selected records." 6b. Select a method from the "Replace empty values with" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average" replaces empty values by the average for the entire row; and "Row interpolation" sets the missing values to the interpolated value between the two neighboring values in the row. 7b. Set one of the columns to be used for normalization as a baseline by selecting from the "Baseline for rescaling" drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious examples. Select None if no baseline is needed. 8b. Enter a "Trim value." If a trim value of 10% is entered, the highest and the lowest 5% of the values are excluded when calculating the mean. 9b. Check the "Overwrite existing columns" check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 10b. Click a radio button to specify whether to calculate mean from "All genes" or "Genes from Portfolio." If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 11b. Click Finish. Normalized columns are computed and added to the data set.

To normalize by percentile (also see Background Information) 2c. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens.

3c. Choose "Normalize by percentile value" and then click the Next> button. The normalization dialog box 2(2) opens.

Loading and Preparing Data for Analysis in Spotfire

4c. Select the "Value columns" on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns.

7.8.18

Supplement 6 Current Protocols in Bioinformatics

5c. Click a radio button to select whether to work with "All records" or "Selected records." 6c. Select a method from the "Replace empty values with" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average" replaces empty values by the average for the entire row; and "Row interpolation" sets the missing value to the interpolated value between the two neighboring values in the row. 7c. Select one of the columns to be used for normalization as a baseline by selecting from the "Baseline for rescaling" drop-down list. The control channel in a twocolor experiment or the control GeneChip in an Affymetrix experiment are obvious examples. Select None if no baseline is needed. 8c. Enter a Percentile. For example, "85-percentile" is the value that 85% of all values in the data set are less than or equal to. 9c. Check the "Overwrite existing columns" check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 10c. Click a radio button to specify whether to calculate mean from "All genes" or "Genes from Portfolio." If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 11c. Click Finish. Normalized columns are computed and added to the data set.

To scale between 0 and 1 (also see Background Information) 2d. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens.

3d. Choose "Scale between 0 and 1" and then click the Next> button. The normalization dialog box 2(2) opens. 4d. Select the "Value columns" on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5d. Click a radio button to select whether to work with "All records" or "Selected records." 6d. Select a method from the "Replace empty values with" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average" replaces empty values by the average for the entire row; and "Row interpolation" sets the missing value to the interpolated value between the two neighboring values in the row. 7d. Check the "Overwrite existing columns" check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8d. Click a radio button to specify whether to calculate mean from "All genes" or "Genes from Portfolio." If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 9d. Click Finish. Normalized columns are computed and added to the data set.

Analyzing Expression Analysis

7.8.19

Current Protocols in Bioinformatics Supplement 6

To subtract the mean (also see Background Information) 2e. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens.

3e. Choose "Subtract the mean" and then click the Next> button. The normalization dialog box 2(2) opens. 4e. Select the "Value columns" on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5e. Click a radio button to select whether to work with "All records" or "Selected records." 6e. Select a method from the "Replace empty values with" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average" replaces empty values by the average for the entire row; and "Row interpolation" sets the missing value to the interpolated value between the two neighboring values in the row. 7e. Check the "Overwrite existing columns" check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8e. Click a radio button to specify whether to calculate mean from "All genes" or "Genes from Portfolio." If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 9e. Click Finish. Normalized columns are computed and added to the data set.

To subtract the median (also see Background Information) 2f. In the Tools Pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens.

3f. Choose "Subtract the mean" and then click the Next> button. The normalization dialog box 2(2) opens. 4f. Select the "Value columns" on which to perform the operation. For multiple selections, hold down the Ctrl key and click on the desired columns. 5f. Click a radio button to select whether to work with "All records" or "Selected records." 6f. Select a method from the "Replace empty values with" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average" replaces empty values by the average for the entire row; and "Row interpolation" sets the missing values to the interpolated value between the two neighboring values in the row. 7f. Check the "Overwrite existing columns" check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8f. Click a radio button to specify whether to calculate mean from "All genes" or "Genes from Portfolio." If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) to calculate mean from. Click OK. 9f. Click Finish. Normalized columns are computed and added to the data set.

Loading and Preparing Data for Analysis in Spotfire

7.8.20

Supplement 6 Current Protocols in Bioinformatics

NORMALIZATION OF MICROARRAY DATA BETWEEN EXPERIMENTS

Experimental comparisons of expression are valid only if the data are corrected for systemic biases such as the technology used, protocol used, and investigator. Since these biases are regularly detected in raw microarray data, it is imperative that some sort of normalization procedure be used to address this issue (Smyth and Speed, 2003). At this time, however, there is no consensus way to perform normalization. Fold change as signed ratio, fold change as log ratio, fold-change as log ratio in standard deviation units, and Zscore calculation are all examples of between-experiments normalization that are equally applicable to both spotted and Affymetrix array platforms. 1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1 or Alternate Protocol 1) and prefilter (see Support Protocol 1) microarray data.

BASIC PROTOCOL 3

To normalize by calculating fold change (as signed ratio, log ratio, or log ratio in standard deviation units; also see Background Information) 2a. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens.

3a. Select a radio button for "Fold change as signed ratio," "Fold change as log ratio," or "Fold change as log ratio in Standard Deviation units." Click Next. The Normalization dialog box 2(2) opens. 4a. Select the "Value columns" on which to perform the operation. 5a. Click a radio button to select whether to work with "All records" or "Selected records." 6a. Select a method from the "Replace empty values with" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average" replaces empty values by the average for the entire row; and "Row interpolation" sets the missing values to the interpolated value between the two neighboring values in the row. 7a. Check the "Overwrite existing columns" check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8a. Click a radio button to specify whether to calculate mean from "All genes" or "Genes from Portfolio." If Genes from Portfolio is selected, a portfolio dialog box will open and the user can specify a number of records or list(s) from which to calculate mean. Click OK. 9a. Click Finish. Normalized columns are computed and added to the data set.

For Z-score calculation (also see Background Information) 2b. In the Tools pane of the DescisionSite Navigator, click on Analysis>Data preparation>Normalization. The normalization dialog box 1(2) opens.

3b. Click Z-score Normalization and then click the Next> button. The Normalization dialog box 2(2) opens. 4b. Select the "Value columns" on which to perform the operation. 5b. Click a radio button to select whether to work with "All records" or "Selected records." 6b. Select a method from the "Replace empty values with" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average"

Analyzing Expression Analysis

7.8.21

Current Protocols in Bioinformatics Supplement 6

replaces empty values by the average for the entire row; and "Row interpolation" sets the missing value to the interpolated value between the two neighboring values in the row. 7b. Check the "Overwrite existing columns" check box if it is desirable to overwrite the previous column generated by this method. If this check box is deselected, the previous column is retained. 8b. Select the "Add mean column check box" if it is desirable to add a column with the mean of each gene. 9b. Select the "Add standard deviation check box" if it is desirable to add a column with the standard deviation of each gene. 10b. Select the "Add coefficient of variation check box" if it is desirable to add a column with the coefficient of variation of each gene. 11b. Click a radio button to select whether to calculate the Z-scores from "All genes" or "Genes from Portfolio." Selecting the latter option opens a portfolio dialog box where on can choose a number of records or lists from which to calculate Z-score. Choose a list and go back to the Normalization dialog. 12b. Click Finish. Columns containing normalized data are added to the data set.

BASIC PROTOCOL 4

ROW SUMMARIZATION

The row summarization tool allows users to combine values from multiple columns (experiments) into a single column. Measures such as averages, standard deviations, and coefficients of variation of groups of columns can be calculated. Since microarray experiments are typically performed in multiple replicates, this tool serves to summarize those experiments and determine the extent of variability. 1. Open an instance of Spotfire (UNIT 7.7) and upload (see Basic Protocol 1 or Alternate Protocol 1) and prefilter (see Support Protocol 1) microarray data. 2. In the Tools pane of the DecisionSite Navigator, click on Analysis>Data Preparation>Row summarization (Fig. 7.8.20). The "Row summarization" dialog box (Fig. 7.8.21) is displayed. 3. Create the appropriate number of groups using the New Groups tool. Move the desired value columns to suitable groups in the "Grouped value columns" list. To determine the average per row of n columns, create a new group in the "Grouped value columns" list, and then select it. Click to select all of the n columns in the value columns list and then click the Add button. In this manner, several groups can be summarized simultaneously. At least two value columns must be present in any "Grouped value columns" for this tool to work. Clicking on "Delete group" deletes the selected group and its contents (value columns) are transferred to the bottom of the "Value columns" list (Fig. 7.8.21). 4. Select a group and click on Rename Group to edit the group name. This is important because the default column names are names of the original columns followed by the chosen comparison measure in parentheses. When dealing with a number of experiments, this sort of nomenclature can be problematic. Therefore, it is advisable to choose meaningful group names at this stage. 5. Click a radio button to select whether to work with "All records" or "Selected records."

Current Protocols in Bioinformatics

Loading and Preparing Data for Analysis in Spotfire

7.8.22

Supplement 6

Figure 7.8.20

The Row Summarization Tool is displayed.

Figure 7.8.21 Row Summarization dialog allows the users to chose the value columns on which to perform the summarization, as well as other variables such as which measure (e.g., Average, Standard Deviation) to use.

Analyzing Expression Analysis

7.8.23

Current Protocols in Bioinformatics Supplement 6

6. Select a method from the "Replace empty values" drop-down list. "Constant" allows the user to replace empty values with a constant value; "Row average" replaces empty values by the average for the entire row; and "Row interpolation" sets the missing values to the interpolated value between the two neighboring values in the row. 7. Select a "Summarization measure" (e.g., average, standard deviation, variance, min, max, median) from the list box and click on OK. 8. Results are added to the dataset and new query devices created.

COMMENTARY Background Information

Normalization methods Normalize by mean. The mean intensity of one variable (in two-color arrays) is adjusted so that it is equal to the mean intensity of the control variable (logR - logG = 0, where R and G are the sum of intensities of each variable). This can be achieved in two ways: rescaling the experimental intensity to a baseline control intensity that remains constant, or rescaling without designating a baseline so that intensity levels in both channels are mutually adjusted. Normalize by trimmed mean. This method works in a manner that is essentially similar to normalization by mean, with the exception that the trimmed mean for a variable is based on all values except a certain percentage of the lowest and the highest values of that variable. This has the effect of reducing the effect of outliers during normalization. Setting the trim value to 10%, for example, excludes the top 5% and the bottom 5% values from the calculation. Once again, the normalization can be performed with and without a baseline. Normalize by percentile. The X-percentile is the value in a data set that X% of the data are lower than or equal to. One common way to control for systemic bias in microarrays is normalizing to the distribution of all genes-- i.e., normalizing by percentile value. Signal strength of all genes in sample X is therefore normalized to a specified percentile of all of the measurements taken in sample X. If the chosen percentile value is very high (85-percentile), the corresponding data point lies sufficiently far away from the origin that a good line can be drawn through all the points. The slopes of the line for each variable are then used to rescale each variable. One caveat of this sort of normalization is that it assumes that the median signal of the genes on the chip stays relatively constant throughout the experiment. If the total number of expressed genes in the experiment changes dramatically due to true biological activity (causing the median of one chip to be much higher than another), then the true expression values have been masked by normalizing to the median of each chip. For such an experiment, it may be desirable to consider normalizing to something other than the median, or one may want to instead normalize to positive controls. Scale between 0 and 1. If the intent of a microarray experiment is to study the data using clustering, the user may need to put different genes on a single scale of variation. Normalizations that may accomplish this include scaling between 0 and 1. Gene expression values are scaled such that the smallest value for each gene becomes 0 and the largest value becomes 1. This method is also known as Min-Max normalization. Subtract the mean. This method is generally used in the context of log-transformed data. This will replace each value by [value ­ mean (expression values of the gene across hybridizations)]. Mean and median centering are useful transformations because they reduce the effect of highly expressed genes on a dataset, thereby allowing the researcher to detect interesting effects in weakly expressed genes. Subtract the median. This method is also generally used in the context of logtransformed data and has a similar effect to mean centering, but is more robust and less susceptible to the effect of outliers. This will replace each value by [value ­ median (expression values of the gene across hybridizations)]. Fold change as signed ratio. This is essentially similar to normalization by mean. A fold change for a gene under two different conditions (or chips) is created. If there are n genes and five variables A, B, C, D, and E, assuming that variable A is considered baseline, the normalized value ei for the variable E in the ith gene is calculated as Norm ei = ei /ai , where ai is the value of variable A in the ith gene. Fold change as log ratio. If there are n genes and five variables (A, B, C, D, and E),

Loading and Preparing Data for Analysis in Spotfire

7.8.24

Supplement 6 Current Protocols in Bioinformatics

assuming that variable A is considered baseline, the normalized value ei for the variable E in the ith gene is calculated as Norm ei = log (ei /ai ), where ai is the value of variable A in the ith gene. Fold change as log ratio in standard deviation units. If there are n genes and five variables (A, B, C, D, and E), assuming that variable A is considered baseline, the normalized value ei for the variable E in the ith gene is calculated as Norm ei = 1/Std(x) · log (ei /ai ) where Std(x) is the standard deviation of a matrix of log ratios of all signal values for the corresponding record. Z-score calculation. Z-score provides a way of standardizing data across a wide range of experiments and allows the comparison of microarray data independently of the original hybridization intensities. This normalization is also typically performed in log space. Each gene is normalized by subtracting the given expression level from the median (or the mean) on all experiments, and then divided by the standard deviation. This weighs the expression levels in favor of those records that have lesser variance.

clade B protease gene using high-density oligonucleotide arrays. Nat. Med. 2:753-759. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., Zeitlinger, J., Jennings, E.G., Murray, H.L., Gordon, D.B., Ren, B., Wyrick, J.J., Tagne, J.B., Volkert, T.L., Fraenkel, E., Gifford, D.K., and Young, R.A. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799-804. Leung, Y.F. and Cavalieri, D. 2003. Fundamentals of cDNA microarray data analysis. Trends Genet. 19:649-659. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematics, Statistics and Probability 1967:281-297. Sankoff, D. and Kruskal, J.B. 1983. Time Warps, String Edits, and Macromolecules. The Theory and Practice of Sequence Comparison. AddisonWesley, Reading Mass. Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467-470. Schena, M., Heller, R.A., Theriault, T.P., Konrad, K., Lachenmeier, E., and Davis, R.W. 1998. Microarrays: Biotechnology's discovery platform for functional genomics. Trends Biotechnol. 16:301-306. Smyth, G.K. and Speed, T. 2003. Normalization of cDNA microarray data. Methods 31:265-273. Smyth, G.K., Yang, Y.H., and Speed, T. 2003. Statistical issues in cDNA microarray data analysis. Methods Mol. Biol. 224:111-136. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., and Church, G.M. 1999. Systematic determination of genetic network architecture. Nat. Genet. 22:281-285. Yang, Y., Buckley, M.J., Dudoit, S., and Speed, T.R. 2002. Comparison of methods for image analysis on cDNA microarray data. J. Comp. Stat. 11:108-136. Yeoh, E.J., Ross, M.E., Shurtleff, S.A., Williams, W.K., Patel, D., Mahfouz, R., Behm, F.G., Raimondi, S.C., Relling, M.V., Patel, A., Cheng, C., Campana, D., Wilkins, D., Zhou, X., Li, J., Liu, H., Pui, C.H., Evans, W.E., Naeve, C., Wong, L., Downing, J.R. 2002. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. 2002. Cancer Cell 1:133-143.

Literature Cited

Cheok, M.H., Yang, W., Pui, C.H., Downing, J.R., Cheng, C., Naeve, C.W., Relling, M.V., and Evans, W.E. 2003. Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nat. Genet. 34:85-90. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863-14868. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., and Lander, E.S. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531-537. Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C., Trent, J.M., Staudt, L.M., Hudson, J. Jr., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P.O. 1999. The transcriptional program in the response of human fibroblasts to serum. Science 283:83-87. Jolliffe, I.T. 1986. Principal Component Analysis. Springer Series in Statistics. Springer-Verlag, New York. Kerr, M.K. and Churchill, G.A. 2001. Experimental design for gene expression microarrays. Biostatistics 2:183-201. Kozal, M.J., Shah, N., Shen, N., Yang, R., Fucini, R., Merigan, T.C., Richman, D.D., Morris, D., Hubbell, E., Chee, M., and Gingeras, T.R. 1996. Extensive polymorphisms observed in HIV-1

Contributed by Deepak Kaushal and Clayton W. Naeve Hartwell Center for Bioinformatics and Biotechnology St. Jude Children's Research Hospital Memphis, Tennessee

Analyzing Expression Analysis

7.8.25

Current Protocols in Bioinformatics Supplement 6

Information

BI0708.tex

25 pages

Find more like this

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

229246


You might also be interested in

BETA
BI0708.tex