Read Validating And Updating Your Data Using SAS® Formats text version

Validating And Updating Your Data Using SAS® Formats

Peter Welbrock, Britannia Consulting, Inc., MA

Overview In whatever way you use SAS software, at some point you will have to deal with data. It is unavoidable. The trouble is with data is that it is not always: · · Valid Correct

Data for Examples For each of the examples, we will be using the same data. This will make following each example easier. This data will be rich in `formatable' data so that each of the techniques will be applicable. The data will look as follows: Column Customer_num Age Gender Classification birthdate zip phone Orders value Description Unique Customer id Customer Age Customer Gender Customer Type SAS date value Zip Code Phone Number Number of Orders $ Value of Orders

Fortunately, SAS® has a myriad of tools to deal with such issues. One such tool is formats, which allows to you both view your data in another form, or to transform it based upon pre-determined rules i.e. those established when the format was created. This paper will cover the basic uses of formats within SAS. Rather than just approaching formats from a syntactical perspective, it will be approached using examples of their use. Syntax is easy to reference, but practical uses are harder to find. Both the use and the creation of formats will be used. Formats that are part of SAS software itself will be briefly covered followed by those created by the user. Emphasis will be given to both the creation and the use of the format. What is a Format? A format is a stored set of rules that can be used to restructure the cardinality of a column. This restructuring can take place in either viewing the data, or by recoding the data. These rules can either be determined by the user (a user-defined format), or by SAS software itself. This is a very simplistic definition of a tool that is extremely powerful and useful, but one that will suffice for an introduction to the subject. Rather than a definition of what a format actually is, more useful will be what it can be used for. This paper will be concentrating on the latter.


Table: Customer_data

The actual data values are predictable given the column descriptions, and are rich in potential for main topics of this paper: the validation, recoding (transformation) and lookup of data. Viewing Data In a Different Form (Using inherent SAS Formats) This is one of the most elementary uses of formats. Data is stored in one way, but is required to be viewed in another. For example, in the customer_data file, the birth date is stored as a SAS date, but using SAS date time formats, could be viewed or transformed into almost any form. For example, if the customer was born on January 1, 1960, then because it is a SAS date, the stored value would be 0 (zero). If the birth month is required from this date, then it would not be necessary to programatically work this out, but a SAS format could be used. Many formats are supplied with SAS software and are therefore available to anyone using the software. It is very

worthwhile exploring the different categories of formats SAS supplies to understand the general ways in which they can be used, even if every single available format is not learned. Key categories (with format examples) are as follows (see the online documentation for a complete list): · Numeric Formats ! PERCENTw.d (convert percentages to numeric values) ! COMMAw.d (removes embedded characters) Character Formats ! $UPCASEw. (convert character data to upcase) ! $QUOTEw. (writes data values enclosed in double quotation marks), Date and Time Formats ! DAYw. (writes day of week) ! DOWNAMEw. (writes data value as the name of the day of the week) ! MONYYw. (writes date values in the form mmmyy or mmmyyyy) ! MONNAMEw. (writes a date value as the name of month)

The output from the Print procedure will contain two columns: one containing the stored value of the customer_num column, and the other containing the 9 character month corresponding the date the customer was born.

Note: most SAS procedures will, by default, operate on the formatted value of a column. Make sure that you read and understand how the formatted values are used before applying formats to columns in SAS procedures.



So, up to this point, we have seen the types of inherent formats SAS supplies. We have also seen a simple example of how a format can be applied using a very basic SAS procedure. Of course, use of formats is not limited to procedures, but can also be used within the data step construct, or SQL. The simple example above illustrated the process of viewing data in a different form than it is stored. In this case, we converted a SAS date value to a 9-character month value for display purposes only. We also applied the format on an `as needed' basis, rather than applying it permanently to the column. Transforming (recoding) Data Having seen how a format can be used to view data in a different form than it is stored, the same principle can be used to actually transform data. Rather than using predetermined inherent SAS formats for this, an example of user defined formats will be used. Supposing the column classification in the customer_data table is to be changed into something more meaningful (maybe for analysis purposes). The way the data is stored currently is at too granular a level and it needs to be grouped into more convenient levels. Obviously, SAS itself will not have a convenient format to perform this change. There is, however, a SAS procedure that can be used to create a format. Not surprisingly, this procedure is called PROC FORMAT.

The above list merely illustrates the types of formats that SAS supplies. The entire list is very long, but one that is worthwhile perusing.

SAS formats are used to `write-out' existing data in another form. SAS also has the concept of an informat that is a mirror image of a format. Instead of writing out as a format will do, an informat will `read-in' data in a particular form.

Note the final example in the list above: monname. As the format name suggests, this format will write out the month name from a SAS date value. The following snippet of code will illustrate the how this can be used to display the month instead of the value 0 (zero). proc print data=customer_data noobs; var customer_num birthdate; format birthdate monname9.; run;

There are many different options available for specifying which values are included in a particular formatted group. There are many short cuts available. Although beyond the scope of this paper, it is definitely worth reading the online documentation to understand these optionsi.

formatted groupings would also have to be numeric in nature. The values that make up the format groupings (see line 3) can then be listed. In line 4, an optional technique is to use the `-` syntax that means that any values in the range from XC to XZ will be included within the Magazine group. The format is now created. It can now be used to transform the classification column. In the example below, the format is being used to create a new column (gen_class). data new_cust; set customer_data; length gen_class $9; gen_class=put(classification,$c_type); run; This is a very simple example of using the put() function to recode a column. It would, of course, have been possible to perform the same actions with if-then-else SAS language constructs, but this would be both lengthy to type and difficult to maintain.

A warning about using the put() function is that it always returns a character string. This might not be what is required, so be careful in its useii.

The following code will create a format that can then be used in the same way a SAS supplied format can be applied. proc format library=work; (1) value $ c_type (2) `XA', `XB'=`Books' (3) `XC' ­ `XZ'='Magazines' (4) `YA'-`YZ'='E books'; run; When run, this code will create a format that will be stored in the work library (line 1). This means, as with the rest of SAS software when storing an object in the work library, that it will only be available for the length of the SAS session. Once the session is ended, the ability to use this format without creating it again will be removed. The format is actually stored in a SAS catalog within the work library. This catalog is always named formats and will automatically be created by SAS. If the format were to be needed across SAS sessions, then it should be permanently stored. This would be done by specifying a library that will be recreated in subsequent SAS sessions (e.g. library=mylib). The name of the format (see line 2) is c_type and it is a character format. It is very important to understand the distinction between character and numeric formats. Character formats can only be used on character data. They are always preceded by a $ upon their use and the format name is limited to seven characters (eight including the $). Numeric formats can only be used on numeric data and the format name can be up to eight characters. Note that if a numeric format were being created, the only difference would be that the $ on the value statement would be removed. Obviously, the values that make up the

One of the benefits of using a format in the example outlined above is that it reduces the amount of code that needs to be maintained. Instead, only the format needs to be maintained. This can be done in a couple of ways: · · Update the Proc Format code that created the format and re-run it. Store the format information in a data table, update the table as needed then recreate the data.

If there is choice between these two methods, the last one is preferable. To use it, however, a further feature of Proc Format is required, the use of the cntlout and cntlin options. The cntlin options facilitates the translation of data within a SAS file directly into a

format. The cntlout option does the opposite, it takes a SAS format and translates it into a SAS data file. The following code will take the format $c_type and store it as a SAS table. proc format library=work cntlout=c_type; select $c_type; run; The cntlout option specifies the name of the SAS table where the format information will be stored. In this case, a temporary SAS table called c_type will be created. The select statement specifies the format(s) to be extracted. Note that if more than one format is specified in the select statement, then all of the information will be placed into the single cntlout= SAS file. If the select statement is not included, then all the formats contained within the catalog speicified by the library= option will be extracted to the output SAS file. Note that the $c_type format will still exist. Using the cntlout option will not remove the format. Once the format information is extracted, then it can be updated just like any other SAS table. It is important to remember, however, that the structure of the table must be kept intact if the format is to be recreated from the table. The c_type table will have many columns (run a Proc Contents against the output file to see a full list), but the following are some of the key ones: Column start end fmtname type label Description Starting Value for the the format grouping Ending value for the format grouping Name of the format Type of format (character or numeric) Format Value Label

(specifically, end is not really needed unless a range is present). In our example, using the $c_type format, to add another format grouping, we would edit the outputted c_type SAS table, adding a new row corresponding to the new format group and then submit the following code: proc format library=work cntlin=c_type; run; This will write over the existing $c_type format, incorporating any changes made within the data table. This method of updating or creating formats is incredibly useful, since it allows the SAS user to use existing data to create formats without having to write multiple lines of code. It also means that formats can be easily maintained, since they will be leveraged off any changes made to data files. For example, if a format was created to collapse general ledger accounts into summary accounts, any time a new account was created, the file used to classify the accounts could be used as the basis for the input into a Proc Format with the cntlin option. Using the illustrations above, user defined formats have been used to re-classify data. The cardinality of a column within a table has been changed based upon the format grouping defined within the user created SAS format. In the example above, a new column was created based upon the formatted values of the classification column.

Be careful when transforming a column based upon format because the transformation is one way. It is always safer to create a new column so that the original data remains intact. There is no concept of `de-formatting'!

An Alternative to Look-up Tables These have been selected as the key columns since to perform the opposite function (create a format from a table), these are the columns that are minimally required Another very powerful use of formats is as an alternative to look-up tables. Using our existing customer_data file, suppose we

want to include some basic demographic data based upon the customer_num and zip. There could be two look-up tables, one for the customer demographic information, and one for the zip code information. The example we have is to compare the customer's income with the median income within their zip code. This could be done with the following code: proc sql; create table cust_demographic as select cust.*, zip.median_income, c_dem.median from customer_data cust, zip_demographic zip, customer_demog c_dem where cust.customer_num=zip.customer_id and cust.customer_num=c_dem.customer_id; quit; This could, depending on the size of the files, become very resource intensive. An alternative would be to create formats. In this case, there would be two formats created as follows: · A format from the zip_demographic file with the customer_id as the start value and median_income as the value. The format will be called $zip_inc. A format from the customer_demog file with the customer_id as the start value and median as the value. The format will be called $c_inc.

which takes up resources. The creation of the two new columns median and median_income will also take up resources. There is no hard set of rules as to using a lookup table or a format is more efficient. A great deal depends on the following: · The computing resources available. SAS formats are loaded into memory, so very large formats might not be a good idea. How often the lookup will be performed. The more often, the more sense it makes to create a permanent format. How static the format will be. If the format grouping change very often, it might be more trouble than it is worth to keep a permanent format. If many attributes are required for a single key value. The example above illustrates the use of two distinct formats. What if five, or ten, or fifteen attributes were needed, would using formats then make sense?iii

· ·


There are instances, however, that using formats as lookups can be surprisingly efficient. It is a technique that should be looked into, even though it is a proprietary to SAS.

Note that there are potential gains in efficiency by creating format groupings in a specific order. If you know that your data has certain values with high frequency, then there is a benefit to storing the applicable format group toward the start of the format. Since SAS will store the format groups in order of the label, the notsorted option on the value statement will ensure that the format values will be stored in the order in which they have been placed. Be wary of this option, however, since it will affect the output behavior of several SAS procedures when output is to be sorted by the formatted value of a column, rather than its actual value.


The following code would perform the same task as the SQL above: proc sql; create table cust_demographic as select cust.*, put(customer_num,$zip_inc.) as median_income, put(customer_num,$c_inc.) as median from customer_data cust; quit; The need to actually perform joins is removed completely. Of course, as with everything, one does not get `something for nothing'. The formats have to be created

Data Validation One of the key problems that every SAS programmer comes across is the validity of

data. The temptation is for `full speed ahead' on the analysis and `damn the torpedoes'. This is a dangerous and counter-productive tendency since few of us work in an environment where we can make assumptions about the validity of data. In fact, the only assumption we can safely make is that somewhere in the data, there are problems lurking that will surface only after we hand in the final report. Problems with data necessitate an approach that will help us both get to know the data (still essential, despite all the modern technological tools available) and uncover any problems. Using formats is one method that can help in uncovering unexpected data. The restriction with formats is that they will be based upon some a priori knowledge of that dataiv. One form of data validation is `finite validation'. This can be used when there are a known number of possibilities for a value in a given column. In customer_data, the column gender can only have two values: 0 or 1. This is a very simple validation that will be easy to perform with a format. The format could be created as follows: proc format library=work; value $ chk_gen `0' = `female' `1' = `male' other='error'; run; With this simple format, the data can then be checked. An example of this is illustrated in the following snippet of code: data check(drop=chk_gen); set customer_data; length chk_gen $6; chk_gen=put(gender,$chk_gen.); if chk_gen = `error' then do; put `gender error id: ` customer_num; output; end; run;

Any customers that have an invalid gender will be both listed in the log and the entire row from the customer_data file will be outputted to the check file. This is a very simple example of using a format to test for the validity of data, but one that can be extended to become very complex. An additional trick in using formats to validate data is to use `nested' formatsv. Suppose that we want to validate for very high or low value values. We might, for example be suspicious of any value below or including $100 or above $100,000. We want to see these rows explicitly, but otherwise want the value printed with the $ sign and commas. We could create the following format: proc format library=mylib; value chk_val low-100='too low' 100 <- 100000 =[dollar16.] 100000 <- high = `too high'; run; (1) (2) (3) (4)

There are a few differences from previous examples we have used: · First of all, we are creating a permanent format (see line 1). We are storing the format in a catalog called formats that is referenced by the mylib libname. Secondly, we are creating, for the first time, a numeric format i.e. a format that will be used on data stored within a SAS numeric field or on numeric data itself (see line 2, that has a value statement without a $). We are using the low option in the range. In this case (line 3) this means that any value below and including 100 will be included in this grouping. In line 4, any value that is between 100 and 100000 (excluding the value 100) will have another format applied. In this case, it is the SAS supplied dollar16 format which writes out the numeric value with dollar signs, commas and decimal points




Line 4 really includes the clever piece of this code. The ability to nest formats can save the programmer a vast amount of time, both in the validation of data, and in any other way that the format might be used. The syntax of the nested format, however, is important with square brackets being essential. Sundry Format Topics Picture Statement One aspect of formats that has not been discussed so far is that using the picture statement. This enables the SAS programmer to use a format as a mask. The picture statement will allow, for instance, telephone numbers, or social security numbers stored without dashes, to be displayed with dashes. Note that a picture can only be used when the data is numeric. The picture statement can best be described by example, using the phone column in the customer_data tablevi. proc format library=work; picture phonenum; low-high = `(999)999-9999' (prefix = `(` ); run; Now, whenever the picture is applied to the phone column, the data will be viewed in the form: (555)555-5555 Conclusions · Know your data! It is important to really know your data when creating such a format. It is easy to mistreat data when applying formats, without even knowing you are doing so. When dealing with very large data, make sure that you test to see when a format might or might not be efficient. There are no simple rules. Using formats should be one weapon in the artillery. Think about permanently storing formats if at all possible. There is an overhead to their creation. If the underlying format


groups change continuously, think about creating formats directly from tables (using the cntlin option) rather than through code. Know your data!

Some great examples of this can be found in Pete Lund's SUGI 25 paper, entitled: More than Just Value: A Look Into the Depths of PROC FORMAT. ii See Jack Shoemaker's SUGI 25 paper entitled: Eight PROC FORMAT Gems. Specifically look at Jack's tip number 4. iii For a discussion on multiple attributes, see Jack Shoemaker's : Eight PROC FORMAT Gems. Specifically look at Jack's tip number 8. iv For a discussion on the validation of data, see: Peter R. Welbrock, Strategic Data Warehousing Principles Using SAS® Software, Cary, NC: SAS Institute Inc., 1998 384pp. See Chapter 6. v This trick was inspired by Jack Shoemaker's tip number 7, in his SUGI 25 paper: Eight PROC FORMAT Gems vi The picture statement is well explained in detail Pete Lund's more advanced paper from SUGI 25: More than Just Value: A Look Into the Depths of PROC FORMAT. Contact Information Please feel free to contact the author with comments/suggestions/abuse: Peter Welbrock Britannia Consulting, Inc RR2 #1132 Vineyard Haven, MA 02568 [email protected] Trademark Information SAS is a registered trademark of SAS Institute, Inc., Cary, NC, USA





Validating And Updating Your Data Using SAS® Formats

7 pages

Find more like this

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate


You might also be interested in

Chapter 10
electrosonic ss.qxp