Others Talk, We Listen.

SAS Optimization Tips

by Mark Hudson on Jan 30, 2013

Although SAS provides a powerful set of capable tools, SAS is primarily considered a data analytics leader.  Many SAS users are data analysts who focus on the outcome of the data analysis rather than SAS performance optimization.

If you are new to SAS, you may not know how to optimize performance.  If you only analyze relatively small volumes of data, you might not be concerned about performance optimization.  Either way, as a SAS user you should be aware of these SAS optimization tips before you run into problems.

Reduce

After reading data from outside sources, SAS stores its data in proprietary data set formats.  Those data sets may be temporary which are dropped when a SAS session ends or permanent which remain available for future SAS sessions.

Both temporary and permanent data sets allow DROP and KEEP options.  Say you are merging 2 data sets and you need most of the fields in both data sets … use the DROP option to exclude the fields you do not need.  Conversely, if you need only a few of the fields in either data set, use the KEEP option to include the fields you do need.  The sample code below shows ways you can mix and match DROP and KEEP statements.

DATA DATASET3 (DROP = FIELDC);
  MERGE DATASET1 (DROP = FIELD1 FIELD2 FIELD3)
    DATASET2 (KEEP = FIELDA FIELDB FIELDC);
  BY FIELDA;
NEWFIELD = FIELDC * 2;

Notice the DROP and KEEP options for DATASET1, DATASET2 and DATASET3 are independent of each other.  FIELD1, FIELD2 and FIELD3 are excluded as DATASET1 is read.  Likewise, FIELDA, FIELDB and FIELDC are included as DATASET2 is read.  DATASET3 excludes FIELDC from further processing of DATASET3 even though FIELDC was used in creating NEWFIELD in DATASET3.

By limiting data sets to only necessary fields, you minimize the computing resources required to complete your tasks.

Reuse

Even temporary SAS data sets are stored for the life of a SAS session.  Setting, merging and even sorting data sets can result in new data sets.  Over the life of a SAS session, those data sets consume space.  Think about that first data set you read into your SAS session (e.g., DATASET1).  If you sorted that data set into another data set (e.g., DATASET1A) then merged that sorted data set with another data set (e.g., DATASET2) into a new dataset (e.g., DATASET3) then DATASET1, DATASET1A and DATASET2 still use space even if you only now need DATASET3.

To reuse the space used by DATASET1, DATASET1A, and DATASET2, you simply delete those data sets in your SAS session.  The sample code below shows how to reclaim the space for reuse.

PROC DATASETS LIBRARY = WORK;
  DELETE DATASET1 DATASET1A DATASET2;

Keep in mind that once you delete those data sets, they are no longer available.

If you are already familiar with the PROC DELETE method of deleting data sets, remember SAS has threatened to deprecate that syntax.

Rethink

A recent project presented me with a reasonably simple SAS objective to tackle.  The objective was to read in a file of business data, apply prescribed business rules and output the results of the data I read with the data from the business rules I applied.

My initial design was read the entire input row, apply the business rules and write the completed output row.  Once I better understood the data and the rules, I recognized most of the fields on the very wide input data rows were not involved in the business rules.  Large fields like names and addresses unnecessarily consumed computing resources if I processed all of the data together.

Using the DROP and KEEP options to separate fields required by the business rules from the fields not required by the business rules, I was able to create 2 data sets.  One data set (e.g., DATASET1A) holds the static data not used in the business rules.  The other data set (e.g., DATASET1B) holds the significantly smaller volume of data actually used in the business rules logic.

By including the row number read from the original input file in both DATASET1A and DATASET1B, I am able to rejoin the static DATASET1A with the manipulated DATASET1B to produce the final result set meeting the original requirements.  By separating the static and volatile data, I minimize computing resource consumption and better prepare the solution to scale for larger data volumes.

Of course the results are more important than performance optimization.  Ad hoc analysis of relatively small data volumes may not warrant the sorts of considerations described above.  But understanding these optimization tips and factoring those tips into your designs develops good habits for the future.