5 About the QA Package

This documentation describes QA Package methods and capabilities and provides the information required to get started with running the tool.

5.1 Control Flow Module

All Level 1 checks for all tables are performed first. If any major issues with the data are detected the package will abort (see Figure 5.1)

Control Flow: Level 1 Module Abort Logic

Figure 5.1: Control Flow: Level 1 Module Abort Logic

All Level 2 checks are then performed in a logical sequence and abort at each step if AbortYN = Y for any flag in that step (see Figure 5.2). Level 2 checks include:

  1. Perform critical intra-table checks that would cause downstream data integrity issues
  2. Perform critical cross-table checks for the same reasoning as above. Note: this is the final place where the package may abort
  3. Continue to remaining Level 2 checks and all remaining modules, regardless of the resulting data flags. Note: Prior to this step, all resulting datasets are output to the “dplocal” folder. Only log, metadata, and signature files will be in the “msoc” subfolder, until all abort checks have been successfully executed. At that time, all appropriate datasets will be moved to the “msoc” subfolder.
Control Flow: Level 2 Module Abort Logic

Figure 5.2: Control Flow: Level 2 Module Abort Logic

After successful completion of the Level 2 module, each individual module following will then be executed, in the order specified in the inputfiles/control_flow.csv file.

5.2 Definition Of Enrollment Span Comparisons

Definitions and Examples of Enrollment Date Range Relationships by PatID

Figure 5.3: Definitions and Examples of Enrollment Date Range Relationships by PatID

5.3 Minimum And Maximum Dates Of Data Completeness

Minimum and Maximum dates of data completeness are created by this package for all SCDM tables4 containing at least one date variable as defined in the input file lkp_all_minmax.sas7bdat.5

Definition 5.1 (Minimum date of data completeness) The Minimum date of data completeness (mindate) is calculated by determining the earliest year-month—e.g., 2010-01—with a record count within an 80% threshold of the record count of the next month—e.g., 2010-02—and then assigning the first day of that month to create a SAS date, formatted as YYYY-MM-DD—e.g., 2010-01-01.

Definition 5.2 (Maximum date of data completeness) The Maximum date of data completeness (maxdate) is calculated by determining the latest year-month—e.g., 2017-10—with a record count within an 80% threshold of the record count of the prior month—e.g., 2017-09—and then assigning the last day of that month to create a SAS date, formatted as YYYY-MM-DD—e.g., 2017-10-31.

Overall minimum are then calculated thus:

Definition 5.3 (Overall Minimum date of data completeness (DP Min)) The overall dp_mindate is calculated by determining the latest mindate—i.e. the maximum of the minimum—from the SCDM Enrollment, Dispensing, Encounter, Diagnosis, and Procedure Table

Definition 5.4 (Overall Maximum date of data completeness (DP Max)) The overall dp_maxdate is calculated by determining the earliest maxdate—i.e. the minimum of the maximum—from the SCDM Enrollment, Dispensing, Encounter, Diagnosis, and Procedure Table

These dates are stored in a SAS dataset minmax_dates. The DPMin and DPMax associated with the latest production ETL at each Data Partner site will be used by Common Components (CC) to populate the global macro variables &mindate and &maxdate for distributed request packages.

Example of Maximum date of data completeness algorithm

Figure 5.4: Example of Maximum date of data completeness algorithm

5.4 Age Calculation

Age in years (age_years) is calculated using the date of birth variable found in the SCDM Demographic Table and the overall maxdate calculated for the ETL under review.

Definition 5.5 (Kreuter method for age calculation) The following equation measures age in whole years. It counts the months between the two dates, subtracts one month if the day boundary has not been crossed for the last month, and then converts months to years and reports it as an integer.1

'Age_years' = floor((intck('month','birth_date',&DP_MaxDate.)-(day(&DP_MaxDate.)\<day('birth_date')))/12)

Age in years is summarized based on the following categories:

00. Missing
00. Negative
01. 0-1 yrs
02. 2-4 yrs
03. 5-9 yrs
04. 10-14 yrs
05. 15-18 yrs
06. 19-21 yrs
07. 22-44 yrs
08. 45-64 yrs
09. 65-74 yrs
10. 75+ yrs

5.5 Enforcing the ICD-10 switchover for Inpatient and Institutional Encounters

The Quality Assurance (QA) Package includes two warn flags, DIA_2_07_00-0_223 and PRO_2_07_00-0_223. These flags are designed to signal when an ICD-10 code is detected prior to October 1, 2015, or when an ICD-9 code is identified on or after this date in either the Diagnosis or Procedure tables.

For the majority of encounter types, the triggering of these flags is dependent on the ADate variable in either the Diagnosis or Procedure table. However, there is an exception for inpatient (IP) and institutional stays (IS). During the transition from the ICD-9 to ICD-10 era, the switchover was based on the discharge date rather than the admission date for these two types of encounters. Consequently, when the EncType is either IP or IS, the QA Package bases this validation check on the Ddate variable from the Encounter Table. The mechanics of the QA Package dictates that these inpatient and institutional stay flags be treated as cross-table checks and are thus implemented as DIA-ENC_2_07_00-0_223 and PRO-ENC_2_07_00-0_223.

5.6 L3 Comparison Module

The L3 Compare module compares each ETL run against the previous one. It checks for notable changes in record counts, patient counts, null rates, value distributions, and temporal patterns across the SCDM tables, and flags instances where differences exceed SOC defined thresholds.

5.6.1 Prerequisites

5.6.1.1 Required Inputs

Running the L3 Compare module requires the previous ETL’s L3 output files alongside the current ETL outputs. Two files are required for the module to start at all:

  • all_l1_record_counts — used to calculate proportional differences
  • minmax_dates — used to determine how many months of new data were added

If either file is missing or empty, the entire module is skipped and a warning is written to the log. The current ETL run and existing L3 outputs are unaffected.

5.6.1.2 Where to Put Historical Files

Historical L3 results are placed in the inputfiles directory of the QA Package by SOC analysts.

5.6.1.3 Configuration

The module is driven by two lookup tables:

lkp_l3_auto — defines which checks run, what datasets they read, and how results are categorised. This table does not require modification under normal operation.

lkp_l3_threshold — defines the flagging thresholds by data volume tier. Default values are set at release and apply unless site-specific overrides are in place by SOC analysts

5.6.2 How Flags Are Generated

5.6.2.1 Thresholds Scale With Data Volume

The module calculates how many months of new data were added between ETL runs by comparing date ranges in minmax_dates:

DIFFMIN = new.DP_MinDate - old.DP_MinDate
DIFFMAX = new.DP_MaxDate - old.DP_MaxDate
MonthsAdded = DIFFMIN + DIFFMAX

It uses this to select the appropriate threshold tier from lkp_l3_threshold.

Months Added overall_count_threshold_low overall_count_threshold_high overall_count_threshold_neg ym_threshold propdiff_threshold
≤ 4 1% 5% -5% 1% 1%
5 – 12 5% 10% -5% 1% 1%
13 – 18 10% 15% -5% 1% 1%
19 – 24 15% 20% -5% 1% 1%
≥ 25 20% 25% -5% 1% 1%

The overall count thresholds (overall_count_threshold_low and overall_count_threshold_high) scale with refresh size. When only a small number of months were added, even a modest percentage change is notable. When a large refresh was processed, the thresholds are wider to avoid noise from expected data growth. These two columns drive WARN and CRIT severity respectively for overall count checks.

The negative threshold (overall_count_threshold_neg) is fixed at -5% regardless of refresh size. Data loss is treated as higher-sensitivity regardless of how much new data came in.

The year-month threshold (ym_threshold) and proportional difference threshold (propdiff_threshold) are both fixed at 1% across all tiers. The ym_threshold applies to year-month checks only, which never upgrade to CRIT. The propdiff_threshold applies to overall count checks as the second metric in the dual-threshold evaluation.

5.6.2.2 Two Metrics, Two Severity Levels

For most checks, the module calculates two metrics:

  • Percent change(new_count - old_count) / old_count × 100
  • Proportional difference(new_count / variable_denom_new) - (old_count / variable_denom_old), where variable_denom is the total record count for the table being checked, sourced from all_l1_record_counts

Each metric is compared against its threshold independently. A CRIT can be reached two ways:

  • If one metric exceeds its threshold: WARN
  • If percent change exceeds overall_count_threshold_high: CRIT
  • If both percent change and proportional difference exceed their thresholds: CRIT

The flag description indicates which metric(s) triggered the flag — see Reading Flag Descriptions below.

Year-month checks are an exception: they only use percent change and never upgrade to CRIT, since month-over-month variation in temporal data is expected.

5.6.2.3 Edge Cases

  • When the prior ETL count is zero, percent change is set to missing and the record is not flagged on that metric.
  • Checks skip the most recent 4 months of year-month data. Trailing months often have incomplete data due to claims processing lag, so comparing them would produce misleading flags.
  • If a dataset is missing, empty, or lacks a required variable, that specific check is skipped. The log records why. Other checks continue normally.

5.6.3 Outputs

5.6.3.1 Primary Output: all_l3_flags

This table is a consolidated summary of everything the module flagged, saved to both dplocal.all_l3_flags and msoc.all_l3_flags.

Column Description
FlagID Unique flag identifier, tokenised by table and check type
FlagType NOTE, WARN, or CRIT
Flag_Descr Plain-text description of what was flagged and why
Count Number of records that triggered this flag

Records are sorted by FlagID. Only flagged records are included — if a check ran clean, it won’t appear here.

5.6.3.2 Intermediate Outputs

Each check type also writes its own detailed dataset to DPLOCAL and MSOC. These contain the full comparison results (both flagged and unflagged records), useful for investigating results that did not flag or for checking proximity to threshold.

Dataset What It Contains
l3_checkid_300 Total record counts by table
l3_checkid_310 Record counts by one stratifying variable
l3_checkid_312 Record counts by year-month
l3_checkid_315 Unique patient counts by table
l3_checkid_316 Cross-table record discrepancy counts
l3_checkid_320 Record counts by two stratifying variables (overall thresholds)
l3_checkid_322 Record counts by two stratifying variables (year-month thresholds)
l3_checkid_323 Unique patient counts by year-month
l3_checkid_350 Null counts by variable
l3_checkid_354 Unexpected zero value counts
l3_checkid_357 Special missing value counts by variable
l3_checkid_360 Conditional null counts
l3_checkid_361 Conditional non-missing value counts
l3_checkid_362 Dual-conditional non-missing value counts

5.6.4 Interpreting Results

5.6.4.1 Severity Levels

NOTE

Informational. Worth knowing about, but doesn’t indicate a likely data quality issue on its own.

WARN

One threshold was exceeded. Investigate to determine whether the change is expected (e.g., a known data refresh characteristic) or warrants escalation.

CRIT

Severity is upgraded when either: percent change exceeds overall_count_threshold_high, or both percent change and proportional difference exceed their respective thresholds simultaneously. CRITs should be reviewed before sign-off.

5.6.4.2 Reading Flag Descriptions

The Flag_Descr field indicates what triggered the flag:

  • “Percent change between ETLs exceeds threshold (overall_count_threshold)” — only the percent change metric fired
  • “Proportion change between ETLs exceeds threshold (propdiff_threshold)” — only the proportional difference metric fired
  • “Percent change exceeds overall_count_threshold and proportion change exceeds propdiff_threshold” — both fired (this is a CRIT)

The [threshold_type] token in the description is replaced with the actual threshold column name from lkp_l3_threshold, making it possible to trace exactly which threshold applied.

5.6.4.3 Example

Setup: Previous ETL covered 2020-01 to 2022-06. Current ETL covers 2020-01 to 2022-12.

Months added is calculated as:

DIFFMIN = new.DP_MinDate - old.DP_MinDate = 0 months
DIFFMAX = new.DP_MaxDate - old.DP_MaxDate = 6 months
MonthsAdded = DIFFMIN + DIFFMAX = 6

6 months added → threshold tier 5–12: overall_count_threshold_low = 5%, overall_count_threshold_high = 10%, propdiff_threshold = 1%.

Proportional difference is calculated as:

PROP_DIFF = (new_count / variable_denom_new) - (old_count / variable_denom_old)

Where variable_denom is the total record count for the table being checked, sourced from all_l1_record_counts.


Scenario A — WARN (percent change above low threshold, proportional difference below threshold):

DTH table: old = 9,600 records, new = 10,272 records, table denominator = 500,000.

  • PCT_CHANGE = (10,272 - 9,600) / 9,600 × 100 = 7% → exceeds overall_count_threshold_low (5%) ✓, below overall_count_threshold_high (10%) ✗
  • PROP_DIFF = (10,272 / 500,000) - (9,600 / 500,000) = 0.02054 - 0.01920 = 0.00134 (0.13%) → below propdiff_threshold (1%) ✗

One metric exceeded → WARN


Scenario B — CRIT via dual threshold (both metrics exceeded):

DTH table: old = 9,600 records, new = 10,272 records, table denominator = 50,000.

  • PCT_CHANGE = 7% → exceeds overall_count_threshold_low (5%) ✓, below overall_count_threshold_high (10%) ✗
  • PROP_DIFF = (10,272 / 50,000) - (9,600 / 50,000) = 0.20544 - 0.19200 = 0.01344 (1.34%) → exceeds propdiff_threshold (1%) ✓

Both metrics exceeded → CRIT


Scenario C — CRIT via overall_count_threshold_high alone:

DTH table: old = 9,600 records, new = 10,656 records, table denominator = 500,000.

  • PCT_CHANGE = (10,656 - 9,600) / 9,600 × 100 = 11% → exceeds overall_count_threshold_high (10%) ✓

Percent change alone exceeds the high threshold → CRIT (proportional difference is not evaluated)

5.6.5 Check Reference

5.7 Check Reference

CheckID What It Compares Threshold Type Can Upgrade to CRIT?
300 Total record counts by table Overall count Yes
310 Counts by one stratifying variable Overall count Yes
312 Counts by year-month Year-month No
315 Unique patient counts by table Overall count Yes
316 Cross-table record discrepancies Overall count Yes
320 Counts by two stratifying variables Overall count Yes
322 Counts by two stratifying variables (temporal) Year-month No
323 Unique patient counts by year-month Year-month No
350 Null counts by variable Overall count Yes
354 Unexpected zero value counts Overall count Yes
357 Special missing value counts Overall count Yes
360 Conditional null counts Overall count Yes
361 Conditional non-missing value counts Overall count Yes
362 Dual-conditional non-missing value counts Overall count Yes

5.7.1 Troubleshooting

5.7.1.1 The entire module was skipped

The log will contain a WARN message indicating which required file was missing or empty (all_l1_record_counts or minmax_dates). Verify that the previous ETL’s L3 outputs were provided as input and that both files contain observations.

5.7.1.2 A specific check is missing from all_l3_flags but was expected to run

Two possibilities:

  1. The check ran and produced no flags — clean results don’t appear in all_l3_flags.
  2. The check was skipped due to a dataset issue, such as the previous ETL L3 data is missing from the inputfiles folder, look for a NOTE in the log with the dataset name and failure reason.

References

1.
William Kreuter. Age calculation and when does the sun rise and fall code. In: SAS Conference Proceedings: Pacific Northwest SAS Users Group 1996.; 1996:3. https://www.lexjansen.com/pnwsug/1996/PNWSUG96025.pdf

  1. While the r01_mother_deliveries and / or r02_infants tables, created by the separate Mother-Infant Identification Program contain dates, these are derived dates from the core tables and are thus not used for setting minimum and maximum dates.↩︎

  2. It should be noted that the min/maxdate algorithm may not work well with all types of date distributions—e.g., a distribution with a large drop proceeded or followed by a long, flat tail of many months. When there are at least two consecutive months at the tail end of the distribution with relatively low counts, the algorithm may sometimes pick a month with incomplete data. For example, if the Year-Month 2017-06 in Figure 4 had a count of “600” instead of “400”, it would meet the 80% threshold and be incorrectly chosen as the max date.↩︎