Setting
up Completeness Reports
There are 3 tables which are used to run a completeness
report. The first is ETL_QA_SQL. This table creates our measures. There has to
be a source measure and a target measure. So for CUSTOMER_DIM there 2 records
in this table.
|
ETL_QA_SQL |
|
|
|
QA_CATEGORY |
QA_STAT_NAME |
SQL_TO_RUN |
|
CUSTOMER_DIM |
CUSTOMER_DIM_S |
select count(*) from SOURCE.CUSTOMERS; |
|
CUSTOMER_DIM |
CUSTOMER_DIM_T |
select count(*) from CUSTOMER_DIM Where ActiveFlag = 1; |
|
INVOICE_FACT_AmountPaid |
INVOICE_FACT_AmountPaid_S |
select sum(AMT_PD) from SOURCE.INVOICES; |
|
INVOICE_FACT_AmountPaid |
INVOICE_FACT_AmountPaid_T |
select sum(AmountPaid) from INVOICE_FACT; |
|
INVOICE_FACT_CustomerId |
INVOICE_FACT_CustomerId_S |
select count(*) from INVOICE_FACT Where CustomerId <> 0
And LastUpdate = current_date; |
|
INVOICE_FACT_CustomerId |
INVOICE_FACT_CustomerId_T |
select count(*) from INVOICE_FACT Where LastUpdate =
current_date; |
Now when you calculate these measures the results are put in
the ETL_QA_RESULT. Here is a sample.
|
ETL_QA_RESULT |
|
|
|
|
ETL_QA_RESULT_ID |
CREATE_TS |
QA_STAT |
SQL_TO_RUN |
|
CUSTOMER_DIM_S |
2007-03-05 10:55 |
975 |
select count(*) from SOURCE.CUSTOMERS; |
|
CUSTOMER_DIM_T |
2007-03-05 10:56 |
1000 |
select count(*) from CUSTOMER_DIM Where ActiveFlag = 1; |
|
INVOICE_FACT_AmountPaid_S |
2007-03-05 11:05 |
1000 |
select sum(AMT_PD) from SOURCE.INVOICES; |
|
INVOICE_FACT_AmountPaid_T |
2007-03-05 11:06 |
1000 |
select sum(AmountPaid) from INVOICE_FACT; |
|
INVOICE_FACT_CustomerId_S |
2007-03-05 10:35 |
960 |
select count(*) from INVOICE_FACT Where CustomerId <> 0
And LastUpdate = current_date; |
|
INVOICE_FACT_CustomerId_T |
2007-03-05 10:36 |
1000 |
select count(*) from INVOICE_FACT Where LastUpdate =
current_date; |
Now all we need to do is setup the report. The report has to
know the source measure and the target measure to figure out the completeness.
This is controlled through the ETL_QA_RESULT_MAP table. Here is how it would
look.
|
ETL_QA_RESULT_MAP |
|
|
|
TABLE_NAME |
ETL_QA_RESULT_ID_S |
ETL_QA_RESULT_ID_T |
|
CUSTOMER_DIM |
CUSTOMER_DIM_S |
CUSTOMER_DIM_T |
|
INVOICE_FACT |
INVOICE_FACT_AmountPaid_S |
INVOICE_FACT_AmountPaid_T |
|
INVOICE_FACT |
INVOICE_FACT_CustomerId_S |
INVOICE_FACT_CustomerId_T |
The report is created from joining the last 2 tables together
where you only use the last run of results. Here is the SQL.
/* -----------------------------------------------------------------
EtlResultCompleteness.sql
Decription: Reports
completeness of dimensions and fact tables.
Written by: Kim
Duke
Notes:
Need to setup
tables
1. ETL_QA_SQL
2.
ETL_QA_RESULT_MAP
Then run sequence
SeqQaCompleteness
----------------------------------------------------------------- */
Select
ETL_QA_RESULT_MAP.TABLE_NAME AS QA_CATEGORY,
convert(decimal(18,0),AllRuns_T.QA_STAT) AS TARGET_COUNT,
convert(decimal(18,0),AllRuns_S.QA_STAT) AS SOURCE_COUNT,
/* Oracle
ROUND((100 * (AllRuns_T.QA_STAT / AllRuns_S.QA_STAT)), 2) AS COMPLETENESS */
convert(decimal(18,2),ROUND((100 *
(convert(decimal(18,2),AllRuns_T.QA_STAT) /
convert(decimal(18,2),AllRuns_S.QA_STAT))), 2)) AS COMPLETENESS
From
ETL_QA_RESULT_MAP,
ETL_QA_RESULT
AllRuns_S,
(
Select
ETL_QA_RESULT.ETL_QA_RESULT_ID,
Max(ETL_QA_RESULT.CREATE_TS) AS MAX_CREATE_TS
From
ETL_QA_RESULT
Group By
ETL_QA_RESULT.ETL_QA_RESULT_ID
) LastRun_S,
ETL_QA_RESULT
AllRuns_T,
(
Select
ETL_QA_RESULT.ETL_QA_RESULT_ID,
Max(ETL_QA_RESULT.CREATE_TS) AS MAX_CREATE_TS
From
ETL_QA_RESULT
Group By
ETL_QA_RESULT.ETL_QA_RESULT_ID
) LastRun_T
Where
AllRuns_S.ETL_QA_RESULT_ID = LastRun_S.ETL_QA_RESULT_ID
AND
AllRuns_S.CREATE_TS = LastRun_S.MAX_CREATE_TS
AND
AllRuns_S.ETL_QA_RESULT_ID = ETL_QA_RESULT_MAP.ETL_QA_RESULT_ID_S
AND
AllRuns_T.ETL_QA_RESULT_ID = LastRun_T.ETL_QA_RESULT_ID
AND
AllRuns_T.CREATE_TS = LastRun_T.MAX_CREATE_TS
AND
AllRuns_T.ETL_QA_RESULT_ID = ETL_QA_RESULT_MAP.ETL_QA_RESULT_ID_T
Order By
ETL_QA_RESULT_MAP.TABLE_NAME
;
|
Completeness Report |
|
|
|
|
QA_CATEGORY
|
TARGET_COUNT |
SOURCE_COUNT |
COMPLETENESS |
|
CUSTOMER_DIM |
975 |
1000 |
97.50% |
|
INVOICE_FACT_AmountPaid |
1000 |
1000 |
100.00% |
|
INVOICE_FACT_CustomerId |
960 |
1000 |
96.00% |
This is a simple report. The html page is a quick and dirty
way to report straight out of DataStage. It is generated from a DataStage job I
wrote. Any SQL statement can be sent in this format to any user or group of
users. This whole process is automatically ran after our daily loads are done.
This is what we call a sanity check. All the real SQL has only the records
loaded today. So the WHERE clause has LastUpdate = current_date. Some
dimensions calculate the whole table because it does not take that long.