Examine tables and extract their variations with customary SQL
Evaluating tables in BigQuery is a vital job when testing the outcomes of information pipelines and queries previous to productionizing them. The flexibility to match tables permits for the detection of any modifications or discrepancies within the knowledge, guaranteeing that the info stays correct and constant.
On this article we’ll reveal how one can evaluate two (or extra) tables on BigQuery and extract the data that differ (if any). Extra particularly, we’ll showcase how one can evaluate tables with similar columns in addition to tables with a distinct quantity of columns.
First, let’s begin by creating two tables with some dummy values that we are going to then be referencing all through this tutorial with a view to reveal just a few totally different ideas.
— Create the primary tableCREATE TABLE `temp.tableA` (`first_name` STRING,`last_name` STRING,`is_active` BOOL,`no_of_purchases` INT)INSERT `temp.tableA` (first_name, last_name, is_active, no_of_purchases)VALUES (‘Bob’, ‘Anderson’, True, 12),(‘Maria’, ‘Brown’, False, 0),(‘Andrew’, ‘White’, True, 4)
— Create the second tableCREATE TABLE `temp.tableB` (`first_name` STRING,`last_name` STRING,`is_active` BOOL,`no_of_purchases` INT)INSERT `temp.tableB` (first_name, last_name, is_active, no_of_purchases)VALUES (‘Bob’, ‘Anderson’, True, 12),(‘Maria’, ‘Brown’, False, 0),(‘Andrew’, ‘White’, True, 6),(‘John’, ‘Down’, False, 0)
Evaluating data of tables with the identical columns
Now that we have now created our two instance tables, you need to have observed that there are a few variations between them.
SELECT * FROM `temp.tableA`;
+————+———–+———–+—————–+| first_name | last_name | is_active | no_of_purchases |+————+———–+———–+—————–+| Bob | Anderson | true | 12 || Andrew | White | true | 4 || Maria | Brown | false | 0 |+————+———–+———–+—————–+
SELECT * FROM `temp.tableB`;
+————+———–+———–+—————–+| first_name | last_name | is_active | no_of_purchases |+————+———–+———–+—————–+| Bob | Anderson | true | 12 || Andrew | White | true | 6 || Maria | Brown | false | 0 || John | Down | false | 0 |+————+———–+———–+—————–+
Now assuming that desk temp.tableB is the most recent model of some dataset whereas temp.tableA is an older one and we wish to see the precise variations (when it comes to data) between the 2 tables, all we want is the next question:
WITHtable_a AS (SELECT * FROM `temp.tableA`),table_b AS (SELECT * FROM `temp.tableB`),rows_mismatched AS (SELECT’tableA’ AS table_name,*FROM (SELECT*FROMtable_a EXCEPT DISTINCTSELECT*FROMtable_b )
UNION ALL
SELECT’tableB’ AS table_name,*FROM (SELECT*FROMtable_b EXCEPT DISTINCTSELECT*FROMtable_a ))
SELECT * FROM rows_mismatched
Now the end result will include all of the variations noticed between the tables together with a reference to the desk identify the place the data had been discovered.
In our particular examples, tables A and B had been having a distinction in two data; The primary one appears to be the document for Andrew White since this particular person has a distinct worth for no_of_purchases subject. Moreover, desk tableB has one extra document that’s not even current on desk tableA.
+————+————+———–+———–+—————–+| table_name | first_name | last_name | is_active | no_of_purchases |+————+————+———–+———–+—————–+| tableB | John | Down | false | 0 || tableB | Andrew | White | true | 6 || tableA | Andrew | White | true | 4 |+————+————+———–+———–+—————–+
Notice: In case you are not acquainted with the WITH clause and Frequent Desk Expressions (CTEs) in SQL, ensure that to learn the next article:
Evaluating data of tables with totally different columns
Now let’s suppose you wish to evaluate the data between two tables having a distinct quantity of columns. Clearly, we must do an apples-to-apples comparability which means that we in some way have to extract solely the widespread fields from each tables so as to have the ability to carry out a significant comparability.
Let’s re-create our tables with a view to generate some mis-matching columns in order that we are able to then reveal how one can cope with these circumstances:
— Create the primary tableCREATE TABLE `temp.tableA` (`first_name` STRING,`last_name` STRING,`is_active` BOOL,`dob` STRING)INSERT `temp.tableA` (first_name, last_name, is_active, dob)VALUES (‘Bob’, ‘Anderson’, True, ’12/02/1993′),(‘Maria’, ‘Brown’, False, ’10/05/2000′),(‘Andrew’, ‘White’, True, ’14/12/1997′)
— Create the second tableCREATE TABLE `temp.tableB` (`first_name` STRING,`last_name` STRING,`is_active` BOOL,`no_of_purchases` INT)INSERT `temp.tableB` (first_name, last_name, is_active, no_of_purchases)VALUES (‘Bob’, ‘Anderson’, True, 12),(‘Maria’, ‘Brown’, True, 0),(‘Andrew’, ‘White’, True, 6),(‘John’, ‘Down’, False, 0)
Now our new tables have solely three columns in widespread, particularly first_name, last_name and is_active.
SELECT * FROM `temp.tableA`;
+————+———–+———–+————–+| first_name | last_name | is_active | dob |+————+———–+———–+————–+| Bob | Anderson | true | ’12/02/1993′ || Andrew | White | true | ’10/05/2000′ || Maria | Brown | false | ’14/12/1997′ |+————+———–+———–+————–+
SELECT * FROM `temp.tableB`;
+————+———–+———–+—————–+| first_name | last_name | is_active | no_of_purchases |+————+———–+———–+—————–+| Bob | Anderson | true | 12 || Andrew | White | true | 6 || Maria | Brown | false | 0 || John | Down | false | 0 |+————+———–+———–+—————–+
Now if we try to run the question we executed within the earlier part the place the 2 tables had been having the identical columns, we’ll find yourself with this error:
Column 4 in EXCEPT DISTINCT has incompatible varieties: STRING, INT64 at [13:7]
That is completely regular provided that our tables not have matching columns. We have to barely amend our preliminary question such that the very first CTEs will solely choose the mutual columns for each desk. Our question will look as under:
WITHtable_a AS (SELECT first_name,last_name,is_activeFROM `temp.tableA`),table_b AS (SELECT first_name,last_name,is_active FROM `temp.tableB`),rows_mismatched AS (SELECT’tableA’ AS table_name,*FROM (SELECT*FROMtable_a EXCEPT DISTINCTSELECT*FROMtable_b )
UNION ALL
SELECT’tableB’ AS table_name,*FROM (SELECT*FROMtable_b EXCEPT DISTINCTSELECT*FROMtable_a ))
SELECT * FROM rows_mismatched
The tables created on this part had been having the next mismatches (when contemplating solely their mutual columns):
The document for Maria Brown has variations in column is_activeTable tableB has one extra document (John Down) which isn’t current in tableA
These variations may be noticed in question outcomes shared under:
+————+————+———–+———–+| table_name | first_name | last_name | is_active |+————+————+———–+———–+| tableB | Maria | Brown | false || tableB | John | Down | false | | tableA | Maria | Brown | true | +————+————+———–+———–+
Closing Ideas
On this article, we offered a complete information on how one can evaluate tables in BigQuery. We highlighted the significance of this job in guaranteeing the accuracy and consistency of information and demonstrated a number of methods for evaluating tables with similar columns in addition to tables with totally different quantities of columns. We additionally walked via the method of extracting data that differ between tables (if any).
Total, this text aimed to equip readers with the required instruments and information to successfully and effectively evaluate tables in BigQuery. I hope you discovered it helpful!
Change into a member and browse each story on Medium. Your membership payment instantly helps me and different writers you learn. You’ll additionally get full entry to each story on Medium.
Associated articles you may additionally like