Tuesday, November 26, 2019

ETL in the cloud using SQL Query and Db2

From COS to Db2 using SQL
The SQL Query service on IBM Cloud allows to process data stored on Cloud Object Storage (COS) by writing SQL queries. So far, results were either shown in the console and / or written back to files on COS. Thus, I was happy to notice a new feature: Query results can now be written back to Db2 tables (on cloud). From my tests both Db2 on Cloud and Db2 Warehouse on Cloud are supported.

Setup

To perform some tests of the new feature, I need three services in my IBM Cloud account: Cloud Object Storage (COS) to store the input data, SQL Query to perform the data processing, and Db2 on Cloud or Db2 Warehouse on Cloud for accepting the result data.

From Cloud Object Storage into Db2

Select statements in SQL Query have the usual structure with the exception that the tables in the FROM clause are URIs for files stored in COS. For my tests, I used a file with precipitation data for countries worldwide with columns made up of the country name and the rain totals by individual years. With that data on COS, I tried to get all the country names:

select row_Number() over (order by CountryorArea) as id,
       CountryorArea as name
from (cos://eu-de/henrik-input/precipitation2.csv)
into crn:v1:bluemix:public:dashdb-for-transactions:us-south:s/a228853-....-034bc:cf-service-instance:/COUNTRIES


The above query reads from the file on COS identified by a short URI. It extracts the country names from CountryorArea and presents them in the column name. And it adds a column id with computed values, enumerating all result rows using the row_number() function. The result is stored in the table COUNTRIES in Db2 identified by an IBM Cloud-internal resource name. Once I figured out all the parts, it worked as described.

Caveats

It is documented for the dbResult clause, the target table is created before the data is stored in Db2. If the table exists, it is droppped first and recreated with the columns and data types as determined by SQL Query. String columns are stored as CLOBs by design. Thus, I had to use CAST in my Db2 environment to turn the values back into VARCHARs.

Overall, the new feature provides an easy way to collect all kinds of data on Cloud Object Storage, perform the necessary transformation and aggregation and bring it into a database environment (Db2). My next tests will be to analyze log files.


If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.