Saturday, 18 July 2015

Talend : running multiple SQL statements using database row component (example aws redshift tRedshiftRow)

Talend : running multiple SQL statements using database row component (example aws redshift tRedshiftRow)

Problem - You might need to run more than one statement in tRedshiftRow (or any other database Row component in talend).

Solution - You can do this by writing your statements in a single string separated by the statement separator for the DB you are using for redshift its ; and i guess for most its the same.

I majorly issue vacuum and analyze commands for my db tables as last job of my load.
example below - just below sql in tREdshiftRow component and it will execute it as multi-statement sql.

"vacuum table1;
vacuum table2;
vacuum table3;
"

Insert update strategy when using Talend for aws redshift database


Insert update strategy when using Talend for aws redshift database

In this blog i want to highlight insert and update strategy which i found working good with aws redshift database. There would be other strategies which would also work good, but i would share my findings - when i used Talend with redshift.

As of tos v5.6.1 - there is no bulk output component for redshift. so if you use output component for redshift it is slow when it comes to insert, update. What redshift works best is with using their data load utility of COPY from flat file in S3 or it works best when you want to insert data using custom SQL "insert into targettable select from sourcetable";

similarly for update it works best if you use update from table as below
update targettable
set targettable.column = sourcetable.column
from sourcetable
where tarettable.columnkey = sourcetable.columnkey.

So in my work with talend and redshift I mostly endup in using tRedshiftRow component where you can write your custom SQL for insert, update or copy command and this is processed by redshift very fast. There is magnitude of difference in performance when you use custom SQLfor insert, update or copy command.

So when working on redshift DB via talend, try to design your jobs in the way that for jobs which process bulk data, you generate a flat file from source, push it to S3 and use tRedshiftRow to write copy command to load it. If you want to transform data across  your tables in your database then write your own insert, update statement for better performance.

Tuesday, 18 November 2014

Using tJavaRow - input_row and output_row (for row by row transformation)


Using tJavaRow - input_row and output_row (for row by row transformation)

In this post i will like to explain a very simple and basic usage of tJavaRow.
Lets say you need to do some transformation on every row of your input data. You can do this by using tMap, tJavaFlex, tJavaRow and others.
for now i will focus on tJavaRow. Its a good practice to use input_row and output_row as objects which can be used to fetch values of your columns. 

using tJavaRow

Fixedflowinput returns two rows and we pass each row via tJavaRow and append text "-XYZ" to name and finally pass this as output to tLogRow.
code of tJavaRow...notice the usage of input_row and output_row here..

output_row.id = input_row.id;
output_row.name = input_row.name+"-XYZ";






Tuesday, 2 September 2014

Implement Insert (for new) and update (for existing) strategy in talend (tmap update/insert/reject)


Implement Insert (for new) and update (for existing) strategy in talend

Most of times in dwh/etl/data integration loads we have incoming source data feeds and we have to merge (update/insert) this with existing data in target table.
There are multiple ways to do this, i will explain one of the method i have used.

You have a source table - say srcemployee
you want to merge this data after doing lookup with target - tgtemployee
keys for lookup is say empid

all we need is source input, target input(lookup) tmap and toutput
in tmap we would join source input with target input on key column and select join options as all matches inner join. In tmap output we will have two groups once for handling matching records and other for handling rejects (records which were not satisfying the join condition) these are insert.

overall job (tmysqloutput_1 insert, tmysqloutput_2 is for update)



schema of input srcemployee

schema of input srcemployee


tmap join condition
tmap join condition

mention key for update in schema for tmysqloutput_2
mention key for update in schema for tmysqloutput_2