Thursday, 11 October 2012

Talend-tmap left outer join and only get rows from left (main) table which does not exists in right table or how to do something like select * from where not exists(select...)



Problem - How to do left outer join in tmap and only get rows from left (main) table which does not exists in right table or how to do something like select * from where not exists(select...)

Solution - There are many - but one which we have used is given below
Say you have tables as below

Current_Snapshot -
name_skey
name_desc

Previous_Snapshot -
name_skey
name_desc

What we need is all rows from current_snapshot which does not exists in previous_snapshot
- Take main flow from current_snapshot - as main row input to Tmap
- in the input component of previous_snapshot select a dummy_flag with some constant value
example - select name_skey, name_desc, "XX" dummy_flag from previous_snapshot.
Now take flow from previous_snapshot  - as row lookup flow to above tmap.
In tmap - join this look up to main on name_skey and name_desc - use left outer join and all rows.
Now in output table in tmap - use expression and there put lookuprow.dummy_flag==null
What happens is that tmap will join all rows from current_snapshot to previous_snapshot and for those rows which exists in current_snapshot and not in previous_snapshot it will have dummy_flat==null and our expression filter in tmap helps us in identifying these row.





Wednesday, 3 October 2012

Talend - tjdbcoutput/other database output components - dynamically controlling enable parallel execution parameter at runtime

Issue - how to control enable parallel execution parameter under advance settings for tjdbcoutput component

Solution - use context variable for example "no_of_parallel_thread" in your context file and then use this context.no_of_parallel_thread variable in your enable Parallel Execution.


Talend - tjdbcoutput (other database output components) performance improvement - use of Batchsize / enable parallel execution parameter under Advance settings


Issue - While working with tjdbcoutput parameter - one of our source table was big and when we were inserting data into the target we found slow performance with tjdbcoutput component. Later we took help of batchsize and enable parallel execution features under advance settings to get better performance



Solution – Use of batchsize/commit frequence/enable parallel execution in Advance settings helps in getting good performance - usage of batchsize parameter will be more clear if you look at code generated by talend. What talend does is as below in the code

Read data from database in a rs_input (record set) by executing input component SQL.
Next it loops through each record till it reaches the batchsize counter and in each iteration of loop it prepares the INSERT /delete/update statement as asked by output component in our case tjdbcoutput. Next when this batchsize counter is reached talend executes this batch of sql statements against DB.

So if you increase the batchsize – you will get more throughput or rather more rows will flow to your tjdbdoutput component in one batch and you will get better performance at DB level as to and fro between talend and DB will reduce as you will have more rows going in one batch and less number of batch trips between talend and DB.
Another point where you can receive good performance gains is by adjusting your commit frequency (number of rows to commit) – try to play with different options – but we found that the closest or equal it is to batchsize it helps.

Enable parallel execution – is very useful feature if you want to do tjdbcoutput operation in parallel threads. It again depends on your database settings – but its worth trying to set it 2 or 4 and see whether you gain benefit as what it does is that – if it is set to 4 it will fire 4 batches of sql statements (insert/delete/update/ as the case be) and if you have not used EXISTING connections then you can also see 4 new connections in this case.
There is no hard and fast setting for these - you have arrive at what works best for your case.