Talend - Data Integration using talend

Thursday 28 April 2016

Talend integration with slack

Problem - You need to post message to your slack channel webhook using talend. This message could be something like sending a summary of your load job, its success or failure.

Solution - to post a message to slack one of the way is using payload.

payload={"channelname": "#yourchannelname", "username": "anyusernameyoucanput", "text": "message to be posted", "icon_emoji": ":ghost:"}'

you can then post this payload using below
curl -X POST --data-urlencode 'payload={"channelname": "#yourchannelname", "username": "anyusernameyoucanput", "text": "message to be posted", "icon_emoji": ":ghost:"}'

in Talend you can use below in tJava component to do above.
here there are some context.* (context variables used which you have to define in your job)
Once the context parameters are defined, you can run the job and if you have entered things correctly you should be able to post message to slack.

String strSlackPayload ="";
strSlackPayload += "{\"channel\": \"#";
strSlackPayload += context.slack_channel_name+"\", \"username\": \""+context.slack_user_name+"\"";
strSlackPayload += ", \"text\": \"";
strSlackPayload += "message to be sent";
strSlackPayload += "\", \"icon_emoji\": \":white_check_mark:\"}";

String line;
StringBuffer jsonString = new StringBuffer();

URL url = new URL(context.slack_webhook_url);

//escape the double quotes in json string
String payload=strSlackPayload;

HttpURLConnection connection = (HttpURLConnection) url.openConnection();

connection.setDoInput(true);
connection.setDoOutput(true);
connection.setRequestMethod("POST");
connection.setRequestProperty("Accept", "application/json");
connection.setRequestProperty("Content-Type", "application/json; charset=UTF-8");
OutputStreamWriter writer = new OutputStreamWriter(connection.getOutputStream(), "UTF-8");
writer.write(payload);
writer.close();
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
while ((line = br.readLine()) != null) {
jsonString.append(line);
}
br.close();
connection.disconnect();

Talend batch job wrapper

Problem - You have multiple jobs which you want to run in a batch in a sequence and with dependencies established between them.

Solution - you can very well call talend jobs inside another talend job. Lets say you have Job1, job2....job10 and you want to run them in your load batch as below

START -->job1--->Job2--->job3......-->job9--->job10

Here job1 should only run once job2 is done.

To do above you have to create a new talend job - which we can call batch_load job.
inside this job you have to drag and drop your job1...job10 from repository panel.

then you have to establish triggers OnSubJobOK between them.
once you have done that, your batch job is ready to be exported and then deployed and executed.

batch_job with multiple jobs called from it

you can set what to do when child job fails

Thursday 24 September 2015

Configure java path for Talend Open Studio (5.6.1)

Problem - Recently after installing java 8 on my machine, i was not able to run Talend Open Studio 5.6.1 which was complaining that java 8 is not supported. I had java 7 too on my machine, but talend open studio was by default pointing to java in my path.

Solution - (if you were not able to change default java in your system to java 7) - We can configure talend open studio to use an specific java installation. In my case i wanted to configure talend open studio 5.6.1 to use java 7.
All you have to change is the corresponding *.ini file for your talend executable.
I am using windows 8 so in my case it was TOS_DI-win-x86_64.ini

add a new entry to this file as below (-vm...). It already has other entries.
this new vm entry will control the location of java which talend will use. enter the -vm parmeter and path to your java 7 installation. In my case it was c:\program files\java\jre7\bin

-vmargs
-vm C:\Program Files\Java\jre7\bin
-Xms512m
-Xmx1536m
-XX:MaxPermSize=512m
-Dfile.encoding=UTF-8

save the file TOS_DI-win-x86_64.ini and restart talend studio.

Once studio is opened you can also control this behaviour via menu option Windows-->preferences-->java

Saturday 18 July 2015

Talend : running multiple SQL statements using database row component (example aws redshift tRedshiftRow)

Talend : running multiple SQL statements using database row component (example aws redshift tRedshiftRow)

Problem - You might need to run more than one statement in tRedshiftRow (or any other database Row component in talend).

Solution - You can do this by writing your statements in a single string separated by the statement separator for the DB you are using for redshift its ; and i guess for most its the same.

I majorly issue vacuum and analyze commands for my db tables as last job of my load.

example below - just below sql in tREdshiftRow component and it will execute it as multi-statement sql.

"vacuum table1;
vacuum table2;
vacuum table3;
"

Insert update strategy when using Talend for aws redshift database

Insert update strategy when using Talend for aws redshift database

In this blog i want to highlight insert and update strategy which i found working good with aws redshift database. There would be other strategies which would also work good, but i would share my findings - when i used Talend with redshift.

As of tos v5.6.1 - there is no bulk output component for redshift. so if you use output component for redshift it is slow when it comes to insert, update. What redshift works best is with using their data load utility of COPY from flat file in S3 or it works best when you want to insert data using custom SQL "insert into targettable select from sourcetable";

similarly for update it works best if you use update from table as below

update targettable

set targettable.column = sourcetable.column

from sourcetable

where tarettable.columnkey = sourcetable.columnkey.

So in my work with talend and redshift I mostly endup in using tRedshiftRow component where you can write your custom SQL for insert, update or copy command and this is processed by redshift very fast. There is magnitude of difference in performance when you use custom SQLfor insert, update or copy command.

So when working on redshift DB via talend, try to design your jobs in the way that for jobs which process bulk data, you generate a flat file from source, push it to S3 and use tRedshiftRow to write copy command to load it. If you want to transform data across your tables in your database then write your own insert, update statement for better performance.

Tuesday 18 November 2014

Using tJavaRow - input_row and output_row (for row by row transformation)

Using tJavaRow - input_row and output_row (for row by row transformation)

In this post i will like to explain a very simple and basic usage of tJavaRow.

Lets say you need to do some transformation on every row of your input data. You can do this by using tMap, tJavaFlex, tJavaRow and others.

for now i will focus on tJavaRow. Its a good practice to use input_row and output_row as objects which can be used to fetch values of your columns.

using tJavaRow

Fixedflowinput returns two rows and we pass each row via tJavaRow and append text "-XYZ" to name and finally pass this as output to tLogRow.

code of tJavaRow...notice the usage of input_row and output_row here..

output_row.id = input_row.id;
output_row.name = input_row.name+"-XYZ";