Friday, 4 April 2014

Talend - Insert data from S3 to aws redshift DB using copy from command via Talend

Talend - Insert data from S3 to aws redshift DB using copy from command via Talend


Lets say there is a need to move data from one of your source DB to aws redshift db via talend. This could be done via using t(youdbspecific)input component and tRedshiftOutput component. But this might be slow when compared to using COPY command in aws redshift for copy from S3.
We can implement COPY from S3 file in talend as below.

Below screenshot of job contains tpostgressqlInput component - which is my source database from where i want to read data.
I write this data into a flat file on my local machine/server using tFileinputDelimited component.
once written - i need to transfer this flat file to amazon aws S3. We can use tS3Put component for this. This component would need bucket/acess key/secret key details for your AWS account.
Once file has been uploaded to S3, we can use tRedshiftRow to execute copy from command and load data to aws redshift table.






27 comments:

  1. Hi Arpit
    I'm very new to Talend. If I want to write data from S3 to a Redshift DB, would I only have to use tS3Connection and tRedshiftRow components?
    Also, would I use the same components if I want to load multiple files from S3 in the same folder to Redshift ?

    Thanks,
    Shantal

    ReplyDelete
  2. not very sure what you want. if you already have files in S3, then all you need to do to get this data loaded into redshift is use COPY From command..search for this command and you will get information on how to use it. To use this command from talend you have to use tredshiftrow component. for multiple files you can run a loop from talend and execute command via tredshiftRow.

    ReplyDelete
    Replies
    1. Hi,
      I'm new to Talend and I used COPY cmd to transfer csv file from S3 to Redshift.
      The reporting I wrote works well in SQLWorkbench but when I copied it to tRedshiftRow Talend, it works no longer and ended immeteately with an error msg : insufficient data
      What's the possible problem can this be ?
      Thanks a lot for your help.

      Delete
    2. PS :

      and this error appeared when I want to transfer a big table.
      A small one without Data data type works well.

      Delete
    3. can you try to put any other command in the tredshiftRow instead of your copy command, just to be sure that you are not missing any other settings in the tredshiftrow component itself..

      Delete
    4. I tried it. I put 2 tRedshiftRow components, the first one I put a create table command and it works well, when I come to the second one, the COPY stop immediately without doing anything and told me that : insufficient data.
      In fact I have to create the table myself cause I don't have the table in Redshift, can't it create the table by itself ?
      Perhaps it's the schema has made a probleme but if it works in SQLWorkbench, I can see no reason why it blocks in Talend...

      Delete
    5. This comment has been removed by the author.

      Delete
    6. I wandered whether it can caused by so many data, so I select only one row ( but still has many colonne) it doens't work neither

      Delete
    7. its difficult to see what is happening this way..can you post the text or command you are putting inside the tredshiftrow...also table has to exist in DB so that copy command can load it..and also the awskeys, bucket etc should be there and proper entry should be made

      Delete
    8. yes !
      But it seems that I can't add a picture ...
      Can we contact by mail or sth ? I can give you my email adresse

      Delete
    9. or you can see my uploaded pictures in TalendForum
      https://www.talendforge.org/forum/viewtopic.php?id=46609

      Delete
    10. you need to enclose the sql statement inside double quotes. "copy from......."
      or better you can create a global variable and store your command in it and then use that global variable in tRedshiftRow using (String)globalMap.get("variablename");

      Delete
    11. In Talend it is enclosed by ""
      I found the problem may caused by newline characters (\n).
      I checked when I use "\n" as newline characters, in my csv file I can not see the newline symbol ("...CATG19|CATG2080189D.01|...")so when I do COPY, Redshift take "CATG2080189D.01" as the component of the same column.
      So I tried "\\n", I can see "...CATG19|CATG20\n80189D.01|..." in my csv file but caused "CATG20n80189D.01 " error in Redshift which results in "String length exceeds DDL length"
      I do added ESCAPE option into COPY cmd as recommanded here :
      http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_command_examples.html#r_COPY_preparing_data
      but well, I don't know in which component should I write the sed command

      Delete
    12. I solved it finally ! with the ESCAPE
      Thank you for your help !

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Is it possible to load data from s3 to amazon RDS Mysql using the same script?
    btw: could you please write full copy script (of course without detailed credenctials)?

    BR

    ReplyDelete
  6. for moving data from S3 to mysql you can use below options
    1) using talend aws components awsget you can get the file from S3 to your talend server or your machine where talend job is running and then you can read this .csv/json/other file and insert into mysql using talend rds mysql components.
    2) copy command works for redshift..below is syntax for same..
    copy target_table_name(col1, col2) from 's3://aws_S3_bucket_name/S3_bucket_inbound_folder_name/filename_with_extn' CREDENTIALS 'aws_access_key_id=youraws_S3_access_key;aws_secret_access_key=youraws_S3_access_key' gzip delimiter ';' timeformat 'DD-MM-YYYY HH24:MI:SS' dateformat 'DD-MM-YYYY'

    ReplyDelete
  7. Hi Arpit,

    I'm new to Talend and would like to know how the tS3put component works. I am familiar with redshift copy command and when we use an unload command to upload a file to S3 , we can specify it as a manifest. My question is how tS3put command upload the file into S3 , is it a manifest file or a single .csv file ?

    Thanks,
    Parvathy

    ReplyDelete
  8. its a single file it puts to S3. it takes file from the system where talend is running and pushes it to S3

    ReplyDelete