Talend - Data Integration using talend: 2013

Monday, 30 December 2013

Talend integration with Hive on hadoop – Part#2 (Read data from Hive)

Talend integration with Hive on hadoop – Part#2 (Read data from Hive)

In my previous example Talend integration with Hive on hadoop – Part#1
we created external table customers_ext in my hive database and loaded data into this table.
In this example we will read data from this table and once we have this data into talend server/memory we can transform/move this data as per our need using other talend components.

Pre-requisites –
1) Talend integration with Hive on hadoop – Part#1

See job below - it uses tHiveInput component to run sql
"select country,count(1) from arpitdb.customers_ext group by country" against my hive db.

and output from tHiveInput is printed using tLogRow.

See below screenshots for more details

output on execution of job is given below- first hive will run sql and will internally run map/reduce job and finally give results output to talend.

Starting job job_for_blog at 18:06 30/12/2013.

[statistics] connecting to socket on port 3803
[statistics] connected
.------------+-----------.
|       tLogRow_1        |
|=-----------+----------=|
|country     |countofrows|
|=-----------+----------=|
|Australia   |1034       |
|Canada      |1004       |
|Chile       |1047       |
|China       |1002       |
|France      |971        |
|Germany     |1004       |
|Japan       |989        |
|Russia      |1012       |
|South Africa|935        |
|UK          |1002       |
|US          |10000      |
|country     |1          |
'------------+-----------'

[statistics] disconnected

Job job_for_blog ended at 18:07 30/12/2013. [exit code=0]

Tuesday, 17 December 2013

Talend integration with Hive on hadoop – Part#1 (Write data into Hive)

Talend (big data edition) integration with Hive on hadoop – Part#1 (Write data into Hive)

Talend (big data edition) integration with Hive on hadoop – Part#1 (Write data into Hive)

I will write on how to use talend to connect to Hive database running on Hadoop and create a table and insert/load data into this table.

Pre-requisites –
1)   Hadoop+hive installed – I am using Cloudera quick start VM (Oracle virtualbox VM).
2)   Any source db from where you want to source data and push to Hive – I am using dellstore2 sample database in postgressql 9.3 DB.
3)   Talend big data edition – I am using TOS 5.4

Overall objective of job – create external table “customers_ext” in hive and read data from dellstore2db@postgressql and load this data into “customers_ext” table in hive
Follow steps below -
1)   As a first step I create external table in hive using tHiveCreateTable - see settings below
(see important settings and also option to create EXTERNAL table.

2) Write output in file format to HDFS using tHDFSOutput component. Basically to insert data into hive we simply have to first create flat file input and later load data into table using this file by using hive commands. Talend tHiveLoad does this load of data for us.

3) tHiveLoad - to load data into hive table from flat file. see settings for this component as below for loading data

4) Finally when we run job - what happens is that we first create external table in hive and associate it with a file location and structure. Next we read data from postgressql database and write this data using tHDFSOutput component - matching output file format wrt/ delimeter and row separators. once file has been created into HDFS we call tHiveLoad to load data from this file into hive table.