Saturday 19 January 2013

Talend tHash (in memory) data storage Vs Flatfile data storage


Problem - One of the challenge you might face is when to use tHash and when to use flatfiles in your job as a medium to store intermediate data which will be used in your job for lookup/processing/others.

Solution - There is no exact solution for this, but what i will like to share is some observations -
  1. Firstly there is no clear context as to when to use flat file or thash but it majorly depends on what's your requirement wrt/ the size of data your job is processing and how is it processing - do you intend to process entire data in one shot of you can break it in chunks.
  2. What I mean in above point is that - if you have a table whose entire size with data is 5GB and your are running talend job with JVM parameters of 1024 to 4096, then what ever you do, - when you load this entire data into tHash it is going to complain out of memory exception.
  3. Again if you plan to run multiple jobs in parallel in your talend server and each job will store some considerable amount of data in thash (in memory) then again you have this challenge of of out of memory exception as though you have made JVM to reach max of 4096, but it could be that other java process running are not allowing this job to get max 4096.
  4. You can get rid of above #2,#3challenges by using flatfile - but again if you read this generated flatfile in your talend job and use tMAP lookup and join - there are chances that tMap will try to load entire data into memory for lookup (depending on your t*output component batchsize (assuming your tmap is connected to t*output component and does insert of data into DB)) - and in case you have a high batch size you might endup with out of memory exception again.
  5.  So what is the solution - No one word solution - 
    • First step is to find what is the amount of data you want to store in tHash.
    • Second is the way you want to design your job - can it be designed in a way that your job processes data in chunks instead of one big lookup
    • tMap has an option of using temp directory for joining - i have never used it, so will not be able to comment on that.
    • It is a safe bet to go with flat files as you do not have to worry about memory consumption reaching its max when multiple jobs are running in parallel and considering flat file handlers provided by talend - like touch file, delete file, deriving name of flat file using context variable etc..
    • tHash - is good for small lookup kind of things - again one of the challenge is that so far i am not able to find a way to flush/cleanup data once loaded in tHash map - so what is loaded once in memory of tHash stays for the entire job execution.
To summarize - using tHash vs flatfile - depends on how you want to design your job, depends on your data size and server memory available.

Ignore text - PN9S8TH8WU9Q

No comments:

Post a Comment