Best Practices Software engineering: Talend ETL Best Practices

Hello World Talend ETL Data Integration

objectives:

Talend Open Studio is used for Data Integration , extraction and

processing data among different sources.

1. Get it:

http://www.talend.com/download/data-integration

Direct Link :

>For Linux, Mac and Windows.

http://talend.dreamhosters.com/tos/release/V5.2.0/TOS_DI-r92826-V5.2.0.zip

2.Installation and Getting Started :

(1) : Panel for Project Structure

(2) : Panel for Jobs Structure and Code

(3) :Tabs containing the properties of components, execution console, problems ...

(4) :Panel of available components.

Best Practices : Manipulation of Documents :

1. Preparation of data sources :

we will manipulate multiple data sources (CSV, text files and database) to extract the data, transform them and save them in other formats. The first step to do is to define the data sources in order to generate the Repository patterns and their use in the following activities.

Note : we will add the following files (client.csv & state.txt) .
client.csv
state.txt

In panel (1) representing the Repository, expand the section (Métadonnées)

STEP1:

To define sources as fields separated by delimiters (such as text or csv files), select: Create a delimited file.
Enter the file name in the window that appears: client (in our case, we will add the file client.csv)

STEP2:

Then select the file you want to add. Navigate to the file for that client.csv that was provided. The file viewer allows you to have an idea about the contents of this file.
Click Next.

STEP3:

In the next window, click on the checkbox Define the header lines as column name. Then click Refresh preview. The preview of the extracted file will be updated, so that the first line of the file represents the field names.
Click Next.

STEP4:
Modify the schema name delimited file (client), and observe the composition of the fields in the Description panel of the diagram. You can then modify the data schema to your liking.

STEP5:

In this case, do not forget to check the key to the id field.
You can also modify the lengths of fields (default values were calculated according to the Talend data already in the file).
Click on finish.
And you added a source file whose schema can be used throughout the application.

STEP6:Follow the previous steps to Generate the schema of file (state.txt)
state.txt

STEP7:In the DBMS(ن.ت.ق.ب/SGBD) of your choice(MySQL, Postgress, oracle, ..), create a database client_bd containing a table called client_bd. The structure of this table does not matter, it will be overwritten later.

I will use MySQL as DBMS

455.png

STEP8: Add the database as a source in the metadata, and add the client_bd table to table schemas.

2.Sorting documents(Tri):

we propose to sort the contents of client.csv automatically, using Talend components. To do this, follow these steps:

STEP1: Create New Job (job_sort as name).

STEP2: Drag the delimited files client 0.1 that you created earlier in the panel (2) . a

STEP3: In panel (4), representing the palette, select the component in the category tSortRow Transformation. This component allows, as its name suggests, to sort a set of data in a particular column.

Drag this component in the main window

STEP4: To represent the output file, drag the tFileOutputDelimited in the main window. It is located under the File>Writing(Fichier>écriture) category.

STEP5: Connect the three elements to represent the execution chain. To do this, right-click the client component, keep it pressed, and drag the component sort.
Do the same sort between the component and the output file.

STEP6: We will now configure three components. We first define the client's name to sort by alphabetical order of the source file.
Click on the component sort. Tab Component panel (3), click (+). Change the value of inserted fields to sort by client name, ascending alphabetical order.

STEP7: Then click on the component output.
Choose where you want to save the output file.
Check the box: Include the header for the header columns will appear in the output file.
Front of the box Schema, change the schema type to Repository(Référentiel ) , then click [...] next to Edit Schema. This will define the field structure of the output file.
- In the window that appears, choose the schema customer delimited file that you created.

STEP8: Execute the process, click on the Run button panel (3), and then click Run. Or hit F6. At the end of the execution, the following trace is displayed on the main window:

==> So, out.csv has the same data than client.csv, but , out.csv is sorted by nameClient

.3.Data integration with Talend:

STEP1: Duplicate job job_sort and name it sort_file_in_db .

STEP2: Copy the data generated in the output file defined in the database client_bd you created in the previous activity (instead of a CSV file). Upon creation, the target table will be overwritten and replaced by the table containing the sorted data.

Configuration of output :

Execute & Check your Database (for Example by phpMyAdmin)

Best Practices Software engineering

WELCOME Abdennour : Software engineer

Nov 24, 2012

Talend ETL Best Practices

No comments:

Post a Comment

Followers

Welcome World