Talend Open Studio Cookbook
上QQ阅读APP看书,第一时间看更新

Filtering input rows

Often, rows can be filtered out of a flow because they do not fulfill the required criteria for processing. This example shows how this can be achieved within the tMap component, so as to avoid costly join logic.

Tip

Note that you should not concern yourself too much with the complexity of tMap in this recipe; rather you should concentrate on the filters. Joining is covered in later recipes in this chapter.

Getting ready

Open the job jo_cook_ch04_0050_tMapInputFilter.

How to do it...

  1. Run the job. You will see that there are many records read from orderItemFile and all are being output.
  2. Kill the job and view the output. You will see many order items being displayed, all of which are duplicates. These are the ones we will need to remove.
  3. Open tMap and click the Activate/unactivate expression button How to do it... for the customer input table.
  4. Add the filter expression customer.customerId == 2 || customer.customerId == 3 into the input expression filter, as shown in the following screenshot:
    How to do it...
  5. Run the job and you will see that only two records have been output.

How it works…

Adding the filter enabled us to reduce the number of customers to two; either the customer with an ID of 2 or the customer with an ID of 3.

There's more…

Talend does provide a separate component for filtering (tFilterRow) and it is generally a matter of personal style or development standards as to which method you use for filtering data prior to processing in tMap.

Note that when input filtering is used, the rows are simply discarded. Whether the rows should be discarded is a design decision, and the developer should be clear on the fact that it is ok to discard the rows.

Tip

If the requirement states that rejects must be recorded, then do not use an input filter in tMap. Instead, use tFilterRow prior to tMap to enable the rejected rows to be captured or, if tFilterRow cannot be used on the input, then the rows will have to be processed and then filtered at the output.

When using database inputs, it is usually better and more efficient to filter within the SQL query, rather than within the Talend job.