
Filtering input rows
Often, rows can be filtered out of a flow because they do not fulfill the required criteria for processing. This example shows how this can be achieved within the tMap
component, so as to avoid costly join logic.
Getting ready
Open the job jo_cook_ch04_0050_tMapInputFilter
.
How to do it...
- Run the job. You will see that there are many records read from
orderItemFile
and all are being output. - Kill the job and view the output. You will see many order items being displayed, all of which are duplicates. These are the ones we will need to remove.
- Open
tMap
and click the Activate/unactivate expression buttonfor the customer input table.
- Add the filter expression
customer.customerId
==
2
||
customer.customerId
==
3
into the input expression filter, as shown in the following screenshot: - Run the job and you will see that only two records have been output.
How it works…
Adding the filter enabled us to reduce the number of customers to two; either the customer with an ID of 2 or the customer with an ID of 3.
There's more…
Talend does provide a separate component for filtering (tFilterRow
) and it is generally a matter of personal style or development standards as to which method you use for filtering data prior to processing in tMap.
Note that when input filtering is used, the rows are simply discarded. Whether the rows should be discarded is a design decision, and the developer should be clear on the fact that it is ok to discard the rows.
Tip
If the requirement states that rejects must be recorded, then do not use an input filter in tMap
. Instead, use tFilterRow
prior to tMap
to enable the rejected rows to be captured or, if tFilterRow
cannot be used on the input, then the rows will have to be processed and then filtered at the output.
When using database inputs, it is usually better and more efficient to filter within the SQL query, rather than within the Talend job.