Talend Open Studio Cookbook
上QQ阅读APP看书,第一时间看更新

Introduction

Managing metadata is one of the most important aspects of developing Talend jobs, and the most common form of metadata used within Talend jobs is the schema.

Schema metadata

For successful development of jobs, it is essential that the metadata defined for a data source accurately describes the format of its underlying data. Failure to correctly define the data will result in numerous errors and waste of time tracking down problems with data formats that could otherwise be avoided.

Talend provides a host of wizards for capturing metadata from a variety of data sources such as database tables, delimited files, and Excel worksheets and stores them within its built-in metadata repository.

Schemas

Talend stores metadata definitions in schemas, which may be built in to individual components or stored in its metadata repository, as shown in the following screenshot:

Schemas

In general, it is best practice to define source and target metadata using a repository schema and mid-flow metadata as a Built-In schema.

The main exception to this rule is when dealing with one-off generated source data, such as a database query. Despite being a data source, it is easier to store the schemas for these custom queries as Built-In rather than cluttering the repository with single-use schemas.

Repository schemas

The benefits of using Repository schemas are:

  1. They can be re-used across multiple jobs, thus reducing the amount of re-keying.
  2. Talend will ensure that changes made to a Repository schema are cascaded to all jobs that use the schema, thus avoiding the need to scan jobs manually for Built-In schemas that need to be changed.
  3. Impact analysis reports can be generated showing where a Repository schema is being used within a project. This enables the impact of changes to be more assessed more accurately when planning changes to any underlying data sources.

Generic schemas

Generic schemas aren’t tied to a particular source, so they can be used as a shared resource across multiple types of data source or they can be used to define data sources that are generated, such as the output from custom SQL queries.

Shared schemas

Schemas captured from a particular type of data source are stored in the metadata repository in a folder for that data type (for example, CSV file schemas are stored in the directory for delimited files).

There are however instances where schemas will be shared across multiple types. For example, a CSV file and Excel file could be used to directly load a database table.

If you import the metadata from one of the sources, it will be stored in the folder for that source, which could make it hard to find.

By storing the schema as a Generic schema, it is more obvious that the schema isn’t used just for a single source.

Generated data sources

It is often necessary to perform a query against a database and return the result set to the Talend job. It is often the case that the same query is used multiple times in many jobs.

By storing the schema for the result set in a generic schema, it removes the tedious process of having to create the same schema over and over again manually every time the query is used.

Tip

Another very common use for generic schemas is within the tHashInput and tHashOutput components. If you are using the hash components as lookups, then one tHashOutput could be linked to many tHashInput components and all will share the same schema. By exporting the output schema to a generic schema, tHashInputs can be set up much more quickly in comparison to hand-cranking or cutting and pasting schemas from the output. This also has the benefit of ensuring that changes to the format are cascaded to all related components.

Fixed schemas and columns

Some components, such as tLogCatcher, have predefined schemas that are read-only. These can be easily recognized due to the fact that the whole schema is gray.

You may also find that certain flows, for instance the reject flows, have fixed columns that have been added to the original schema. This is because Talend will add the errorCode and errorMessage fields to the schema to store the error information. These additional fields will be green to distinguish them as Talend fields.