Blogs

EMAIL: info@example.com

Agaric Collective: Understanding the syntax of Drupal migrations

In the 31 days of Drupal migrations series, we explained different aspects of the syntax used by the Migrate API. In today’s article, we are going to dive deeper to understand how the API interprets our migration definition files. We will explain how to configure process plugins and set subfields and deltas for multi-value field migrations. We will also talk about process plugin chains, source constants, pseudofields, and the process pipeline. After reading this article, you will better comprehend existing migration definition files and improve your own. Let’s get started.

Understanding the syntax of Drupal migrations.

Field mappings: process plugin configuration

The Migrate API provides syntactic sugar to make migration definition files more readable. The field mappings under the `process` section are a good example of this. To demonstrate the syntax consider a multi-value Link field to store links to online profiles. The field machine name is field_online_profiles and it is configured to accept the URL and the link text. For brevity, only the `process` section will be shown, but it is assumed that the source includes the following columns: `source_drupal_profile`, `source_gitlab_profile`, and `source_github_profile`.


process:
  field_online_profiles: source_drupal_profile

In this case, we are directly assigning the value from `source_drupal_profile` in the `source` to the `field_online_profiles` in the `destination` entity. For now, we are ignoring the fact that the field accepts multiple values. We are setting the link text either, just the URL. Even in this example, the Migrate API is making some assumptions for us. Every field mapping requires at least one `process` plugin to be configured. If none is set, the `get` plugin is assumed. It copies a value from the source to the destination without making any changes. The previous snippet is equivalent to the next one:


process:
  field_online_profiles:
    plugin: get
    source: source_drupal_profile

The process plugin configuration options should be placed as direct children of the field that is being mapped. In the previous snippet, plugin and source are indented one level to the right under field_online_profiles. There are many process plugins provided by Drupal core and contributed modules. Their configuration can be generalized as follows:


process:
  destination_field:
    plugin: plugin_name
    config_1: value_1
    config_2: value_2
    config_3: value_3

Check out the article on using process plugins for data transformation for a working example.

Field mappings: setting sub-fields

Let’s expand the example by setting the a value for the Link text in addition to the URL. To accomplish this, we will migrate data into subfields. Fields can store complex data and in many cases they have multiple components. For example, a rich text field has a subfield to store the text value and another for the text format. Address fields have 13 subfields available. Our example uses Link fields which have three subfields:

  • uri: The URI of the link.
  • title: The link text.
  • options: Serialized array of options for the link.

For now, only the uri and title subfields will be set. This also demonstrates that, depending on the field, it is not necessary to provide values for all the subfields. One more thing we will implement is to include the name of the online profile in the Link text. For example: “Drupal.org profile”.


process:
  field_online_profiles/uri: source_drupal_profile
  field_online_profiles/title:
    plugin: default_value
    default_value: 'Drupal.org profile'

If you want to set a value for a subfield, you use the field_name/subfield syntax. Then, each subfield can define its own mapping. Note that when setting the uri we are taking advantage of the get plugin considered the default to simplify the value assignment. In the case of title, the default_value process plugin is used to set a fixed value to comply with our example requirement.

When setting subfields, it is very important to understand what format is expected. You need to make sure the process plugins return data in the expected format or the migration will fail. In particular, you need to know if they return a scalar value or an array. In the case of scalar values, you need to verify if numbers or strings are expected. In the previous example, the uri subfield of the Link field expects a string containing the URL. On the other hand, File fields have a target_id subfield that expects an integer representing the File ID that is being referenced. Some process plugins might return an array or let you set subfields directly as part of the plugin configuration. For an example of the latter, have a look at the article on migrating images using the image_import plugin. image_import lets you set the alt, title, width, and height subfields for images directly in the plugin configuration. The following snippets shows a generalization for setting subfields:


process:
  destination_field/subfield_1:
    plugin: plugin_name
    config_1: value_1
    config_2: value_2
  destination_field/subfield_2:
    plugin: plugin_name
    config_1: value_1
    config_2: value_2

If a field can have multiple subfields, how can I know which ones are available? For easy reference, our next blog post will include a list of subfields for different types of fields. To find out by yourself, check out this article that covers available subfields. In summary, you need to locate the class that provides the FieldType plugin and inspect its schema method. The latter defines the database columns used by the field to store its data. Because of object oriented practices, sometimes you need to look at the parent class to know all the subfields that are available. When migrating into subfields, you are actually migrating into those particular database columns. Any restriction set by the database schema needs to be respected. Link fields are provided by the LinkItem class whose schema method defines the three subfields we listed before.

If a field can have multiple subfields, how does the Migrate API know which one to set when no one is manually specified? Every Drupal field has at least one subfield. If they have more, the field type itself specifies which one is the default. For easy reference, our next blog post will indicate the default subfield for different types of fields. To find out by yourself, check out this article that covers default subfields. In summary, you need to locate the class that provides the FieldType plugin and inspect its mainPropertyName method. Its return value will be the default subfield used by the Migrate API. Because of object oriented practices, sometimes you need to look at the parent class to find the method that defines the default subfield. Link fields are provided by the LinkItem class whose mainPropertyName returns uri. That is why in the first example there was no need to specify a subfield to set the value for the link URL.

Field mappings: setting deltas for multi-value fields

Once more, let’s expand the example by setting the populating multiple values for the same field. To accomplish this, we will specify field deltas. A delta is a numeric index starting at 0 and incrementing by 1 for each subsequent element in the multi-value field. Remember that our example assumes that the source has the following columns: source_drupal_profile, source_gitlab_profile, and source_github_profile. One way to migrate all of them into the multi-value link field is:


process:
  field_online_profiles/0/uri: source_drupal_profile
  field_online_profiles/0/title:
    plugin: default_value
    default_value: 'Drupal.org profile'
  field_online_profiles/1/uri: source_gitlab_profile
  field_online_profiles/1/title:
    plugin: default_value
    default_value: 'GitLab profile'
  field_online_profiles/2/uri: source_github_profile
  field_online_profiles/2/title:
    plugin: default_value
    default_value: 'GitHub profile'

If you want to set a value for a subfield, you use the field_name/delta/subfield syntax. Then, every combination of delta and subfield can define its own mapping. Both delta and subfield are optional. If no delta is specified, 0 is assumed which corresponds to the first element of a (multi-value) field. If no subfield is specified, the default subfield is assumed as explained before. In the previous example, if there is no need to set the link text the configuration would become:


process:
  field_online_profiles/0: source_drupal_profile
  field_online_profiles/1: source_gitlab_profile
  field_online_profiles/2: source_github_profile

In this example, we wanted to highlight syntax variations that can be used with the Migrate API. Nevertheless, this way of migrating multi-value fields is not very flexible. You are required to know in advance how many deltas you want to migrate. Depending on your particular configurations, you can write complex process pipelines that take into account an unknown number of deltas. Sometimes, writing a custom migration process plugin is easier and/or the only option to accomplish a task. Even if you can write a migration with existing process plugins, that might not be the best solution. When writing migrations, strive for them to be easy to read, understand, and maintain. For reference, the generic configuration for mapping fields with deltas and subfields is:


process:
  destination_field/0/subfield_1:
    plugin: plugin_name
    config_1: value_1
    config_2: value_2
  destination_field/0/subfield_2:
    plugin: plugin_name
    config_1: value_1
    config_2: value_2
  destination_field/1/subfield_1:
    plugin: plugin_name
    config_1: value_1
    config_2: value_2
  destination_field/1/subfield_2:
    plugin: plugin_name
    config_1: value_1
    config_2: value_2

Process plugin chains

So far, for every field_name/delta/subfield combination we only have used one process plugin. The Migrate API does not impose any restrictions to the number of transformations that the source data can undergo before being assigned to a destination property or field. You can have as many as needed. Chaining of process plugins works similarly to Unix pipelines in that the output of one process plugin becomes the input of the next one in the chain. When the last plugin in the chain completes its transformation, the return value is assigned. We have covered this topic in greater detail in the article on using process plugins for data transformation. For now, let’s consider an example chain of two process plugins:


process:
  title:
    - plugin: concat
      source:
        - source_first_name
        - source_last_name
      delimiter: ' '
    - plugin: callback
      callable: strtoupper

In this example, we are using the concat plugin to glue together the source_first_name and source_last_name. A space is placed in between as specified by the delimiter configuration. The result of this is later passed to the callback plugin which executes the strtoupper PHP function on the concatenated value effectively making the string uppercase. Because there are no more process plugins in the chain, the string transformed to uppercase is assigned to the title destination property. If source_first_name is ‘Mauricio’ and source_last_name is ‘Dinarte’, then title would be set to ‘MAURICIO DINARTE’. Refer to the article mentioned before for other things to consider when manipulating strings. The configuration of process plugin chains can be generalized as follows:


process:
  destination_field:
    - plugin: plugin_name
      source: source_column_name
      config_1: value_1
      config_2: value_2
    - plugin: plugin_name
      config_1: value_1
      config_2: value_2
    - plugin: plugin_name
      config_1: value_1
      config_2: value_2

It is very important to note that only the first process plugin in the chain should set a source configuration. Remember that the output of the previous process plugin is the input for the next one. Setting the source configuration in subsequent process plugins is unnecessary and can actually make the chain produce unexpected results or fail altogether.

Source constants, pseudofields, and the process pipeline

We have covered source constants, pseudo-fields, and the process pipeline in the article on using data placeholders in the migration process. This time, we are only going to give an overview to explain their syntax. Constants are arbitrary values that can be used later in the process pipeline. They are set as direct children of  the source section. Let’s consider this example:


source:
  constant:
    DRUPAL_LINK_TITLE: 'Drupal.org profile'
    GITLAB_LINK_TITLE: 'GitLab profile'
    GITHUB_LINK_TITLE: 'GitHub profile'
process:
  field_online_profiles/0/uri: source_drupal_profile
  field_online_profiles/0/title: constant/DRUPAL_LINK_TITLE
  field_online_profiles/1/uri: source_gitlab_profile
  field_online_profiles/1/title: constant/GITLAB_LINK_TITLE
  field_online_profiles/2/uri: source_github_profile
  field_online_profiles/2/title: constant/GITHUB_LINK_TITLE

To define source constants, you write a constants key and set its value to an array of name-value pairs. When you need to refer to them in the process section, you use constant/NAME and they behave like any other column present in the source. Although not required, it is customary to name constants in uppercase. This makes it easier to distinguish them from regular source columns. Notice how their use makes assigning the link titles simpler. Instead of using the default_value plugin, we read the value directly from the source constants.

Pseudofields also store arbitrary values for use later, but they are defined in the process section. Their names can be arbitrary as long as they do not conflict with a property name or field name in the destination. The value can be set to a verbatim copy from the source (a column or a constant) or they can use process plugins for data transformations. For the next example, consider that there is no need for the link text to be different among online profiles. Additionally, there is another Link field that can only store one value. This new field is used to store the URL to the primary profile. The example can be rewritten as follows:


source:
  constant:
    LINK_TITLE: 'Online profile'
process:
  pseudo_link_text:
    - plugin: get
      source: constant/LINK_TITLE
    - plugin: callback
      callable: strtoupper
  field_online_profiles/0/uri: source_drupal_profile
  field_online_profiles/0/title: '@pseudo_link_text'
  field_online_profiles/1/uri: source_gitlab_profile
  field_online_profiles/1/title: '@pseudo_link_text'
  field_online_profiles/2/uri: source_github_profile
  field_online_profiles/2/title: '@pseudo_link_text'
  field_primary_profile: '@field_online_profiles/0'

A psedofield named pseudo_link_text has been created. It has its own process pipeline to provide the link text that will be used for all online profiles. When you want to use the pseudo, you have to enclose it in quotes (‘) and prepend an at sign (@) to the name. The pseudo_ prefix in the name is not required. In this case it is used to make it easier to distinguish among pseudofields and regular property or field names.

The previous snippets is also a good example of how the migrate process pipeline works. When setting field_primary_profile, we are reusing a value stored in another field: the first delta of field_online_profiles. There are many things to note here:

  • The migrate process pipeline lets you reuse anything that has been defined previously in the file. It can be source constants, pseudo fields, or regular destination properties and fields. The only requirement is that whatever you want to use needs to be previously defined in the migration definition file.
  • Source columns are accessed directly by name. Source constants are accessed using the constant/NAME syntax.
  • Any element defined in the process section can be reused later in the process pipeline by enclosing its name in quotes (‘) and prepending an at sign (@). This applies to pseudofields and regular destination properties and fields.

When reusing an element in the process pipeline, its whole structure becomes available. In the previous example, we set field_primary_profile to '@field_online_profiles/0'. This means that all subfields in the first delta of the field_online_profiles field will be assigned to field_primary_profile. Effectively this means both the uri and title properties will be set. Be mindful that when you reuse a field, all its delta and subfields are copied along unless specifically restricted. For example, if you only want to reuse the uri of the first delta you would use  '@field_online_profiles/0/uri'. In none of these scenarios, indicating that you want to reuse something guarantees that it will be stored in the new element assignment. For example, the field_primary_profile field only accepts one value. Even if we used '@field_online_profiles' to reuse all the deltas of the multi-value field, only the first one will be stored per the field’s (cardinality) definition.

The Migrate API is pretty flexible and you can write very complex process pipelines. The examples we have presented today have been exaggerated to demonstrate many syntax variations. Again, when writing migrations, strive for process pipelines that are easy to read, understand, and maintain.

What did you learn in today’s article? Did you know that it is possible to specify deltas and subfields in field mappings? Were you aware that process plugins can be chained for multiple data transformations? How have you used source constants and psuedofield before? Please share your answers in the comments. Also, we would be grateful if you shared this article with your friends and colleagues.

Read more and discuss at agaric.coop.