In the first part of this two-part series, we explore the factors that drive complexity when integrating third-party data sources with large-scale digital platforms
We use the word integration a lot when we’re talking about building large-scale digital platforms. The tools we build don’t stand in isolation: they’re usually the key part of an entire technology stack that has to work together to create a seamless user experience. When we talk about “integrating” with other services, usually we’re talking about moving information between the digital platform we’re building and other services that perform discrete tasks.
Platforms are not encapsulated monoliths. For just about any feature you could imagine for a platform, there may be a third-party service out there that specializes in doing it and you can optimize (for cost, output, functionality, ease of use, or many other reasons) by choosing to strategically integrate with those services. When we architect platforms (both big and small), we’re usually balancing constraints around existing vendors/service providers, existing data sets, and finding cost and functionality optimizations. It can be a difficult balancing act!
Some examples of integrations include:
- On a healthcare website, clicking the “Make an Appointment” button might take you to a third-party booking service.
- On a higher-education website, you might be able to view your current class schedule, which comes from a class management system.
- On a magazine site, you might not even know that you’re able to read the full article without a paywall because the university network you’re browsing from integrates with the publisher’s site to give you full access.
- On a government website, you might be able to see the wait time for your local Department or Registry of Motor Vehicles.
In short: an integration is a connection that allows us to either put or retrieve information from third-party data sources.
What Drives Complexity in Integrations
The main factors that drive complexity in integrations:
- Is it a Read, Write, or Read/Write integration?
- What is the data transportation protocol?
- How well-structured is the data?
- How is the data being used?
Read, Write or Read/Write
When we talk about reading and writing, we’re typically talking about the platform (Drupal) acting on the third party service. In a Read-only integration, Drupal is pulling information from the third-party service and is either processing it or else just displaying it along with other information that Drupal is serving. In a Write-only integration, Drupal is sending information to a third-party service, but isn’t expecting processed data back (the services will often send status messages back to acknowledge getting the data, but that’s baked into the process and isn’t really a driving factor for complexity). The most complex type of integration is a Read/Write integration: where Drupal is both writing information to a third-party service and also getting information back from that third-party service for processing or display.
It is impossible to separate the idea of accessing information from the question: is the data behind some type of access control? When you’re planning an integration, knowing what kind of access controls are in place will help you understand the complexity of the most basic mechanics of an integration. Is the data we’re reading publicly accessible? Or is it accessible because of a transitive property of the request? Or do we have to actively authenticate to read it? Write operations are almost always authenticated. Understanding how the systems will authenticate helps you to understand the complexity of the project.
In thinking about the transportation protocol of the data, we expand this definition beyond the obvious HTTP, REST, and SOAP to include things like files that are FTP’ed to known locations and direct database access where writing our own queries against a data cache. The mechanics of fetching or putting data affect how difficult the task can be. Some methods (like REST) are much easier to use than others (like FTP’ing to a server that is only accessible from within a VPN that your script isn’t in).
REST and SOAP are both protocols (in the wider sense of the word) for transferring information between systems over HTTP. As such, they’re usually used in data systems that are meant to make information easy to transport. When they’re part of the information system, that usually implies that the data is going to be easier to access and parse because the information is really designed to be moved. (That certainly doesn’t mean it’s the only way, though!)
Sometimes, because you’re integrating with legacy systems or systems with particular security measures in place, you cannot directly poll the data source. In those cases, we often ask for data caches or data dumps to be made available. These can be structured files (like JSON or XML, which we’ll cover in the next section) that are either made available on the server of origin or are placed on our server by the server or origin. These files can then be read and parsed by the integrating script. Nothing is implied by this transportation method: the data behind it could be extremely well structured and easy to work with, or it could be a mess. Often, when we’re working with this modality, we ask questions like: “how will the file be generated?”, “can we modify the script that generates the file?”, and “how well structured is the data?”. Getting a good understanding of how the data is going to be generated can help you understand how well-designed the underlying system is and how robust it will be to work with.
When data folks use phrases like “data structure,” I think there’s a general assumption that everyone knows exactly what they mean. It is one of those terms that seems mystical until you get a clear definition, and then it seems really simple.
When we talk about the data structures in the context of integrations, we’re really talking about how well-organized and how small the “chunks” of data are. Basically: can you concisely name or describe any given piece of information? Let’s look at an example of a person’s profile page. This could be a faculty member or a doctor or a board member. It doesn’t matter. What we’re interested in finding out is this: when we ask a remote system to give us a profile, it is going to respond with something. Those responses might look like any of the following:
- Composed (“pre-structured”) response:
Name: Jane Doe, EdD. Display Name: Samantha Doe, EdD
- Structured Response:
First Name: Samantha Last Name: Doe Title: EdD.
- Structured Response, with extra processing needed:
First Name: Jane Last Name: Doe Title: EdD. Preferred Name: Samantha
All of these responses are valid and would (presumably) result in the same information being displayed to the end user but each one implies a different level of work (or processing) that we need to do on the backend. In this case: simpler isn’t always better!
If, for example, we’re integrating with this profile storage system (likely an HR system, or something like that) so we can create public profiles for people on a marketing site, we may not actually care what their legal first name for HR purposes is (trust me, I’m one of those folks who goes by my middle name—it’s a thing). Did you catch that in the third example above this person had a preferred name? If the expected result of this integration is a profile for “Samantha Doe, EdD.”, how do we get there with these various data structures? They could each require different levels of processing in order to ensure we’re getting the desired output of the correct record.
The more granularly the information is structured, the easier it is to process for the purpose of changing the information.
At the other end of the spectrum is data that will require no processing or modification in order to be used. This is also acceptable and generally low complexity. If the origin data system is going to do all of the processing for us and all we’re doing is displaying the information, then having data that is not highly granular or structured is great. An example of that might be an integration with a system like Twitter: all you’re doing is dropping in their pre-formatted information into a defined box on your page. You have very little control over what goes in there, though you may have control over how it looks. Even if you can’t change the underlying data model, you can still impact the user’s experience of that information.
The key here, for understanding the complexity of the integration, is that you want to be on one extreme or the other. Being in the middle (partially processed data that isn’t well-structured) really drives up effort and complexity and increases the likelihood of there being errors in the output.
One of the best early indicators of complexity is the answer to the question “how is the data being used?” Data that is consumed by or updated from multiple points on a site is generally going to have more complex workflows to account for than data that is used in only one place. This doesn’t necessarily mean that the data itself is more complex, only that the workflows around it might be.
Take, for example, a magazine site that requires a subscription in order to access content (i.e., a paywall). The user’s status as a subscriber or anonymous user might appear in several different places on the page: a “my account” link in the header, a hidden “subscribe now!” call to action, and the article itself actually being visible. Assuming that the user subscription status is held in an external system, you might now be faced with the architectural decisions: do we make multiple calls to the subscription database or do we cache the response? If the status changes, how do we invalidate that cache throughout the whole system? The complexity starts to grow.
Another factor in the data usage to consider is how close the stored data is to the final displayed data. We often refer to this as “data transformations.” Some types of data transformations are easy and while others push the bounds of machine learning.
If you had data in a remote system that was variable in length (say, items in a shopping cart), then understanding how that data is stored AND how that will be displayed is important. If the system that is providing the data is giving you JSON data, where each item in the cart is its own object, then you can do all kinds of transformations with the data. You can count the objects to get the number of items in the cart; you can display them on their own rows in a table on the frontend system; you can reorder them by a property on the object. But what if the remote system is providing you a comma-separated string? Then the display system will need to first transform that string into objects before you can do meaningful work with them. And chances are, the system will also expect a csv string back, so if someone adds or removes an item from their cart, you’ll need to transform those objects back to a string again.
All of this is rooted in a basic understanding: how will the data I’m integrating be used in my system?
Come back on Thursday for Part 2, where we’ll provide a framework to help you make sure you’re asking the right questions before embarking on a complex integration project.