Our customer repeatedly receives a large volume of documentation from various subcontractors. Together we have set up a partially automated import, which ensures that the documentation lands in the customer’s document management system. This case is from life science.


The documents in this case are research documentation.

Namely, it is documentation of clinical trials (controlled testing of drugs on humans). Clinical trials are often conducted by CROs (Clinical Research Organisations) in the role of subcontractor to the pharmaceutical company that needs to test its product.

The documentation, including that from doctors, is collected by the CRO in its own document management system and, once the trial is completed, the documentation is exported as a single package and delivered to the pharmaceutical company. The package is called a Trial Master File (TMF). The size of the packages can vary enormously, but to give an indication of the order of magnitude, many have had in the region of 30,000 documents.

Our customer receives such packages on a regular basis and our task is to import the documents into the client’s own document management system, as automated and easily as possible in a pharmaceutical reality.

This is a case story of the great collaboration we have with the customer in question and how we established a methodology and over time have improved it.


The similarity of the different deliverables is that they are all TMFs and there are clear expectations on what such a package TMF contains. It is well described in a model (DIA’s reference model), and each type of documentation has a unique place in the model given by a so-called artifact number, but it is also given by the model what metadata documents should additionally be provided with.

Now, it would be convenient if both the CROs and our client firstly used the model in question, and secondly agreed on its interpretation. But of course, it should not be so.

Our customer has closely followed the model with some local adaptations. So, it is very well described how the documents should be once entered into the customer’s system. But the deliverables are very different. Some CROs use the model and have managed to export the metadata from their system to us, so we almost get the whole thing, and just have to adapt to local conditions. Others deliver a mess, to be honest. And we get everything in between the two extremes.

So, the task of getting the documents into the customer’s system is to import the delivered documents under full pharmaceutical control, but also to identify the core metadata of the delivered documents. We call this classification. The starting point is very different each time – from the spreadsheets provided, from file paths and file names, and whatever else is given – but the result is consistent from time to time.


We chose a migration tool that specializes in document migration. The tool is called migration-center and is from German fme AG. It has many years of experience, is well tested, and can cope with the specific requirements of the pharmaceutical sector.

Migration-center consists of a migration-center engine and a “cockpit” from which migration/import is controlled. At one end it connects to the source with a file scanner and at the other end it connects to the destination with an importer. Many of the market standard systems are covered out-of-the-box.

In this case, we chose an importer for Documentum, which is our customer’s technology and two scanners. The documents are on a file system from start, so both scanners can read files in the file system. One (filesystem scanner) is specialised in reading the metadata available in the filesystem itself (that is, mainly file paths and file names) about each file. The other (csv scanner) is specialised in reading the metadata about each file from spreadsheets.

In addition to installing and setting up the tool (within the customer’s firewall), we developed and validated (tested) the process that we now follow for each delivery. The process became a handbook or manual that we follow every time a new TMF needs to be imported.
The customer has chosen for us to perform each import because they want to save internal resources. But they could have chosen to train a few people to do it themselves and bid us a fond farewell.


This is how the tool and the process work – very briefly.

  1. The data source is scanned for documents. In this case, the relevant folders are scanned on the file drive where our TMF package is located. All available metadata for the documents is retrieved from the source. Now documents and metadata are inside the migration-center database.
  1. The task now is to process the metadata that we have obtained from the scan, so that we get the metadata that the customer’s system requires from us. That is, the classification work described above in the section on the model.
    In most cases, the information we need is there. It’s a matter of identifying them and getting them extracted.
    For example, if we need Country, and it is not politely delivered to us in a spreadsheet from the CRO, then we must look in the folder. Usually, we can find that there is a fine logic where documents from a country are put in a folder with the name after the country. So, the information is there – in the folder path. The exercise for us is to discover this and then extract the information we need.
    So in this example we need to use the file path read by our scanner and therefore available in the migration centre database, and derive from it a Country value per document. We therefore set up a rule in the tool that makes that derivation – here we would concretely do some string gymnastics, which is one of many, many options we have for building rules in the tool.
  1. Once the rules we need are set up, we can simulate the migration. It doesn’t take that long, because only metadata is processed – documents don’t have to be moved. Then we can look at the result and see if we’re happy. We can also pull the result of the simulation into spreadsheets and have it checked by internal people with business knowledge.
    We keep tweaking and simulating until we’re happy. When we’re happy, we have a configuration in the migration center that fits this data set exactly.
  1. The import is started and the documents are imported.


We skipped over the testing a bit quickly above. Testing is always required regardless of industry, but this section looks at how we meet the requirements of the pharmaceutical sector for validation under GxP.

In the start-up project, where we established the processes and technology, we qualified migration-center. We based this on the fact that it is an established product and that the supplier provided their process for developing and maintaining the product. We decided to qualify our delimited use of the tool – defined by the handbook – and established partly IQ but also a thorough OQ which got around the corners of our way of using the product according to the handbook. The dataset we used as an example in the qualification was imported and the customer’s UAT could then test the tool against the URS for a TMF import via the imported data.

With the tool qualified once and for all, the validation of each import can now concentrate on that data-specific testing. We have an almost completely reusable OQ which is repeated for each dataset, and likewise the customer has a UAT which is very standard. Each TMF import is a project with its own validation plan.

So how does the data-oriented validation work in practice? Once we have finished making the rules to process a dataset, the processing of the data needs to be validated. We have developed rules etc in a development environment, but the good thing is that all the rules etc we have prepared can be packaged together and exported via a standard function in the tool. We can then import them into a validation environment and perform the import from end to end (possibly on a representative subset of the documents).
Once the validation is over, we can again use standard functionality in the tool to import the rules into the production environment and then we are ready to run the import.


The packages delivered are structurally different each time. That means we can’t just push a button and make it happen. But we’ve now had so many different variants that there’s more and more to reuse. We’ve automated, generalised, systemised in a way, and now it’s running efficiently and well. We have acquired enough knowledge to be able to classify the documents ourselves, with the occasional consultation of an internal specialist. The validation was a lot of work the first few times, but now it runs quite painlessly.

So, while it’s not a push of a button, the import is automated as far as it possibly can be. The alternative used before we got this up and running was manual uploads.