If you recall, a few posts ago I told you that there would be several talks on Data Warehousing and a few on Data Repositories at the DIA annual meeting. Well, I was only half right. Why? Because there was a failure of communication of the semantic kind.
Let me explain.
Each time I walked into a session that was described to be about data warehousing, it turned out to be a session on the use of standards (e.g. CDISC) and tools to transform clinical trial data into usable datasets for statistical analysis. As far as I know, no self respecting statistician would call the resulting set of datasets to be a warehouse. So why is everyone using the term "data warehouse?"
On reflection, the best explanation I can think of is that 1. most people in Life Sciences don't know what the traditional definition of a data warehouse is and 2. lacking this knowledge they assume that a collection of clean datasets when stored in one place and created according to some standard can somehow be labeled a data warehouse.
This is simply not the case.
Here are a few traditional definitions of a data warehouse:
"A data warehouse is a database geared towards the business intelligence requirements of an organisation. The data warehouse integrates data from the various operational systems and is typically loaded from these systems at regular intervals." - Source: Oranz
"A subject-oriented non-volatile collection of data used to support strategic decision making. The warehouse is the central point of data integration for business intelligence. It is the source of data for data marts within an enterprise and delivers a common view of enterprise data." - Source: IBM
"A data warehouse is the main repository of an organization's historical data, its corporate memory. It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems." - Source: Wikipedia
I have put some terms in large bold type to point out some key reasons why the outputs from the standardization and transformation of clinical trial data DO NOT qualify as a data warehouse.
First, when datasets are cleaned or transformed they are still just files normally stored on a file system As the first definition states, a data warehouse is a "database." This means that it is most likely a set of relational tables that store data row by row that, in turn, are directly retrievable through a native application like Oracle. Bottom line: A collection of datasets stored on a file system does not a data warehouse make!
Second, clinical trial datasets are typically analyzed over and over and the outputs often turn into inputs for further analysis or transformation. In other words, the data are volatile. As the second definition states, a data warehouse typically contains non-volatile data. One could argue that once the data in a trial are locked, they are also non-volatile from there forward. Unfortunately, even these data are typically file based and thus still break the first rule.
One could argue that the types of transformations that standardization with CDISC or transformation from raw to analysis datasets is somewhat analogous to the typical data warehousing ETL (i.e. Extract, Transform and Load) process. Unfortunately, the L(oad) part is a missing step since there is no attempt to create a database.
Last, the collection of datasets falls far short of qualifying as a corporate memory. The datasets are typically accessible only by a small group of specialists (say statisticians or programmers). They are also not able to serve as a historical record of what was done by whom at any given time.
So what is being produced here if not a data warehouse?
In short, all of the talks and discussions at DIA were about what I call a Clinical Data Repository or CDR. The key difference between a Repository and a Warehouse is that a Repository does not require that a non-volatile database be created. Rather, you have a standards-based and tightly controlled environment where raw data and analysis data are managed in a centralized or federated store(s) accessible to anyone with a need to know. It is a transactional environment in the sense that the data are retrievable and analyzable with all of the inputs and outputs actively managed by the software rather than procedurally by the user community.
A subset of the CDR is the Statistical Computing Environment (SCE) that allows statisticians and programmers to manage, transform and analyze the data in the traditional sense. In other words, they can perform all of the work called for by the clinical trial Protocol and the Statistical Analysis Plan (SAP) to produce the outputs required for regulatory submission.
Putting the semantic issue aside, if the DIA meeting truly reflects what is going on in the industry, then we are on the road to seeing the wide-scale adoption of CDR. At first, we will most likely see it being used in the biostatistics area (i.e. SCE) and later in other areas such as Pharmacology and all of the omics.
From where I sit, the sooner this happens the better.