Video: Web 2.0 Explained
Is Sharepoint ready for Life Sciences?

DIA Reflections: The Importance of Semantics

If you recall, a few posts ago I told you that there would be several talks on Data Warehousing and a few on Data Repositories at the DIA annual meeting. Well, I was only half right. Why? Because there was a failure of communication of the semantic kind.

Let me explain.

Each time I walked into a session that was described to be about data warehousing, it turned out to be a session on the use of standards (e.g. CDISC) and tools to transform clinical trial data into usable datasets for statistical analysis. As far as I know, no self respecting statistician would call the resulting set of datasets to be a warehouse. So why is everyone using the term "data warehouse?"

On reflection, the best explanation I can think of is that 1. most people in Life Sciences don't know what the traditional definition of a data warehouse is and 2. lacking this knowledge they assume that a collection of clean datasets when stored in one place and created according to some standard can somehow be labeled a data warehouse.

This is simply not the case.

Here are a few traditional definitions of a data warehouse:

"A data warehouse is a database geared towards the business intelligence requirements of an organisation. The data warehouse integrates data from the various operational systems and is typically loaded from these systems at regular intervals." - Source: Oranz

"A subject-oriented non-volatile collection of data used to support strategic decision making. The warehouse is the central point of data integration for business intelligence. It is the source of data for data marts within an enterprise and delivers a common view of enterprise data." - Source: IBM

"A data warehouse is the main repository of an organization's historical data, its corporate memory. It contains the raw material for management's decision support system. The critical factor leading to the use of a data warehouse is that a data analyst can perform complex queries and analysis, such as data mining, on the information without slowing down the operational systems." - Source: Wikipedia

I have put some terms in large bold type to point out some key reasons why the outputs from the standardization and transformation of clinical trial data DO NOT qualify as a data warehouse.

First, when datasets are cleaned or transformed they are still just files normally stored on a file system As the first definition states, a data warehouse is a "database." This means that it is most likely a set of relational tables that store data row by row that, in turn, are directly retrievable through a native application like Oracle.  Bottom line: A collection of datasets stored on a file system does not a data warehouse make!

Second, clinical trial datasets are typically analyzed over and over and the outputs often turn into inputs for further analysis or transformation. In other words, the data are volatile. As the second definition states, a data warehouse typically contains non-volatile data. One could argue that once the data in a trial are locked, they are also non-volatile from there forward. Unfortunately, even these data are typically file based and thus still break the first rule.

One could argue that the types of transformations that standardization with CDISC or transformation from raw to analysis datasets is somewhat analogous to the typical data warehousing ETL (i.e. Extract, Transform and Load) process. Unfortunately, the L(oad) part is a missing step since there is no attempt to create a database.

Last, the collection of datasets falls far short of qualifying as a corporate memory. The datasets are typically accessible only by a small group of specialists (say statisticians or programmers). They are also not able to serve as a historical record of what was done by whom at any given time.

So what is being produced here if not a data warehouse?

In short, all of the talks and discussions at DIA were about what I call a Clinical Data Repository or CDR. The key difference between a Repository and a Warehouse is that a Repository does not require that a non-volatile database be created. Rather, you have a standards-based and tightly controlled environment where raw data and analysis data are managed in a centralized or federated store(s) accessible to anyone with a need to know. It is a transactional environment in the sense that the data are retrievable and analyzable with all of the inputs and outputs actively managed by the software rather than procedurally by the user community.

A subset of the CDR is the Statistical Computing Environment (SCE) that allows statisticians and programmers to manage, transform and analyze the data in the traditional sense. In other words, they can perform all of the work called for by the clinical trial Protocol and the Statistical Analysis Plan (SAP) to produce the outputs required for regulatory submission.

Putting the semantic issue aside, if the DIA meeting truly reflects what is going on in the industry, then we are on the road to seeing the wide-scale adoption of CDR. At first, we will most likely see it being used in the biostatistics area (i.e. SCE) and later in other areas such as Pharmacology and all of the omics.

From where I sit, the sooner this happens the better.

Comments

phoenix web design

The more it expands the better.

George Laszlo

Doug, your comments bring up yet another dimension of a data warehouse, the fact that it can be used to put data from different sources into a common pool. And, as you point out, putting these data into the same pool does not necessarily mean that the proper relationships between them have been defined. Someone has to take that next step either before or after the data have been pulled in.
Your reference to CDISC and what must be done next reminded me once more that this process of standardization is really time consuming. I wish there was a way to speed this up but I don't think anyone has come up with a way to do this. In the meantime, everyone does the best they can and live with the danger that we create even more divergence. This, by the way, is one of the key reasons why we don't or can't get around to embedding data standards into the daily life of a biopharma company and end up having to transform data at the tail end just to make regulatory submission possible. A very frustrating situation!

Doug Bain

George,

I agree with your assessment on a confusion around the definition of a Data Warehouse. Taking lumps of CDISC (SDM or ODM) based data across studies, and dumping them into a single database will not create a Warehouse.

If I was wishing to analyze the data, regardless of the Database I was using, I would not necessarily obtain the necessary meta or contextual information to determine the usable data.

Each protocol has a series of 'gates' that data must pass through to be considered 'clean'. Data that arrives into a cross study repository will, in effect, be the lowest common denominator across all the protocols. I suppose some useful information may be extracted, but, really effective information requires effective metadata to be available.

I believe the next (complex) phase for CDISC must be a rules standardization exercise. Once we have that, and, we are able to examine the rules applied to supposedly 'clean' data in a consolidated data store - then we can call it a warehouse.

George Laszlo

Charles, since it appears that you work for Oracle, I will forgive you for being enthusiastic and calling 10gR2 a "disruptive technology." I have learned the hard way that this industry is pretty good at being skeptical about new technologies and certainly take their time adopting it. Witness that most statistical analysis is still being done in batch mode using SAS against datasets stored as files. Nothing relational there! So, the concept of being disruptive may be more wishful thinking than reality at the moment.
Putting that aside, your suggestion that "the Database can also perform in-Database comparative statistical functions, and it provides a full range of data mining and text mining algorithms" is a good one. Oracle 10g can enable this and some vendors who specialize in the Biopharma sector (e.g. Waban Software) can make all of this happen including the management and audit trail capabilities.
Once a single application provides these functions, it is difficult to label it either as CDR, DW or Knowledge Warehouse. What's important, however, is that people understand what functions are provided and how they can displace whatever they are used to.

Charles Berger

What if the Database can also perform in-Database comparative statistical functions, and it provides a full range of data mining and text mining algorithms? (Oracle Database 10gR2 has these capabilitites included) Does this "disruptive technology" change all the rules? Manage data and analytze the data in the same place - where it is safe, secure, has audit trails, etc. Is that a CDR, DW or maybe even a Knowledge Warehouse?

The comments to this entry are closed.