The role of big data in medicine

The big data can play a vital role in medicine building better predictive models around individual patients resulting into better diagnosis and treatment of disease.

The last decade has seen huge advances in the amount of data we routinely generate and collect in pretty much everything we do, as well as our ability to use technology to analyze and understand it. In 2012, worldwide digital healthcare data was estimated to be 500 petabytes and is expected to reach 25,000 petabytes by 2020. The intersection of these trends is what we call “Big Data” and it is helping businesses in every industry to become more efficient and productive.

If we are interested to find out how Big Data is contributing to make the world a better place, there is no better example than the uses being found for it in medicine.

However, one of the major limitations with medicine today is understanding of the biology of disease. Big data comes into play around aggregating more and more information around multiple scales for what constitutes a disease—from the DNA, proteins, and metabolites to cells, tissues, organs, organisms, and ecosystems. Those scales of the biology need to be modeled by integrating big data. The models will evolve, the models will build, and they will be more predictive for given individuals.

Furthermore, it is not feasible to use big data in medicine all of a sudden. It is more of a continuous process, more of an evolution. As we start building these models, these need to be tested by applying the models on individuals, assessing the outcomes, refining the models, and so on. Then only we will get answers to our Questions. The modeling will become more informed as we start pulling in all of this information. Right now, we are at the very beginning stage of this revolution, but probably it’s going to speed up, because there’s great maturity in the information sciences beyond medicine.

The life sciences are not the first to encounter big data. We have information-driven companies like Google, Amazon and Facebook, with a huge number of the algorithms applied there—to predict what kind of foods you like to buy, what kind of electronic products you like, all use the similar machine-learning techniques. Similar type of methods and models can all be applied in medicine as well.

Big Data in Healthcare Today

A number of use cases in healthcare are well suited for a big data solution. Many academic, research-focused healthcare institutions are either experimenting with big data or using it in advanced research projects. Those institutions draw upon data scientists, statisticians, graduate students, and the like to wrangle the complexities of big data.

A Step Forward

This is leading to ground breaking work, often by partnerships between medical and data professionals, with the potential to peer into the future and identify problems before they happen. One recently formed example of such a partnership is the Pittsburgh Health Data Alliance – which aims to take data from various sources (such as medical and insurance records, wearable sensors, genetic data and even social media use) to draw a comprehensive picture of the patient as an individual, in order to offer a tailored healthcare package.

That person’s data won’t be treated in isolation. It will be compared and analyzed alongside thousands of others, highlighting specific threats and issues through patterns that emerge during the comparison. This enables sophisticated predictive modelling to take place – a doctor will be able to assess the likely result of whichever treatment he or she is considering prescribing, backed up by the data from other patients with the same condition, genetic factors and lifestyle.

Another partnership that has just been announced is between Apple and IBM. The two companies are collaborating on a big data health platform that will allow iPhone and Apple Watch users to share data to IBM’s Watson Health cloud healthcare analytics service. The aim is to discover new medical insights from crunching real-time activity and biometric data from millions of potential users.


Barriers Exist for Using Big Data in Healthcare Today

There are several challenges with big data being applied into medicine that yet need to be addressed.


The value for big data in healthcare today is largely limited to research because using big data requires a very specialized skill set. Hospital IT experts familiar with SQL programming languages and traditional relational databases aren’t prepared for the steep learning curve and other complexities surrounding big data.


In healthcare, HIPAA compliance is non-negotiable. Nothing is more important than the privacy and security of patient data. But, there aren’t many good, integrated ways to manage security in big data. Although security is coming along, it has been an afterthought up to this point. If a hospital only has to grant access to a couple of data scientists, it really doesn’t have too much to worry about. But when opening up access to a large, diverse group of users, security cannot be an afterthought.

Big Data Differs from the Databases Currently Used in Healthcare

Big data differs from a typical relational database

Big Data Has Minimal Structure

The biggest difference between big data and relational databases is that big data doesn’t have the traditional table-and-column structure that relational databases have. In classic relational databases, a schema for the data is required (for example, demographic data is housed in one table joined to other tables by a shared identifier like a patient identifier). Every piece of data exists in its well-defined place. In contrast, big data has hardly any structure at all. Data is extracted from source systems in its raw form stored in a massive, somewhat chaotic distributed file system. The Hadoop Distributed File System (HDFS) stores data across multiple data nodes in a simple hierarchical form of directories of files. Conventionally, data is stored in 64MB chunks (files) in the data nodes with a high degree of compression.

Big Data Is Raw Data

By convention, big data is typically not transformed in any way. Little or no “cleansing” is done and generally, no business rules are applied. Some people refer to this raw data in terms of the “Sushi Principle” (i.e. data is best when it’s raw, fresh, and ready to consume). Interestingly, the Health Catalyst Late-Binding Data Warehouse follows the same principles. This approach doesn’t transform data, apply business rules, or bind the data semantically until the last responsible moment–in other words, bind as close to the application layer as possible.

Big Data Is Less Expensive

Due to its unstructured nature and open source roots, big data is much less expensive to own and operate than a traditional relational database. A Hadoop cluster is built from inexpensive, commodity hardware, and it typically runs on traditional disk drives in a direct-attached (DAS) configuration rather than an expensive storage area network (SAN). Most relational database engines are proprietary software and require expensive licensing and maintenance agreements. Relational databases also require significant, specialized resources to design, administer, and maintain. In contrast, big data doesn’t need a lot of design work and is fairly simple to maintain. A lot of storage redundancy allows for more tolerable hardware failures. Hadoop clusters are designed to simplify rebuilding of failed nodes.



1 comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: