Scientists who work with this kind of big data are searching for clues to could prevent or treat human diseases or shed light on fundamental biological processes, so they need answers that are reliable and reproducible. This is particularly important in the new era of precision medicine, where doctors make decisions about what treatment a patient should receive based on genetic information.
“Biology is getting more and more computational,” says Cedric Notredame, group leader at the CRG. “Twenty years ago it was very expensive to do DNA sequencing, so there was very little data. You could look at a sequence on a piece of paper and analyse it by hand. Now it is so much cheaper and faster – there is much more data so we have to use computers to analyse it.”
But there are different software programmes running on different computers with different operating systems, and they do not always give the same answers from the same data. And because there are so many operations and data points involved in these large-scale calculations, it’s impossible to figure out what’s gone wrong and how to fix it. What’s more, says Notredame, many people have not even realised that this so-called computational instability is even a problem.
“It was an epiphany when we realised there was so much computational instability – this was previously not known at all,” says Notredame. “It is a problem because we are moving to an era where drugs and diagnostics are based on genetic data – a computer will spit out a number and tell you your risk of a disease or which drug to take. But it is all based on ranking and probability, so even a tiny variation in output could have a dramatic impact on patients.”
Some tech companies have tried to solve this reproducibility problem by building expensive bespoke data pipelines, which lock users into that particular software platform. But Notredame and his team took a simpler approach.
“We built an analysis platform using a technique called virtualisation, which effectively creates a simulated, identical virtual computer inside any machine,” he explains. “It’s exactly the same idea as the old 80s arcade game simulators that you can run in your PC, but on a much bigger scale.”
This ‘computer within a computer’ means that researchers can run any piece of software inside the virtual environment and get the same result, because their data will always be processed in the same way regardless of the physical machine that they are using.
“We’re a small research group, so we needed to build a simple solution that could be easily used by everyone. And we couldn’t redesign all the software tools that we have – we wanted to keep running the programmes we’re used to using,” says Notredame. “Our solution is simple and cost-effective because we did it to solve our own needs.”