OK Computer


A new ‘virtual computer’ makes biological data-crunching more reliable.

Imagine you’re standing in front of a huge box containing hundreds of different calculators. You take the first one and type in “2+2=” and you get the correct answer: 4. You do the same thing with the next machine and get the same answer, and the next. You carry on testing all the devices, getting the same result every time. But then you get an answer you weren’t expecting: 5. With such a simple calculation it’s easy to see that something has gone wrong. If the input was correct, then the processor inside the machine must have made a mistake.

Now imagine you’re doing the same calculation with multiple enormous sets of genomic data, crunching millions of bits of information together with a computer. You get one answer from a huge Linux supercomputer in the basement of a research institute, but a slightly different one from a cloud-based server, and yet another solution from a Mac. So why are they different, and how do you know which is correct?

Big data, big problem

Scientists who work with this kind of big data are searching for clues to could prevent or treat human diseases or shed light on fundamental biological processes, so they need answers that are reliable and reproducible. This is particularly important in the new era of precision medicine, where doctors make decisions about what treatment a patient should receive based on genetic information.

“Biology is getting more and more computational,” says Cedric Notredame, group leader at the CRG. “Twenty years ago it was very expensive to do DNA sequencing, so there was very little data. You could look at a sequence on a piece of paper and analyse it by hand. Now it is so much cheaper and faster – there is much more data so we have to use computers to analyse it.”

But there are different software programmes running on different computers with different operating systems, and they do not always give the same answers from the same data. And because there are so many operations and data points involved in these large-scale calculations, it’s impossible to figure out what’s gone wrong and how to fix it. What’s more, says Notredame, many people have not even realised that this so-called computational instability is even a problem.

“It was an epiphany when we realised there was so much computational instability – this was previously not known at all,” says Notredame. “It is a problem because we are moving to an era where drugs and diagnostics are based on genetic data – a computer will spit out a number and tell you your risk of a disease or which drug to take. But it is all based on ranking and probability, so even a tiny variation in output could have a dramatic impact on patients.”

Some tech companies have tried to solve this reproducibility problem by building expensive bespoke data pipelines, which lock users into that particular software platform. But Notredame and his team took a simpler approach.

“We built an analysis platform using a technique called virtualisation, which effectively creates a simulated, identical virtual computer inside any machine,” he explains. “It’s exactly the same idea as the old 80s arcade game simulators that you can run in your PC, but on a much bigger scale.”

This ‘computer within a computer’ means that researchers can run any piece of software inside the virtual environment and get the same result, because their data will always be processed in the same way regardless of the physical machine that they are using.

“We’re a small research group, so we needed to build a simple solution that could be easily used by everyone. And we couldn’t redesign all the software tools that we have – we wanted to keep running the programmes we’re used to using,” says Notredame. “Our solution is simple and cost-effective because we did it to solve our own needs.”

NextFlow to the rescue

Following the publication of the paper describing the new virtual platform, known as NextFlow, Notredame and his team decided to make it freely available for others to use. Thousands of researchers are downloading the system every month and many research organisations have adopted it, including the Pasteur Institute in France, the Sanger Institute in the UK, Sweden’s National Genomics Infrastructure, the Genome Institute of Singapore and the US National Institutes of Health.

A large international online community has also sprung up to pool ideas and share tools, supported by training workshops and hackathons held at the CRG, pushing the boundaries of what NextFlow can do.

“I love this technology because it is useful, but it’s more important that it solves a problem,” says Notredame, reflecting on NextFlow’s success. “Computational instability is a widespread issue, but there was no solution and you can’t correct for it. It’s very exciting to know that we have solved a problem that people had not even realised existed yet, but which could have become huge as we enter the Big Data era.”

Reference work

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C.

“Nextflow enables reproducible computational workflows.”

Nat Biotechnol, 35(4):316-319 (2017). doi: 10.1038/nbt.3820. No abstract available.