Tuesday, April 10, 2012

Building New Systems for Scientists

Three Good Places to Start

In my last post,  I wrote about the need to build new data and software systems for scientists – particularly those working in the life sciences.  Chemists, biologists, pharmacologists, doctors and many other flavors of scientists working in life sciences/medical research are potentially the most important (and under-served) technology end-user group on the planet.  (If you are looking for a great read on this, try Reinventing Discovery : The Age of Networked Science.)

There are many ways to break down what needs to be done. But here are three ways to think about how we can help these folks:
  1. Support their basic productivity    by giving them modern tools to move through their information-oriented workdays
  2. Help them improve their collaboration with others    by bringing social tools to science/research
  3. Help them solve the hardest, most-challenging problems they are facing    by using better information/technology/computer-science tools
While all three of these are worth doing, #3 is the most tempting to spend time on, as the problems are hard and interesting intellectually.  

A great example in the life sciences currently is Next Generation Sequencing/Whole Genome Sequencing, where (with all kinds of caveats) one instrument generates about 75TB of data a year.  The cost of these instruments has dropped by an order of magnitude over the past five years, resulting in many decentralized labs just going out and buying them and getting busy with their own sequencing.  

One of the many challenges that this poses is that often the labs don't realize the cost of the storage and data management required to process and analyze the output of these experiments (my back of the envelope is at least 2X-3X the cost of the instrument itself). The worst case scenario is when the scientists point these instruments at public file shares and find those file shares filling up in a small number of days  to the dismay of other users who want to use the file shares.  Don't laugh: it's happening every day in labs all around the world .  


The drop in cost-per-genome makes Moore's Law look incremental.

Anyway, there are hundreds of interesting scientific problems like NGS that require equally advanced computer science and engineering.  One effort towards providing a place where these things can develop and grow in an open environment is OpenScience.

But I'd like to focus more on #1 and #2 – because I believe that these are actually the problems that, when solved, can deliver the most benefit to scientists very quickly. They don't depend on innovation in either the fundamental science or information technology. Rather, they depend primarily on end users and their organizations implementing technologies that already exist and   if deployed with reasonable thought and discipline   can have a huge positive impact on scientific productivity.  

Here are a few examples.

Collaboration Tools

Social networking, Wikis, Google Hangouts [watch] and other networking technologies make it easy and inexpensive to create more-flexible collaboration experiences for scientists.  These aren’t your father’s collaboration tools – the kind of process-automation gizmos that most technology companies produced for businesses over the past 30 years. Rather, these new tools empower the individual to self-publish experimental design and results and share  with groups of relevant and interested people  early and often. Think radically more- dynamic feedback loops on experimentation, and much more granular publishing and review of results and conclusions.  

Much of what scientists need in their systems is collaborative in nature.  If you are a researcher working in either a commercial organization or academic/philanthropic organization, how do you know that those experiments haven’t already been done by someone else – or aren’t being run now?  If scientists had the ability to "follow" and "share" their research and results as easily as we share social information on the Web, many of these questions would be as easy to answer as "Who has viewed my profile on LinkedIn?"  

Part of this depends on the clear definition of scientific "entities": Just as you have social entities on Facebook  Groups, Individuals, etc.   scientists have natural scientific entities. In the life sciences, it's the likes of Compounds, Proteins, Pathways, Targets, People, etc.  If you do a decent job of defining these entities and representing them in digital form with appropriate links to scientific databases (both public and private), you can easily "follow compound X." This would enable a scientist to not only identify who is working on scientific entities that he's interested in, but also fundamentally to stand on the shoulders of others, avoid reinventing  the wheel, and raise the overall level of global scientific knowledge and efficiency by sharing experiments with the community.  

Two start-ups that I mentioned in my earlier blog post  Wingu and Syapse  are creating Web services that enable this kind of increased collaboration and distribution of research in the pharma industry.  Many large commercial organizations are attempting to provide this type of functionality using systems such as SharePoint and MediaWiki.  Unfortunately, they are finding that traditional IT organizations and their technology suppliers lack the expertise to engage users in experiences that can compete for attention relative to consumer Internet and social tools.  

I've watched this dynamic as various research groups have begun to adopt Google Apps.  It's fascinating because companies and academic institutions that adopt Google Apps get the benefit of all the thousands of engineers who are working to improve Google's consumer experience and just apply these same improvements to their commercial versions  truly an example of the "commercialization of enterprise IT" (and credit to my friend Dave Girouard for the great job he did building the Enterprise group at Google over the past eight years).  

One of the ways that Google might end up playing a large and important role in scientific information is due to the fact that many academic institutions are aggressively adopting Gmail and Google Apps in general as an alternative to their outdated email and productivity systems. They have skipped the Microsoft Exchange stage and gone right to multi-tenant hosted email and apps. The additional benefit of this is that many scientists will get used to using multi-user spreadsheets and editing docs collaboratively in real time instead of editing Microsoft Office documents, sending them via email, saving them, editing them, sending them back via email, blah...blah...blah.

If companies aren't doing the Google Apps thing, they are probably stuck with Microsoft - and locked into the three-five year release cycles of Microsoft technology in order to get any significant improvements to systems like SharePoint.  After a while, it becomes obvious that the change to Google Apps is worthwhile relative to the bottleneck of traditional third-party software release cycles. Particularly for researchers, for whom these social features can have a transformational effect on their experimental velocity and personal productivity.  

Another example of how this dynamic is playing out is seen in the dynamic between innovators Yammer and Jive vs. Microsoft SharePoint. This is a great example of how innovators are driving existing enterprise folks to change, but ultimately we'll see how the big enterprise tech companies  Microsoft, IBM, etc.  can respond to the social networking and Internet companies stepping into their turf. And we'll see if Microsoft can make Office365 function like Google Apps. But if Azure is any indication, I'd be skeptical.  

Open-Source Publishing

First - thank you Tim Gowers for the post on his blog - all I can say is YES!

In my opinion, the current scientific publishing model and ecosystem are broken (yet another topic for another post). But today new bottom-up publishing tools like ResearchGate let scientists self-publish their experiments without depending on the outdated scientific publishing establishment and broken peer-review model.  Scientists should be able to publish their experiments – sharing them granularly instead of being forced to bundle them into long-latency, peer-reviewed papers in journals. Peer review is critical, but should be a process of gradually opening up experimental results and findings through granular socialization.  Peer review should not necessarily be tied to the profit-motivated and technically antiquated publishing establishment.  I love the work that the folks at Creative Commons have done in beginning to drive change in scientific publishing.   

One of the most interesting experiments with alternative models has been done by the Alzheimer Research Forum, or Alzforum. The set of tools known as SWAN is a great example of the kind of infrastructure that could easily be reused across eScience.   It makes sense that in the face of the huge challenge represented by treating Alzheimer's, people   patients, doctors, scientists, caregivers, engineers  would work together to develop the tools required to share the information and collaborate as required, regardless of the limitations of the established infrastructure and organizational boundaries.  I know there are lots of other examples and am psyched to see these highlighted and promoted :)

Data As A Service

Switching to infrastructure for a second:  One of the things that scientists do every day is create data from their experiments.  Traditionally this data lives in lots of random locations - on the hard drives of scientific instruments, on shared hard drives within a lab, on stacks of external hard drives sitting on their lab benches, perhaps a database that someone has set up within a lab.  

I believe that one of the ways that we can begin to accelerate the pace of experimentation and collaboration of scientists is to enable them to put their data into a rational location for sharing their data, conducting collaborative analytics, sharing the derived results, and establishing the infrastructure and provenance required to begin producing reproducible results.  

And, as we've seen with Next-Generation Sequencing: One of the challenges of science that depends on Big Data is that scientists are traditionally conditioned to manage their data and their analytics locally. This conditioning creates problems when you have large-scale data (can't afford to keep it all locally) and when you want to collaborate on the analysis of the data (data is too big to copy around to many locations).  This also creates a challenge in terms of the ability to create the provenance required to reproduce the results.  

One of the emerging trends in database systems is the "data-as-a-service" model. Data as a service essentially begins to eliminate the complexity and cost of setting up proprietary Big Data systems, replacing this with running databases as services on scale-out infrastructure in multi-tenant and single-tenant modes.  The most high-profile example of these lately has been Dynamo, Amazon's data-as-a-service key value store.  

One of the other well-developed data-as-a-service providers is Cloudant,  which provides developers an easy-to-use, highly scalable data platform based on Apache CouchDB.  Cloudant was founded in 2008 by three physicists from MIT, where they were responsible for running multi-petabyte data sets for physics experiments running on infrastructure such as the Large Hadron Collider.  Cloudant’s product was born out of its founders’ collective frustration with the tools available specifically in context of the interest of serving science.  (Yet again an example of the needs of science driving the development of technology that benefits the rest of us.)  

One of the things that attracted me to working with the team at Cloudant was their continued interest in developing large-scale data systems for scientists and their commitment to prioritize the features and functions required by science end-users at the top of the company's priority list. 

What other pockets of inspiration and innovation are you seeing in building new systems for scientists?  Please comment.

4 comments:

  1. Great post, Andy. Couple of thoughts it prompted:

    Google Apps: The only time this switch really works in the wild is when you're NOT already using the MS stack. Which means Academia can manage that switch, but most private companies can't/won't. (I know. I just tried to take my researchers and company through it, and we ended up backing away and happily sticking with Microsoft. Which surprised the heck out of all of us. But turns out: we're all really, *really* used to Outlook, and we get grumpy when we don't have it.) There are exceptions, sure, but the answer is either going to have to be for Google to deliver some of those richer features from the client experience, or for MS (and the IT functions) to become more scientist-self-service-savvy. I'm hoping for both results, actually. (I wrote an article for CIO.com on our pilot and decision making that I think is supposed to come out this month).

    Secondly - as much as I love the *idea* of ResearchGate, for all the 1.4million users it trumpets, it's pretty quiet out there (relatively speaking).

    In part, that's because when people say "let's design something social" today, they end up re-designing Facebook. Which is sort of useful, because people are already trained on how to use Facebook.

    The problem is, however, that most people don't WANT to use Facebook for science (or other professional collaborations). True innovation in this space will come from leveraging the data to foster collaboration, not the other way around. For example: Fold.it, or REBASE, or NCBI's genome database. In all of these cases, the data/problem/science is driving the collaboration, and there is active contribution. Even when we try and solve the problems with NewsGator/Google Hangouts/Facebook type environments and we've given people the space to collaborate, we remain disappointed when they don't magically start talking to one another.

    Maybe these are more successful because they touch on the "data as a service" theme, you mention, where there is a much more clear benefit to the contributor.

    It's an interesting issue, and one that I'm thinking a lot about these days, as an opportunity to both support the scientists I work with, as well as the scientists that are our customers and collaborators. I'm also convinced that there are plenty of examples from other industries that we can more actively leverage (e.g. Data as a service? Try netflix...)

    ReplyDelete
  2. One approach for helping productivity is to use mashups which know scientific terminologies. An example is BioGPS from the Novartis genomics foundation : http://biogps.org/

    By the way, if you combine Lexicon and Metastore you are not too far from something similar for NIBR ;-)

    ReplyDelete
  3. Great post Andy. This regards with the collaboration tools, it is better to use a collaboration tools with has everything you need in order for you to work in one workplace only.

    ReplyDelete
  4. They say that science changes one funeral at a time and I think one can make an analagous statement about technology adoption. Except that in technology adoption, changes are driven not by older companies dying (they don't often), but rather by the continuous wellspring of new companies that are looking to do things faster and cheaper.

    However, in science it is difficult and expensive to start a new venture, relative to most other industries. The "garage" phase of a pharmaceutical startup typically happens inside a large academic institution, and once the product is promising enough to be a company of its own, that company immediately has millions of dollars and a "senior" leadership team that has "their way" of doing things.

    Fair?

    ReplyDelete