Tuesday, April 3, 2012

It’s Time to Build New Systems for Scientists (Particularly Life Scientists)

Society and the Cambridge Innovation Cluster Will Benefit

Scientists are potentially the most important technology end-users on the planet. They are the people who are conducting research that has the potential to improve and even save lives. Yet, for the most part, scientists have been almost criminally under-served by information technologists and the broad technology community. 

Interestingly, life sciences have had a tremendous impact on computer science.  Take, for example, Object Oriented Programming (OOP), developed by Dr. Alan Kay. A biologist by training, Dr. Kay based the fundamental concepts of OOP on microbiology: “I thought of objects being like biological cells and/or individual computers on a network, only able to communicate with messages (so messaging came at the very beginning - it took a while to see how to do messaging in a programming language efficiently enough to be useful).” 

It’s time to return the favor. 

Pockets of Revolution and Excellence 

Of course, there are others who are trying to advance information technology for scientists, like visionaries Jim Gray, Mike Stonebraker, and Ben Fry. 

Throughout his career, Jim Gray, a brilliant computer scientist and database systems researcher, articulated the critical importance of building great systems for scientists. At the time of his disappearance, Jim was working with the astronomy community to build the world-wide telescope. Jim was a vocal proponent of getting all the world’s scientific information online in a format that could easily be shared to advance scientific collaboration, discourse and debate. The Fourth Paradigm is a solid summary of the principles that Jim espoused. 

My partner and Jim’s close friend Mike Stonebraker started a company  Paradigm4, based on an open source project called SciDB  to commercialize an “array-native” database system that is specifically designed for scientific applications and was inspired by Jim’s work. One of my favorite features in P4/SciDB is data provenance, which is essential for many scientific applications. If the business information community would wake up from its 30-year “one-size-fits-all” love affair with traditional DBMS, it would realize that new database engines that have provenance as an inherent feature can create better audit-ability than any of the many unnatural acts they currently do with their old school RDBMS. 

Another fantastic researcher who is working at the intersection of the life sciences and computer science is Ben Fry. Ben is truly a renaissance man who works at the intersection of art, science and technology. He’s a role model for others coming out of the MIT Media Lab and a poster child for why the Lab is so essential. Cross-disciplinary application of information technology to applications in science is perhaps the most value-creating activity that smart young people can undertake over the next 20 years. (At least it’s infinitely better than going to Wall Street and making a ton of dough for a bunch of people that already have too much money). 

Time to Step Up IT for Scientists 

But ambitious and visionary information technology projects that are focused on the needs of scientists are too rare. I think that we need to do more to help scientists  and I believe that consumer systems as well as traditional business applications also would benefit radically. 

As a technologist and entrepreneur working in the life sciences for the past 10+ years, I’ve watched the information technology industry spend hundreds of billions of dollars on building systems for financial people, sales people, marketers, manufacturing staff, and administrators. And, more recently, on systems that enable consumers to consume more advertising and retail products faster. Now the “consumerization of IT” that drove the last round of innovation in information technologies  Google, Twitter, Facebook, and so on  is being integrated into the mainstream of corporate systems. (In my opinion, however, this is taking 5 to 10 times longer than it should because traditional corporate IT folks can’t get their heads around the new tech.) 

Meanwhile, scientists have been stuck with information technologies that are ill-suited to their needs  retrofitted from non-science applications and use cases. Scientists have been forced to write their own software for many years and develop their own hardware in order to conduct their research. My buddy Remy Evard was faced with this problem  primarily in managing large-scale physics information  while he was the CIO at Argonne National Labs. Now he and I share the problem in our work together at the Novartis Institutes for Biomedical Research (NIBR). 

When I say “systems,” I am talking about systems that capture the data and information generated by scientists' instruments and let scientists electronically analyze this data as they conduct their experiments and also ensure the “repeatability” of their experiments. (Back to the value of provenance). With the right kind of systems, I believe we could: 
  • Radically increase the velocity of experimentation. Make it much easier for scientists to do their experiments quickly and move on to the next experiment sooner 
  • Significantly improve the re-usability of experimentation – and help eliminate redundancy in experiments 
  • Ensure that experiments – both computational and wet lab – can be easily replicated 
  • Radically improve the velocity of scientific publication 
  • Radically improve the ability of peers to review and test the reproducibility of their colleagues’ experiments 
Essentially, with radically better systems, we would vastly improve the productivity and creativity of science. I also believe there would be immeasurably large benefits to society as a whole  not only from the benefits of more effective and efficient science, but also as these systems improvements are applied to consumer and business information systems. 

How We Got Here 

So, what’s holding us back? 

Scientific applications are highly analytical, technical, variable and compute-intensive – making them difficult and expensive to build. 

Scientific processes don’t lend themselves to traditional process automation. Often the underlying ambiguity of the science creates ambiguity in the systems the scientists need, making development a real moving target. Developers need to practice extreme Agile/Scrum methods when building systems for scientists –traditional waterfall methods just won’t work. 

Developers need to treat the core data as the primary asset and the applications as disposable or transitory. They must think of content, data and systems as integrated and put huge effort into the management of meta-data and content. 

Great systems for scientists also require radical interoperability. But labs often operate as fiercely independent entities. They often don’t want to share. This is a cultural problem that can torpedo development projects. 

These challenges demand high-powered engineers and computer scientists. But the best engineers and computer scientists are attracted to organizations that value their skills. These organizations have traditionally not been in life sciences, where computer scientists usually rank near the bottom of the hierarchy of scientists. 

When drug companies have a few million dollars to invest in their lab programs, most will usually choose to invest it in several new new chemists instead of an application that improves the productivity of their existing chemists. Which would you choose? Some companies have just given up entirely on information systems for scientists.

So, scientists are used to going without – or they resort to hacking things together on their own. Some of them are pretty good hackers [watch]. But clearly, hacking is a diversion from their day jobs – and a tremendous waste of productivity in an industry where productivity is truly a life-and-death matter. 

No More Excuses 

It’s also completely unnecessary, given the changes in technology. We can dramatically lower the cost and complexity of building great systems for scientists by using Web 2.0 technologies, HTML5, cloud computing, software-as-a-service application models, data-as-a-service, big data and analytics technologies, social networking and other technologies. 

For example, with the broad adoption of HTML5, apps that used to require thick clients can now be built with thin clients. So we can build them more quickly and maintain them more easily over time. We can dramatically lower the cost of developing and operating flexible tools that can handle the demanding needs of scientists. 

Using Web technologies, we can make scientific systems and data more interoperable. With more interoperable systems based on the Web, we can capitalize on thousands of scientists sharing their research and riffing off of it – whether inside one company or across many. There’s tremendous benefit to large scale in the scientific community – if you can make it easier for people to work together. 

Many of my friends have been asking me why I’ve been spending so much time at the Novartis Institute for Biomedical Research (NIBR) over the past three years. The simple answer is that I believe that building systems for scientists is important  and I find it incredibly rewarding. 

One of the lessons I learned while we were starting Infinity Pharmaceuticals was that while scientists needed much better systems, small biotech companies didn’t have the resources to build these systems. So, every time we looked at spending $1 Million on a software project at Infinity, we always decided it was better to hire another five medicinal chemists or biologists. The critical mass required to build great systems for scientists is significant -- and arguably even exceeds the resources of a 6,000+person research group like NIBR. The quality of third-party solutions available for scientists and science organizations such as NIBR are pitiful: there have been only a handful of successful and widely adopted systems, Spotfire being the most notable example.  Scientists need better 3rd party systems and delivering these systems as web services might finally provide a viable business model for "informatics" companies.  Two great new examples of these types of companies are Wingu and Syapse - check them out.  

Calling All Cambridge Technologists 

So here’s my challenge to the technology industry: How about investing $3 to $4 billion on systems for scientists? And have the Cambridge Innovation Cluster take the lead? 

If you work in Kendall or Harvard Squares, you can’t throw a Flour Bakery & Café sticky-bun without hitting a scientist. It’s one huge research campus with millions of square feet of laboratory space and scientists everywhere. 

Get engaged with scientist-as-the-end-user – they’ll welcome you, trust me. Build something in one of these labs. If you build something compelling in a thoughtful way, it’s going to be noticed, sought and adopted by others. 

Since Cambridge has one of highest concentrations of scientists in the world, technologists here should focus on building systems for scientists. I’m betting that they can do it better than anyone else in the world. 

What do you think?

9 comments:

  1. Very nice article Andy - congrats. I think that one of the factors impeding the development of effective systems for the life sciences, is the vast complexity of biological systems and their heterogeneity. Biology is "messy" (for want of a better word) and the computational approaches that have worked well in other "cleaner" fields (often founded on traditional mathematical and analytic approaches) have not translated so well to biology. A suspension bridge is much easier to describe in a set of ODEs than a living cell!

    ReplyDelete
  2. Great post Andy. Years ago I founded a company with a microbiologist to create StrainMan, an inventory application for managing biological strains. I too found it rewarding to work with scientists. Do we really need another narcissistic, curated, mobile checkin app? I say let the west coast work on those kinds of apps.

    Bob Mancarella

    ReplyDelete
  3. Hi Andy,

    Couldn't agree with you more. Research is the apex of knowledge-work!

    We're answering the "Call to Arms", the name of Jim Gray's seminal post on convergence of data and code processing (http://queue.acm.org/detail.cfm?id=1059805) and we've picked up where Alan Kay has left off (http://queue.acm.org/detail.cfm?id=1039523).

    Object Orientation has lost the thread. SOA collapses under middleware bloat and indirection. IT is getting cheaper, but apps are getting more brittle. How did folks forget IT is all about the business?

    The biggest challenge of the 21st century is 'making sense' of distributed and heterogeneous code and data in real-time.

    We've gone 'Back to the Future' taking strong ideas of early Declarative Programming (Prolog, LISP, Simula) and cross-polinated them with REST architecture for a "FunctionalWeb" - "A Web of Things that does stuff".

    Uniform treatment of code and data eliminating the artificial barriers between OLAP and OLTP - in our solution (www.ideate.com) an agent dynamically generates a service interface for diverse workloads.

    A constraint-based 5GL - a horizontal app framework for the horizontal cloud. Lightweight, scalable, stable, fully dynamic, extensible and adaptable.

    Get rid of the stovepipes, the future is all about cross-cutting concerns - system-wide optimization so the whole is greater than the sum of the parts.

    Our Lean Startup is already profitable and working with Research Institutions on four continents.

    The Future is already here, it's just not very evenly distributed - but we're working on that ;)

    Best,
    Dave Duggal

    ReplyDelete
  4. Nice post Andy. Sums up the collective drive that we share at NIBR to helping scientists discover novel medicines; to address unmet medical needs. Financial institutions are mature in the enabling systems available in industry. Wonder why life sciences is so far behind.. would be interesting to know where industry went wrong and why there has been a lack of investment and global tech development in the entire BioIT space.

    ReplyDelete
  5. Great post, I agree fully with the need. I am working on similar problems in Seattle... farther away from many scientists but right at the center of the cloud computing revolution. At Sage Bionetworks we are building an open platform for data sharing collaborative data analysis called Synapse (http://synapse.sagebase.org). Many of the themes you mention here apply directly to our efforts: leveraging cloud computing, capturing analysis provenance, the centrality of data.

    In my view, we need a GitHub like platform for biological data scientists. More on my personal blog at http://wp.me/p2faIU-y

    ReplyDelete
  6. Hi Andy: This is what I got out of your post. Please feel free to lambaste me if I am completely off target. Are you proposing that the development community come up with a tool that is to Matlab, for instance, that Tumblr is to HTML/CSS, or Google is to Map/Reduce?

    I know I am oversimplifying, but unless you guys in charge can change the problem of getting more dollars appropriated for the tools, the solution may be to get the "non hackers" in the labs to become the psuedo- hackers. Not all pHD candidates, research interns, and younger life science grads will come with a comp sci background, or personal hobby. But if it was intuitive, then everyone becomes the basic "hacker" that you referred to. Scientists need to Matlab/Java what the consumer got for HTML/CSS with Tumblr.

    ReplyDelete
  7. YES - that is one of the key things - making information tools more accessible - especially stats and modeling tools is right in line with what I think we need.

    ReplyDelete
    Replies
    1. So you are really asking for the consumerization of Big Data IT. Getting a platform where there is an intuitive interface for firmware manipulation and configuration, middleware manipulation and configuration, ETL, and Analytics. A service where the UI is intuitive enough (in reality) for a "true novice" business user to, for instance, leverage the power of the resources available in a Github (or Synapse as Mike pointed out), integrate it with the power of a Vertica (which has a community license [thus accessibility isn't the overriding issue]). The way Khan Academy allows ANYONE to become an educator, a platform that allows ANYONE, and I mean ANYONE, to become a basic computer programmer.

      Maybe we need the "Khan Academy" of technology innovation. A set of "how to" tutorials, perfectly organized in an extremely intuitive interface that can turn ANY basic user, biological intern, etc., into a programmer, without any real background other than functional understanding. Instead of 10 minute tutorials on basic math and other subjects, 10 minute tutorials on basic "hacks" on anything and everything especially where the access to a particular resource is free or open source (or the Palmer Academy in this case as you are one of the people that has the know how and vision to make something like that happen).

      I mention the Khan Academy specifically because of the tight controls Sal kept to ensure the quality of the service. He is not allowing it to expand out of control (yet). But man, how awesome would it be if we had that kind of repository from the other Sal's of the world, in anything and everything, organized into perfectly indexed, organized and searchable libraries, with the quality of Khan Academy.

      Delete
  8. Do you think Software Carpentry (http://software-carpentry.org) is teaching scientists the skills they need to build tools themselves? Or do you think there are other things they need to know in order to either roll their own, or work more closely with software developers, and if so, what?
    - Greg Wilson (gvwilson@software-carpentry.org)

    ReplyDelete