challenges in scientific software design

Uncertainty about where your life is going next is tough to deal with. You can sit around and wait for something to happen to you. You can prepare for a particular outcome and then have that be nothing like what you prepared for. Or you can prepare for multiple different plausible outcomes, spread your already-overcommitted energies even thinner, and still have nothing like any of your anticipated alternatives come to pass. I’m starting to better appreciate the extent to which this has driven large parts of my career.

My own history as a programmer

I’m pretty well known within the astronomy community as a software designer — at least, I would say, well enough to be recruited for postdoctoral positions and invited for talks at conferences based on my large-scale data pipeline design expertise, rather than my scientific accomplishments. This is a frustrating thing.

Computers and programming have always been a familiar thing to me: my family got an Atari 800XL when I was around eight years old, which featured programming in BASIC and offline storage on cassette tapes. A succession of newer and more sophisticated home computers, attendance at a high school with a Computer Systems research lab and advanced computer science coursework such as Systems Architecture and Artificial Intelligence, and ongoing exposure through self-study meant I could teach myself pretty much any language and do sophisticated things in it quickly. Together with friends I was designing computer games with simple sprite graphics and exploring object-oriented programming and parallel programming in high school. I was way ahead of the curve as far as computers were concerned; I had every advantage.

Starting in undergrad, I really bent my energies more towards physics and math, and although my computer skills were still coming in handy, I updated them only occasionally or as needed. I learned proper object-oriented C (as C++) in grad school, picked up perl and SQL in my first postdoc and Python in my second one. Here at ANU I’ve learned object-oriented Python, and gotten some experience with Django and CSS. But it’s mostly been little bits of things here and there. The overriding concern in science is getting papers out. What’s more, in my postdocs so far I have usually been forced by necessity to maintain pre-existing software systems that were never really designed so much as accreted, with the accompanying labor overhead associated with maintaining something that wasn’t written to be maintained.

At first I took it personally and blamed specific people for my poor luck. But I’ve come to realize that I’m probably only marginally better, when it comes to really big systems like the ones that make surveys like SkyMapper go forward. The design for the SkyMapper pipeline isn’t terrible; Fang Yuan and I have put a lot of thought into it, and we’re both personally responsible for the fact that it works, and works to spec. But I wonder how much better it could have been if we had both been trained in software design: good coding habits, best practices for design, project management methodologies like Agile, and so forth.

I’m now worrying about this from a career management point of view: I’ve always imagined that I could transfer to software engineering if physics didn’t work out, but this may not be as easy a transition as I had at first imagined. The fact is that opportunities to learn from real software experts as a scientist are probably not that hard to come by. But at least in my past experience, we were rarely encouraged to consult and learn from those people, and perhaps tacitly discouraged from doing so by the incentives at work in our employment environments.

Competing incentives

Readers who are experienced with software, let me know whether the following claims are reasonable. As a software developer, assuming your workplace is reasonably functional, your task is clear: design a software product that works, on time and on budget. If your workplace is super-functional, you might also concern yourself with the problem of designing a software product that works well and is easily maintainable and extensible into the future. But you’re being paid to do one thing, and doing that thing exceptionally well will presumably also advance your career and allow you to move to positions of greater influence and responsibility.

In science, if your workplace is functional you’ll have to split your energy between writing software for an experiment and publishing results associated with that experiment. If your workplace is dysfunctional you can easily spend your whole time writing software, which is what you were nominally hired to do, but spend virtually no time publishing first-author papers, which is what you need to do to get your next job. If your experiment is small and may never be repeated, you have no accountability to anyone but yourself to make sure that your code is well-documented, reusable, or maintainable. While it is important to large, ongoing projects (such as wide-area sky surveys) that software be designed well, in practice you rarely get the eyes of a real expert under the hood and so the design can be pretty horrendous. You might know how it works, but heaven help the person who comes after you. Any time you spend developing skills to write good software takes precious time away from writing the papers you’ll need to get your next job. Or even just getting the code working so you can show more tangible results to your supervisor — a plot made, a supernova discovered. The breakneck speed both of scientific discovery and technological progress threatens to keep our view short-term, looking only one project or one job ahead.

Moreover, because you’re learning to write and maintain bad software without the domain knowledge typically used in industry, your skills may not be as immediately transferrable as you think. I’ve been reading about the coding interviews used in top software firms such as Microsoft, Google, and Amazon; similar interview practices are in place for “Data Scientist” positions, not just software engineer positions. They involve (among other things) coding base-level algorithms such as linked lists, hash maps, binary search trees, and quick sort, at a whiteboard under time pressure in front of an interviewer; I haven’t had to do things like that since AP Computer Science in high school. I know what all these data structures are, but (haha) isn’t this kind of tedious quiz-work what the standard library implementations are meant to free you from? I worry about the efficiency of my code, but usually only if it’s slower than I need it to be to solve the problem I’m trying to solve, and that’s a pretty loose criterion. I try to comment my code well, and to make sensible class hierarchies, and to only check working versions of things into the revision control system; I would say this is probably more than about 90% of my scientist colleagues do. But I don’t usually give myself time to implement well-known design patterns like Model-View-Controller. Or write unit tests. Or use distutils to make a Python module that can be easily shared or distributed on GitHub.

In short, I know some of the things I’ll need to know but have only ever taken a very short-term view towards maybe picking them up someday. If I’m looking to make a career switch, now’s probably the time, and I can certainly hope that the act of doing so will accelerate the software I have to write and share. But let’s not pretend that’s easy, or guaranteed.

The factors I’ve mentioned above are slowly changing. Senior scientists are beginning to realize that both the needs of modern, computationally-intensive science and the career landscape for scientists are very different from when they were students and postdocs. The younger generation is adopting more best practices of software design, and is aggressively promoting open-source science as well as open-source software infrastructure for science. There are some groups which are clear models for effective software production in astronomy, which is also discussed on popular professional blogs. Young scientists ignore these resources, and the trends they represent, at their peril.

How this is affecting my plans going forward

I’m still holding out hope for an academic career. Fortunately, since much of what I do is closer to software engineering than the typical postdoc, there’s probably time for me to make some adjustments to be more competitive in an industry setting. What’s more, I may be able to chart a course that does double duty — e.g., publishing papers about applications of machine learning to problems faced by next-generation astronomical surveys. It’s important that I get my big science results out as well — that’s mostly going to happen this year.

I also intend to make a big push this year to publish any piece of code that’s worth sharing. This also will benefit both my scientific and software careers:

As a scientist, putting my code out there will encourage other scientists to use it, make my process more transparent, get my work cited and increase its impact. David Hogg argued strongly at the last winter AAS meeting that the benefits of releasing code publicly far outweigh the disadvantages, for these and other reasons. Many people argue that if they release their code to the public, others will scoop them. But in general, putting your name out on it means you’ve established priority — others will see you as the expert and not try to scoop you (in general)! This might not be true for certain very competitive subfields, but in general, people want to scoop each other on Nature-style results, which I would bet generally aren’t the results enabled by releasing code. Regarding my specific code and the problem it was designed to solve, I want to turn e.g. my bolometric light curve fitter into a general-purpose supernova analysis toolkit, and I can leverage the contributions of others if I release my code (along with a methods paper targeted for PASA, which others can cite when they use or extend my code). The real scientific value is no longer in the code itself, but in the way it’s used — the problems to which it’s applied, and (in the Bayesian analysis case) the particular priors used for the modeling. If these problems and assumptions were obvious, we would already know a lot more about supernovae than we do; people out there have good data, but they’re using the same old assumptions all the time, when a little more thought would go a long way.

As a software developer, making my code open-source puts it out there for potential employers to see. It also puts it out there for other software developers to see, from whom I can learn some of the abovementioned finer points of the software design craft through hands-on problem solving, as what would otherwise be pretty ugly code becomes improved. (“If you’re not embarrassed, you took too long to release it!”) If I’m really lucky, some of them might even have future job leads — but let’s not count those unhatched chickens just yet.

About Richard

I'm an American scientist who is building a new life in Australia. This space will contain words about science and math, but also philosophy, policy, literature, my travels, occasional rants, all sorts of things I find strange and awesome. The views expressed in this blog do not necessarily reflect the opinions of my employer at the time (currently University of Sydney), though personally, I think they should.
This entry was posted in Astronomy, Career, Technical and tagged , , , . Bookmark the permalink.

7 Responses to challenges in scientific software design

  1. “paid to do one thing, and doing that thing exceptionally well will presumably also advance your career” .. yeah, no. All software is a mess of bolt-ons, “how the hell did this ever work”, dirty hacks, and half-baked ideas. In all but a tiny sliver of software development (that is, small web dev shops that actually follow an agile methodology) you spend almost all of your time battling that problem. And career advancement is based on the same BS in software as in any other field. So the skills you’ve honed — minimizing unnecessary work, balancing development time with competing demands, and so on — are precisely the skills you need in the real world.

    The interviews you mentioned are there to see you think about code on your feet. They separate the person who crammed for their masters in CS with Java flashcards from someone who thinks fluidly about computing, code, data, and so on. Someone who writes perfectly optimized code all the time is wasting company resources — in that interview they’ll want a correct solution first, then to hear you talk about how you might optimize when it becomes a bottleneck.

    As for “open-sourcing” — it’s not a verb, any more than you could take a pile of data and “science” it into a Nature article. It’s a way of operating, when you have a group of people with similar interests who can agree to contribute to their common good. So an open-source tool that nobody ever downloads or which can’t be used or adapted successfully isn’t really any good. An open source project needs a little bit of additional work from everyone to make the on-ramp, both for users and developers, a little easier than “here’s a tarball good luck”. That might be as simple as a setup.py file and a DEVELOPING.txt that gives a one-page overview of the code’s organization. And it’s also the sort of thing which coders not familiar with the heart of the algorithm can easily hack on. As an example, I worked on the build system for http://proto.bbn.com/ while knowing very little about the language’s insides or what cool stuff Jake’s doing with it.

    You are *absolutely* right about open-source being visible to employers. Every time I’ve interviewed someone who might spend any time coding, I’ve looked them up on Github, Bitbucket, ohloh, etc., and asked for pointers to code they’re particularly proud of. That tells me a lot about their abilities and makes a great place to start interview questions.

    • Richard says:

      A couple of other things I thought of in the car:

      First, maybe I made it sound above as though I thought any code I published would automatically be useful. I don’t actually believe this, but I do believe I have several applications for which there would be a wide user base; I believe this because I’ve been watching what others have been doing in the literature, and publishing my own papers and giving talks that advertise the benefits of my particular approach to problems that crop up pretty frequently in the study of supernovae. I would definitely make the release more than a standard tarball; a working setup.py, good docstrings, a README, and an open-source license which keeps the code free and allows others to use it are (as you say) probably the bare minimum (though the rest need not be overly pretty).

      Second, while I’m sure I’ve been building useful skills doing what I’ve been doing, I might have been arguing against a straw man above when I said programming skills might not transfer easily (though I still think there’s something to that). I’ve often heard the phrase “you won’t need to worry about [not] getting a job”. This probably means “you have a lot of transferrable skills and making a transition to a new career will be easier for you than for many others”, which is more or less what you said, and which may or may not be true depending on what kind of career I want to transition into. But in my head I’ve somehow been translating that as “you don’t need to actively manage your career because your skills will transfer out of the box and employers will beat a path to your door”, which is demonstrably false.

  2. Richard says:

    Did I mention that science-ing helps you transition to new actionability paradigms? :)

    Thanks for the candid feedback. Hopefully “publish” is an appropriate verb in this case… And I’m sure that each employment sector has its own distinctive odor of BS, but I had it in my mind that the skills profiles in other sectors might not be so disparate as in my situation, and that one wouldn’t be encouraged to spend such large amounts of time doing things that were valuable but would eventually result in getting fired anyway. If one or both of those assumptions is wrong, then I guess I haven’t really missed much by working in academia, except maybe more pay and a different (not necessarily better) pressure schedule for switching jobs.

  3. jefflassahn says:

    As someone who works in the software industry, my first reaction to a big chunk of this is “how cute! He thinks we know what we’re doing!” It may be true that there are a few software companies that only hire people who always build exceptionally designed code, and have working development processes in place that result in clean designs etc. I’ve never actually seen one, though. Everything I’ve seen (including a few glimpses into the big guys like Google and Microsoft) is a herd of people in a giant hurry, unsure of their own skills, trying to hack something together that works.

    My experience is that if you can write a piece of software that works without lots of assistance, you’re in the leading half of the pack, and you can get a job without too much trouble. Whether it’s a job you like, that advanced your career in a way you find satisfying is a much harder question. But it at least means you don’t have to plan around the fear of not being able to live and pay rent.

    • Richard says:

      Thanks Jeff. I’m getting a definite sense from other sources that most people in software are totally just winging it, as you say. Jobwise, it’s a good time to be in software, which is currently eating the world; I’m trying not to think too hard about the post-Singularity future where even the software people are out of jobs. That possibility sounds like the sort of tofu-brain thing you and me and Josh and Alex would all talk about years ago, but seems much more real and disturbing to me these days.

  4. Joshua O'Madadhain says:

    I was going to say that I’m finally getting around to posting a response, but then you and Jeff appear to be posting from THE FUTURE.
    (And then I remembered that since you live in Australia that you are perpetually living in THE FUTURE from my perspective. That’s got to be worth a second look for a resume right there. :) )

    I will reiterate what others have said above. I have the privilege of working with the most highly skilled software engineers I’ve ever worked with, by which I mean that the least capable of them could land a job anywhere else I’ve worked without worrying too much about it (and the most capable are kind of scary). And I don’t know of any of them whose work I know that hasn’t made a mistake, or cut some corners. Heck, this last Friday I spent writing a document whose sole purpose is to make sure that my compatriot and I stop doing things backwards (seriously, literally, backwards in at least one sense of the word) on our current project. Granted, it’s a very complex code base…but then, it’s a complex code base not just because it’s trying to encapsulate a great deal of business logic but because we don’t always make changes in the most future-proof ways.

    I can testify that being able to demonstrate the following:
    * significant contribution to a code base whose source is available for inspection
    * ability to think on my feet about novel problems and come up with workable solutions reasonably quickly
    has been beneficial to my career in this field.

    If you want to interview for a software position, a good idea to be sure that you recall the principles of basic ADTs (list/stack/[priority] queue/tree/graph/hashtable) and some of the more common sorting algorithms, but the kinds of questions that I use in interviews tend to be more off the beaten track: design questions and requests to implement algorithms that may be unfamiliar and that are (I suspect, and by design) almost never asked for implementation by other interviewers.

    As an interviewer, I don’t care if you know how to implement quicksort, I care if you can hack together something that works, deal with your mistakes if you make any, and at least be able to discuss intelligently how you might make it faster/more efficient/more scalable. I don’t care if you use the same design I would, I care if you can explain your design and go about constructing it in a reasonable way, and understand its tradeoffs.

    I think that you’ve likely learned a lot of valuable lessons about how to design code, and systems, that work well. If you can articulate the lessons that you’ve learned, and show that you know how to apply them to new problems, I’d say you’re in good shape.

    • Richard says:

      Thanks Josh. I think my first priority right now is, and has to be, learning more about the landscape and figuring out where I really want to end up if astrophysics doesn’t work out, so I might knock on your virtual door sometime in the next year to just ask questions, the answers to which I hope you won’t have to kill me if you tell me. Hopefully that will then make next actions much clearer, and with clarity of purpose I expect any transitions I may need to make will become that much easier.

Leave a comment