Tuesday, January 7, 2014

Bashing Through Bioinformatics (Part 1)

Every now and then, my brother John and I decide to do an online course together. Recently, it was a refreshingly good course on Bionformatics with Pavel Pevzner. As a dedicated masochist, I decided why not try to do this class just by shell scripting in Bash

What is Bash? That's a question I'm not entirely sure I'm fit to answer. The succinct, but potentially ambiguous answer is that Bash is a type of UNIX shell. To be incrementally more transparent, a shell is a language of commands and an interpreter of those commands -- another name for a shell is a "command-line interpreter." Your job is to know the command-line language; the shell's job is to tell the operating system what you want it to do. 

But Bash is more than a set of commands. It's also a scripting language: for loops, if statements, string operations, basic calculations... Interestingly, some of the commands (perhaps more aptly called "tools" or "utilities") within the Bash environment are scripting languages themselves, like Awk, R, or Python. For general purpose programming, languages like Python are often more powerful and flexible than Bash, and so, as a scripting language, one might write Bash off. And yet the power of these other languages can easily be exploited and accessed in the Bash environment in a variety of ways (piping, heredocs, batch processing, etc). And so the line between what is and is not Bash becomes blurry to me:  where does Bash end and other features of the Bash shell environment begin? Bash is not Python or R, yet Bash can be used to seamlessly glue Python and R commands together. 

About two years ago my brother was learning to program and would occasionally ask me a question, sometimes about the shell environment. The questions were simple ones, like about compressing a data file. At that point in time, my familiarity with the shell environment was pretty basic. I wasn't using shells for anything much more advanced than navigating my file system, making and removing folders ("directories" in shell parlance), and beginning sessions in languages like Python or IDL. I used a UNIX shell as a means to an end: open up Terminal or X11, type in "python" or "idl", and--BANG!--the shell served its purpose. I'm exaggerating a little (e.g., I used FTP and SSH during that time), but overall I didn't demand much of the command line, and I didn't know you could.

My perspective on shells and shell scripting began to dramatically change when, some time last year, John began talking about all these weird things, like Awk and Sed. I kindly ignored him so I didn't have to consider he knew more than me... Then he made some shell scripts in Bash to grab stock data from Yahoo! Finance and do some basic analyses.

My mouth hung agape, imaginary cigarette dangling. My cup of coffee conveniently positioned to fall on my lap at this exact moment, forcing me to spit whatever coffee in my mouth onto my brother's face.

You see, the purpose of the scripts---what the scripts did---didn't surprise or wow me. This was the same type of stuff I did in MatLab for my day-to-day research. But these scripts weren't written in MatLab or Java or C++ or any of the other languages I was aware of... They were written in Bash. That was new to me. It never occurred to me that I could stay right in the shell and script something useful. To me, the shell seemed clunky and primitive -- a relic of times past. This is embarrassing to admit, but it's true. 

For the first time in a long time, there was something that John and I could really talk about: programming and, hey what the hell --- finance. I studied physics. John studied bio. Programming and finance seemed like a good intersection: programming because we both liked it and finance because it was an equal-footing domain where neither of us enjoyed much expertise (no ego to get in the way).  Although we have never really discussed finance again, we became interested in exploring ways to exploit my physics and mathematics background for applications in molecular biology and genomics research, which brings us back to the opening of this discussion: that I'm a masochist who wished to script bioinformatics algorithms purely in Bash.

Before writing any further, because of the inherent ambiguity of where Bash begins and ends, here were some basic rules I followed:
  1. Basic Awk usage was allowed, despite Awk itself being considered a scripting language. "Basic" because that's about my pay grade with Awk, but more importantly because Awk has some advanced features that I wanted to limit myself from using, e.g., multi-dimensional arrays. One of my goals was to hack up more complex data structures than those native to Bash (i.e., almost anything more complex than a 1D array). 

  2. Other scripting languages that can double-up as command-line tools, like Python or R, were definitely NOT allowed. Otherwise, what would be the point of this exercise?
Strictly sticking to these rules is something that interested me because I began wondering: When exactly is the point while coding that I *need* to resort to a specialized language like Python, MatLab, or R? In various research projects I've worked on, much of my coding has involved importing some data, divyying it up, cleaning it, organizing it, maybe restructuring it, or searching it, converting characters, and writing the alterations to a file. That is,  while the primary purpose of my programming is mathematically oriented (e.g., computing power spectra of a geomagnetic time series), much of the coding surrounding the mathematical features is non-mathematical. I have found that much of these non-mathematical, textually-oriented tasks are often easier and more efficient to do at the Bash command line, or maybe in a Bash script, using tools like Awk, Sed, and grep. 

Two things I quickly learned while coding bioinformatics algorithms in Bash are: (1) Bash for-loops are slow, (2) complex data structures that one might take for granted in more general purpose programming languages are incredibly useful linguistic advancements, and this becomes relentlessly unmistakable as one's scripting demands on Bash progresses. 

These two issues make a class on bioinformatics dreadfully interesting! In fact, I'm ashamed to admit I had to hang up my masochistic towel. While at first it was fun trying to write Bash functions to mimic data structures necessary for bioinformatics (e.g., lists or multi-dimensional arrays, and so on), this quickly began eating up all my time -- I do have to work on my PhD research every now and then! Even if I continued in this vein, speed seemed to really prevent me from realizing my goal: for each programming task in the class, one's program has to run in under five minutes. Beyond simple tasks, my bash scripts were not meeting this requirement. 

It wasn't all bad though: I did learn a lot! If you love learning new languages even when it's probably not totally necessary, then I'd recommend repeating this exercise yourself. Learning Bash felt like a history lesson in computer languages. In the coming weeks, I will expound in gory detail what I learned during my short stint as a strictly-Bash scripter. 

3 comments: