Wednesday, May 28, 2014

The slow death of the term "uniquely mappable" in deep sequencing studies and resisting the "conservative" urge to toss out data





Here starts the little series of illustrations showing possible outcomes starting with requiring a perfect match (0 mismatches) and gradually allowing more complex alignments (e.g. allowing mismatches). This is not meant to show things in a proportionally correct manner (i.e. not in proportion to all of your reads). Above, 6 reads are shown that do not map to the human genome for various reasons: from a region not in the current assembly (NIA), has a SNP, has a base call error (BCE) with a low base quality score, or is from a contaminating E. coli source (Con). There are 4 examples of reads that map uniquely: 2 map to the correct positions and 2 map to incorrect positions since they had a SNP and/or BCE. Finally, there is an example of a multiread that maps to >1 position equally well.
Unireads are reads that align to a single, unique position in the genome.
Multireads are reads that align to multiple positions in the genome.


Original Publish Date: 
Wednesday, May 28, 2014

The slow death of the term "uniquely mappable"
Most genomics studies using next generation sequencing (e.g. ChIP-seq) use what people consider to be standard quality control steps such as using "uniquely mappable" reads. Even the ENCODE website says one of their quality metrics is using uniquely mappable reads. But what are "uniquely mappable reads"? Why or how are they useful? and do they always ensure high quality data? If you answer those questions too quickly, you will miss a lot of nuance. Anyone new to doing their own NGS bioinformatic analyses who thought through those questions just a moment too long can attest to how much they have haunted them since. SeqAnswers.com, BioStars, and other forums are filled with questions about what defines a uniquely mappable read and how to extract them. However, the answers on different threads are often different. Why?


"Uniquely mappable" is a concept that started when short reads first hit the scene and there was no sophisticated way to deal with them other than to see if a short read matched 1 site exactly (uniquely mappable), if it matched >1 site exactly, or if it did not match any site exactly. However, it was always kept in mind that SNPs and base call errors would require the allowance of some mismatches once the dust settled from all of the initial excitement. Once mismatches were allowed, "uniquely mappable" became a very obscured concept. The number of uniquely mappable reads decreases with an increasing number of mismatches allowed in the alignment. Another way of looking at it: the number of multireads increases with the number of mismatches allowed since it allows formerly "unique" reads to potentially match more sites. So with allowing more mismatches, more formerly uniquely mapped reads are tossed out when keeping only "unireads". Thus, in some ways using only uniquely mapped reads, especially after allowing, for example, 5 mismatches, is absurdly conservative and perhaps even plain wrong due to how much data is thrown out. Moreover, unique reads include increasing amounts of false positives with increasing numbers of mismatches allowed. 

Another way to envisage why uniquely mapped reads become multi-reads when mismatches are allowed is to think of a genome composed of all 4^50 50-mers (all 50-mers composed of A,C,G,T), each as its own chromosome so there are no overlapping 50-mers to think about. If you take one 50 bp read (no errors) from this genome and map it back requiring a perfect match, then there is only 1 place it can align -- to itself, where it came from. Once you allow 1 mismatch, it could map to 151 different 50-mers. Want to allow 2 mismatches? Now it can map to over 11 thousand 50-mers. If you allow up to 5 mismatches, it can map to over 534 million positions in that genome! In each of these cases, if you were keeping tabs on the number of mismatches somehow, then despite the fact that it mapped to over 534 million positions, you could pick the highest confidence position - the one with 0 mismatches in this case. OR you could throw it away because it is a "multiread". In real life, a genome is only composed of a tiny fraction of all those 50-mers and inside that tiny fraction each 50-mer is present in different copy numbers, not just 1. This is why even when allowing 0 mismatches, some reads map to multiple sites. In the real life case, when allowing, for example, 5 mismatches, yes there are theoretically >534 million 50-mers it could match, but it is possible none of them occur in the genome since 534 million 50-mers is <<<<1% of all possible 50-mers. However, it is also possible that one or a couple do exist in the genome and in this case, the read, which was unique when requiring a perfect match, is now considered a multi-read and would be tossed out in studies that keep only "unireads" rather than choosing the best position. Thus, there becomes an absurdity to keeping only uniquely mappable reads when more and more mismatches are allowed. Moreover, in real life, this is also why contaminating (false positive) reads begin to align more when more mismatches allowed. Again, if you allow 5 mismatches in the 50 bp read, then the contaminating read can map to >534 million 50-mers, 1 of which may be in the reference genome of interest and in this case, such a uniquely mappable read would be kept. However, if one was more nuanced in his/her judgement and thought the alignment score was too poor, they would toss it despite that it was uniquely mapped.

After allowing 1 MM, one of the unique reads is now able to map to >1 position and so would no longer be considered unique if you extract only the reads that map to one place. The other correctly mapped uniread stays unaffected. The next read, originally a multiread, increases in multiplicity though this new position has a lower alignment score than the first 2 and could be ignored by more sophisticated algorithms. The read with the SNP is now able to align to the correct position (an intended result for allowing mismatches). The read with the base call error that could not map now becomes a multiread - it can map to the correct position and a position that agreed with the BCE but needed 1 mismatch allowance elsewhere in the read for alignment. Note that in bowtie2, the correct alignment would have a higher alignment score since the mismatch occurred over the BCE (which in this example has a low base quality score) whereas the mismatch in the incorrect position occurs over a high quality base call. Thus, it could choose the correct position, but people who take only unireads would toss it. The read with both a BCE and a SNP still only aligns to the incorrect place but uniquely! The next read with a BCE now has become a multiread by now being able to align to the correct position. In bowtie2, the incorrect mapping would have the better alignment score since it did not require a mismatch and matches over low quality base calls are not penalized. Thus, this is an example of a read for which the best alignment is not the correct alignment (unless a different scoring scheme that penalized low quality base calls in general was adopted). The contaminating reads still do not map.



Now 2 mismatches are allowed. The read with both a SNP and a BCE that originally aligned to the incorrect position is now able to align to the correct position. However, the correct position has a lower alignment score since it requires 2 mismatches (and since, at least in bowtie2, low quality base calls are not penalized over matches). The only other new event here is that one of the contaminating reads is now able to align uniquely.



The problem of "uniquely mappable" does not end with mismatch tolerance. Once gaps are also allowed, "uniquely mappable" becomes almost impossible to define. Note that most current aligners allow gaps. Think about it -- it is possible to map reads that were considered "unique" by the perfect match definition in ways such that none end up being "uniquely mappable" when allowing gaps and mismatches (without constraints). What becomes important is the scoring scheme, the alignment score, and cutoffs on alignment scores to be considered a "valid alignment". Then it is important to form some definition of a well-mapped or reliably mapped position for a given read from the alignment score of its best alignment as well as those of its other alignments. This also means that the status of a read being "uniquely mappable" becomes directly dependent on the scoring scheme and cutoffs - which are tune-able parameters, thus making the term even more elastic. 

"Uniquely mappable reads" have even less of a practical relevance when end to end alignment is not enforced and, instead, positions of the best local alignments (where the alignment to the reference can start and end anywhere inside the read) are searched for. For many reads, the best local alignment may still be the position that was formerly "uniquely mappable" in an end to end alignment requiring a perfect match. However, there are likely to be many other possible local alignments (think BLAST results). What again becomes most important is the scoring scheme, cutoffs for valid alignments, and sophisticated ways to define "confidently mapped" and "best" positions rather than using the binary criterion of "uniquely mapped" as a proxy. At this point, very few reads will have been uniquely mapped and many that have been uniquely mapped can be false positives. Taking all "uniquely mapped" reads could blindly include those false positives whereas an alignment score that represents mismatches, big gaps, and the use of only small region of the read would represent the lack of confidence in the mapping at which point we can choose to ignore it.

Now gaps are allowed in addition to 2 MM. The multiread increases in multiplicity again, but again this new alignment has an even lower alignment score so could be ignored. The read with a SNP that was unique when allowing 1 or 2 mismatches, now has become a multiread. The 2 incorrect alignments would have lower alignment scores and could be ignored. When keeping only reads that align to a single position in the genome, though, it would be tossed. The next 2 reads, one with a BCE and the next with both a SNP and BCE, increase in multiplicity (again lower scores though which could be ignored). Finally, another contaminating read is now able to uniquely align.
The first  change in this picture is that the read from a region not in the assembly is now able to map, but to an incorrect position -- though uniquely! The 3rd read from the left, which was a uniread for all the conditions until now has finally lost its uniread status. It can now align to 2 positions, though the incorrect position would have a lower alignment score and could be ignored. If keeping only reads that aligned to a single position, it would be tossed. Now the remaining contaminating read is also able to align (uniquely) and one of the others has now become a multiread. All in all, this series of illustrations is not proportionally accurate -- it is just meant to show the absurdity of ignoring alignment scores and judging a read that maps uniquely as good and multiply as bad when in many cases an alignment score comparison could tell you where a multiread maps and that a uniread maps horribly.







So why is there lack of agreement about "uniquely mappable" in the forums?
Practically speaking, "uniquely mappable" has actually become an outdated and not-so-useful term of sorts. Technically speaking, the "uniquely mappable" status of a read can change with the algorithm used, with its heuristics for finding alignments in super reasonable periods of time and its scoring schemes and cutoffs. Culturally speaking, "uniquely mappable" has become a term associated with being "rigorous" with your data. The term persists and even as it persists it exists simultaneously in two states: some people latch on to the old definition and take it to mean that the read only has a single site in the genome that it maps to (but this changes with how the read is mapped as discussed above!) while others have adopted a newer definition and take it to mean that the read either maps to a single place OR the alignment score of the best alignment position is better than the score of the next best alignment position (e.g. bowtie2). The latter definition is more sophisticated and takes into consideration the fact that more and more reads will be able to map to more and more places once you start allowing mismatches, gaps, and local alignments, all of which may be necessary because of SNPs, base call errors, microdeletions, microinsertions, reads split across break points … in other words, the many issues that arise when mapping reads (often with errors) from a cancer genome back to the reference. The view is that even if it can map to more than one place, the better the best score is over the 2nd best score, the more likely that position is the correct position to align it to. However, it is not guaranteed that the best alignment score belongs to the correct mapping position, though this might be the case most of the time. 

The goal should not be to include only uniquely mappable reads. The goal should be to include any read that is mapped to a position with high confidence. This would include some "multireads" and exclude some "unireads":
Since, on some level, all reads can become multireads (some sooner than others), "uniquely mappable" is not a useful concept anymore. Truth is, it was always just a quick-and-dirty, black-and-white, binary hueristic to classify a mapped read. There always needed to be a more dynamic way to quantify a "well-mapped" read; a spectrum from lowest to highest in belief or confidence that the mapping is correct. Mapping Quality (MAPQ or MQ) scores, which typically range from 0-40 or 42, theoretically approach this ideal. MAPQ is supposed to be -10log10(p) where p is the probability the read is mapped wrong. This way, the smaller the probability it is mapped wrong, the higher the score it gets. Thus, if there are 2 places a read can go and it is assigned to 1 uniformly at random, the a priori probability that it is mapped wrong is 0.5 and the MAPQ = -10log10(0.5) ~= 3. Similarly, a MAPQ of 10, 20, 30 and 40 should mean the probability it mapped wrong is ~0.1, 0.01, 0.001, and 0.0001. I do not think most current aligners implement probabilistic models that first estimate the probability a read is mapped wrong and then convert it to a MAPQ though. For example, Bowtie2 seems to follow a rule based approach to assigning MAPQ values (which I discuss in another post). In Bowtie2, a read that aligns to more than 1 site equally well is never given higher than a MAPQ of 1 (even if it aligns to only 2 sites equally well as discussed above).

Though MAPQ is related to uniquely mappable (mainly the newer definition), it compounds it with other factors. MAPQ is just a transformation of the probability a read is mapped wrong, which should take into consideration, at least implicitly, the following: the alignment score, the number of times the read mapped, and the alignment scores of the other mappings (Bowtie2 looks only at the minimum alignment score allowed and the alignment scores of the best and second best alignments to estimate the MAPQ). Since the estimated probability that a read is mapped wrong depends on whether the read mapped more than once, it is also heavily influenced by the allowance of mismatches and gaps, scoring schemes, and cutoffs. Moreover, different algorithms approximate MAPQ scores in different ways. So the MAPQ of a mapped read from bowtie2 can be different than that of bwa given that everything else (e.g. finding and scoring alignments) is the same (which it isn't). MAPQ may not be completely standardized yet, but with current read mapping it is a more nuanced, more practical, more accurate way to quantify our belief that a read is correctly placed than is classifying it as "uniquely mapped" or not.

From here on forward let us refer to "multireads" as only those reads for which their best alignment score is the same as their second best alignment score. Reads for which their best alignment score is better than their second best, formerly multireads, will now be called "maxireads" to represent they were placed at the position of maximum score. This term, "maxiread" also encompasses reads that would previously be called "uniquely mappable" as they also by definition were placed at the position with maximum alignment score. So the two terms moving forward are "maxireads", those with single maximum alignment scores, and "multireads" for which there is no single maximum alignment score (i.e. the maximum is shared by >= 2 positions). 

The frontier is really in deciding how to find the best positions for all multi-reads. Multireads are still essentially just filtered out. Thinking about mapping all mappable reads like a puzzle, the problem is, we have been assuming we have 2 options:
   1. toss out the puzzle pieces that have shapes that fit in multiple places equally well 
   2. randomly choose one of the multiple places that a puzzle piece's shape fits equally well
Most of us have largely ignored the 3rd option: 
  3. use the picture we can already see formed by the puzzle pieces that fit into 1 spot best to inform us where the puzzle pieces with shapes that fit multiple places go

In other words, use the background already put together by the higher confidence maxiread data to form posterior probabilities of where a multi-read goes. Posterior MAPQs could be assigned and the reads could still be filtered out if they did not quite meet someone's MAPQ cutoff. Some posterior MAPQs would still be 0 if the maxi-read background was uninformative (i.e. balanced across all sites). Other posterior MAPQs could become quite high if the high quality maxi-read background was concentrated at just 1 of the optional sites. 

For example, imagine a read that maps to 10 sites equally well. A priori, the probability it maps to any of those sites is 10%. Thus, you have a 90% chance of placing it at the wrong site if you assign it uniformly at random. The prior MAPQ would therefore be -10log10(0.9) = ~0.46, which would be rounded to 0. However, if you look at the background distribution of "maxireads" and at one of those 10 sites there are 91 maxi-reads and each of the others has 1 for a total of 100 maxi-reads, the posterior probability might be 91% to the first site and 1% to each of the 9 others. Now, for a read mapped to the first site (91% chance it came from there), the posterior probability that a it is mapped wrong given it is a perfect match is 9%. Let's call it 10% to make the math easier. The posterior MAPQ would now be -10*log10(0.1) = 10 instead of 0. However, it is difficult for me to say how a read randomly assigned to one of the other 9 sites would be scored. Since reads mapped to the other 9 sites were assigned to them under the same posterior probability distribution as the reads mapped to the first site (at a rate of 1% to each site), it seems like they should be scored as equally correct as the read mapped to the first site -- not necessarily -10log10(0.99) = 0.436 ~= 0 -- though that would be one way to score them. Moreover, other factors need to be taken into consideration such as the mappability of each of those sites. Some regions of the genome might be deserts in terms of places maxi-reads could align while others might be chock full of them. There needs to be a way to normalize the number of maxi-reads at a site by the propensity of maxi-reads to occur at that site. If one could do that, then one could intelligently place multi-reads to reconstruct the most probabilistically accurate puzzle picture rather than to leave puzzle pieces out of it or to fit them in uniformly at random.

Before mapping all of the high quality "maxireads" (see text for definition), we have a completely uninformed, uniform distribution on how to assign this multiread.


















After mapping all of the high quality "maxireads" (see text for definition), i.e. after looking at the data, we have better informed posterior distribution on how to assign this multiread.

I have found that some groups have began trying to implement ideas like this. This needs to catch on!


Slightly related -- others have come up with ways to study repetitive regions (from where many multireads arise) in general:
- TEA: http://compbio.med.harvard.edu/Tea/

----
Note: of course these "goals" change according to the application. However, my hope is to open a discussion on all the data we toss out and the general assumption that keeping only uniquely mapped reads is "rigorous". It could be called "conservative", but it is not rigorous. Its more closely related intellectual laziness than to intellectual rigor. In another post, I will happily discuss more better ways to deal with "redundant reads" as well, instead of just removing them all or all but 1, which of course sets a very low ceiling on how many reads you can align to the genome in the era of increasing throughput.




Tuesday, February 18, 2014

Why it burns when you P.

The smaller the p-value, the better the result. True or false?




If you answered "True", then that is why it burns when you use and interpret p-values. It hurts us all. Stop it.






The p-value seems to have become the judge, jury, and executioner of evaluating scientific results. As I go further into my PhD, I see just how in love we are with p-values as well as all the signs of an abusive relationship. No matter how many horrors the p-value wreaks upon us, we just know it won't do it again. Plus, no matter how much we misuse and manipulate the P-value, it just won't leave us. It's here to stay and that's love  right?! Love or hate, it is pretty hard to imagine a world without p-values, but such a world did exist: p-values have only been around for less than a century. And before that: nothing was statistically significant! Yet scientists were able to somehow test hypotheses, come to conclusions, and develop theories anyway including Darwin, Mendel, Newton, Einstein…



Google Ngram snapshot Feb 17, 2014. 
Side note: It is not strictly true that the world was completely absent of significance before the P value. For example, John Arbuthnot seems to have stumbled upon a significant result in 1710 and the term "statistically significant" seems to have come up around the year 1600 according to google ngram (above). However, "P values", "statistical significance", and "statistically significant results" in their modern incarnations are a product of the 1920s.



So who invented the P-value, what was it originally used for, and how is it used in modern data analysis? Could science and data analysis exist without p-values once again? Should it? Many of you (i.e. my readers, i.e. referer spam bots) might have felt a sense of shock thinking that the almighty p-value ever had an inventor. What mere mortal could have created the mighty metric for "significance" itself?  Rather than tell you, I encourage you, Spam Bot, to read, "The Lady Tasting Tea" by David Salsburg. One result of inventing the p-value that is almost certainly statistically significant is the number of scientific papers that have included the p-value since. More interesting is that the p-value has become so omnipresent, so omnipotent, so mythological and magical that few if anyone cites the mere mortal inventor when they use it. If everyone did cite R. A. Fisher, he would quite possibly be the most cited scientist ever by no small margin (okay I told you, Spam Bot ...but I still think you should read that book).

The p-value has been a source of controversy ever since its invention. Unfortunately, the controversy has mostly gone on in the coolest statistics circles of which it is statistically unlikely that you or any of your ancestors are or were in. Good news! Due to increasingly bad science -- that is perhaps a consequence of the rise, abuse, misuse, and generally poor understanding of p-values -- the controversy has reached the foreground. 

There are already plenty of people that have provided prose preaching about the promiscuous rise and predicted plummet of the p-value's popularity and that promote posterior probabilities and other alternatives such as reporting effect sizes and confidence intervals. So I do not want to do my own version of that. See the end for recommended books and articles covering these issues. Rather, I want to ask you a few questions (feel free to answer in the comments section). This is not a comprehensive set of questions, but they are questions that I think any scientist/data analyst should have their own thoughts on. In fact, it would be amazing if average Joe Citizen had thoughts on these questions about the almighty p-value and related topics such as the False Discovery Rate, reproducibility, and validity. All of us are blasted in the face with statistically significant results on a daily basis. Popular views on whether eggs and coffee are good or bad for your health change all the time because of statistical significance. So, which is it: good or bad?!?! Can statistical significance ever answer those health questions definitively?






Take a moment to think about each of the following questions. There are some starting points to explore these topics afterward, but mostly I expect that you know how to use Google. For the uninitiated, remember that p-values get smaller with "higher significance" -- hence the tempting, but horribly wrong conclusion of "the smaller the p-value, the better the result":

1. What is a p-value? 

2. What does a p-value mean to you

3. What does p-value mean to your colleagues? 

4. Are results better when the p-value is smaller (i.e."more significant")?

5. What factors make a p-value tiny anyway?

6. When you are scanning a paper or report and see "highly significant" (tiny) p-values, how does that affect how you perceive the results?

7. When should you or anyone else report a p-value? 

8. Is it always necessary to include a p-value?

9. What is an effect size?

10. Does a tiny p-value guarantee a large effect size?

11. Is it possible to get a highly significant (tiny) p-value with a tiny effect size?

12. How does "n" (number of data points) affect the p-value?

13. What is a null hypothesis? or what is the null distribution?

14. Is there more than one possible null distribution to use for a given test? 

15. How do you pick a null distribution?

16. Is there a correct null distribution? or multiple correct null distributions ("nulls")? or mostly incorrect nulls? or do nulls have no inherent correctness at all?

17. What does the null distribution you use say about your data? 

18. If your p-value is "highly significant" and lets you conclude that your data almost certainly did not come from your null distribution, a negative assertion, does that give you any positive assertions to make about your data?

19. If your p-value is "not significant" and so you fail to reject the null hypothesis, does that mean the null hypothesis is therefore true?

20. If your p-value is "not significant" and so you fail to reject the null hypothesis, does that imply that your alternative hypothesis is therefore false?

21. If your p-value is "significant", would it come out statistically significant again if you repeated the experiment? What is the probability it would be significant again? Is that knowable? 

22. Is it possible to get a highly significant (tiny) p-value when your data have indeed come from the null distribution?

23. When you do 100 tests with a cut-off for significant p-values at 0.05, if all the tests are in fact from the null distribution, how many do you expect to be counted as significant anyway?  -- i.e. how many false positive tests will there be out of the 100 when the significance cutoff is 0.05? If you do 2.5 million tests with a significance cutoff of 0.00001, how many false positives would you expect to come through?

24. When doing multiple tests in general, how can you limit the number of false positives that come through? How will that affect the number of true positives that come through?

25. What is the Bonferroni correction?

26. What is the False Discovery Rate (FDR)? …and what are q-values?

27. Does the False Discovery Rate control all types of false discoveries?

28. What types of false positives are not accounted for when controlling the FDR?

29. When reading a paper or report, if the FDR is tiny, how does that affect how you perceive the results?

30. What is Bayes' Rule?

31. What is a prior probability? a likelihood? a posterior probability?

32. What is a prior distribution?

33. What is a posterior distribution?

34. Compare null distributions to prior distributions. 

35. Compare p-values and posterior probabilities.

36. Compare confidence intervals to credible intervals.

37. What is reproducibility? reliability? accuracy? validity?

38. What is the Irreproducible Discovery Rate (IDR)?

39. If a result is highly reproducible, does this ensure it is valid? 


40. If a result is not reproducible, does this guarantee it is not valid?

41. If results are both reproducible and valid, does this guarantee the interpretation is correct?

42. Are the non-reproducible elements across replicates always "noise"? Or could they be signal?

43. Is it possible that the most reproducible elements across replicates are in fact the noise? 


Books:

-- David Salsburg
-- This was a good read - a brief history of statistics, the birth and rise of statistical ideas, and the characters involved






-- James Schwartz
-- This is also a great history of statistics though not strictly relevant to the p-value topic. Think of it as the pre-P history. I mainly included it here because when I was done reading this, I read "The Lady Tasting Tea" and thought that in some ways it picked up where this book left off. Many of the great statisticians were interested in genetics - that is still true even today.








-- Wackerly, Mendenhall, Scheafer





Articles (chronicles) concerning the p-value battles:


Significance Questing -- Rothman, 1986



















Other articles (some of which concern p-value battles -- e.g. motivation for Bayesian Inference):





Wiki pages:





Other blogs, classes, etc that get into the topic of p-values -- for and against:







Hypothesis Testing -- slides by Vladimir Janis

The Statistical Significance Testing Controversy:A Critical Analysis  <-- R. Chris Fraley's syllabus for his class of that title. At bare-minimum, this is just a very extensive amount of references on these topics organized into points such as why to abandon significance tests, why to keep them, effect size, statistical power, etc.


IDR - a very new statistical idea:

Tuesday, February 11, 2014

The Oil-and-Water of Website Design

Hear ye, hear ye: I have a meagre-but-workable knowledge of designing and maintaining small websites. 

Over the years I've figured out some helpful website design/management practices that are probably pretty well known and obvious to real website managers, but I think are probably useful to anyone who likes to dabble. The underlying theme is layering and separation. Modularity. For example, the aesthetics of the website---the logo, the masthead---should be developed independently of coding (and preferably beforehand to help guide the coding). The coding itself naturally should separate into content, on-screen actions, and layout --- that is, e.g., your HTML, your javascript, and your CSS. 

I'm going to skip a bunch of personal history and go straight to a case study: my lab's website. A few years ago I thought it would be good for my lab to have a web presence, so I thought: (1) We need a cool name, (2) We need a cool logo, (3) We need a cool website. 

Turns out "cool" is defined such that "Terrestrial Lab" is a cool name and that this is a cool logo:


The definition is also flexible enough such that www.terrestrial-lab.org is a cool website. At the least, these will be our starting assumptions!

Before I made this website, my advisor often referred to our lab as the NJIT Middle and Upper Atmospheric Group. While that name was descriptive, I thought something like "Terrestrial Lab" was catchier. We're not studying rocks, oceans, volcanoes, or a plethora of other "terrestrial" features (hell, most of science fits under the "terrestrial" umbrella), so this name might seem remiss. But we do study space weather activity with a focus on the Earth's lower, middle, and upper atmosphere, and our work does make up the "terrestrial" component of our broader group, "Center for Solar-Terrestrial Research." So 1+1=2, and it's a great name!

With a hip new name, we could also have an easy-to-remember website URL. Our old website was TSMTM.net --- pretty dang easy to remember.....that is, if you're familiar with the -ospheres: troposphere, stratosphere, mesosphere, thermosphere, magnetosphere. But if you're anything other than an atmospheric scientist, I wouldn't feel bad if it just looks like a random string of letters. And, anyway, if you're anything like me, all you see at a glance is Teenage Mutant Ninja Turtles.

What's a hip new name without an eye-catching logo? I wanted something identifiable. Something simple and obvious from far. My idea was to make a logo that someone could mindlessly doodle in class. Something that would look good as a sticker, or on a T-shirt. Most importantly, it had to do all this while making us look like a serious group of scientists (we're actually just a bunch of clowns, so this is important!). 

I started with a triangle: simple, serious, pointy. I played around with it on a piece of paper, adding bits of text until I got things looking like how I wanted. Then I went digital with Adobe Illustrator after the idea was already down. I had to play around with fonts until for a while, but I finally nailed it --- with printed T-shirts and all:

 


Great, but a logo by itself though isn't enough for a website: one also needs a masthead of sorts. Some kind of graphic. Something that is nice looking, professional, and somehow relevant. I had always thought the periodograms I produced with magnetometer data from Antarctica had an aesthetic appeal to them, so I started there. Overall, this should be a fun part. Just play around. This was my final product:



Ok, so we have a name, a logo, a main page masthead design. Now it's time to get your hands dirty and code. I'm going to assume you know a lot of basics (some HTML, a sprinkle of javascript, a dash of CSS) and just list some things that have helped me. For starters, my website looks like this:


It's important to realize that the TL website looked like this even when I had much less efficient coding practices. I did not yet fully appreciate the oil-and-water of website design --- the natural separation of content, on-screen actions, and layout! The focus of my pointers below is on efficiency, which has nothing to do with the final product, but everything to do with maintaining the final product without going insane. 

1. Internal vs External Cascading Style Sheets: Formerly, I used internal cascading style sheets as my primary CSS with manual overrides in-line if necessary. This was fine when I first created the site as a proof-of-concept to show my advisor, but it quickly became awkward to maintain the site and develop new pages. If I wanted to change the CSS of one page, I'd have to manually go around and do it to the other pages as well---or just say "to heck with consistency!" 

Solution: I updated my site's primary CSS to an external style sheet (.css file) and now can do manual over-rides both (locally) in-line, like I formerly could, and (page-globally) in the header using the <style> tag. (Look at the code below and/or Google these things if you want to learn more.)


2. Internal vs External Javascript: 
Same story. I used to have it internal to each page on my site using the <script> tag. Although I rarely modify the javascript on the site, the individual pages look so much more cleaned up without it, which allows me to focus on a page's content. Each page now calls on an external .js file, which I only need to edit once if the need arises, not laboriously for each page.

3. The Navigation Bar: The meat of the TL website is the navigation bar at the top. It needs to remain constant as a user peruses the pages, which means it somehow has to exist within each page's HTML file. One terrible way to do it is to copy and paste the nav bar's HTML int each web page's HTML file. That's fine until you want to modify the nav bar --- to add new sections, delete old links, or edit the associated javascript or CSS. One small change to the nav bar requires manually going into each file and...oh man! How time-consuming. I used to actually do this. It was tiresome and oppressive, and it simply stamped out creativity and made website maintenance a chore. 

Solution: Again, nothing new conceptually: separation of dynamic content from static content. It's possible to use an external HTML file for the nav bar and simply have each webpage of the TL site refer to that file. This dramatically changes the web design and management experience. How to do it? The answer is Server-Side Includes, which allows one HTML file to call upon an external HTML file. The only caveat for using SSI is that you must change your webpage extensions from .html (or .htm) to .shtml. 

To learn basic SSI (which is likely all you need), the wikipedia article is good enough. 

The code pasted below shows more than meets the eye. These few lines in an HTML file used to be tens of lines! Maybe more. I used to have both JS scripts and a huge CSS code in the head of every page. Now the website's server simply pastes it there before it gets sent to the user's browser. Same thing for the SSI and my nav bar. That's it. The server does all the hard work before this page ever goes live in someone's browser.

<head>
<title>CSTR - Terrestrial Laboratory</title>
<script type="text/javascript" src="TL_script1.js"></script>
<script type="text/javascript" src="TL_script2.js"></script>
<link rel="stylesheet" type="text/css" href="TL_in_style.css">
</head>


<body>
<!--#include virtual="topLinkBar.html"-->

#
# all other HTML code for the page
#

</body>

And, so the secret to sanity is the separation of seemingly singular suppositions! (And, of course, the mastery of alliteration.)

Thursday, January 30, 2014

Intro to Vim

If you have 25 minutes to spare, watch Derek Wyatt's video about why coders should learn Vim.

Derek is charismatic and fun to watch, for one --- about as good as any TV show I waste my time on. Better though: he has a whole series of videos teaching the curious coder about Vim. And why not be entertained while learning how to wield this mighty weapon of choice?  

The Vim Screen
The Vim Screen
I learned Vim very slowly over the years (read: "I didn't really know how to use Vim over the years"), using it only occasionally when I had to do something quick from the command line. I remember the first time I saw its extremely plain, featureless screen and thinking, "This thing can't possibly be as good as everyone says it is." I figured that Vim must be one of those "back in the old days" kind of things... But every time I needed to edit
something at the command line real quick I'd think of Vim. I was intrigued and intimidated by all the good things people had to say about it. 

Recently I've been working at the command line quite a bit and frequently switching between languages, which usually means between editors / development environments. I wanted some consistency. I thought, "Why not try Vim again?" A few Google searches later I was playing Vim Adventures. Cost 25 bucks, but I figured, hey, this will be a fun way to pick up a new skill on my off time...and I've blown 25 bucks on worse things!

Fun it was, but after learning all it had to teach there is often something I need to do that I don't know how, or I will wonder if there is a Vim extension to do such-and-such. I'm currently learning about various Vim plug-ins (e.g., Vim-LaTeX) and can't help but to realize I still know diddly squat about the inherent powers lurking underneath the illusory screen simplicity of Vim.

So-- I'm going to give these videos a go and, hopefully, I learn a thing or two. 

Tuesday, January 7, 2014

The Craft of Homebrew-ing for Mac OS

Author: John Urban
Original publish date: circa Dec. 2012
Originally appeared: Brown University Genomics Club

Do you want to install a bunch of bioinformatics programs, such as bowtie, bedtools, and samtools, on your MacBook Pro?  Want to do it in a super-fast, super-easy way?

Homebrew is the solution. What is Homebrew? There are webpages in the 'More Help' section at the bottom of the page that may help you understand it better. For now, suffice it to say that it makes installing software packages quite straightforward. All you will have to do to get bowtie, for example, is type 'brew install bowtie' at the command line. Wait a minute or so. Now you have bowtie.

First you need to install Homebrew. Unfortunately, you cannot just do 'brew install homebrew'. Fortunately, the installation is straightforward. Read on!

Instructions:
You will need Xcode on your MacBook Pro. This can be downloaded for free from the Mac App Store.

Download Xcode before starting installation of homebrew.

General tip: It is best to stay with latest OS for MAC -- currently 10.8.2 (when writing this). You may consider upgrading before anything else. It might be that you have to in order to get the latest version of Xcode. It is cheap and to quell any fears - it will not erase anything on your computer. Mostly, things will continue unperturbed from the OS upgrade. You may find some issues that can be overcome with TimeMachine backups.



(1) Get Homebrew
Find out how to get Homebrew here.

That link basically says to enter this at the command line:
    ruby -e "$(curl -fsSkL raw.github.com/mxcl/homebrew/go)"

It will then give you a lot of prompts to continue or abort, which will scare you if you're new to all this. 

If you continue through all of the prompts, then proceed to Step (2).



(2) Doctor Homebrew
Homebrew will tell you to do this, but in case you're not the type to pay attention to the silly messages programs print to you and assuming you are paying attention to the words you are reading right now, here is what to do.

First thing to do after installing homebrew is to type at the command line:
    brew doctor

This will result in a self-diagnosis and a printout of all potential 'illnesses and ailments' that  you should 'cure' before doing anything else. I will cover the warnings I took care of here. It may not be comprehensive. Simply pasting your warnings into google is usually sufficient to find further advice though.

I received the following types of warnings. 

(a) I had to update my Xcode
If you just downloaded it you should have the latest version. For me, it was just a matter of updating it through the app store. My computer had been prompting me to update it for months. 


(b) I had to update to the latest version of Xquartz
When I upgraded my MAC OS from 10.6.8 to 10.8.x I found the new OS no longer came with X11. Xquartz was the answer to this. However, I have been using "terminal" instead of X11/Zquartz. Nonetheless, you need most updated version. If you already have an older version of Xquartz, it is just a matter of opening it. It will ask you to update automatically. Otherwise, download latest version.


(c) I had to download Xcode command line tools
These do not download automatically with Xcode as they are not necessary for most people. If you have the latest version of Xcode already (do that first), then just open Xcode and click on the following menu bar options:
                  Xcode --> preferences --> downloads --> components. 
 Then choose to install command line tools.


(d) It gave me the suggestion to arrange the order of my PATH variable a certain way. 
This can be done by going into your .bash_profile (and/or .bashrc) file in home directory and typing something like this:
             export PATH=/usr/local/bin:/usr/local/sbin:~/bin:$PATH

Alternatively, you can find the path file to modify at:
                          /etc/paths
Switch the paths inside it around accordingly.


(e) I had to update python

Verbatim warning:

Warning: "config" scripts exist outside your system or Homebrew directories.
`./configure` scripts often look for *-config scripts to determine if software packages are installed, and what additional flags to use when compiling and linking.

Having additional scripts in your path can confuse software installed via Homebrew if the config script overrides a system or Homebrew provided script of the same name. We found the following "config" scripts:

/Library/Frameworks/Python.framework/Versions/2.7/bin/python-config
/Library/Frameworks/Python.framework/Versions/2.7/bin/python2-config
/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7-config

Solution:
   I had installed python 2.7 at one point (my Mac came with 2.6) and it modified my .bash_profile file in the following way:

#Setting PATH for Python 2.7
#The orginal version is saved in .bash_profile.pysave
PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}"
export PATH

    In other words, it put the new python in my PATH variable. However, now homebrew has that covered. The solution was to just silence (or erase) those modifications made by python installation.

#Setting PATH for Python 2.7
#The orginal version is saved in .bash_profile.pysave
#PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}"
#export PATH

    I chose to use '#' to silence them in case I uninstall homebrew in the future, which I doubt I will do. It all can just be erased though. 



(f) I had to sudo chown something

Verbatim warning:

Warning: Some directories in /usr/local/share/man aren't writable.
This can happen if you "sudo make install" software that isn't managed
by Homebrew. If a brew tries to add locale information to one of these
directories, then the install will fail during the link step.
You should probably `chown` them:

    /usr/local/share/man/de
    /usr/local/share/man/de/man1

The following solution should only be employed if you are using your own machine and you are the only user. For example, I am the only user of my MacBook Pro. This means you know your computer's pw, which will be necessary.

        To be sure of your username, type into the command line:
            whoami
        It spits out your exact username.

        Next use the following commands to chown the files as homebrew wants:
            sudo chown username /usr/local/share/man/de
            ## Had to enter computer pw, then
            sudo chown username /usr/local/share/man/de/man1/

Those were all of my warnings.
After each attempt at a solution, re-run 'brew doctor' to see if the diagnosis has changed (i.e. make sure it does not include what you sought to fix).

When all ailments are fixed, 'brew doctor' will return:
    "Your system is raring to brew."

At this point, you may still need to (or just want to) type:
     brew update 
This will ensure homebrew is completely updated upon all fixes.



(3) Use Homebrew
Homebrew is easy to use and there is a lot of tutorials on the web.

Try:
brew list
    to see the current list of 'formula' you have

brew search
    to see full list of 'formula' you may want to get

brew info 'formula'
    to find out more about a given formula

brew install 'formula'
    to install a given formula in the search list

brew uninstall 'formula'
    to remove something from your list or something that did not install correctly

brew tap 'keg'
    e.g. brew tap homebrew/science
    --> I had to 'tap' this 'keg' for quite a few things I wanted.
          So you might as well just tap that keg now.
    --> See the full list of formula that may be dependent on tapping this keg here:

An idea of all the 'formulas' I used 'brew install "formula"' for:
    gfortran (needed for octave)
    octave (an open-source MatLab-like program)
    bedtools 
    samtools
    bamtools
    bowtie
    bowtie2
    vcftools 
    blast (type 'brew info blast' first. You may want the smaller 'dynamic version')
    tophat
    cufflinks
    velvet
    abyss
    blat 
    bwa
    clustal-omega
    fastx_toolkit
    phyml
    emboss
    mira
    sga
    gnuplot*

*A note on gnuplot, octave, aquaterm:
Gnuplot is needed for plots in octave. But you may also need/want aquaterm. If so, a slight nuance (read annoyance) is that you must install aquaterm before gnuplot. So if you installed gnuplot before aqua term, then trying to plot in octave will not work if the GNUTERM variable in octave is set to aqua (you can try setting GNUTERM to x11 though). There is an easy Home Brew solution. First do, 'brew uninstall gnuplot'. If aquaterm was already installed, then just do 'brew install gnuplot' and you are done. If not aqauterm has not been installed yet, then install aquaterm (follow their instructions) before doing 'brew install gnuplot' and being done. Note that for an open source language that feels like MatLab, "Julia" (a quickly developing and impressive language) may be a better option than Octave these days.


If you are unfamiliar with any of the above bioinformatics tools, either:
1- Type the name along with 'bioinformatics' (to narrow down possible interpretations) into google
    e.g. sga bioinformatics
OR try:
2 - brew info formula
    e.g. brew info sga



1 - type 'brew help'
2 - type 'man brew'