The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...' --Isaac Asimov
That's Funny… random header image

Open Source Software Notes

I have to make a lot of notes to myself about how to do stuff on the computer.
quick edit link


simple spellcheck:

cat <file> | aspell list | sort | uniq -ic | less

to fast copy over the network with ssh:

cd /destination/dir/ && ssh SOURCE "cd /source/dir && tar -czvf - *" | tar xzf -

to find the day of year of a particular date:

date --date='27 Nov 2007' +%j

to get the details on an arbitrary list of files:

locate [file] > filelist.txt
while read file; do ls -l $file; done < filelist.txt

to quickly scan through a text file for a word. then use ‘n’ and ‘N’ to search forward and backward:

cat [file] | less -p [word]

to remove all of the blank lines in a text document:

sed -i '/^$/d' [file]

to add an extension (here .csv) to all files in a directory:

rename 's/$/.csv/' *

to count the number of each unique item in a list:

cat [file] | sort | uniq -c | sort -nr


to prepend “>filename” to every FASTA file in a directory:

for file in ./*.fasta; do
sed -i "1i \>$bar" $file;
echo $bar;

to fix the bad formatting of CAMERA export files

sed 's/^"//;s/","/\t/g;s/",$//g;s/&nbsp;//g' [infile] > [outfile]

download complete genome sequences from JGI Integrated Microbial Genomes (IMG) using a list of IMG taxon ids (input.txt)


for i in $(cat input.txt);
do echo $i
wget $URL -O $FILE

to find all of the EC numbers in [file], sort, de-replicate, count, and print them by order of decreasing frequency

grep -o -P 'EC\W*\d\.\d\.\d\.\d' [file] | sort | uniq -c | sort -rn > output.txt

ARB import filter to read full_name from a FASTA file. Save to $ARBHOME/lib/import/

From of FASTA file should be >[name][tab][full_name]

AUTODETECT      ">*"
        #Global settings:
KEYWIDTH        1

BEGIN   ">??*"

MATCH   ">*"
        SRT "* *=*1:*\t*=*1"
        WRITE "name"

MATCH   ">*"
        SRT "*\t*=*2"
        WRITE "full_name"



END     "//"

perl script to translate names in tree files or sequence files

given the file to convert and a 2-column translation table. will probably need to be edited depending on type of file. save as ‘’, make it executable ‘chmod +x’, and run as ‘./ [treefile] [translationfile]’

use strict;

my $treefile = $ARGV[0]; # newick-like tree
my $translatefile = $ARGV[1]; #names to translate
my %namehash = ();
my %outhash = ();
open(FILE, "< $translatefile") or die;
while(<FILE>) {
    my @array = split(/\t/); #split on tab
    $array[1] =~ s/[ \/\(\)']/_/g; #replace bad chars with underscore
    $namehash{$array[0]} = $array[1];
close FILE;
open(FILE, "< $treefile") or die;
LINE: while(<FILE>) {
#   chomp; #uncomment to remove newlines
#   s/^[ \t]*//; #uncomment to replace whitespace at beginning of line
#     s/['"]//g; #uncomment to delete quotation marks
    foreach my $phyname (keys %namehash) {
    print "$_";
close FILE;

script to reverse sort lines by the number of tab characters in each line (for importing into R)

perl -nle 'print $count = ($_ =~ tr/\t//) . "\t$_";' [filename] | sort -rn > [outfile]

script to extract accession numbers:

genbank (-f2) or refseq (-f4) or descriptions (-f5) out of genbank amino acids fasta files into a new file

cat *.faa | grep '>gi' | cut -f2 -d\| > [outfile]
cat *.faa | grep '>gi' | cut -f4 -d\| > [outfile]
cat *.faa | grep '>gi' | cut -f5 -d\| > [outfile]


or gRRRRR as I sometimes call it.

to print a figure that’s on your screen:


to clean up and remove variables:


to remove NA-only rows by subsetting:



to generate a clean one-page HTML output of a TeX document

latex2html -split 0 -no_navigation -info 0 -address 0 [file.tex]

to convert normal quotes into LaTeX quotes

sed 's/"\([^"]*"\)/``\1/g' [inputfile] > [outputfile]

to globally comment out/not run figures in LaTeX, put it at the end of the preamble

Engauge [Graph/Plot] Digitizer

Use this excellent program to convert an image of a graph into usable X/Y data points. It expects plots that do NOT have multiple Y values, so rotate images (e.g. P vs. depth) by 90 before you import them. If your plot has multiple colors it is easiest to digitize, in that case just use the ‘discretize’ options and turn off the ‘grid removal’ options. There are tutorials available at the Engauge site on SourceForge.

sudo apt-get install engauge-digitizer