Installing HTSeq
HTSeq is a tool for "Analysing high-throughput sequencing data with Python". This tool consists of a python module to import as well as two wrapper scripts to execute on the command line. Here we show how to get this tool and install it as a VALET package for your research group. The commands to setup, install and create the VALET package are in bash script files:
- t1_setup - Set shell variable and setup all necessary directories (downloading as needed).
- t2_install - Install the python module an install in in your research software directory.
- t3_valet - Write a VALET package file to set up the environment to use HTSeq.
Once you have Bash source files to do these task, you can setup, intall and create the VALET package file with the one compound command:
. t1_setup && . t2_install && . t3_valet
These files serve to document how the package was built, and can be used to build a new version or rebuild the old one.
/lustre
file system and it not backed up. These three script files are quite small, and will rebuild all directories. Keep these files on a system that is backed up, for example, your home directory on Mills.
vpkg_rollback all
command before any VALET commands in the install script
Task 1: setup
The bash source file t1_setup
has shell variables assignments and a few directory "setup" commands.
The variable assignments define the scratch, installation and VALET directories. Their other values will be used to
write the VALET package file. There are two environment variables used in this source file:
USER
and WORKDIR
. Since WORKDIR
may not be set, you should source these files from a workgroup shell.
When in a workgroup shell the directory ${WORKDIR}/sw/valet
has a special meaning. It is searched by VALET for vpkg files. You may use a different directory by adding it to your VALET_SYSCONFDIR
environment variable.
- "t1_setup"
# package variables: id=htseq version=0.5.3p9 description='Analysing high-throughput sequencing data with Python' binaries="htseq-count htseq-qa" url_doc='http://www-huber.embl.de/users/anders/HTSeq/doc/' dir="HTSeq-$version" url_get="http://pypi.python.org/packages/source/H/HTSeq/$dir.tar.gz" # scratch variables: scratch="/lustre/scratch/$USER/$id" alias clean="lrm -r $scratch" # install variables: compiler='gcc/4.6.2' require='numpy/1.6.1-2.7' prefix="${WORKDIR}/sw/$id" installhome="$prefix/$version" pythondir="$installhome/lib/python" # VALET variables. valetdir="${WORKDIR}/sw/valet" test -z "$WORKDIR" && echo "WORKDIR not set, start a workgroup shell." && return 1 # setup directories mkdir -p -m 1777 "$valetdir" mkdir -p "$installhome/bin" "$pythondir" mkdir -p "$scratch" test -d "$scratch/$dir" || curl "$url_get" | tar -zxf - -C "$scratch"
Directory setup notes:
Call-by-need evaluation
Bash uses call-by-need evaluation (also called lazy evaluation). For example:
test -d "$scratch/$dir" || curl "$url_get" | tar -zxf - -C "$scratch"
The statement after the or operator ||
is only evaluated if the test statement fails. If the directory does not exist it will be downloaded, uncompressed, extracted to the scratch directory. This will create the "$scratch/$dir" directory, so the next time you source this file it will not download and extract the file.
The -p option on mkdir
Normally, the mkdir
command fails, with an error message, when the directory already exists. The -p
(or --parent
) option causes mkdir
to exit normally when the directory exists, otherwise it creates it along with all parent directories as needed. You will get an error if one of the parent directories does not exist and you do not have permission to create it. The error message:
mkdir: cannot create directory `/lustre/work/it_css/sw/htseq/': Permission denied
refers to a parent directory, which is not permitted properly. If you do not own the directory, it should have the sticky bit set.
The -m 1777 option on mkdir
Normally, any new directory will be readable and searchable by everybody, but writable only by you, as the owner of the directory. This is a good choice for the install directories, since it means your group can use the installed package, but only you can rebuild or maintain the package.
However, directories that are shared and writable should have the sticky bit set. The sticky bit on a directory (also called the restriction deletion flag) prevents an unprivileged user from removing or renaming a file in the directory unless they own the file or the directory. See man chmod
. The -m 1777
option on mkdir
will set each newly created directory to be group writable with restricted deletion.
Task 2: install
This bash source file t2_install
assumes the setup task file was sourced.
- "t2_install"
# $scratch/$dir set scratch directory for installation # $require and $compiler set to VALET dependent package # $pythondir set to Python install directory # $installhome set to pakcage install dirrectory pushd $scratch/$dir vpkg_rollback all vpkg_require $require $compiler export PYTHONPATH=$PYTHONPATH:$pythondir python setup.py install --home=$installhome popd
Installation notes:
pushd .. popd
The installation is done in a scratch directory. The pushd
changes to that directory after pushing the PWD on a stack.
The popd
brings you back to the CWD. When done there is a clean
alias set in t1_setup
to remove the scratch directory.
vpkg commands
These installation command come from the HTSeq installation instructions. The instructions to download and install a dependent package is skipped. The command vpkg_require
is used to add packages that do not need installation. Make sure you read the instructions carefully. It is best to install from source and you can't install in an area which requires root (or sudo) access.
Task 3: valet
The Bash source file t3_valet
assumes the shell variables have been assigned in the setup source file, and it contains just one cat
command to write out a complete VALET package file with one version.
- "t3_valet"
# $valetdir set to VALET directory (should be in $VALET_SYSCONFDIR) # $id set to VALET package-id # $description set to package description # $binaries set to a list of the package executables # $prefix set to prefix for install location # $prefix/$version is set to the install location # $require and $compiler set to VALET dependent packages # $pythondir set to the package python directory. cat > "$valetdir/$id.vpkg" <<EOT <?xml version="1.0" encoding="UTF-8"?> <package xmlns="http://www.udel.edu/xml/valet/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.udel.edu/xml/valet/1.0 http://www.udel.edu/xml/valet/1.0/schema.xsd" id="$id"> <description>$description</description> <url>$url_doc</url> <prefix>$prefix</prefix> <version id="$version"> <description>$binaries</description> <dependencies> <package id="$require"/> <package id="$compiler"/> </dependencies> <export variable="PYTHONPATH" action="path-append">$pythondir</export> </version> </package> EOT
VALET file creation notes:
here document redirect
This script contains just one cat
command with a
here-document redirect. The resulting
htseq.vpkg file will contain the lines in the source (between «EOT
and EOT
) with the shell variables expanded
to the values set in t1_setup. In many cases, the names of the variables match the place in the vpkg file.
XML documents
VALET uses XML to describe the package, which may contain several versions. It also uses an XML schema to describe the structure of an XML document.
More than one version
To install a new version, you should first copy this vpkg file to a safe location. Then change the t1_setup
script to set variable for the new version. Remember to check for a new URL location and new requirements. Then run all three steps.
Finally merge the saved vpkg file with the new vpgk file, to use both versions. The first version is the default.
Testing HTSeq
VALET versions
When the htseq VALET package is installed and VALET_SYSCONFIG is set, the two commands
workgroup -g it_css vpkg_versions htseq
should yield:
Available versions in package (* = default version): [/lustre/work/it_css/sw/valet/htseq.vpkg] htseq HTSeq: Analysing high-throughput sequencing data with Python * 0.5.3p9 htseq-count htseq-qa
You will get this error message, if you are not in a workgroup shell, i.e., $WORKDIR is not set, or you installed the VALET
package in different directory. You can use any directory, but you must configure VALET to use your directory.
For example, add the lines to your .bash_profile
:
#Set my VALET config directories, to find vpkg files. export VALET_SYSCONFDIR=/archive/it_css/sw/valet:~trainf/.valet
HTSeq binaries
There are two binaries in the HTSeq. To test to see if htseq-count
runs:
vpkg_require htseq htseq-count --help
The binary htseq-qa
needs an additional library (The plots are output as a PDF file.) So to test:
vpkg_require htseq matplotlib htseq-qa --help
Without matplotlib added to your environment, you get this message.
matplotlib
as a dependency to your htseq
package. It was not mentioned on the htseq web page as a dependency, but it is need for htseq-qa
(and the tour using python). At this point, it is easy to just add to the htseq.vpkg
file.
Between <dependencies>
and </dependencies>
add the line:
<package id="matplotlib/1.1.0-2.7.2"/>
It would be better to go back and add to the install task scripts:
to t1_setup
add:
plotrequire='matplotlib/1.1.0-2.7.2'
to t3_valet
add:
<package id="$plotrequire"/>
and then rebuild the valet package file with the bash commands
. t1_setup && . t3_valet
Clearly, these three steps are more than just a simple change to one file, but this documents what you did, and you will not forget to add this change for the next version.
A tour through HTSeq
See A tour through HTSeq for a tour, which demonstrates the functionality of HTSeq. To follow along on a Mills compute node, you must download the example data and add the VALET packages htseq
and matplotlib
(The demonstrations uses pyplot
in the matplotlib
module.)
qlogin export VALET_SYSCONFDIR=$WORKDIR/sw/valet vpkg_require htseq matplotlib curl http://www-huber.embl.de/users/anders/HTSeq/HTSeq_example_data.tgz | tar -zxf -
Following the example from the HTSeq tour site:
[dnairn@n015 ~]$ python Python 2.7.6 (default, Feb 12 2014, 12:13:46) [GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import HTSeq >>> fastq_file = HTSeq.FastqReader( "yeast_RNASeq_excerpt_sequence.txt", "solexa" ) >>> fastq_file <FastqReader object, connected to file name 'yeast_RNASeq_excerpt_sequence.txt'> >>> import itertools >>> for read in itertools.islice( fastq_file, 10 ): ... print read ... CTTACGTTTTCTGTATCAATACTCGATTTATCATCT AATTGGTTTCCCCGCCGAGACCGTACACTACCAGCC TTTGGACTTGATTGTTGACGCTATCAAGGCTGCTGG ATCTCATATACAATGTCTATCCCAGAAACTCAAAAA AAAGTTCGAATTAGGCCGTCAACCAGCCAACACCAA GGAGCAAATTGCCAACAAGGAAAGGCAATATAACGA AGACAAGCTGCTGCTTCTGTTGTTCCATCTGCTTCC AAGAGGTTTGAGATCTTTGACCACCGTCTGGGCTGA GTCATCACTATCAGAGAAGGTAGAACATTGGAAGAT ACTTTTAAAGATTGGCCAAGAATTGGGGATTGAAGA >>> ...