software:python-htseq:python-htseq

Installing HTSeq

HTSeq is a tool for "Analysing high-throughput sequencing data with Python". This tool consists of a python module to import as well as two wrapper scripts to execute on the command line. Here we show how to get this tool and install it as a VALET package for your research group. The commands to setup, install and create the VALET package are in bash script files:

  • t1_setup - Set shell variable and setup all necessary directories (downloading as needed).
  • t2_install - Install the python module an install in in your research software directory.
  • t3_valet - Write a VALET package file to set up the environment to use HTSeq.

Once you have Bash source files to do these task, you can setup, intall and create the VALET package file with the one compound command:

. t1_setup && . t2_install && . t3_valet

These files serve to document how the package was built, and can be used to build a new version or rebuild the old one.

The Mills work directory for your research group in on the /lustre file system and it not backed up. These three script files are quite small, and will rebuild all directories. Keep these files on a system that is backed up, for example, your home directory on Mills.
These scripts are sourced. This means that all variables set, and all VALET packages added will be in your environment on completion. In fact, this is used to pass setup variables to the install and valet scripts. This is also why there is the vpkg_rollback all command before any VALET commands in the install script

The bash source file t1_setup has shell variables assignments and a few directory "setup" commands. The variable assignments define the scratch, installation and VALET directories. Their other values will be used to write the VALET package file. There are two environment variables used in this source file: USER and WORKDIR. Since WORKDIR may not be set, you should source these files from a workgroup shell. When in a workgroup shell the directory ${WORKDIR}/sw/valet has a special meaning. It is searched by VALET for vpkg files. You may use a different directory by adding it to your VALET_SYSCONFDIR environment variable.

"t1_setup"
# package variables:
id=htseq
version=0.5.3p9
description='Analysing high-throughput sequencing data with Python'
binaries="htseq-count htseq-qa"
url_doc='http://www-huber.embl.de/users/anders/HTSeq/doc/'
dir="HTSeq-$version"
url_get="http://pypi.python.org/packages/source/H/HTSeq/$dir.tar.gz"
 
# scratch variables:
scratch="/lustre/scratch/$USER/$id"
alias clean="lrm -r $scratch"
 
# install variables:
compiler='gcc/4.6.2'
require='numpy/1.6.1-2.7'
prefix="${WORKDIR}/sw/$id"
installhome="$prefix/$version"
pythondir="$installhome/lib/python"
 
# VALET variables.
valetdir="${WORKDIR}/sw/valet"
 
test -z "$WORKDIR" && echo "WORKDIR not set, start a workgroup shell." && return 1
 
# setup directories
mkdir -p -m 1777 "$valetdir"
mkdir -p "$installhome/bin" "$pythondir"
mkdir -p "$scratch"
test -d "$scratch/$dir" || curl "$url_get" | tar -zxf - -C "$scratch"

Directory setup notes:

Call-by-need evaluation

Bash uses call-by-need evaluation (also called lazy evaluation). For example:

test -d "$scratch/$dir" || curl "$url_get" | tar -zxf - -C "$scratch"

The statement after the or operator || is only evaluated if the test statement fails. If the directory does not exist it will be downloaded, uncompressed, extracted to the scratch directory. This will create the "$scratch/$dir" directory, so the next time you source this file it will not download and extract the file.

The -p option on mkdir

Normally, the mkdir command fails, with an error message, when the directory already exists. The -p (or --parent) option causes mkdir to exit normally when the directory exists, otherwise it creates it along with all parent directories as needed. You will get an error if one of the parent directories does not exist and you do not have permission to create it. The error message:

mkdir: cannot create directory `/lustre/work/it_css/sw/htseq/': Permission denied

refers to a parent directory, which is not permitted properly. If you do not own the directory, it should have the sticky bit set.

The -m 1777 option on mkdir

Normally, any new directory will be readable and searchable by everybody, but writable only by you, as the owner of the directory. This is a good choice for the install directories, since it means your group can use the installed package, but only you can rebuild or maintain the package.

However, directories that are shared and writable should have the sticky bit set. The sticky bit on a directory (also called the restriction deletion flag) prevents an unprivileged user from removing or renaming a file in the directory unless they own the file or the directory. See man chmod. The -m 1777 option on mkdir will set each newly created directory to be group writable with restricted deletion.

This bash source file t2_install assumes the setup task file was sourced.

"t2_install"
# $scratch/$dir set scratch directory for installation
# $require and $compiler set to VALET dependent package 
# $pythondir set to Python install directory
# $installhome set to pakcage install dirrectory
pushd $scratch/$dir
vpkg_rollback all
vpkg_require $require $compiler
export PYTHONPATH=$PYTHONPATH:$pythondir
python setup.py install --home=$installhome
popd

Installation notes:

pushd .. popd

The installation is done in a scratch directory. The pushd changes to that directory after pushing the PWD on a stack. The popd brings you back to the CWD. When done there is a clean alias set in t1_setup to remove the scratch directory.

vpkg commands

These installation command come from the HTSeq installation instructions. The instructions to download and install a dependent package is skipped. The command vpkg_require is used to add packages that do not need installation. Make sure you read the instructions carefully. It is best to install from source and you can't install in an area which requires root (or sudo) access.

The Bash source file t3_valet assumes the shell variables have been assigned in the setup source file, and it contains just one cat command to write out a complete VALET package file with one version.

"t3_valet"
# $valetdir set to VALET directory (should be in $VALET_SYSCONFDIR)
# $id set to VALET package-id
# $description set to package description
# $binaries set to a list of the package executables
# $prefix set to prefix for install location
# $prefix/$version is set to the install location
# $require and $compiler set to VALET dependent packages 
# $pythondir set to the package python directory.
cat > "$valetdir/$id.vpkg" <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<package
  xmlns="http://www.udel.edu/xml/valet/1.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.udel.edu/xml/valet/1.0
                      http://www.udel.edu/xml/valet/1.0/schema.xsd"
  id="$id">
  <description>$description</description>
  <url>$url_doc</url>
  <prefix>$prefix</prefix>
  <version id="$version">
    <description>$binaries</description>
    <dependencies>
      <package id="$require"/>
      <package id="$compiler"/>
    </dependencies>
    <export variable="PYTHONPATH" action="path-append">$pythondir</export>
  </version>
</package>
EOT

VALET file creation notes:

here document redirect

This script contains just one cat command with a here-document redirect. The resulting htseq.vpkg file will contain the lines in the source (between «EOT and EOT) with the shell variables expanded to the values set in t1_setup. In many cases, the names of the variables match the place in the vpkg file.

XML documents

VALET uses XML to describe the package, which may contain several versions. It also uses an XML schema to describe the structure of an XML document.

More than one version

To install a new version, you should first copy this vpkg file to a safe location. Then change the t1_setup script to set variable for the new version. Remember to check for a new URL location and new requirements. Then run all three steps. Finally merge the saved vpkg file with the new vpgk file, to use both versions. The first version is the default.

Testing HTSeq

When the htseq VALET package is installed and VALET_SYSCONFIG is set, the two commands

workgroup -g it_css
vpkg_versions htseq

should yield:

Available versions in package (* = default version):

[/lustre/work/it_css/sw/valet/htseq.vpkg]
htseq      HTSeq: Analysing high-throughput sequencing data with Python
* 0.5.3p9  htseq-count htseq-qa
ERROR: unknown package: htseq

You will get this error message, if you are not in a workgroup shell, i.e., $WORKDIR is not set, or you installed the VALET package in different directory. You can use any directory, but you must configure VALET to use your directory. For example, add the lines to your .bash_profile:

#Set my VALET config directories, to find vpkg files.
export VALET_SYSCONFDIR=/archive/it_css/sw/valet:~trainf/.valet

There are two binaries in the HTSeq. To test to see if htseq-count runs:

vpkg_require htseq
htseq-count --help

The binary htseq-qa needs an additional library (The plots are output as a PDF file.) So to test:

vpkg_require htseq matplotlib
htseq-qa --help
This script needs the 'matplotlib' library, which was not found. Please install it.

Without matplotlib added to your environment, you get this message.

You may want to add matplotlib as a dependency to your htseq package. It was not mentioned on the htseq web page as a dependency, but it is need for htseq-qa (and the tour using python). At this point, it is easy to just add to the htseq.vpkg file. Between <dependencies> and </dependencies> add the line:
      <package id="matplotlib/1.1.0-2.7.2"/>

It would be better to go back and add to the install task scripts:

to t1_setup add:

plotrequire='matplotlib/1.1.0-2.7.2'

to t3_valet add:

      <package id="$plotrequire"/>

and then rebuild the valet package file with the bash commands

. t1_setup && . t3_valet

Clearly, these three steps are more than just a simple change to one file, but this documents what you did, and you will not forget to add this change for the next version.

See A tour through HTSeq for a tour, which demonstrates the functionality of HTSeq. To follow along on a Mills compute node, you must download the example data and add the VALET packages htseq and matplotlib (The demonstrations uses pyplot in the matplotlib module.)

qlogin
export VALET_SYSCONFDIR=$WORKDIR/sw/valet
vpkg_require htseq matplotlib
curl http://www-huber.embl.de/users/anders/HTSeq/HTSeq_example_data.tgz | tar -zxf -

Following the example from the HTSeq tour site:

[dnairn@n015 ~]$ python
Python 2.7.6 (default, Feb 12 2014, 12:13:46) 
[GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import HTSeq
>>> fastq_file = HTSeq.FastqReader( "yeast_RNASeq_excerpt_sequence.txt", "solexa" )
>>> fastq_file
<FastqReader object, connected to file name 'yeast_RNASeq_excerpt_sequence.txt'>
>>> import itertools
>>> for read in itertools.islice( fastq_file, 10 ):
...     print read
... 
CTTACGTTTTCTGTATCAATACTCGATTTATCATCT
AATTGGTTTCCCCGCCGAGACCGTACACTACCAGCC
TTTGGACTTGATTGTTGACGCTATCAAGGCTGCTGG
ATCTCATATACAATGTCTATCCCAGAAACTCAAAAA
AAAGTTCGAATTAGGCCGTCAACCAGCCAACACCAA
GGAGCAAATTGCCAACAAGGAAAGGCAATATAACGA
AGACAAGCTGCTGCTTCTGTTGTTCCATCTGCTTCC
AAGAGGTTTGAGATCTTTGACCACCGTCTGGGCTGA
GTCATCACTATCAGAGAAGGTAGAACATTGGAAGAT
ACTTTTAAAGATTGGCCAAGAATTGGGGATTGAAGA
>>> 

    ...
  • software/python-htseq/python-htseq.txt
  • Last modified: 2017-10-23 18:03
  • by sraskar