====== Installing HTSeq ====== HTSeq is a tool for "Analysing high-throughput sequencing data with Python". This tool consists of a python module to import as well as two wrapper scripts to execute on the command line. Here we show how to get this tool and install it as a VALET package for your research group. The commands to setup, install and create the VALET package are in bash script files: * **t1_setup** - Set shell variable and setup all necessary directories (downloading as needed). * **t2_install** - Install the python module an install in in your research software directory. * **t3_valet** - Write a VALET package file to set up the environment to use HTSeq. Once you have Bash source files to do these task, you can setup, intall and create the VALET package file with the one compound command: . t1_setup && . t2_install && . t3_valet These files serve to document how the package was built, and can be used to build a new version or rebuild the old one. The Mills work directory for your research group in on the ''/lustre'' file system and it not backed up. These three script files are quite small, and will rebuild all directories. Keep these files on a system that is backed up, for example, your home directory on Mills. These scripts are sourced. This means that all variables set, and all VALET packages added will be in your environment on completion. In fact, this is used to pass setup variables to the install and valet scripts. This is also why there is the ''vpkg_rollback all'' command before any VALET commands in the install script ===== Task 1: setup ===== The bash source file ''t1_setup'' has shell variables assignments and a few directory "setup" commands. The variable assignments define the scratch, installation and VALET directories. Their other values will be used to write the VALET package file. There are two environment variables used in this source file: ''USER'' and ''WORKDIR''. Since ''WORKDIR'' may not be set, you should source these files from a workgroup shell. When in a workgroup shell the directory ''${WORKDIR}/sw/valet'' has a special meaning. It is searched by VALET for vpkg files. You may use a different directory by adding it to your ''VALET_SYSCONFDIR'' environment variable. # package variables: id=htseq version=0.5.3p9 description='Analysing high-throughput sequencing data with Python' binaries="htseq-count htseq-qa" url_doc='http://www-huber.embl.de/users/anders/HTSeq/doc/' dir="HTSeq-$version" url_get="http://pypi.python.org/packages/source/H/HTSeq/$dir.tar.gz" # scratch variables: scratch="/lustre/scratch/$USER/$id" alias clean="lrm -r $scratch" # install variables: compiler='gcc/4.6.2' require='numpy/1.6.1-2.7' prefix="${WORKDIR}/sw/$id" installhome="$prefix/$version" pythondir="$installhome/lib/python" # VALET variables. valetdir="${WORKDIR}/sw/valet" test -z "$WORKDIR" && echo "WORKDIR not set, start a workgroup shell." && return 1 # setup directories mkdir -p -m 1777 "$valetdir" mkdir -p "$installhome/bin" "$pythondir" mkdir -p "$scratch" test -d "$scratch/$dir" || curl "$url_get" | tar -zxf - -C "$scratch" Directory setup notes: === Call-by-need evaluation === Bash uses call-by-need evaluation (also called [[http://en.wikipedia.org/wiki/Lazy_evaluation|lazy evaluation]]). For example: test -d "$scratch/$dir" || curl "$url_get" | tar -zxf - -C "$scratch" The statement after the **or** operator ''||'' is only evaluated if the test statement fails. If the directory does not exist it will be downloaded, uncompressed, extracted to the scratch directory. This will create the "$scratch/$dir" directory, so the next time you source this file it will not download and extract the file. === The -p option on mkdir === Normally, the ''mkdir'' command fails, with an error message, when the directory already exists. The ''-p'' (or --''parent'') option causes ''mkdir'' to exit normally when the directory exists, otherwise it creates it along with all parent directories as needed. You will get an error if one of the parent directories does not exist and you do not have permission to create it. The error message: mkdir: cannot create directory `/lustre/work/it_css/sw/htseq/': Permission denied refers to a parent directory, which is not permitted properly. If you do not own the directory, it should have the sticky bit set. === The -m 1777 option on mkdir === Normally, any new directory will be readable and searchable by everybody, but writable only by you, as the owner of the directory. This is a good choice for the install directories, since it means your group can use the installed package, but only you can rebuild or maintain the package. However, directories that are shared and writable should have the sticky bit set. The sticky bit on a directory (also called the restriction deletion flag) prevents an unprivileged user from removing or renaming a file in the directory unless they own the file or the directory. See **''man chmod''**. The ''-m 1777'' option on ''mkdir'' will set each newly created directory to be group writable with restricted deletion. ===== Task 2: install ===== This bash source file ''t2_install'' assumes the setup task file was sourced. # $scratch/$dir set scratch directory for installation # $require and $compiler set to VALET dependent package # $pythondir set to Python install directory # $installhome set to pakcage install dirrectory pushd $scratch/$dir vpkg_rollback all vpkg_require $require $compiler export PYTHONPATH=$PYTHONPATH:$pythondir python setup.py install --home=$installhome popd Installation notes: === pushd .. popd === The installation is done in a scratch directory. The ''pushd'' changes to that directory after pushing the PWD on a stack. The ''popd'' brings you back to the CWD. When done there is a ''clean'' alias set in ''t1_setup'' to remove the scratch directory. === vpkg commands === These installation command come from the HTSeq installation instructions. The instructions to download and install a dependent package is skipped. The command ''vpkg_require'' is used to add packages that do not need installation. Make sure you read the instructions carefully. It is best to install from source and you can't install in an area which requires root (or sudo) access. ===== Task 3: valet ===== The Bash source file ''t3_valet'' assumes the shell variables have been assigned in the setup source file, and it contains just one ''cat'' command to write out a complete VALET package file with one version. # $valetdir set to VALET directory (should be in $VALET_SYSCONFDIR) # $id set to VALET package-id # $description set to package description # $binaries set to a list of the package executables # $prefix set to prefix for install location # $prefix/$version is set to the install location # $require and $compiler set to VALET dependent packages # $pythondir set to the package python directory. cat > "$valetdir/$id.vpkg" < $description $url_doc $prefix $binaries $pythondir EOT VALET file creation notes: === here document redirect === This script contains just one ''cat'' command with a [[http://en.wikipedia.org/wiki/Here_document#Unix_shells|here-document]] redirect. The resulting htseq.vpkg file will contain the lines in the source (between ''< workgroup -g it_css vpkg_versions htseq should yield: Available versions in package (* = default version): [/lustre/work/it_css/sw/valet/htseq.vpkg] htseq HTSeq: Analysing high-throughput sequencing data with Python * 0.5.3p9 htseq-count htseq-qa **ERROR: unknown package: htseq** You will get this error message, if you are not in a workgroup shell, i.e., $WORKDIR is not set, or you installed the VALET package in different directory. You can use any directory, but you must configure VALET to use your directory. For example, add the lines to your ''.bash_profile'': #Set my VALET config directories, to find vpkg files. export VALET_SYSCONFDIR=/archive/it_css/sw/valet:~trainf/.valet ===== HTSeq binaries ===== There are two binaries in the HTSeq. To test to see if ''htseq-count'' runs: vpkg_require htseq htseq-count --help The binary ''htseq-qa'' needs an additional library (The plots are output as a PDF file.) So to test: vpkg_require htseq matplotlib htseq-qa --help **This script needs the 'matplotlib' library, which was not found. Please install it.** Without matplotlib added to your environment, you get this message. You may want to add ''matplotlib'' as a dependency to your ''htseq'' package. It was not mentioned on the htseq web page as a dependency, but it is need for ''htseq-qa'' (and the tour using python). At this point, it is easy to just add to the ''htseq.vpkg'' file. Between '''' and '''' add the line: It would be better to go back and add to the install task scripts: to ''t1_setup'' add: plotrequire='matplotlib/1.1.0-2.7.2' to ''t3_valet'' add: and then rebuild the valet package file with the bash commands . t1_setup && . t3_valet Clearly, these three steps are more than just a simple change to one file, but this documents what you did, and you will not forget to add this change for the next version. ===== A tour through HTSeq ===== See [[http://www-huber.embl.de/users/anders/HTSeq/doc/tour.html|A tour through HTSeq]] for a tour, which demonstrates the functionality of HTSeq. To follow along on a Mills compute node, you must download the example data and add the VALET packages ''htseq'' and ''matplotlib'' (The demonstrations uses ''pyplot'' in the ''matplotlib'' module.) qlogin export VALET_SYSCONFDIR=$WORKDIR/sw/valet vpkg_require htseq matplotlib curl http://www-huber.embl.de/users/anders/HTSeq/HTSeq_example_data.tgz | tar -zxf - Following the example from the HTSeq tour site: [dnairn@n015 ~]$ python Python 2.7.6 (default, Feb 12 2014, 12:13:46) [GCC 4.4.5 20110214 (Red Hat 4.4.5-6)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import HTSeq >>> fastq_file = HTSeq.FastqReader( "yeast_RNASeq_excerpt_sequence.txt", "solexa" ) >>> fastq_file >>> import itertools >>> for read in itertools.islice( fastq_file, 10 ): ... print read ... CTTACGTTTTCTGTATCAATACTCGATTTATCATCT AATTGGTTTCCCCGCCGAGACCGTACACTACCAGCC TTTGGACTTGATTGTTGACGCTATCAAGGCTGCTGG ATCTCATATACAATGTCTATCCCAGAAACTCAAAAA AAAGTTCGAATTAGGCCGTCAACCAGCCAACACCAA GGAGCAAATTGCCAACAAGGAAAGGCAATATAACGA AGACAAGCTGCTGCTTCTGTTGTTCCATCTGCTTCC AAGAGGTTTGAGATCTTTGACCACCGTCTGGGCTGA GTCATCACTATCAGAGAAGGTAGAACATTGGAAGAT ACTTTTAAAGATTGGCCAAGAATTGGGGATTGAAGA >>> ...