Essentials of next Generation Sequencing Workshop 2014



Download 105.59 Kb.
Date28.01.2017
Size105.59 Kb.
#10311

Essentials of next Generation

Sequencing Workshop 2014

University of Kentucky AGTC



Class

1




Essentials of Unix/Linux


Goal: To gain familiarity with, and comfort using, the Unix command line.



1.1 Getting Connected and Reconnected


If you are working from a Windows machine on campus, it will come with a program called "PuTTY", which is a free telnet/SSH client. Think of SSH (secure shell) as a simple, secure way for computers to communicate with each other.


  • After locating PuTTY, just enter the IP address for our machine (128.163.192.150) and click "Open". You will be prompted for your user name and password.




  • You now see a prompt, similar to:

[daniel.harris@agtc01 ~]$


You will see your name instead of mine. Here, agtc01 is the hostname. You are now “located” in your home directory (often abbreviated as ~ or ~/). The dollar sign $ at the end of prompt simply signifies the end of the prompt.

  • Exit PuTTY.





    • exit


This will close PuTTY. Try to start PuTTY and log in again. We will assume you can do this step in future classes without issue.

1.2 Exploring and Learning Basic Commands

Let’s try to familiarize ourselves with the machine by using a few commands.





  • Display your working directory.





    • pwd


You will see something like this:


[daniel.harris@agtc01 ~]$ pwd

/home/daniel.harris


This is your current working directory, meaning you are virtually located here. Any actions you do will assume you are here.


  • Make a directory.





    • mkdir example

This has created a directory underneath your home directory called example.




  • List your current directory's contents.





    • ls

This lists the contents of your current directory. For example:


[daniel.harris@agtc01 ~]$ ls

example
There will likely be more in your current directory (such as workshop materials), but also notice that a directory named example is there. How do we know it’s a directory?





  • Repeat the previous step with –l. This is the “long” directory listing that shows more details.





    • ls -l

[daniel.harris@agtc01 ~]$ ls –l

total 4

drwxrwxr-x 2 daniel.harris daniel.harris 4096 Jul 1 23:51 example


What does this all mean? Below is an example key.

screenshot.png

For file permissions, "r" means "can read", "w" means "can write", and "x" means "can execute". Notice there are nine columns (or three sets) of permissions in the example; these three sets of permissions designate the settings for the user/owner, the group, and others (everyone else on the machine).

So, other than memory, how do you know what does –l does?



  • Show the manual for ls.





    • man ls

Every Linux program has a manual page installed and is accessible via man program-name. Press q to quit the man program.




  • Change your current working directory to the newly created example directory.





    • cd example




cd /home/daniel.harris/ # change using the full path

cd ~ # change using the ~ home directory short- cut

cd # cd will default to the home directory

cd .. # the ".." short-cut means the "parent" dir


You can view the file system as a tree. Meaning:

cd .. # change to my parent directory

cd ../.. # change to my parent's parent directory
Also, note that "." refers the current working directory.

cd . # change to where I am (doesn't change anything)



Hints for a productive time:

Your shell supports tab-completion, which means you don't usually have to type the full names of commands or files. For example, if your file's name is FinalProjectDecemeber172010.txt, you could simply type cat Final and it would complete the line for you. If there are more than 1 FinalXXX, pressing twice will display a list of all the possibilities.


Alternatively, your shell also supports wildcard characters (the * character), so typing cat Final* would have edited everything that started with the word "Final".
I used the cat command above. What does it do? How can you find out? Use man.
A quick way to repeat commands is to scroll through your previous commands by using the up and down arrow keys.

1.3 Learning a Text Editor


There are many text editors available for Linux (vim/nano/emacs/joe), but we’ll focus on vim today. Let’s use vim to create and save a text file that simply contains your name. Let’s save it in your example directory.



  • Navigate to your home directory.





    • cd ~





  • Start vim and we will edit a file called name.txt. vim will create the file if it does not exist.





    • vim example/name.txt

vim has modes. Your keyboard input will be treated differently depending on what mode you are in. You start in command-mode – where you can enter commands, which are simply keystrokes.

For example, the command i places you in insert mode, (also the insert key works too) – this is where you are free to type.

There is a banner at the bottom that indicates what mode you are currently in. When in insert mode, pressing the escape key places you in command-mode. Note that pressing the insert key again will place you in append-mode, which acts much like insert-mode but overwrites letters at the cursor.



  • Toggle between modes to get a decent feel for noticing which mode you are in.

  • In insert-mode, type your first name on the first line and your last name on the second line of the file. When done, leave insert-mode and enter command-mode.

There is an additional mode, called last-line-mode. To access it, enter : (a colon). If you press a colon ":", vim's cursor will switch to the last line of the screen and let you enter vim commands to change its internal settings.

For example, you can turn line numbers on and off. vim's binary options are usually in couples, this_option and nothis_option.

:set number

:set nonumber


Also, you can perform common commands here too. You can save/write this file with ":w". You can quit with ":q" or quit without saving with ":q!".



  • Save your file. (:w)

Commands in vim are case sensitive, so d is differently than D.


Examples of keys that can be pressed in command-mode:
x/delete key delete text

D delete text up to the end of line

dd delete current line

y undo single delete or change

U under entire last line change

ZZ save and quit

/string will search for string in document

n will goto the next search result post-searching

y yank (copy)

:y # yank # many lines, eg y5 yanks 5 lines (in colon/last line mode)

Y yank entire line

p put (paste) – will paste whatever has been yanked.


Copy and pasting is difficult at first, so let’s experiment with that.
You can use vim to view a text file safely, just use :q! to exit without saving any accidental changes.
Alternatively, you can use a command called "cat" (concatenate):
cat some_textfile
If the file is long, "less" is a command that will let you page-up and page-down the file.
less some_textfile

1.4 Downloading from the Internet


  • Change your working directory to the example directory we created earlier.

  • Type the following.





    • wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/yeast.nt.gz




[daniel.harris@agtc01 example]$ wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/yeast.nt.gz

--2012-07-09 01:35:26-- ftp://ftp.ncbi.nih.gov/blast/db/FASTA/yeast.nt.gz

=> `yeast.nt.gz'

Resolving ftp.ncbi.nih.gov... 130.14.250.10

Connecting to ftp.ncbi.nih.gov|130.14.250.10|:21... connected.

Logging in as anonymous ... Logged in!

==> SYST ... done. ==> PWD ... done.

==> TYPE I ... done. ==> CWD /blast/db/FASTA ... done.

==> SIZE yeast.nt.gz ... 3732371

==> PASV ... done. ==> RETR yeast.nt.gz ... done.

Length: 3732371 (3.6M)
100%[=======================================================================================>] 3,732,371 4.19M/s in 0.8s
2012-07-09 01:35:28 (4.19 MB/s) - `yeast.nt.gz' saved [3732371]


  • List the contents of your directory to view the results.



[daniel.harris@agtc01 example]$ ls

name.txt yeast.nt.gz





  • A gunzip archive was downloaded. It must be decompressed; in general, gunzip acts as a toggle.





    • gunzip yeast.nt.gz




This will produce a file called yeast.nt.

  • Suppose we want to remove this file now.





    • rm yeast.nt

List the contents of your directory. It is gone, permanently. There isn’t really a concept of a trashcan when dealing with the terminal/console. To read more about rm, check out its man page.





  • Re-download the file using wget. Remember, a quick way to repeat commands is to scroll through your previous commands by using the up and down keys.



  • Create a directory underneath your example directory called source. Move the .gz file you just downloaded into the directory by the following:





    • mv yeast.nt.gz source/yeast.nt.gz





  • Suppose we want to copy this file to a backup.





    • cp yeast.nt.gz ~/example/source/yeast.backup.nt.gz

The first argument of cp (copy) is the file you wish to copy, while the second argument is the destination, which can also include directory information.



  • Remove the backup file you just copied.


1.5 Using SCP to Transfer Files


SCP is a command-line tool that can send files to a remote machine or that can download files from a remote machine. We will use agtc02 as our remote machine, so that files are copied strictly through networking even if the source and destination happen to match.


  • Create a directory under your example directory called target.



  • Change to your example directory. List the contents of it. You should see directories called source and target.

  • List the contents of your target directory. It should be empty.



  • Type the following but replace username with your username.





    • scp source/yeast.nt.gz username@agtc02:~/example/target/


This will copy the file source/yeast.nt.gz to the destination of ~/example/target/ on the desired server across the network.



  • List the contents of your target directory. It should contain the file we just copied.



  • Remove the file from the target directory.



  • List the contents of your target directory. It should be empty again.



  • Type:





    • scp username@agtc02:~/example/source/yeast.nt.gz ./target


You just downloaded the file into your local target directory. This transfer could have occurred between two different machines.





  • List the contents of your target directory. It should contain the file we just copied.



  • Remove the .gz located in your target directory. Verify it worked by listing its contents.

1.6 Grep and Re-direction


Grep is a tool that can search for patterns within text files. It is based on regular-expressions, a language for representing patterns. Patterns can get quite complex, but we’ll stay relatively simple for this.


  • Pick a .fasta file to play with. You can choose the yeast file from above too if needed.

  • Search for the sequence meta-data (the line of information that precedes the sequence):


    • grep \> yeast.nt



[daniel.harris@agtc01 example]$ grep \> yeast.nt

>gi|6226515|ref|NC_001224.1| Saccharomyces cerevisiae mitochondrion, complete genome

>gi|6319247|ref|NC_001133.1| Saccharomyces cerevisiae chromosome I, complete chromosome sequence

>gi|6319354|ref|NC_001134.1| Saccharomyces cerevisiae chromosome II, complete chromosome sequence

>gi|6319780|ref|NC_001135.1| Saccharomyces cerevisiae chromosome III, complete chromosome sequence

>gi|7839148|ref|NC_001136.2| Saccharomyces cerevisiae chromosome IV, complete chromosome sequence

>gi|7276232|ref|NC_001137.2| Saccharomyces cerevisiae chromosome V, complete chromosome sequence

>gi|6321039|ref|NC_001138.1| Saccharomyces cerevisiae chromosome VI, complete chromosome sequence

>gi|6321173|ref|NC_001139.1| Saccharomyces cerevisiae chromosome VII, complete chromosome sequence

>gi|6862570|ref|NC_001140.2| Saccharomyces cerevisiae chromosome VIII, complete chromosome sequence

>gi|6322016|ref|NC_001141.1| Saccharomyces cerevisiae chromosome IX, complete chromosome sequence

>gi|6322236|ref|NC_001142.1| Saccharomyces cerevisiae chromosome X, complete chromosome sequence

>gi|6322623|ref|NC_001143.1| Saccharomyces cerevisiae chromosome XI, complete chromosome sequence

>gi|6322960|ref|NC_001144.1| Saccharomyces cerevisiae chromosome XII, complete chromosome sequence

>gi|6323501|ref|NC_001145.1| Saccharomyces cerevisiae chromosome XIII, complete chromosome sequence

>gi|6323989|ref|NC_001146.1| Saccharomyces cerevisiae chromosome XIV, complete chromosome sequence

>gi|6324406|ref|NC_001147.1| Saccharomyces cerevisiae chromosome XV, complete chromosome sequence

>gi|6324971|ref|NC_001148.1| Saccharomyces cerevisiae chromosome XVI, complete chromosome sequence

Each line represents a line that matches the pattern \> which is really the pattern “>” but because “>” means something special on the command-line, we must escape it by adding a “\” to it. What does “>” mean?

The symbols “<” and “>” are special characters at the command line. The “<” symbol directs input into whatever tool is on the left hand side of it, while the “>” symbol directs output to whatever file is specified on the right hand side of it. For example, we saw that cat can display contents of a file to the screen. It can actually display input from an input stream too.
IMPORTANT: Forgetting to escape the > symbol when using it as a search term in grep can be disastrous. This is because the following command:
grep > input_file
is interpreted as grep “nothing” and write it to a file named input file. So instead of printing lines containing the “>” symbol (as intended), this will simply write over your input file and replace your sequences with nothing (i.e. the file will now be empty)!



    • cat < yeast.nt
    Try this:


This is a bit silly because cat can handle filenames perfectly well without the “<” symbol. More useful in this case is the ability to direct output. By default, output goes to the screen. You can send it to a file with the “>” symbol.


  • Now try this:


    • grep \> yeast.nt > output.txt







  • List the contents of your directory. You should see an output.txt file. cat the file to display its contents.



  • You can use grep to count the number of lines that matched too by adding the –c option. Paired with the above pattern, this is a quick way to count how many sequences in a file.


1.7 Pipes


It’s possible to build small-scale workflows by sending the output of a tool to be the input of another.


  • Find the yeast.nt .fasta file to play with.



  • Type:





    • wc yeast.nt

You will see counts of how many characters, words, and lines the file has.




  • To see only a count of lines, you can use:





    • wc -l yeast.nt


  • Type:





    • grep \> yeast.nt | wc -l


The | symbol is a pipe and acts as a middle-man between grep’s output and wc’s input. This returns a count of how many lines are in the file that are being printed by grep – which is basically what the –c flag does for grep.





  • Programs sort and uniq are common to see in piped workflows. Read their man pages.

1.8 awk & sed


awk is programming language that is typically used for parsing and filtering/selecting text; sed is a stream editor that is typically used for transforming text. They are often used in conjunction with one another in piped command sequences.

  • Locate the simple yeast example file that was downloaded earlier.

  • Perform a grep on the file, locating all sequence headers that begin with a “>” symbol. The output will have lines that appear as this:

    >gi|6319247|ref|NC_001133.1| Saccharomyces cerevisiae chromosome I, complete chromosome sequence




  • What if we wanted to filter these search results so that only the second column is displayed? Let’s create a sequence of piped commands to do this.



    • grep \> yeast.nt | awk -F'|' '{print $2;}'



The –F flag defines the field-separator (spaces are the separators by default) – these determine the columns. The second part of the command in single quotes is awk code. This example has one line of code that prints a variable called $2. $2 just happens to be the second column as determined by the field separators. You can guess what $1, $3 etc. refer to. awk programming can get quite complicated – we could spend an entire session on it, but for now, just be aware that this selecting/filtering text is possible.

  • Use grep and awk to print only the 4th column of the example file. The output should look like this:

NC_001224.1

NC_001133.1

NC_001134.1

NC_001135.1

NC_001136.2

NC_001137.2

NC_001138.1

NC_001139.1

NC_001140.2

NC_001141.1

NC_001142.1

NC_001143.1

NC_001144.1

NC_001145.1

NC_001146.1

NC_001147.1


NC_001148.1



  • Let’s suppose we want to use this list of references as input to a fictional script that only accepts numbers as input. How do we get rid of the NC_ prefix? The answer is sed.



sed relies heavily on regular expressions (much like grep does). It is an incredibly powerful tool and takes much practice to master. We hope to give you basic exposure to it by providing a common task (substitution). To solve the prompt in #6 above, we can simply substitute NC_ with nothing (an empty string).




    • grep \> yeast.nt | awk -F'|' '{print$4;}' | sed 's/NC_//'

The s means substitution: the stuff between the first and second /’s (here, it’s NC_) is the pattern to match, while the stuff between the second and third /’s is what to substitute that matched pattern with (here, it’s empty, which in turn replaces NC_ with nothing).





  • This is only a taste of awk/sed can do, but it should be a good start. For example, you should be able to complete the following tasks:

    1. Substitute spaces for underscores in sequence headers.

    2. Retrieving lists of query/subject sequences from tabular format blast reports

1.9 Compiling a Program


Although hundreds of tools are installed on a basic Linux machine, very few bioinformatics tools are provided by default. Some are available through as a package and can be installed by apt-get to the system if you have super-user privileges. Some tools provide binaries that are able to be executed immediately after downloading them. If no binary exists for your platform and the source code is available, you can compile the tool yourself.

We’ve already installed samtools onto agtc01, but let’s suppose you want to install it:




  • Download samtools. A quick googling will reveal its webpage and download link on source-forge. There’s also a different method: instead of using wget, you can use git. git is a version control system (what does that mean?) but in practice, it also acts as a method of distributing source code.


    • git clone git://github.com/samtools/samtools.git



You can read the man-page for git to read about other cool features.

Alternatively, you could have downloaded the code from SourceForge. The result is a tar bzip2 file. Tar is a method of archiving multiple files, while bzip2 is a zipping/compression method. To untar and unzip this file: tar -xjvf samtools-0.1.18.tar.bz2


The x means we are extracting (as opposed to c for creating), the j means unzip with bzip2, the v means verbose which will show you the files as it unzips, and the f points the file after the space. Be aware that different zipping options exist (–z will use gunzip).



  • List the contents of the directory and change into the directory that was created by git.




  • If you list the contents of this new directory, you’ll see many .c and .h files which correspond to the actual source code. Feel free to cat or less them to see what the code actually looks like (but don’t accidentally edit them). You’ll also see a file called Makefile which contains instructions of how to compile the program with the help of a utility called make (check its man page).




    • make

It will give an error “No rule to make target ‘../htslib/htslib.mk’. Stop.” Some searching will reveal that we need to use git to clone the related htslib project. This project is a library: a collection of code designed to be used by other programs (like samtools).




  • Change back to the parent directory “..”




  • Download htslib.


    • git clone git://github.com/samtools/htslib.git



  • List the directory contents again and change into the new directory.



  • We have another directory of source code. This time make works:





    • make



Make reads the project’s Makefile, which consists of instructions telling the computer how to compile the project. In this case, the Makefile tells make to execute the compiler gcc several times with certain options to convert the various source files in htslib into a library that samtools can use. Most source code downloads on Linux use such a Makefile, though for many projects you have to first run a script like ./configure to create the Makefile.


  • Now change back to the samtools directory. Remember that it is one level up: you can still do it in one command, though:





    • cd ../samtools





  • Run make one more time to build samtools. This time it finds the htslib library that you just built, and proceeds to use gcc as before to compile the samtools source code and link it together with htslib to create an executable program.




  • If you list the directory contents again, you’ll see a program called samtools.



  • If you type samtools, you’ll see the menu of options again – but this samtools is actually the system-installed one. To execute the local one:





    • ./samtools

Recall that ./ is the current directory.






    • echo $PATH
    How does the system know where samtools is actually located? By default, when executing a program without a location specified, it will look everyone on your PATH environment variable. The following will show you what your path is considered to be.





  • Try to find samtools.





    • locate samtools

It should display many hits because it’s using string matching, but primarily of interest is /usr/local/bin/samtools.





  • In fact, you can tell which samtools will be run by typing:





    • which samtools

You can also use whereis.


How do you add directories to your path?


  • View your current path:





    • echo $PATH

/home/fpd/fpd/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/daniel.harris/bin





  • Now type:





    • export PATH=$PATH:~/bin2





  • View path again:





    • echo $PATH

/home/fpd/fpd/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/home/daniel.harris/bin:/home/daniel.harris/bin2

It is important to note that this does not change your path forever. If you want to make it permanent, you need to place it in your .bash_profile – which is simply a file that gets read every time you log into the system.

There are other environment variables too. You can see them by typing env.




  • Try it

Essentials of Next Generation Sequencing 2014 Page of



Download 105.59 Kb.

Share with your friends:




The database is protected by copyright ©ininet.org 2024
send message

    Main page