biopython- sa przykłady uzycia biopython.PDF

Biopython: Python tools for computation biology

Brad Chapman and Je Chang

August 2000

Contents

1 Abstract

2 Introduction

3 Parsers for Biological Data 2

3.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.2 Usage Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2.1 Usage Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2.2 Extracting information from a FASTA le . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.2.3 Parsing Output from Swiss-Prot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2.4 Downloading and Extracting information from Pub-Med . . . . . . . . . . . . . . . . . 5

4 Representing Sequences 6

4.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.2 Usage Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Other Tools 7

5.1 Biocorba interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.2 Classication Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.3 Additional Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

6 Future Goals 8

6.1 Planned Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

7 Contact Information

8 Conclusions

9 Acknowledgements

1 Abstract

The Biopython project was formed in August 1999 as a collaboration to collect and produce open source

bioinformatics tools written in Python, an object-oriented scripting language. It is modeled on the highly

successful Bioperl project, but has the goal of making libraries available for people doing computations in

Python. The philosophy of all the Bio* projects is that part of bioinformaticists' work involves software

development. In order to prevent repeated eorts we believe that the eld can be advanced more quickly

if libraries that perform common programming functions were available. Thus, we hope to create a central

source for high-quality bioinformatics tools that researchers can use.

As an open source project, Biopython can be downloaded for free from the web site at http://www.

biopython.org . Biopython libraries are currently under heavy development. This paper describes the

current state of available Biopython tools, shows examples of their use in common bioinformatics problems,

and describes plans for future development.

2 Introduction

Development of software tools is one of the most time consuming aspects of the work done by a bioin-

formaticist. A successful solution to this problem has been the establishment of common repositories of

interworking open-source libraries. This solution not only saves development time for researchers, but also

leads to more robust, well-tested code due to the contributions of multiple developers on the project. The

ideas behind open source collaborative software have been explored through a series of essays by Eric Ray-

mond ( http://www.tuxedo.org/~esr/writings/ ). One example where the open source methodology has

been successfully applied is the Bioperl project ( http://www.bioperl.org ). Similarly, Biopython seeks to

develop and collect biologically oriented code written in python ( http://www.python.org ).

Python is an object-oriented scripting language that is well suited for processing text les and automating

common tasks. Because it was designed from the ground up with an object-oriented framework, it also scales

well and can be utilized for large projects. Python is portable to multiple platforms including multiple UNIX

variants, Windows, and Macintosh. The standard library comes with many high-level data structures such

as dictionaries, and contains numerous built in modules to accomplish tasks from parsing with regular

expressions to implementing a http server.

The native python libraries interface well with C, so after initial development in python, computationally

expensive operations can be recoded in C to increase program speed. In addition, jpython ( http://www.

jpython.org ), an implementation of python in pure java, allows python to freely interact with java libraries.

The options allow rapid development in python, followed by deployment in other languages, if desired.

A library of common code for biological analysis is essential to allow bioinformaticists to take advantage

of all of the benets of programming in python. This paper describes work towards development of such a

library.

3 Parsers for Biological Data

3.1 Design Goals

The most fundamental need of a bioinformaticist is the ability to import biological data into a form usable

by computer programs. Thus, much of the initial development of Biopython has been focused on writing

code that can retrieve data from common biological databases and parse them into a python data structure.

Designing parsers for bioinformatics le formats is particularly dicult because of the frequency at which

the data formats change. This is partially because of inadequate curation of the structure of the data, and

also because of changes in the contents of the database. Biopython addresses these diculties through the

use of standard event-oriented parser design.

Then event-oriented nature of biopython parsers are similar to that utilized by the SAX (Simple API for

XML) parser interface, which is used for parsing XML data les. They are dierent then SAX in that they

are line-oriented; since nearly all biological data formats use lines as meaningful delimiters biological parsers

can be built based on the assumption that line breaks are meaningful. A parser involves two components: a

Scanner, whose job is to recognize and identify lines that contain meaningful information, and a Consumer,

which extracts the information from the line.

The Scanner does most of the dicult work in dealing with the parsing. It is required to move through a

le and send out \events" whenever an item of interest in encountered in the le. These events are the key

pieces of data in the le that a user would be interested in extracting. For instance, lets imagine we were

parsing a FASTA formatted le with the following sequence info (cut so the lines t nicely):

>gi|8980811|gb|AF267980.1|AF267980 Stenocactus crispatus

AAAGAAAAATATACATTAAAAGAAGGGGATGCGGG

...

As the scanner moved through this le, it would re o 4 dierent types of events. Upon reaching a new

sequence begin_sequence event would be sent. This would be followed by a title event upon reaching

the information about the sequence and a sequence event for every line of sequence information. Finally,

everything would wrap up with an end_sequence when we have no more sequence data in the entry.

Creating and emanating these events is interesting, but is not very useful unless we get some kind of

information from the events, which is where the Consumer component comes in. The consumer will register

itself with a scanner and let it know that it wants to know about all events that occur while scanning through

a le. Then, it will implement functions which deal with the events that it is interested in.

To go back to our FASTA example, a consumer might just be interested in counting the number of

sequences in a le. So, it would implement a function begin_sequence that increments a counter by one

every time that event occurs. By simple receiving the information it is interested in from the Scanner, the

consumer processes and deals with the les according to the programmer's needs.

By decoupling scanners from consumers, developers can choose dierent consumers depending on their

information or performance needs, while maintaining the same scanner. It is possible to develop multiple

specialized consumer-handlers using the same scanner framework. These parsers can deal with a small

subsection of the data relatively easily. For example, it may be desirable to have a parser that only extracts

the sequence information from a Genbank le, without having to worry about the rest of the information.

This would save time by not processing unnecessary information, and save memory by not storing it.

3.2 Usage Examples

3.2.1 Usage Scenario

To take a look at the parsers in action, we'll look at some examples based around a common theme, to make

things a little more exciting. Let's suddenly become really interested in Taxol, a novel anti-cancer drug

( A good quick introduction to Taxol can be found at http://www.bris.ac.uk/Depts/Chemistry/MOTM/

taxol/taxol.htm ) and use this newfound interest to frame our work.

3.2.2 Extracting information from a FASTA le

To start our search for Taxol information, we rst head to NCBI to do an Entrez search over the Genbank

nucleotide databases ( http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide ). Just search-

ing for the keyword 'taxol' gives 22 results, so let's say we want to parse through these results and extract

id numbers from those that have to do with humans. To do this, we save the search results as a FASTA le,

and then proceed to parse this.

Based on the scanner-consumer discussion above, what we need to do is implement a consumer which

will react every time we reach a title in the FASTA le, and check to see if the title mentions anything to

do with humans. The following consumer does this job:

import string

from Bio.ParserSupport import AbstractConsumer

class TitleSearchConsumer(AbstractConsumer):

def title(self, line):

# see if the title contains a reference to humans}

location = string.find(line, "Homo sapiens")}

if location != -1:

# split the string to give us the accession number}

result = string.split(line, '|')}

print 'Accession:', result[3]}

The Consumer inherits from a base consumer class which ignores any sections that we are not interested

in (like sequences). Now that we've got a consumer, we need to start up a scanner, inform it of what

consumer we want to send events to, and then parse the le. The following code accomplishes this:

# set up the scanner, consumer and file to parse

from Bio.Fasta import Fasta

scanner = Fasta._Scanner()

consumer = TitleSearchConsumer()

file = open('taxol.fasta', 'r')

# parse all fasta records in the file

for n in range(22):

scanner.feed(file, handler)}

Running this example gives the following output:

# python fasta_ex.py

Accession: AW615564.1

Accession: NM_004909.1

Accession: AF157562

...

This example uses the raw scanner-consumer interface we described above. Doing this can be clunky in

many ways, since we have to explicitly know how many records we have to parse, and also need to access

the scanner, which is marked as an internal class. The reason for this is that there are layers developed on

top of the raw scanner and consumer classes which help make them more intuitive to use.

3.2.3 Parsing Output from Swiss-Prot

Delving further into taxol we next decide to look for further information in Swiss-Prot ( http://www.expasy.

ch/sprot/sprot-top.htm ), a hand curated database of protein sequences. We search for Taxol or Taxus

(Taxol was rst isolated from the bark of the pacic yew, Taxus brevifolia). This yields 15 results, which

we save in SwissProt format. Next, we would like to print out a description of each of the proteins found.

To do this, we use an iterator to step through each entry, and then use the SwissProt parser to parse each

entry into a record containing all of the information in the entry.

First, we set up a parser and an iterator to use:

from Bio.SwissProt import SProt

file = open('taxol.swiss', 'r')

parser = SProt.RecordParser()

my_iterator = SProt.Iterator(file, parser)

The parser is a RecordParser which converts a SwissProt entry into the record class mentioned above.

Now, we can readily step through the le record by record, and print out just the descriptions from the

record class:

next_record = my_iterator.next()

while next_record:

print 'Description:', next_record.description

next_record = my_iterator.next()

Utilizing the iterator and record class interfaces, the code ends up being shorter and more understandable.

In many cases this approach will be enough to extract the information you want, while the more general

Scanner and Consumer classes are always available if you need more control over how you deal with the

data.

3.2.4 Downloading and Extracting information from Pub-Med

Finally, in our search for Taxol information, we would like to search PubMed ( http://www.ncbi.nlm.nih.

gov:80/entrez/query.fcgi?db=PubMed ) and get Medline articles dealing with Taxol. Biopython provides

many nice interfaces for doing just this.

First, we would like to do a PubMed search to get a listing of all articles having to do with Taxol. We

can do this with the following two lines of code:

from Bio.Medline import PubMed

taxol_ids = PubMed.search_for('taxol')

Of course, article ids are not much use unless we can get the Medline records. Here, we create a dictionary

that can retrieve a PubMed entry by its id, and use a Medline parser to parse the entry into a usable format.

A PubMed dictionary is accessible using python dictionary semantics, in which the keys are PubMed ids,

and the values are the records in Medlars format.

my_parser = Medline.RecordParser()

medline_dict = PubMed.Dictionary(parser = my\_parser)

Now that we've got what we need, we can walk through and get the information we require:

for id in taxol_ids[0:5]:

this_record = medline_dict[id]

print 'Title:', this_record.title

print 'Authors:', this_record.authors

Running this code will give output like the following:

# python medline_ex.py

Title: PKC412--a protein kinase inhibitor with a broad therapeutic potential [In Process Citation]

Authors: ['Fabbro D', 'Ruetz S', 'Bodis S', 'Pruschy M', 'Csermak K', 'Man A',

'Campochiaro P', 'Wood J', "O'Reilly T", 'Meyer T']

...

In this example, the Biopython classes make it easy to get PubMed information in a format that can be

easily manipulated using standard python tools. For instance, the authors are generated in a python list, so

they could easily be searched with code like:

if 'Monty P' in authors:

print 'found author Monty P'

In sum, these examples demonstrate how the Biopython classes can be used to automate common bio-

logical tasks and deal with bioinformatic data in a pythonic manner.

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: