biopython- sa przykłady uzycia biopython.PDF
(
98 KB
)
Pobierz
648918467 UNPDF
Biopython: Python tools for computation biology
Brad Chapman and Je Chang
August 2000
Contents
1 Abstract
1
2 Introduction
2
3 Parsers for Biological Data 2
3.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Usage Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2.1 Usage Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2.2 Extracting information from a FASTA le . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2.3 Parsing Output from Swiss-Prot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2.4 Downloading and Extracting information from Pub-Med . . . . . . . . . . . . . . . . . 5
4 Representing Sequences 6
4.1 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Usage Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Other Tools 7
5.1 Biocorba interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Classication Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.3 Additional Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6 Future Goals 8
6.1 Planned Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7 Contact Information
8
8 Conclusions
8
9 Acknowledgements
8
1 Abstract
The Biopython project was formed in August 1999 as a collaboration to collect and produce open source
bioinformatics tools written in Python, an object-oriented scripting language. It is modeled on the highly
successful Bioperl project, but has the goal of making libraries available for people doing computations in
Python. The philosophy of all the Bio* projects is that part of bioinformaticists' work involves software
development. In order to prevent repeated eorts we believe that the eld can be advanced more quickly
if libraries that perform common programming functions were available. Thus, we hope to create a central
source for high-quality bioinformatics tools that researchers can use.
As an open source project, Biopython can be downloaded for free from the web site at
http://www.
biopython.org
. Biopython libraries are currently under heavy development. This paper describes the
1
current state of available Biopython tools, shows examples of their use in common bioinformatics problems,
and describes plans for future development.
2 Introduction
Development of software tools is one of the most time consuming aspects of the work done by a bioin-
formaticist. A successful solution to this problem has been the establishment of common repositories of
interworking open-source libraries. This solution not only saves development time for researchers, but also
leads to more robust, well-tested code due to the contributions of multiple developers on the project. The
ideas behind open source collaborative software have been explored through a series of essays by Eric Ray-
mond (
http://www.tuxedo.org/~esr/writings/
). One example where the open source methodology has
been successfully applied is the Bioperl project (
http://www.bioperl.org
). Similarly, Biopython seeks to
develop and collect biologically oriented code written in python (
http://www.python.org
).
Python is an object-oriented scripting language that is well suited for processing text les and automating
common tasks. Because it was designed from the ground up with an object-oriented framework, it also scales
well and can be utilized for large projects. Python is portable to multiple platforms including multiple UNIX
variants, Windows, and Macintosh. The standard library comes with many high-level data structures such
as dictionaries, and contains numerous built in modules to accomplish tasks from parsing with regular
expressions to implementing a http server.
The native python libraries interface well with C, so after initial development in python, computationally
expensive operations can be recoded in C to increase program speed. In addition, jpython (
http://www.
jpython.org
), an implementation of python in pure java, allows python to freely interact with java libraries.
The options allow rapid development in python, followed by deployment in other languages, if desired.
A library of common code for biological analysis is essential to allow bioinformaticists to take advantage
of all of the benets of programming in python. This paper describes work towards development of such a
library.
3 Parsers for Biological Data
3.1 Design Goals
The most fundamental need of a bioinformaticist is the ability to import biological data into a form usable
by computer programs. Thus, much of the initial development of Biopython has been focused on writing
code that can retrieve data from common biological databases and parse them into a python data structure.
Designing parsers for bioinformatics le formats is particularly dicult because of the frequency at which
the data formats change. This is partially because of inadequate curation of the structure of the data, and
also because of changes in the contents of the database. Biopython addresses these diculties through the
use of standard event-oriented parser design.
Then event-oriented nature of biopython parsers are similar to that utilized by the SAX (Simple API for
XML) parser interface, which is used for parsing XML data les. They are dierent then SAX in that they
are line-oriented; since nearly all biological data formats use lines as meaningful delimiters biological parsers
can be built based on the assumption that line breaks are meaningful. A parser involves two components: a
Scanner, whose job is to recognize and identify lines that contain meaningful information, and a Consumer,
which extracts the information from the line.
The Scanner does most of the dicult work in dealing with the parsing. It is required to move through a
le and send out \events" whenever an item of interest in encountered in the le. These events are the key
pieces of data in the le that a user would be interested in extracting. For instance, lets imagine we were
parsing a FASTA formatted le with the following sequence info (cut so the lines t nicely):
>gi|8980811|gb|AF267980.1|AF267980 Stenocactus crispatus
AAAGAAAAATATACATTAAAAGAAGGGGATGCGGG
...
2
As the scanner moved through this le, it would re o 4 dierent types of events. Upon reaching a new
sequence
begin_sequence
event would be sent. This would be followed by a
title
event upon reaching
the information about the sequence and a
sequence
event for every line of sequence information. Finally,
everything would wrap up with an
end_sequence
when we have no more sequence data in the entry.
Creating and emanating these events is interesting, but is not very useful unless we get some kind of
information from the events, which is where the Consumer component comes in. The consumer will register
itself with a scanner and let it know that it wants to know about all events that occur while scanning through
a le. Then, it will implement functions which deal with the events that it is interested in.
To go back to our FASTA example, a consumer might just be interested in counting the number of
sequences in a le. So, it would implement a function
begin_sequence
that increments a counter by one
every time that event occurs. By simple receiving the information it is interested in from the Scanner, the
consumer processes and deals with the les according to the programmer's needs.
By decoupling scanners from consumers, developers can choose dierent consumers depending on their
information or performance needs, while maintaining the same scanner. It is possible to develop multiple
specialized consumer-handlers using the same scanner framework. These parsers can deal with a small
subsection of the data relatively easily. For example, it may be desirable to have a parser that only extracts
the sequence information from a Genbank le, without having to worry about the rest of the information.
This would save time by not processing unnecessary information, and save memory by not storing it.
3.2 Usage Examples
3.2.1 Usage Scenario
To take a look at the parsers in action, we'll look at some examples based around a common theme, to make
things a little more exciting. Let's suddenly become really interested in Taxol, a novel anti-cancer drug
( A good quick introduction to Taxol can be found at
http://www.bris.ac.uk/Depts/Chemistry/MOTM/
taxol/taxol.htm
) and use this newfound interest to frame our work.
3.2.2 Extracting information from a FASTA le
To start our search for Taxol information, we rst head to NCBI to do an Entrez search over the Genbank
nucleotide databases (
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
). Just search-
ing for the keyword 'taxol' gives 22 results, so let's say we want to parse through these results and extract
id numbers from those that have to do with humans. To do this, we save the search results as a FASTA le,
and then proceed to parse this.
Based on the scanner-consumer discussion above, what we need to do is implement a consumer which
will react every time we reach a title in the FASTA le, and check to see if the title mentions anything to
do with humans. The following consumer does this job:
import string
from Bio.ParserSupport import AbstractConsumer
class TitleSearchConsumer(AbstractConsumer):
def title(self, line):
# see if the title contains a reference to humans}
location = string.find(line, "Homo sapiens")}
if location != -1:
# split the string to give us the accession number}
result = string.split(line, '|')}
print 'Accession:', result[3]}
3
The Consumer inherits from a base consumer class which ignores any sections that we are not interested
in (like sequences). Now that we've got a consumer, we need to start up a scanner, inform it of what
consumer we want to send events to, and then parse the le. The following code accomplishes this:
# set up the scanner, consumer and file to parse
from Bio.Fasta import Fasta
scanner = Fasta._Scanner()
consumer = TitleSearchConsumer()
file = open('taxol.fasta', 'r')
# parse all fasta records in the file
for n in range(22):
scanner.feed(file, handler)}
Running this example gives the following output:
# python fasta_ex.py
Accession: AW615564.1
Accession: NM_004909.1
Accession: AF157562
...
This example uses the raw scanner-consumer interface we described above. Doing this can be clunky in
many ways, since we have to explicitly know how many records we have to parse, and also need to access
the scanner, which is marked as an internal class. The reason for this is that there are layers developed on
top of the raw scanner and consumer classes which help make them more intuitive to use.
3.2.3 Parsing Output from Swiss-Prot
Delving further into taxol we next decide to look for further information in Swiss-Prot (
http://www.expasy.
ch/sprot/sprot-top.htm
), a hand curated database of protein sequences. We search for Taxol or Taxus
(Taxol was rst isolated from the bark of the pacic yew, Taxus brevifolia). This yields 15 results, which
we save in SwissProt format. Next, we would like to print out a description of each of the proteins found.
To do this, we use an iterator to step through each entry, and then use the SwissProt parser to parse each
entry into a record containing all of the information in the entry.
First, we set up a parser and an iterator to use:
from Bio.SwissProt import SProt
file = open('taxol.swiss', 'r')
parser = SProt.RecordParser()
my_iterator = SProt.Iterator(file, parser)
The parser is a RecordParser which converts a SwissProt entry into the record class mentioned above.
Now, we can readily step through the le record by record, and print out just the descriptions from the
record class:
next_record = my_iterator.next()
while next_record:
print 'Description:', next_record.description
next_record = my_iterator.next()
4
Utilizing the iterator and record class interfaces, the code ends up being shorter and more understandable.
In many cases this approach will be enough to extract the information you want, while the more general
Scanner and Consumer classes are always available if you need more control over how you deal with the
data.
3.2.4 Downloading and Extracting information from Pub-Med
Finally, in our search for Taxol information, we would like to search PubMed (
http://www.ncbi.nlm.nih.
gov:80/entrez/query.fcgi?db=PubMed
) and get Medline articles dealing with Taxol. Biopython provides
many nice interfaces for doing just this.
First, we would like to do a PubMed search to get a listing of all articles having to do with Taxol. We
can do this with the following two lines of code:
from Bio.Medline import PubMed
taxol_ids = PubMed.search_for('taxol')
Of course, article ids are not much use unless we can get the Medline records. Here, we create a dictionary
that can retrieve a PubMed entry by its id, and use a Medline parser to parse the entry into a usable format.
A PubMed dictionary is accessible using python dictionary semantics, in which the keys are PubMed ids,
and the values are the records in Medlars format.
my_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = my\_parser)
Now that we've got what we need, we can walk through and get the information we require:
for id in taxol_ids[0:5]:
this_record = medline_dict[id]
print 'Title:', this_record.title
print 'Authors:', this_record.authors
Running this code will give output like the following:
# python medline_ex.py
Title: PKC412--a protein kinase inhibitor with a broad therapeutic potential [In Process Citation]
Authors: ['Fabbro D', 'Ruetz S', 'Bodis S', 'Pruschy M', 'Csermak K', 'Man A',
'Campochiaro P', 'Wood J', "O'Reilly T", 'Meyer T']
...
In this example, the Biopython classes make it easy to get PubMed information in a format that can be
easily manipulated using standard python tools. For instance, the authors are generated in a python list, so
they could easily be searched with code like:
if 'Monty P' in authors:
print 'found author Monty P'
In sum, these examples demonstrate how the Biopython classes can be used to automate common bio-
logical tasks and deal with bioinformatic data in a pythonic manner.
5
Plik z chomika:
xyzgeo
Inne pliki z tego folderu:
Steps in protein prediction.pdf
(328 KB)
SurgNeurolInt6118-6462902_175709.pdf
(613 KB)
Seeliger_PCB_2010.pdf
(2415 KB)
Structural bioinformatics - Wikipedia, the free encyclopedia.html
(78 KB)
small%20molecule%20inhibitors%20of%20PPI.pdf
(574 KB)
Inne foldery tego chomika:
0
algorytmika
bioinformatyka (biotech06)
Bioinformatyka (patryska89)
bIOINFORMATYKA (waldiguzek)
Zgłoś jeśli
naruszono regulamin