Automatic Analysis of Dewey Decimal Classification Notations.pdf

(282 KB) Pobierz
ul8.dvi
Automatic Analysis of Dewey Decimal
Classication Notations
Ulrike Reiner
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
37077 Gottingen, Germany
ulrike.reiner@gbv.de
Abstract. The Dewey Decimal Classication (DDC) was conceived by Melvil
Dewey in 1873 and published in 1876. Nowadays, DDC serves as a library classi-
cation system in about 138 countries worldwide. Recently, the German translation
of the DDC was launched, and since then the interest in DDC has rapidly increased
in German-speaking countries.
The complex DDC system (Ed. 22) allows to synthesize (to build) a huge amount
of DDC notations (numbers) with the aid of instructions. Since the meaning of built
DDC numbers is not obvious - especially to non-DDC experts - a computer program
has been written that automatically analyzes DDC numbers. Based on Songqiao
Liu’s dissertation (Liu (1997)), our program decomposes DDC notations from the
main class 700 (as one of the ten main classes). In addition, our program analyzes
notations from all ten classes and determines the meaning of every semantic atom
contained in a built DDC notation. The extracted DDC atoms can be used for
information retrieval, automatic classication, or other purposes.
1 Introduction
While searching for books, journals, or web resources, you will often come
across numbers such as ”025.1740973”, ”016.02092”, or ”720.7073”. What do
they mean? Librarian professionals will identify these strings as numbers (no-
tations) of the Dewey Decimal Classication (DDC), which is named after
its creator, Melvil Dewey. Originally, Dewey designed the classication for
libraries, but in the meantime DDC has also been discovered for classifying
the web or other resources. The DDC is used, among others, because it has a
long-standing tradition and is still up to date: in order to cope with scientic
progress, it is currently under development by a ten-member international
board (the Editorial Policy Committee, EPC). While the rst edition, which
was published in 1876, only comprised a few pages, the current 22nd edition
of the DDC spans a four-volume work with almost 4,000 pages. Today, the
DDC contains approx. 48,000 DDC notations and about 8,000 instructions.
The DDC notations are enumerated in the schedules and tables of the DDC.
132241861.051.png
2
Ulrike Reiner
With the aid of the instructions mentioned above, human classiers can build
new (so-called) synthesized notations (numbers) if these are not specically
listed in the DDC schedules. This way, an enormous amount of synthesized
DDC notations has been built intellectually over the last 130 years. These
mostly unused notations are contained in library catalogues - like a hidden
treasure. They can be considered as belonging to the ”Deep Lib”, one of the
subsets of the ”Deep Web” (Bergman (2001)). Can these notations be made
accessible for information retrieval purposes with reasonable eort?
Our answer to this question consists in the automatic analysis of notations
of the DDC. The analysis program written in the pattern scanning and pro-
cessing language ”gawk” (http://www.gnu.org/software/gawk/) determines
all DDC notations (together with their corresponding captions) contained in
a synthesized (built) DDC notation. Before we go into details of the automatic
analysis of DDC notations in section 3, section 2 provides the basis for the
analysis. In section 4, the results as well as possible applications are presented.
2 DDC Notations
Notations play an important role in the DDC:
”Notation is the system of symbols used to represent the classes in a classica-
tion system. ... The notation provides a universal language to identify the class
and related classes, regardless of the fact that dierent words or languages may
be used to describe the class.” (http://www.oclc.org/dewey/versions/ddc22
/intro.pdf)
The following picture serves as an example for the aforesaid. Class C is repre-
sented by the notation 025.43 or, respectively, by the captions of three dierent
languages:
025.43
X X X X X X z
class C
'$
Universalklassikationssysteme
-
General classication systems
-
&%
:
Systeme de classication
Fig. 1. Class C represented by notation 025.43 or by several captions
132241861.062.png 132241861.073.png 132241861.084.png
Automatic Analysis of Dewey Decimal Classication Notations
3
The DDC notations interrelate with hierarchy and structure in the following
way:
”Hierarchy in the DDC is expressed through structure and notation. ... Struc-
tural hierarchy means that all topics (aside from the ten main classes) are
part of all the broader topics above them. The corollary is also true: what-
ever is true of the whole is true of the parts. This important concept is called
hierarchical force. ... Notational hierarchy is expressed by length of notation.
Numbers at any given level are usually subordinate to a class whose nota-
tion is one digit shorter; coordinate with a class whose notation has the same
number of signicant digits; and superordinate to a class with numbers one or
more digits longer.” (http://www.oclc.org/dewey/versions/ddc22/intro.pdf)
In compliance with the DDC system, the automatic analysis of nota-
tions (numbers) of the DDC is carried out in the VZG (V erbundZ entrale
des Gemeinsamen Bibliotheksverbundes) project Colibri (CO ntext genera-
tion and LI nguistic tools for B ibliographic Retrieval I nterfaces). The goal
of this project is to enrich title records on the basis of the DDC to improve
retrieval. The analysis of DDC notations is conducted under the following
research questions (which are also posed in a similar way in Liu (1993), p.
18):
Q1: Is it possible to automatically decompose molecular DDC notations
into atomic DDC notations?
Q2: Is it possible to improve automatic classication and retrieval by means
of atomic DDC notations?
We dene the terms ”atomic DDC notation” and ”molecular DDC notation”
(while a DDC notation is considered as a string, i.e., an ordered sequence of
symbols) as follows:
Atomic DDC notation:
An atomic DDC notation is a semantically indecomposable string that
represents a DDC class.
Molecular DDC notation
A molecular DDC notation is a string that is syntactically decomposable
into dno atoms.
General remarks: (1) We use the term ”molecular DDC notation” instead of
”synthesized DDC notation” to emphasize the decomposition into dno atoms
(cf. section 3). (2) We abbreviate ”DDC notation” as ”dno”, ”atomic DDC
notation” as ”dno atom”, ”molecular DDC notation” as ”dno mol”, ”caption”
as ”cap”, ”schedule notation” as ”schedno”, and ”table notation” as ”tabno”.
(3) Technical terms (dno atom, dno mol, dno, cap, etc.) with appended ”s”
are to be understood as the respective terms’ plural forms.
132241861.001.png 132241861.002.png 132241861.003.png 132241861.004.png
4
Ulrike Reiner
DDC notations can be found at several places in the DDC. In DDC sum-
maries, the notations for the main classes (or tens), the divisions (or hun-
dreds), and the sections (or thousands) are enumerated. Other notations are
listed in the schedules (”DDC schedule notations”) or (internal) tables (”DDC
table notations”). DDC schedules is ”the series of DDC numbers 000-999, their
headings (captions), and notes.” (Mitchell (1996), p. lxv). A DDC table is ”a
table of numbers that may be added to other numbers to make a class num-
ber appropriately specic to the work being classied” (Mitchell (1996), p.
lxv). Further notations are contained in the ”Relative Index” of the DDC.
The frequency distributions of schedule (table) notations are shown in Fig. 2
(Fig. 3), while schedno0 is short hand for DDC schedule notations beginning
with 0, schedno1 for DDC schedule notations beginning with 1, etc. The cap-
tions for the main classes are: 000: Computer science, information & general
works; 100: Philosophy & psychology; 200: Religion; 300: Social sciences; 400:
Language; 500: Science; 600: Technology; 700: Arts & recreation; 800: Lit-
erature; 900: History & geography. As illustrated by Fig. 2, DDC notations
are not distributed uniformly: the most schednos can be found in the class
”Technology”, followed by the notations in the class ”Social sciences”. The
fewest notations belong to the class ”Philosophy & psychology”. With regard
to the tabnos (Fig. 3), the 7,816 Table 2 notations (”Geographic Areas, His-
torical Periods, Persons”) stand out, whereas, in contrast, the quantities of all
other tabnos are comparatively small (Table 1: Standard Subdivisions; Table
3: Subdivisions for the Arts, for Individual Literatures, for Specic Literary
Forms; Table 4: Subdivisions of Individual Languages and Language Families;
Table 5: Ethnic and National Groups; Table 6: Languages).
As mentioned before, DDC notations, which are not explicitly listed in the
schedules, can be built by using DDC instructions. This process is called ”no-
tational synthesis” or ”number building”. Its results are synthesized DDC
notations (dno mols) that usually only DDC experts are able to interpret.
But with the aid of our program component vc day (v zg colibri d dc number
analyzer), the meaning of dno mols is revealed and the determined dno atoms
can be used, among others, to answer question Q2. The state of the art
of the automatic analysis of DDC notations and the program components
vc day, vc KB (v zg colibri K nowledge B ase), the dno input les for vc day
(in gvk all, in liu t), and the vc day output les (vc daygram: DDC analysis
diagram and vc dayset: DDC analysis set of dnos or captions) are subject of
the next sections. When we speak of program components, we want to make
clear that they belong to the main program vc ds (v zg colibri search system,
cf. Fig. 4), which will not be discussed here (the dotted lines) but can be found
elsewhere (Reiner (2005), Reiner (2007a), and Reiner (2007b)).
132241861.005.png 132241861.006.png 132241861.007.png 132241861.008.png 132241861.009.png 132241861.010.png 132241861.011.png
Automatic Analysis of Dewey Decimal Classication Notations
5
Fig. 2. Frequency distribution of DDC schedule notations
Fig. 3. Frequency distribution of DDC table notations
3 Automatic Analysis of DDC Notations
in gvk all. The GBV Union Catalog GVK (Gemeinsamer V erbundK atalog,
http://gso.gbv.de/) contains 3,073,423 intellectually DDC-classied title records
(status: July, 2004). A few records have more than one DDC notation assigned
132241861.012.png 132241861.013.png 132241861.014.png 132241861.015.png 132241861.016.png 132241861.017.png 132241861.018.png 132241861.019.png 132241861.020.png 132241861.021.png 132241861.022.png 132241861.023.png 132241861.024.png 132241861.025.png 132241861.026.png 132241861.027.png 132241861.028.png 132241861.029.png 132241861.030.png 132241861.031.png 132241861.032.png 132241861.033.png 132241861.034.png 132241861.035.png 132241861.036.png 132241861.037.png 132241861.038.png 132241861.039.png 132241861.040.png 132241861.041.png 132241861.042.png 132241861.043.png 132241861.044.png 132241861.045.png 132241861.046.png 132241861.047.png 132241861.048.png 132241861.049.png 132241861.050.png 132241861.052.png 132241861.053.png 132241861.054.png 132241861.055.png 132241861.056.png 132241861.057.png 132241861.058.png 132241861.059.png 132241861.060.png 132241861.061.png 132241861.063.png 132241861.064.png 132241861.065.png 132241861.066.png 132241861.067.png 132241861.068.png 132241861.069.png 132241861.070.png 132241861.071.png 132241861.072.png 132241861.074.png 132241861.075.png 132241861.076.png 132241861.077.png 132241861.078.png 132241861.079.png 132241861.080.png 132241861.081.png 132241861.082.png 132241861.083.png 132241861.085.png 132241861.086.png 132241861.087.png 132241861.088.png 132241861.089.png 132241861.090.png 132241861.091.png
Zgłoś jeśli naruszono regulamin