Running Head: XML: The Open Source Solution to
Interoperability
XML:
The Open Source Solution to Interoperability
Amy E. Neeser
San Jose State University
Early
Markup Languages
The first markup language, GML
(Generalized Markup Language), was created in 1969 by a group of IBM
researchers. Its original purpose was
for document publishing, text editing and formatting, and allowed basic
information retrieval systems to share documents (Kay, 2005, p. 30). As these technologies progressed, GML expanded
in 1986 and became known as SGML (Standard Generalized Markup Language) which
was created to, provide a set of rules that describe the structure of an
electronic document so that it may be interchanged across various computer
platforms (Chowdhury, 2004, p. 323). In
addition, SGML allowed users to add editorial comments to files, create
different versions of a document in a single file, identify where to place
various types of illustrations and how to incorporate them into text files, and
provide basic information to supporting programs.
Despite these important advancements,
markup languages were nevertheless highly complex and still mostly unknown to
the average user until Tim Berners-Lee, inventor of the World Wide Web, created
HTML (Hypertext Markup Language) in 1990 (Kay, 2005). With the rise of the Internet, data needed to
be displayed in a Web browser so HTML was created to incorporate information
that dealt with presentation. HTML is
criticized as a much too limited markup language because it uses fixed tags
that are completely unrelated to the actual data of the resource; these tags simply
describe how that data should be displayed in a browser (Fichter & Cervone,
2000). Because of the great need for an
advanced yet usable generalized markup language that focuses on the data itself
instead of simply the display, XML (Extensible
Markup Language) was
finally created and is being used in a wide variety of settings today.
Simply put, XML is a combination of SGML
and HTML; it is less complex and resource-intensive than SGML yet surpasses the
ability of HTML in that it does much more than tells a browser how to display
data and link to other items. XML is
intended for computers to generate data, read data, and ensure that the data
structure is unambiguous (Chowdhury, 2004, p. 325). XML is not directly tied to any particular
program or application; instead, it simply describes and structures the data so
it may be interpreted by whatever program happens to be using it. XML is fully interchangeable and customizable
and can be adapted for the particular needs and terminology of individual
fields (Saunders, 1998, p. 45). This
means that the same application could display information on a Web browser,
hand-held computing device, or cell phone simply by using a different style for
each device type (Fichter & Cervone, 2000, p. 32).
XMLs self-describing tags provide a
highly detailed representation of documents and data, which inherently ties
this markup language to information retrieval.
An XML database enables this information to be indexed for powerful,
detailed search It also supports multi-criteria sorting and delivers multiple
options for ordering results (Rogers, 2004, p. 19). Because of these descriptive names and labels
that are assigned through tagging, information can be accessed and retrieved by
a number of different systems and a multitude of applications, making this an
optimal tool to facilitate information retrieval. By breaking down traditional silos which were
barriers to information sharing, XML enables information to be reused by,
integrating text and data from different sources and by searching and linking
across these sources (Adler, Cochrane, Morar, & Spector, 2006, p.
210).
There are many factors, aspects, and
characteristics about XML that have made it, the predominant mechanism for
electronic data interchange between information systems (Adler, Cochrane,
Morar, & Spector, 2006, p. 207).
Some of these characteristic that will be further explored in this paper
include the advantage of being an open source program, using XML to solve information
searching and retrieving dilemmas on the Web, and the examination of XML both
inside and outside of the library setting.
Many information specialists are in favor of the open source extensible
markup language XML as its purpose is to aid information systems in sharing
structured data; however, this is still a fairly new technology and requires
much more analysis and consideration. The
intent of this paper is to examine the ever-increasing role that XML as an open
source entity plays in the field of library and information science,
specifically in regards to information retrieval. This paper will mostly focus on the use of
XML in libraries; however, due to the multi-system accessibility of XML, I will
also explore a wide range of studies and criticism that fall outside of the
library setting as well.
Definitions
Throughout the course of this paper, I
will be returning to a few specific terms that need to be defined in order to
understand their relationship to each other and to the broader field of library
and information science. First and
foremost, the term information retrieval
can be defined as, retrieval of bibliographic
information from stored document databases (Chowdhury, 2004, p. 1). Furthermore, the function of an information
retrieval system is to retrieve the information either the actual
information or the documents containing the information that fully or
partially match the users query (Chowdhury, 2004, p. 2). Information retrieval systems are diverse and
can range from digital libraries, to OPACs (Online Public Access Catalogs), to online databases, to various types of web search
engines.
A general definition for Markup languages may be defined as, A
scheme that allows the tagging and describing of individual structural elements
of text for the purpose of digital storage, appropriate layout display, and
retrieval of individual components (Taylor & Joudrey , 2009, p. 463). XML,
a specific type of markup language, can be distinguished as, A subset of SGML,
designed specifically for Web documents, that omits some features of SGML and
include a few additional features it allows designers to create their own
customized tags, thus overcoming many of the limitations of HTML (Taylor &
Joudrey , 2009, p. 478).
Finally, open source software (OSS) can be defined as According to the Online Dictionary for Library and
Information Science (n.d.), open source is defined as
A computer program for which
the source code
is made available without charge by the owner or licenser, usually via the Internet, to encourage the
rapid development of a more useful and bug-free product through
open peer review.
The practice also allows the product to be customized by its users to suit
local needs. To be certified open source under the Open Source Initiative
(OSI), software
must meet certain established criteria that include no restrictions on access.
According to the OSI, OSS must also
comply with the following criteria in order to fall into this category: free
redistribution, inclusion of source code, must allow the creation of derived
works from the original product, integrity of the authors source code, no
discrimination of persons, groups, or fields of endeavor, distribution of
license, the license must not be specific to a product, the license must not
restrict other software, and the license must be technology-neutral (http://opensource.org/docs/osd). Although
these definitions are good starting points for beginning to understand these
important terms, their relationship to each other and their broader
implications will become much clearer throughout the remainder of this paper.
The
Open Source Advantage
As the need for a more useable yet simultaneously
more advanced markup language became more apparent, it was also becoming more
evident that many differing, specialized languages suited to specific domains
were required to represent the numerous bodies of data used in those domains
(Adler, Cochrane, Morar, & Spector, 2006, p. 208). These individual languages also needed to be
able to be shared and maintained by differing technologies and by a wide body
of users for different purposes. If each
separate domain developed its own language and accompanying system, it would be
a very inefficient use of time and resources because this data could not be
shared across systems. XML was the
solution to this problem in that it provided a very general approach for
satisfying these common requirements. It allowed the definition of languages
in which information is encoded as tagged text and in which different encodings
and tags support different domains of discourse (Adler, Cochrane, Morar, &
Spector, 2006, p. 208). Because XML was
developed to be used by a wide array of disciplines and by both large and small
scale interests, it needed to be open source, and thus nonproprietary, so
multiple parties could contribute and ultimately share technology, information,
and resources.
As XML standards were being developed, a
community of diverse groups and individuals was also being established. There were a wide variety of reasons why both
individuals and companies not only favored but also endorsed and participated
in XMLs open standards: to gain the
benefit of an open community to supplement their own development resources, to
take advantage of the positive marketing perceptions surrounding the
participation in nonproprietary solutions, and to benefit from the vast market
opportunities created (Adler, Cochrane, Morar, & Spector, 2006, p.
209). New standards and prototypes were
now easier to develop with the work and support of an entire community with
various abilities and background knowledge.
The desire not to be left behind the competition, customer requirements
for interoperable solutions, and the simple economics of sharing in a common
pool and community of interests all led to the rapid development and adoption
of open standards (Adler, Cochrane, Morar, & Spector, 2006, p. 209).
The fact that XML is open source greatly
contributed to the markup language playing an important role in information
retrieval. One of the great challenges
in information retrieval is being able to successfully share and retrieve
documents and information from across many databases and disciplines. Largely because it is nonproprietary, XML and
its related standards were allowed to enable, data interoperability, content
manipulation, content sharing and reuse, document assembly, document security
and access control, document filtering, and document formatting across all
disciplines and for all types of devices and applications (Adler, Cochrane,
Morar, & Spector, 2006, p. 209).
When information is more accessible, it is available to a larger
audience and ultimately achieves one of the major goals of information
retrieval: true interoperability.
Solving
Information Searching and Retrieving Dilemmas on the Web
XMLs relationship to information access
and retrieval is especially evident when it comes to resolving information and
content retrieval errors with the use of XML its many associated programs. XML is commonly used by many differing types
of information retrieval systems such as digital libraries, online databases
and OPACs, but the Web is one of the greatest challenges when it comes to
information retrieval. The Web is the
worlds greatest repository of information if you can efficiently find what
youre looking for (Rogers, 2004, p. 19).
When irrelevant documents and information are retrieved after a query,
one must first consider what the problem is, and then how to remedy that
error. the focal point for most
content retrieval errors is the data itself (Yager, 2000, p. 88). XML is a powerful tool that can aid users in
correcting this flawed data to make it more accessible and thus ultimately more
retrievable.
Information and documents on the Web are
often described as being in silos or stranded on information islands when this
data is not searchable beyond a single site.
XML helps ease this problem and, makes it easy to expose information
in a content management system to other sites, enabling searches that cover
multiple databases across many sites (Rogers, 2004, p. 19). XML enable(s) information reuse by
integrating texts and data from different sources and by searching and linking
across these sources, thereby breaking down traditional silos, which were
barriers to information sharing (Adler, Cochrane, Morar, & Spector, 2006, p.
210). With the help of this
user-friendly markup language, users are able to characterize text within a
document with the use of tags and labels so they have the ability to
simultaneously search across multiple information retrieval systems, making
search results on the Web much more accurate.
In addition to examining information
retrieval on the Web, it is also important to consider information extraction (IE)
when discussing the use of XML to aid in locating and retrieving documents and
information. Information extraction
software identifies and removes relevant information from a variety of sources,
pulling information from a variety of sources, and aggregates it to create a
single view. IE translates content into
a homogeneous form through technologies like XML (Adams, 2001, p. 27). While there is certainly interplay between
the two, information retrieval mainly focuses on document retrieval whereas
information extraction focuses on the retrieval of facts. Nevertheless, both must overcome difficulties
of retrieving information on the Web such as language ambiguities including
synonymy, polysemy, morphology, and homogeny (Lancaster, 1991), and XML is
essential for this task.
XML is important
because it facilitates increased access to and description of the content
contained within the documents. The
technology separates the intellectual content of a text from its surrounding
structure, meaning that information can be converted into a uniform structure.
(Adams, 2001, p. 30)
Because of XMLs flexibility, it has the
ability to work in conjunction with and be employed by other programs,
techniques, and applications. AJAX
(Asynchronous JavaScript and XML) is a perfect example of this kind of
interoperability. While AJAX functions
in a number of differing capacities, it is best known for its ability to
retrieve data from a server (asynchronously) in the background without
affecting the functions on the current page that is being displayed. This is achieved by making, a request to a
server via Hypertext Transfer Protocol (HTTP) and continue to process other
data while waiting for the response (Clark, 2008, p. 32). While this may initially sound superfluous,
it is being more commonly used in online information retrieval systems such as
Web search engines, online databases, and digital libraries to aid in the
searching, browsing, and retrieval processes.
AJAX is responsible for simple techniques that many users take for
granted such as username and password verification without having to lose or
reload the data on the entire screen.
Another example is keystroke matching; in other words, as users search
for keywords, subjects, and titles in information retrieval systems, potential
keyword matches are often displayed under the form field in order to help with faster
entry and spelling correction for more efficient and successful information
retrieval. These tasks would simply be
impossible without the use of XML in other related programs and applications.
XML
in the Library Setting
In addition to the myriad of ways that XML,
specifically when being used in AJAX, can aid in perfecting searching and
information retrieval on the Web, it also has practical applications in the
library setting. For example, many of
the ways that AJAX aids in more efficient Web searching could also be used to
improve libraries OPAC systems.
Possible keyword matches, verification of a users personal information
or settings, and digital library search applications could be faster and done
without loss of data or having to wait for pages to reload. AJAX can also make searching and browsing
library resources easier (Clark, 2008, p. 32).
Rather than having to click through the OPACs various subject pages,
AJAX enables the user to browse through the possibilities by simply rolling the
mouse over the links. Ajax reduces the
need to click through to more information, bringing data into the users
working environment (Clark, 2008, p. 32).
In many cases, what may seem like a single search to a user is actually
an unseen complex task as AJAX accesses multiple databases and libraries in
order to bring the requested information to one location; this greatly improves
the quality of the users searches without them ever being aware of it. An increasing number of libraries are
beginning to incorporate AJAX into their systems in various capacities as it
aids in the improvement of information access and retrieval; this would of
course be impossible without the multi-functionality of XML.
XML also plays an important role in the
library setting with its strong impact on electronic records. An increasing number of libraries have
adopted the use of electronic records rather than hard copy records because
they are, easier to transmit, store, and access than the paper records they
represent (Winters, 2005, p. 64).
Because of this wide acceptance, libraries needed to determine how to
manage these new records, and in many cases, XML has been adopted for this
purpose. More specifically, XML aids in
handling an items format which is how the item is displayed, the structure which
shows how to treat each item, and an items meaning which interprets each item
based on the given tags (Winters, 2005, p. 65).
Using XML to handle electronic records is relevant to both the open
source movement and information retrieval because once an item is tagged using
XML, various other applications are able to interpret this data thus making it
usable to more systems. In addition,
more people are able to use these records because XML is human-readable in
addition to being machine-readable; this is important because, it is possible
to interpret an XML document without special training or a glossary of tags
(Winters, 2005, p. 66). Because XML is
open source, a wide range of users are able to customize the markup language to
suit their own needs while still being able to share their resources. Finally, using XML not only allows libraries
to share records and other types of resources, but also helps ensure that these
electronic records are tagged and labeled consistently, ultimately making them
more accessible for retrieval.
MARC (Machine Readable Catalog)
standards, a suite of data element sets that provides the mechanism by which
computers exchange, use, and interpret bibliographic data (Radebaugh, 2007, p.
15) are commonly used in many major libraries.
It is the foundation for most library catalogs that are used today and
was introduced by the Library of Congress in the 1960s. In an effort to increase the sharing and
exchange of bibliographic data, the Library of Congress' Network Development
and MARC Standards Office has and is
continuing to develop a framework for working with MARC data in a XML
environment. This framework is intended
to be flexible and extensible to allow users to work with MARC data in ways
specific to their needs. The framework itself includes many components such as
schemas, style sheets, and software tools (MARC 21 XML Schema, 2009). MARCXML, or using MARC in XML syntax, is
advantageous to libraries and information retrieval as a whole because in the
past, only other libraries that used MARC were able to share data in these
catalogs; they will continue to do so with MARCXML, but also be able to share
it outside of that particular setting as well.
Because of the integration of XML, this is now be a fairly easy
transition because, The MARCXML schema supports all MARC-encoded data
regardless of the format (Radebaugh, 2007, p. 15). Finally, libraries are now able to take
advantage of the many tools that XML has to offer without abandoning MARCs
advantages by adopting the hybrid MARCXML.
As we have seen from the above sections,
one of the major goals in information access and retrieval is, the creation
of data that is sharable, transformable among different systems, and can be
remixed in part, or its entirety, in innovative ways (Walker, 2007, p
28). Aggregation,
which is to gather and reassemble separate sets of data, is one form of this
remixing and can be observed at many levels in the library setting. Libraries aggregate books and electronic
resources to provide the most and best possible resources for their
patrons. Many libraries offer
aggregation through metasearch capabilities so that a user can quickly enter a
single query and receive information that is drawn from a very broad and
diverse set of sources (Walker, 2007, p. 28).
Aggregation is closely tied to XML because as more information resources
are being converted into XML, the cost and time of aggregating these resources
is dramatically reduced. Once the data
is marked up with XML tags, it can be sliced, diced, and reconstructed into the
features and presentation formats that are compelling to users (Walker, 2007,
p. 28).
Xrefer is a classic example of
aggregated information sources that many libraries use today. This
online platform provides both Googles swift, electronic convenience and an
accuracy and focus that the search engines web-scouring mechanisms cannot
deliver (Guz,
2007). With Xrefer, libraries also have the ability
to download MARC records, which allows terms entered into Xrefers search
engine to find resources within the librarys own collection. Xrefer uses XML compatible formats such as Xreferplus
and SFX (Special Effects), the most widely-known OpenURL link server in the
library and publishing community (Desmarais, 2004), to aggregate the many differing information resources
into compatible formats.
Implications
for the Field of Library and Information Science
Because of its foundation in the open source
movement and its wide acceptance and broad uses, XML is certainly not limited
to information retrieval or even to the overarching field of library and
information science. Nevertheless,
despite being a relatively new technology, it has already made a significant
impact on the field and profession through its universality, interoperability, and strong economic
impact. For these reasons and many more,
XML has become one of the most important and widely used paradigms in
distributed computing (Adler, Cochrane, Morar, & Spector, 2006, p. 209).
First and foremost, universality is an
overall goal of the open source movement as well as in library and information
science. To highlight just a few
examples, information professionals strive to provide multilingual access to
all programs and applications, specifically electronic, and to allow
information systems to share resources. XML
allows these opportunities with its inherent universality. Any of the worlds languages can be used for
XML markup or content, due to its incorporation of Unicode as a key component
(Adler, Cochrane, Morar, & Spector, 2006, p. 207). XML defies regional, cultural, and linguistic
barriers in that the user creates the meta-language that is best suited to the
user, regardless of the language, profession, or purpose. Interoperability can also fall under the overarching
objective of universalism; XML also aids in the library and information science
field achieving that goal. Different
users and communities have varying needs but also need to work together in
order to share resources; XML provides the tools to fulfill this opportunity
allowing more people to share information and materials. What is unique about XML is that it can be
uniquely customized to the needs of individual users, yet has the ability to
transcend the barriers of individual programs and systems allowing for optimal
interoperability. One of the most
compelling aspects of XMLs evolution was the intense and spirited collaboration
of communities from different disciplines (Adler, Cochrane, Morar, &
Spector, 2006, p. 215). XMLs inherent
interoperability has united people of different experiences, expectations,
backgrounds, and fields of study and has allowed them to work together to share
ideas, expertise, and resources.
Finally, the economic impact of XML has also
greatly affected the field of library and information science. As discussed earlier, because it is an open
source technology, differing communities have all worked together to achieve
the common goal of creating and using XML.
In a strictly for-profit market, only large companies and firms would
have been able to partake, thus eliminating the purpose of XML to unite systems
of all purposes and sizes for the sake of sharing resources. By working together as a community rather
than competitively, groups are able to drastically reduce costs and not be
hesitant to implement these new technologies that benefit the individual, the
community, and the field as a whole. As
more systems incorporate XML in the future, Distance, time, language and
communication barriers will be vastly reduced (Adler, Cochrane, Morar, &
Spector, 2006, p. 219) in addition to reducing costs.
Conclusion
Although there have been a lot of positive and
productive articles written and studies completed relating the importance of
the XML to information retrieval in the field of library and information
science, there are also some criticisms and the issue calls for additional
research. Many professionals remain
skeptical of XMLs impact on the Web, the information technology field, and the
library and information science community because this is still a relatively
new technology. Despite its history and
evolution, others feel it is a facet of the Web 2.0 craze and simply adds more
layers of abstraction or is a program without any real meaning or solid business
model. Nicholas Petreley (2001) is not
alone when he states, XML is great as a standard way of saying, This next
thing is a widget. But XML doesnt
require that you describe what the widget does, how it works or that the widget
itself conforms to a standard (Petreley, 2001). This is a classic example of the criticism
that it is a fancy application, but has no solid meaning and will quickly
disappear. As XML continues to be used
in more settings and by more diverse groups of people, it will continue to be
tested to see if it endures as a legitimate technology. For those in the field of library and
information science, it is worthwhile to learn about and test these new
technologies and in order to study the impact they may have on our field,
regardless of their permanence. we must keep a firm grounding in the
technologies that drive digital library development, even though they are changing
fast. That means keeping current with
HTML, XML, and the overall Web site administration. (Huwe, 2004, p. 41) Regardless of if these technologies will
disappear in a few years or not, they are shaping our current understanding of
the field as a whole so they are nevertheless worthwhile to continue to learn
from and examine.
Works
Cited
Adams, K. C. (2001).
The Web as a database: new extraction technologies and content
management.
Information Today, Inc. (27-32).
Adler, S., Cochrane,
R., Morar, J. F., & Spector, A. (2006). Technical context and cultural
consequences
of XML. IBM Systems Journal, 45 (2),
207-223.
Chowdhury, G. G.
(2004). Introduction to Modern
Information Retrieval. London: Facet
Publishing.
Clark, J. A. (2008).
AJAX (asynchronous JavaScript and XML): this isnt the web Im used to.
Online, 30 (6), 31-34.
Desmarais, N. (2004).
XML in action. Against the Grain, 15
(3), 102-103.
Fichter, D. & Cervone,
F. (2000). Documents, data, information retrieval, & XML. Online,
24(6),
30-36.
Guz, S. S. (2007). The
promise of Xrefer. Library Journal,
n.d.
Huwe, T. K. (2004).
Keep those Web skills current. Computers in Libraries, 24 (8), 40-42.
Kasdorf, B. (2008). The
XML advantage. Net Connect, 12-15.
Kay, R. (2005). Markup
languages. Computerworld, 30.
Lancaster, F. W.
(1991). Indexing and abstracting in theory and practice. Champaign, Ill:
University
of Illinois, Graduate School of Library and Information Science.
MARC
21 XML Schema. (2009). Retrieved April 21, 2009, from
http://www.loc.gov/standards/marcxml.
Open source. (n.d.) In
Online Dictionary for Library and Information Science. Retrieved April
29,
2009, from http://lu.com/odlis/search.cfm.
Open Source Initiative.
(2009). Open Source Initiative.
Retrieved October 27, 2009
from http://opensource.org/docs/osd
Petreley, N. (2001).
Ontology and the web. Computerworld,
35 (41).
Radebaugh, J. (2007).
MARC 21 / MARCXML. Computers in Libraries,
27 (4), 15.
Rogers, B. (2004).
Solving Web Search Dilemmas. EContent,
19.
Saunders, L. (1998).
Not your mothers HTML: moving on to XML. Computers
in Libraries, 18
(10),
45.
Taylor, A. G. &
Joudrey, D. N. (2009). The Organization
of Information. Wesport: Libraries
Unlimited.
Walker, J. (2007).
Sliced, diced, and reconstructed. Netconnect,
28.
Winters, R. (2005). XML
marks the future for electronic records. The
Information Management
Journal,
39 (6), 64-68.
Yager, T. (2000).
Search no further! Content retrieval is gaining XML boost. Infoworld, 21 (46).