Application of terminology and classification tools for digital collection development and network-based search
ACM DL'98 Workshop
Pittsburgh, June 27th
Workshop Goals
Time and Place for the Workshop
How to Apply to Participate in the Workshop
Schedule for applications and structuring the workshop
Workshop Fee
Strawman Discussion Points

Workshop Goals:

Many current developments in collection building, distributed search and retrieval, and special-topic clearinghouses are struggling with the basic question of the need for terminology and classification tools to support information description and retrieval. Experience has shown, once again, that controlled sets of terms are required and that the simple domain list approach has limited value. Thesauri, classification systems, and authority files are well developed information description systems that need to be adapted to the digital library environment.

This workshop will focus on networked implementation of thesauri, classification tools, and authority files but it is not limited to the technical challenges of doing this and it is also not limited to actual standards development. We hope to have participation from those whose applications would benefit from networked terminology/classification/authority systems so that we can identify the needs, understand the advantages, and lay the groundwork for subsequent standards and technology developments. We look for the results of the workshop to be the creation of a working group that would develop standards; e.g., an XML definition for a thesaurus as a starting point or a general scenario for searching and navigating a networked thesaurus or classification system. Also to ongoing ACM DL workshops on this topic, if appropriate.

Participants are invited to the workshop who

  1. have applications using or planning to use thesauri/classifications/authority file systems for digital libraries;
  2. are engaged in current terminology development activities for networked collections;
  3. have conducted studies applying terminology and classification tools to networked retrieval;
  4. have developed approaches that support the shared development, maintenance, and use of terminology and classification tools; and
  5. are developing standards and structures to support the shared use and development of terminology and classification tools.
In all cases, the focus must be on current and planned developments to use terminology and classification tools in distributed information environments and must emphasize the problems and solutions that will benefit from workshop discussion.

Individuals who are known to have active projects in this area will be invited to participate. General announcements through appropriate email distribution will be made. The organizers will select full participants on the basis of the fit with the workshop topic and the potential contribution to progress in this area.

The workshop will be a for a full day, 9-4. The morning will be devoted to general discussion of issues and development paths; there will be breakout sessions in the afternoon to develop the major issues in more detail, followed by a summary session and identification of the next steps toward our goals. Participation will be limited to provide a good workshop environment for fruitful discussions.

Discussion leaders among the participants will be assigned to develop crosscutting discussion points. The goal by the end of the meeting is for communities of interest to form around important issues for further development.


Linda L. Hill is a Research Specialist with the Alexandria Digital Library Project at the University of California, Santa Barbara. She has worked extensively with thesaurus and metadata development and digital library projects. Linda says "My personal goal in relation to this workshop is to get thesaurus principles and practices enabled within digital libraries so that existing thesauri can be more known and accessible, so that developing projects will recognize the value of the thesaurus approach and will develop and use thesauri according to established standards. Also, to evaluate the usefulness of thesauri and classification systems for networked information discovery."

Gail Hodge, Information International Associates (IIa), has been involved with production systems for abstracting/indexing services for 20 years. She's currently working as a consultant to USGS on a biodiversity vocabulary for the National Biological Information Infrastructure (NBII). She previously held positions with the NASA Center for AeroSpace Information and with Biosis. Gail says "My main goal is to promote this discussion at a high level so that what we do within the U.S. Geological Survey, Biological Resources Division (BRD) context is based on where we are within the technical community. We know that we can't wait until all the problems are solved, but we want to be in synch as much as possible. We need an understanding of the issues involved in using distributed thesauri: How do we handle rights management, authentication, etc. and charging for commercial databases? What will users want to do with these distributed thesauri?  This effects navigation, searching, "transfer," etc. What do we need to know about other thesauri and vocabularies in order to use them in a distributed fashion? Is a registry both of thesaurus elements and of particular thesauri necessary to this effort? Where does this effort intersect with the efforts of others: RDF, XML, metadata schema, metadata registries, Z39.19, Z39.50, search engine vendors, etc.? I would like to come out of this with at least a start toward a way to deal with the architecture so that we can move forward and integrate this effort with other metadata and Internet efforts.

Ron Davies is a consultant with Bibliomatics, Inc., a Canadian information systems consulting firm. He has designed and developed thesaurus management systems for the Organisation for Economic Cooperation and Development (OECD) and the International Development Research Centre (IDRC), and led a project to create a subject classification system for United Nation's information available over the Internet. He is currently developing Java-based software for distributed thesaurus management and use. Ron says "My personal goal has always been to get some agreement on an interoperable way to connect to thesauri over network, so that I could access a thesaurus at one site, and use it to index or search resources at another site. This would mean development of standards in terms of the semantics of thesaual relations as well as the syntax of consulting a thesaurus. This effort could build on other standards (e.g. Z39.50, XML) but there's a lot of specific work that still has to be done.

Time and Place
The workshop will be held in conjunction with the ACM Digital Libraries '98 Conference, Mariott City Center, Pittsburgh, PA, USA, June 23-26, 1998 details of which can be found at the URL above. It will be held on the Saturday following the Conference, June 27th, from 9:00am to 4:00pm. Lunch will be provided.


How to Apply to Participate in the Workshop
Please send a 1-paragraph biosketch and a description of your application, study, or development activities related to this workshop to all three of the workshop organizers, along with your name and coordinates (i.e., position, contact information). The organizers will review all of the applications and statements received and select participants based on potential contribution to the workshop - up to a reasonable/workable size for genuine discussion. Please include URLs pointing to relevant projects or publications. The applicants statements for workshop participation will be posted on the website.

Schedule for applications and structuring the workshop:

June 5 (Friday)

June 13 (Saturday) Workshop Fee: The Workshop fee will be $50. The fee will be collected at the workshop, and can be paid by check made out to "ACM Digital Libraries." Alternatively, the fee can be paid in cash (US$) but we prefer to get a check.

Strawman Discussion Points

The convenors of the workshop have developed four "strawman" topics in order to provide a framework for discussion at the workshop. We plan to hold discussions on the first two topics in the morning and then give participants a choice of discussing the last two in the afternoon or continuing the morning discussions.

Please note that thesaurus is a term often used for convenience in the descriptions of the topics, but that we do not intend that any of the discussions should be confined to traditional thesauri. We believe that the problems and solutions apply to a wide range of  terminology tools including classification schemes, taxonomies and other structured authority files.

The topics are :

1. The Data Model
What kind of data model is needed to support the interactive use of thesauri and other terminologies in online information services such as digital libraries?  What data elements and/or relations are needed to convey the content of these resources? What data elements and relations are important for multilingual thesauri? How do we represent system classifications and notations in a way that client software can understand what the notation means? Does XML hold promise for the representation of thesaurus structures? At the end of this breakout session, we hope to have concrete proposals for developing a generalized model thesaurus.

2. The Functional Model
How do users want to use thesauri and other authority files in searching and  resource description What kind of access is important in exploring or "navigating" through the thesaurus? For example, can a user ask for a single term (e.g., "chemistry"), a subset of terms (all terms with "chemical" in them) or a range of terms (e.g. "chemi*"), or all three? How do users indicate that they want to see an alphabetical view, or a hierarchical list or a classified (systematic list)? Are there other kinds of ways for asking for or looking at thesaurus information that are useful? How do you indicate how much of the list you want to see at one time? At the end of this breakout session, we hope to have the beginning of a functional model of the features that are most important to users consulting terminology services over a network.

3. Thesaurus-level Metadata and Thesaurus Registries
What thesaurus-level metadata is needed to represent the scope, structure, size, ownership, access constraints, etc. of a thesaurus so that potential users (for all applications) will know what is available and how to access and use it? ("Metadata" is intended to mean not the actual attributes of individual terminology tools but the "collection-level metadata" that would describe the terminology tool as a whole.) What is the role of thesaurus-level metadata in enabling the interoperability of online-accessible thesauri? What role could thesaurus registries play in "advertising" the availability of thesauri and facilitating access and use? What tasks are involved in maintaining a registry? What kind of
organizations would best fulfill the registry function?

4. The Business/Intellectual Property Model
What types of collaborative agreements and partnerships are necessary between thesaurus owners? What kinds of relationships are possible between owners of vocabularies published over a public network and users? What are the issues involved?  How can users draw on vocabularies from a variety of different organizations -- government, commercial, academic, not-for-profit, and international?  What are the issues involved with copyright restrictions, payment or limited access to a thesaurus? What technologies are necessary to address some of these issues? What impacts will the expanded use of terminology tools beyond their initial application have on the structure, design, and maintenance of the tools?