Mining gazetteer data from digital library collections

Presentation at the NKOS Workshop, JCDL 2002, on Digital gazetteers: integration into distributed digital library services, July 18, 2002

David Smith, Perseus Project, Tufts University dasmith@perseus.tufts.edu

Gazetteers, while immensely important for digital libraries, must be built with substantial effort. With the availability of sizable geoparsed data collections, digital libraries can begin to aid the builders of gazetteers. In particular, we can mine digital collections for alternate names and for associated temporal information. The Perseus Digital Library collections, stretching from the ancient Mediterranean to nineteenth-century North America and modern scientific documents, provide a testbed for these techniques.

Many texts, particularly reference works, explicitly list alternate names for geographic entities, at least one of which may match an existing gazetteer. An article on "Berytus", for example, contains the known name "Beirut", but also "Berotha" and "Berothai".

More interestingly, we can now use parallel texts in different languages to "project" geoparsing from one language onto another. This technique has already been demonstrated for part-of-speech tagging and named-entity classification. Although most previous efforts at name projection have focused on ad hoc bilingual transliteration rules, parallel corpora can generalize to any language pair and augment gazetteers with multilingual names.

Finally, digital data can provide other attributes for geographic entities. Another Perseus presentation at JCDL on "Detecting Events with Date and Place Information in Unstructured Text", addresses the mining of texts for the important events associated with particular places. Text can also be mined for historical populations and businesses.