Australex 98 Papers - Baker/Manning

Australasian Association for Lexicography
1998

A dictionary database template for Australian Languages

Brett Baker and Christopher Manning

Department of Linguistics, University of Sydney
brettb@sultry.arts.usyd.edu.au cmanning@sultry.arts.usyd.edu.au

Abstract:

Dictionary-making is an increasingly important avenue for cultural preservation and maintenance for Aboriginal people. It is also one of the main jobs performed by linguists working in Aboriginal communities. However, current tools for making dicitionaries are either not specifically designed for the purpose (Word, Nisus), with the result that dictionaries written in them are difficult to maintain, to keep consistent, and to manipulate automatically, or are too complex for many people to use (Shoebox), and are thereby wasted as potential resources. Moreover, neither of these sets of tools provides a suitable user interface for people who simply want to browse or find words in a dictionary. We set out to design a dictionary 'template', written in software that was easy (and fun!) for people to use, and that maintained a consistent relationship among the information in the dictionary.

1 Introduction

Aboriginal people commonly express the view that dictionaries are one of the most important, if not the most important, repositories of information about their language and culture. Consequently, linguists are often called upon to help in the construction of these. This task can involve a number of people with different ranges of experience in computers, dictionaries, and language; and the question then becomes: what software tools should be used to write the dictionary?

First time users of computers may feel that all but word-processing applications are beyond them. And often more experienced users continue to use a word processor or text editor, because it is perceived as a flexible tool with which it is easy to get the job done. However, the problem with using word-processing applications for dictionary construction is that they were not designed for the task, resulting in inconsistencies of relationships among information that are very difficult to iron out later on. While keeping a dictionary in a word processor seems an easy way to get started, down the track, the result is always a great deal of manual effort spent making the dictionary consistent, and the need for complex reformatting to prepare it for publication. Moreover, there are only quite limited facilities for browsing and displaying information in different ways.

For the more experienced, there are tools made especially for structured dictionary data, such as SGML editors and the tools distributed by SIL, such as Shoebox and MacLex. SGML stands for Standard Generalized Markup Language, a system for marking up texts with structure tags, of which HTML is a very simple instance. Traditionally, SGML editors have been very expensive, and not intended for casual users. The Summer Institute of Linguistics (SIL: http://www.sil.org/) distributes several tools for dictionary construction, notably Shoebox, which are designed for maintaining dictionaries as plain text files, but in a structured format, known as FOSF (Field-Ordered Standard Format, a simple markup language where each line begins with a backslash code). Unfortunately, our experiences as prospective user and teacher of Shoebox is that getting started with Shoebox is just much too difficult for the ordinary working linguist. When Shoebox is all set up right, it can be great to use, but the average linguist just never gets there. At any rate, Shoebox is designed for a trained linguist, and does not provide facilities for a casual user. There is a need for something that is easier to use.

We believe the following attributes to be the minimal ones for a reasonable electronic dictionary:

Consistency of formatting. While dictionaries may use different styles of entries, as much consistency as possible in laying out fields (in terms of order, markup, and type styles) aids usability by human and computer
Ease of data entry
Ease of search, and search-and-replace
Ease of moving the dictionary data to and from other applications or platforms. Interoperability is the key, and is what is holding a lot of people back from trying new tools.
Ease of output in a printed form which has stylistic formatting (i.e. the output looks like a 'real' dictionary)

Added advantages, especially if the dictionary is used on a computer, as well as on paper, would be these:

Dynamic layout: point and click interfaces
Multimedia capabilities (pictures, sound recordings, and even video)

Point (2) is an important consideration merely on the basis of time and expense. Using a word-processor to create dictionaries will involve not only the entering of the actual lexical information (words, examples, definitions, and so on), but also the entry of any markup. Either the dictionary content is completely uncontrolled, or else the user is responsible for the consistent entry of graphical devices like punctuation, markup tags, ordering of subparts of the entry, and the consistency and correctness of links within and across entries.

Point (3) needs little explanation. One striking example is if (as is not uncommon) the orthography of the language changes while the dictionary is being created, entailing large-scale changes to entries. Because there is perhaps no dictionary software that can do everything a user wants it to do (more on this below), point (4) is quite important. Point (5) may not matter to many linguists, but is certainly an issue in many Aboriginal communities - where a printed, bound dictionary is an important symbol of linguistic and cultural prestige (a point made forcefully by Austin (1997)).

Points (6) and (7) are also issues which are very relevant to the end users, though perhaps less so to linguists. It should not be forgotten that many, perhaps a majority, of the first-language speakers of Australian languages have little or no facility in the written word. Making dictionary performance less reliant on the literacy skills of the user is therefore an important challenge for linguists and lexicographers. At any rate, these concerns aside, restricting oneself purely to textual content is unnecessarily tying oneself to the limitations of past technology.

The importance of (1) may not appear to be so obvious immediately. But in practice it is vital. Not only does consistent layout help a human being, but it is particularly vital for being able to take advantage of the labour saving, and transformational possibilities of data manipulation by computer. For instance, if the dictionary is in a consistent well-marked-up format, then one can do things like automatically convert it into HTML webpages, or automatically produce graphical interfaces for browsing the data. However, such things cannot be done (without much further manual reformatting of the data) if the dictionary is just a free format word processor file. This is actually one of the disadvantages of the SIL FOSF format, since, while there is a well-defined structure for the backslash codes, users tend to see the material after that on the line as just a freeform text entry field, and to (mis)use it accordingly. Trying to maintain consistency of format when using a text editor to maintain a dictionary becomes very difficult. To make sure that every entry has exactly the right fields in the right order is either a very time consuming manual data entry and checking job, or suddenly requires the user to acquire (arcane) new skills in writing macros and using regular expressions.

1.1 Is there a better way?

We believe that there are better options. The main problem of using word-processors (e.g. Microsoft Word) or text editors (eg. Nisus, Qued/M) for creating and maintaining dictionaries is that dictionaries are in effect databases, but such applications do not support this kind of structure. A spreadsheet, such as Excel, can be seen as providing a flat file database structure, but is not really suitable for representing the complex relationships of dictionary entries. There are then a number of database applications, such as dBase or Access, but these are very difficult for the inexperienced user to come to grips with, and usually rather more aimed at manipulating numeric data than textual data.

It seems that people want the flexibility and ease-of-use of a word processor, and this should be an important consideration. But this should not come at the expense of a consistent structure to entries. The aim is for people to gain the benefits of a more structured dictionary representation, while making things easy enough to use that people will actually choose to use it.

Our goals in setting up a 'dictionary database template project' were therefore to design a dictionary on a computer that:

was maximally easy to use
kept data in a consistent format
had powerful data storage and linking possibilities
had powerful data manipulation possibilities
can hook up with other software

The users we had in mind were:

Language researchers of all kinds
Students and teachers in Aboriginal communities
People who wanted to produce a printed dictionary someday

These users need at least the following capabilities:

Easy data entry
Easy changing of the dictionary layout
Information kept consistently related for you

1.2 The database design

Our goal was to design a kind of dictionary database 'template': a specifically designed file or set of files which could be used by a range of people involved in making and using dictionaries of Australian languages. Ideally, it should have all of the qualities identified above in points (1)-(7). For these reasons, we targetted FileMakerPro (v. 3-4) as a potential tool for our project. We believe that it covers a sufficient number of these points, and that current alternatives fare much worse.

2 Features of FMP

FileMakerPro (FMP) is the easiest to use database on the planet. FMP was designed for Mac users, and therefore has the familiar user-friendly features: it's iconic, has lots of buttons, and uses menus and dialogue boxes to guide users through actions. In particular, it's extremely easy to add/change fields and to design new, colourful and dynamic layouts (presentations of data). FMP provides the basic facilities of a relational database (even though it is rather limited in this regard). It runs on both Macs and PCs. This one criterion rules out most database software (and other things like Hypercard). It supports audio, pictures, and even video - clearly an idea whose time has come. No programming is needed for using the simple browser or for basic data-entry. By putting buttons and menus in layouts, you don't even have to know how to use FMP. This makes it perfect for users who have no interest or ability in designing or using complicated software. Things like searches and filtering are just a heck of a lot easier than in other tools such as Shoebox.

2.1 The power of a relational database

A relational database is an exceedingly powerful tool for creating dictionaries (cf. Austin and Nathan 1992), having possibilities that simply aren't provided for in most of the currently available software tools used for dictionaries. Using FMP, one can link information between related files, or even the same file using 'keys', allowing many-to-one relationships without fuss. One can for example, have relationships of:

senses to words
examples to senses
words to other words

The keys are not the actual content, so changes in transcription, etc., don't ruin the linking between files. Rather, relationships are defined (in a dialogue window) over files and fields. The powerful and easy-to-use layout capabilities of FMP allow users to see on-screen information from multiple files simultaneously (through 'portals'). Thereby, users have the considerable cognitive advantage of being able to manipulate relationships between data at several levels simultaneously, without sacrificing the integrity of the data structure itself. Layout design has absolutely no effect on the actual structure of the database, a real problem when using a text processor. This encourages the user to develop a clear cognitive separation between the data and its layout - something that unfortunately tends not to happen when working in a word processor. Consistent updating is guaranteed by the structure: changes in visible, related fields in other files are changes to the actual field.

2.2 Problems of relational databases

Traditionally, dictionaries have allowed many different kinds of entries for different kinds of words. A problems with using relational databases for dictionaries is that it's hard (though not theoretically impossible) to capture this flexibility in a relational database. Any attempt to do so involves the use of complex structures and oodles of tables. We adopt a pragmatic compromise, with some, but not unlimited, flexibility. Importantly, users can customize it easily.

2.3 Other features

A big advantage of FMP is the power of its layout design capability. This means that even inexperienced users can create simple layouts for basic data entry in a couple of minutes, simply by dragging and dropping field icons around a page in 'layout mode'.

FMP supports a range of levels of secure access, similar to Hypercard. Manipulating this capability, along with the ability for a single version of a FMP database to be simultaneously read and edited by a number of users over a network, can lead to a partial elimination of the problem of multiple versions and authors that plagues dictionary projects.

2.4 Export/Publishing

It is fairly easy to do import/export jobs in FMP. In general, the clearly defined structure of a FMP database means that it is easy to export data, and to reformat it automatically in different ways. In particular, one handy ability is export of lists satisfying certain criteria. Say one wanted to create a separate dictionary file in text format of all the words in one particular dialect of the language, or all bird names. As long as one has identified entries by dialect or semantic domain, one simply does a search on all the entries that have one of a certain set of values in the appropriate field. The list created can be exported directly. One can also choose which particular fields are wanted (say, just 'entry' and 'gloss' for a simple wordlist) in the exported version by clicking on the list of fields presented in the export dialogue window. For linguists, the advantages of this are obvious: one can search by part-of-speech, and export lists of the (classes of) verbs, lists of nouns (in a particular semantic domain or noun class), and so on.

One can easily make formatted 'Merge' files by exporting as 'Merge' format text. Merge format file are similar to a tab-delimited text file, but the content of the columns is labeled. Originally intended for mail merge applications (i.e., generating junk mail), they can be misused for various other purposes such as getting Microsoft Word to create formatted dictionaries. This is easy/nice for Word users, and quite powerful if one uses the Mail Merge language (which has conditionals, etc.). One can also use Mail Merge to create FOSF files.

For example, if we simply export the headword, category, and gloss fields of an example mini dictionary in Merge format, we save a text file that looks like (1), with a name 'dict.data':

file: 'dict.data'
Headword, Category, Gloss
nulh, N, coolamon
ngarriny, N, hand
marji, N, hand

Using Word, we open a new file, saving it as 'dict.templ', and then select the Mail Merge Helper from the menu. This in turn opens a dialogue box asking us what our 'data file' for Merging will be. Selecting the Merge file 'data.dict' which we have already created in (1) automatically places the following instruction at the top of our file:

«DATA dict.data»

Using the iconic commands in Mail Merge Helper we can now select the fields we want in the order we want them in the published version. We select the three fields in order, giving the following result:

«DATA dict.data»«headword»«category»«gloss»

This is effectively a template that Word uses to organise the data in our data file. Rather than simply listing fields, one can also use various conditionals, expressions, and tests to assemble more sophisticated layouts. A neat facility which Word provides is the ability to place standard formatting for each entry into the body of the Merge template itself. We can, for instance, make all of the headwords in our dictionary in bold font, 14 point, by giving the first letter of that field in the template those characteristics. Additionally (or alternatively), we can give each field its standard markup tag, as follows:

\hw «headword» \ca «category» \gl «gloss»

When we perform a Mail Merge on this template file, Word opens a new file with the characteristics we specified:

\hw nulh \ca N \gl coolamon
\hw ngarriny \ca N \gl hand
\hw marji \ca N \gl hand

2.5 Import

Two kinds of files can be imported into FMP: text files (tab-delimited, Merge, SYLK, etc), or other FMP files. Word files need to be saved as plain text. One needs the text files in a rigid consistent format before import is possible (but this is no different to other applications such as Shoebox). The difference is mainly that Shoebox is set up for import from FOSF files, while FMP is mainly set up for import from spreadsheet-style files with columns of data.

As far as we know there is no easy answer to getting textual data into FMP. The best answer is probably the development of an auxiliary program in something like Perl. This is something that we intend to address.

2.6 Programmability

Filemaker has an in-built 'Scriptmaker' facility, which supplies a prespecified list of database actions (e.g. 'Find...', 'Copy...', 'Paste...'), objects (e.g. 'field x' [x is a field name]), conditionals, loops, and so on. There is some use of variables, though this is a bit cumbersome and sometimes has unexpected results. (For example, it is hard to write a script which says: 'run through the fields in an entry, when you find one that is empty, stop and do X'.) On the other hand, one can do a lot of stuff with scripts which is simply unavailable in most Macintosh applications, including Hypercard. Like Hypercard, one can attach scripts to custom buttons, to perform actions like transferring between layouts or files (depending on the name entered into a user-level dialogue box, say), looking up the words in an example sentence (a particularly complex one), or performing a search based on criteria entered by the user.

2.7 Disadvantages of FMP

There are just a couple of disadvantages to FMP, but they are quite serious ones. For some reason (perhaps the FMP developer mob are working on this) it is not possible to copy and paste scripts (or otherwise write them in a text editor) - which means that development is a pain and is much more time consuming and tedious than it could and should be.

Searching is rather limited in FMP: there are some wildcards, but there is certainly nothing like the capability of regular expressions. Doing partial find-and-replace within text strings (of the sort that one is used to from text editors) is difficult and fiddly within FMP. There are some prospects of getting decent results by scripting the steps and building a replace button that would allow users to partially replicate Search-Replace - it still wouldn't have regular expressions though. So, just like Shoebox, you sometimes need to export, edit in a text editor, and reimport to do these kinds of things.

As a long time user of FMP and text editors for doing dictionary and text work, Baker has found it necessary in most cases to maintain at least two separate versions of the dictionary. One, a text file in FOSF which is used purely for search-replace using regular expressions, and for quickly referring to the dictionary while using Word. The other, a FMP version of the database in its entirety. This is the master version for all changes, and it is the version from which different kinds of text-file exports are made (e.g. lists of verbs). It is also the version which is loaded onto school computers, presented to children and speakers, and used to create published versions of the dictionary (using Mail Merge). None of these things are easily achievable in a text file.

2.8 Comparison shopping

The main alternative tool for building dictionaries is SIL's Shoebox. The great feature of Shoebox is its ability to link between texts and dictionaries. When it is all set up right, this functionality is great, and something that we probably cannot duplicate with what is possible with FMP's Scriptmaker. But Shoebox is difficult to get set up right, and in practice the barriers seem to be too big. Moreover, it is only for a linguist, and does not provide an appropriate interface for a casual user or a not-very-computer-literate language consultant. We briefly summarise some of the other pertinent features of the systems:

FMP Shoebox

Multimedia Yes No

Import/Export No worse/easy Not too easy

Publishing Mail Merge No (other programs)

Layouts, Buttons Yes, Yes No, no

Programming FMP Scripts Only OS macros

Search/Replace Limited/No Limited/No

Relational Yes No

Web version Automatic No

Security/Control Yes No

Text/Dict Links No/limited Yes

3 Dictionary Template Structure

The many-to-one relationships licensed by a relational database allow us to model the typical hierarchical structure of dictionaries. A partial relational model of entries looks something like this:

Headword (4)--- Sense (4.1)

Sense (4.2)------- Example (4.2.1)

Example (4.2.2)

Example (4.2.3)

Sense (4.3)------- Example (4.3.1)

Sense (4.4) ... ...

A (head)word may have any number of senses. In turn, a sense may be illustrated by some examples (which are particular to one sense and not another).

The template version we are currently working on has the following structure. There are eight files (tables): Headwords, Senses, Examples, SeeAlso, Attributes, Cognates, Dialects, Semantic Domains. The five not illustrated above are:

SeeAlso: a particular sense may have links (synonym, antonym, etc.) to other senses (of other words), or, a particular headword may have links (cognacy, morphological relations like compounding) to other headwords.
Attributes: a word may have attribute-value pairs (such as inflected forms of verbs)
Cognates: a word may have cognates in other languages
Dialects: a word may have dialectal forms which differ in sense and/or headword form
SemanticDomains: a hierarchy or list of semantic domains to which senses are assigned

There is flexible linking between related entries. Linking is mediated by key fields rather than using the actual orthographic forms.

4 How much does it cost and where do we get some?

We intend that the template will be available free, for distribution either from a website or on disk. This means that as long as people have access to FMP v. 3 or 4 they can download it directly and start using it.

There is some possibility we will be able to distribute the template on disk with a run-time version of FMP. The runtime version supports all normal use of the dictionary template:

import/export
networking
browsing, data entry, searching

It does not allow modification by users of layouts, scripts or field definitions. The runtime version will also not open other FMP databases that are not supplied with it. However, if users possess a commercial copy of FMP they can still open the same database with full capability (i.e. layouts, scripting, etc). We will also provide support for producing camera-ready copy from dictionaries produced within the system (because we're nice people).

5 Broader plans

This template won't fill the needs of all users, but it seems like it could be a useful tool in the toolset, particularly for applications in language centres, schools and communities, and to assist dictionary compilers in their tasks. It goes some distance to achieving the crucial goals of (i) being very easy to use, (ii) supporting multimedia, and (iii) making a dictionary fun to browse. But there are other things that are probably better suited to other needs. In other work, Manning is working on graphical/network interfaces to dictionaries. Baker is also hoping to experiment with graphics and text fragments in audio as tools for navigating dictionaries. We observe that structured representations of dictionary information are a precondition for making such other applications possible.

Bibliography

Austin, P. 1997. Invited address at the Australian Linguistic Society Annual Meeting, Armidale.
Nathan, D. and P. Austin. 1992. Finderlists, Computer-Generated, for Bilingual Dictionaries. International Journal of Lexicography, 5:32-65.

Australasian Association for Lexicography1998