'Language Technology as Linguistics: a Phonological Case Study of Dutch Spelling'
(1985)–Jip Wester– Auteursrechtelijk beschermd
[pagina 205]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Language technology as linguistics: A phonological case study of Dutch spellingJip Wester | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0. IntroductionRecent work on the construction of text-to-speech systems has shown that grapheme-to-phoneme grammars can be very fruitfully modelled after theoretical generative phonologies. The major technological advantages of the phonological approach to grapheme-phoneme work have been explicated e.g. in Hertz (1979, 1982), Kerkhoff et al. (1984), and Wester (1984). The present paper will try to establish a major non-technological reason for this ‘applied phonology’ to work along theoretical lines. It will be argued that a theory-based version of a text-to-speech rule system can function as a restrictive, and non-trivial theory of the formal relation between spelling and speech, and, more generally, harbors the promise of a scientific theory of the formal properties of human secondary language systems (as opposed to the primary language systems which are the usual object of research in generative grammar). The structure of this paper is as follows. In the first section we will elaborate on the theoretical parallel for applied phonology, focussing on the relation between the ‘deep structure component’ and the ‘rule component’ of the grammar, and on the notion of ‘linear rule ordering’. In the second section we will sketch an interesting problem for a formal theory of Dutch spelling, and offer an analysis of the phenomena presented, on the basis of an applied phonology of Dutch. In the third section some potential wider implications of this (type of) analysis will be indicated.Ga naar eind* | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1. From spelling to deep structure: a rewrite componentThe grapheme-phoneme enterprise can to a large extent be viewed as ‘normal phonology’, basically because the problem for the applied grammarian can quite naturally be defined as follows: what should the phonological component of a language look like if we would take the orthography of that language to be the relevant ‘underlying form’? This same definition, however, also reveals what seems to be a fundamental difference between theoretical and applied phonology: applied phonology, working from standard orthography, has an a priori fixation of its input structure, whereas theoretical phonology has not. Applied phonology thus seems to lack a possibility which is essential to normal phonology: the possibility of theorizing about its input, in search of the most elegant and insightful co-operation of the deep structure component and the rule component, together constituting the phonology of the language. At second glance, however, this difference appears to be less dramatic. If we adopt the modular generative approach, applied phonology allows for deep structure manipulation as well, via a logically necessary ‘mapping phase’ at the first stage of the grammar.
In its most elementary form, the ‘mapping phase’ is fully implicit in the grammar, and merely constitutes the working strategy that the mapping of graphemes onto deep structure phonemes will be regarded as 1:1. That is to say, the spelling-characters will be assumed to correspond in a one-to-one fashion to the phonemes traditionally symbolized by the same typographical characters. It should be clear, however, that there is no a priori reason to take this | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[pagina 206]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
traditional parallellism as linguistically significant. Rather, on the basis of the ‘non-phonetic’ complexity of many spelling systems, one would expect the linguistically significant mapping to be less obvious in terms of traditional symbolism, and more straightforward in terms of the linguistic ‘behaviour’ of the graphemes in the text-to-speech process. In other words, one would expect a series of explicit base rules at the beginning of the grammar, which specify the most useful grapheme-to-deep-structure-phoneme correspondences. These base rules rewrite the letter-string into the most efficient input to the phonological rule component. A very elementary example of such a base rule in several Indo-European languages is, of course, the mapping of the grapheme x onto the deep structure phoneme-string /ks/, by the simple rewrite rule (1).
(1) x → ks
In Dutch applied phonology for example, the generalization expressed by this rule is needed at several levels of the grammar. Not only does rule (1) eliminate the need of an extra and very ad hoc ‘phoneme’ /x/, it also produces the necessary input for several phonological rules of the spelling-grammar (Wester 1984).
In a similar vein, a second important task for the Dutch rewrite component is solving the problem of ‘logical ambiguity’ in vowel sequences in spelling. The Dutch spelling system has a number of digraphs and trigraphs for long vowels and diphthongs, which might be captured by the set of schematic rewrite rules of (2). The trees over the grapheme sequences indicate their ‘constituent-hood’ in terms of letter-to-sound relationships.
(2) As might be inferred from this list, the V-rules of (2) should be linearly ordered in a generative sense. For example, the presence in (2) of the context-free oei-RULE next to the equally simple ei- and oe-RULES, logically requires that oei, should be structured before ei and oe. On the level of the data, the necessity of rule ordering is also supported linguistically, as we can see if we consider forms such as (3) ((3a) shows the normal spelling-form, (3b) shows the corresponding letter-to-sound structure).
(3) Each of these forms is a case of potential VV-ambiguity, the most complex case | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[pagina 207]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
being the form koeieuier, in which, logically speaking, oe, ei, ie, eu, ui, oei, e, i, o, and u are potential constituents. The mechanism of ‘linear ordering’, however, can easily rule out many of these logical possibilities in favour of the one and only correct linguistic parsing of k[oei][e][ui][e]r in (3b). The correct structuring is guaranteed in the fact that the ui-RULE is placed before the eu-RULE, the oei-RULE before the ie-RULE, etc., in brief: in the ordering relations of (2). In this same way, the specific ordering of (2) not only ensures the correct reading of our koeieuier-example, it also appears to be the correct generalization for the other forms in (3) and for the VV-system in general.
Note that the motivation of the specific ordering relations in (2) has been purely ‘technological’ (in the sense of: necessary for a correct text-to-speech conversion), and that the notion of rule ordering itself is a motivation for generative ‘orthodoxy’ in this line of work, both in terms of the formal mechanism opted for and of its heuristic power. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2. Formal spelling-theory: the rule of DIAERESIS-PLACEMENTDutch linguistics has a relatively large tradition in the field of theoretical spelling-research. This research is largely triggered by the relatively rapid evolution of the spelling system, calling for sensible evaluation criteria in the face of new spelling proposals. The progress in this area, however, has been rather slow, not least because of the lack of strict and formal methods of analysis (cf. Zonneveld, 1980, Kerstens, 1981, Wester, to appear). In this section we will try to show the fruitfulness of an ‘orthodox’ applied phonology, as a means of elevating traditional spelling research to a linguistically more significant level. The case study presented will involve the diaeresis-phenomenon in Dutch spelling. Section 2.1. sketches the theoretical problem involving Dutch diaeresis; Section 2.2. presents an analysis which in effect uses no formal mechanisms other than those already present in Dutch applied phonology. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2.1. The problemThe diaeresis as an orthographic symbol in Dutch adds no ‘phonemic value’ to its host vowel, as in, for example, German, where the difference between Vater and Väter reflects a difference in pronunciation of the second grapheme. Rather, the function of the diaeresis in Dutch is to ‘disambiguate’ certain V-clusters in spelling, or, in the terms of traditional Dutch spelling-analysis (De spelling van de Nederlandse taal, p. 22):
(4) The diaeresis is written on the second of two adjacent vowels, which together can be read as one vowel or diphthong, but should not be read as such.
[Het deelteken (trema) wordt geschreven op de tweede van twee opeenvolgende klinkers die samen als een klinker of als een tweeklank kunnen, maar niet moeten worden gelezen.]
Thus, the Dutch spelling system calls for a diaeresis to ‘break up’ the V-sequences in (5a), which otherwise would be interpreted as digraphs, leading to incorrect pronunciations. The forms in (5b) and (5c) are ‘diaeresis-less’ according to (4), because the V-sequences should have di-grapheme readings (5b), or because a particular V-sequence simply has no such reading in Dutch (5c). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[pagina 208]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(5)
The statement under (4), however, is inadequate in the face of further Dutch diaeresis-data, as the whole picture of the relevant phenomena should also include the perfectly natural contrasting forms of (6a) and (6b).
(6)
The simple principle of logical disambiguation of (4) is contradicted by the ‘diaeresis-less’ forms in (6a), which, logically speaking, are as ambiguous as those in (6b). So the question we face is this: why is a logically ambiguous form as geuit definitely diaeresis-less, whereas equally ambiguous forms as geëist and smeuïg are natural diaeresis-candidates? This problem becomes even more intriguing if we approach it linguistically. On the basis of the minimal spelling pair zoeven (‘to flash’) - zoëven (‘recently’) from (5), one might infer that the interpretation of a string functions as a parameter for the diaeresis-context. And in fact, on a formal level the difference in interpretation between zoeven and zoëven can be accounted for as a difference in morphological structure: if the form is to be read as the infinitive of a verb (the ‘diaeresis-less’ interpretation), its structure will be that of zoev+en; if zoëven is to mean ‘recently’ (the diaeresis interpretation), its structure will be that of zo+even. However, if we consider the morphological structures of the other forms in (5) and (6), a straightforward relation between morphological structure and the diaeresis-context does not readily meet the eye, as evidenced by the examples of (7).
(7)
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[pagina 209]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
So the question remains: why are the forms of geuit and koeieuier definitely diaeresis-less, whereas the structurally comparable forms of geëist, smeuïg, kippeëi, etc. are not? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
2.2. A ‘technological’ analysisAs a first step towards a formal analysis of the diaeresis-phenomenon, let us adopt the generative framework, i.e. assume a model in which the phenomenon is dealt with by a RULE of DIAERESIS PLACEMENT. While the rule of DIAERESIS PLACEMENT itself might be stated simply as (8), the actual research, of course, will establish the formal, spelling-internal conditions which trigger the application of (8).
(8) A diaeresis is placed on the leftmost grapheme of a grapheme sequence which encodes a phonemic vowel or diphthong (and thus marks the beginning of a new text-to-speech constituent).
As a second step, let us recall the technological set of base rules of Section 1, and assume i) that this ordered set of base rules is part of the theory of Dutch spelling, and ii) that the application of the rules is ‘lexical’, i.e. takes place on single morphemes, before word formation (this second assumption is a formal statement of the observation that letter-to-sound relations in the Dutch spelling system are not affected by affixation or compounding). Given these assumptions, an analysis of the phenomena is quite straightforward. In fact, the only additional mechanism is the general spelling-principle of (9), as an alternative to the logical diaeresis-principle of (4).
(9) Spelling should be formally predictable in its lexical letter-to-sound relationships.
The analysis will run as follows. When the form geëist should have the interpretation ‘past part. of eisen (‘to demand’)’, its morphological constituents will be a prefix ge, a lexical stem eis, and a suffix t. Over these separate morphemes, the lexical base rules of (2) will project the structures of (10a). After word formation, the form (10b) emerges, where ‘+’ represents a morpheme boundary. After deletion of +, the final letter-to-sound structure will be that of (10c).
(10) Note however, that the lexical letter-to-sound structure of (10c) cannot be predicted on the basis of a ‘morphologically blind’ grapheme-string geeist. That is, when we would subject a ‘bare’ grapheme-string geeist to the base rules of (2), the result would be (10d), which is crucially different from the lexical structure of (10c), due to the fact that the base rule which structures the identical V-sequence ee is ordered before the ei-RULE in (2). For a ‘diaeresis-less’ form as geuit (past part. of uiten ‘to utter’), however, the lexical structure will be identical to the ‘blind’ output of (2) over a grapheme-string geuit, Compare (11), in which (11a) shows the separate morphemes with their lexical letter-to-sound structures, (11b) shows the situation after word formation, and (11c) shows the effect of the base rules on a ‘blind’ letter-string geuit.
(11) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[pagina 210]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As it turns out, the lexical letter-to-sound structure of the diaeresis-less form of geuit is identical to the non-lexical structure, whereas for the diaeresis-full form of geëist the two structures will differ. The same principle holds for the other forms presented so far, and in fact for the spelling system in general. Compare the examples under (12), where the column under (b) shows the ‘lexical form’, and the column under (c) shows the effect of (2) on the morphologically blind grapheme-string.
(12) This pattern is, of course, explained by the spelling-principle of (9). Where a lexical structure of a form equals its non-lexical structure, no diaeresis is needed, as the lexical letter-to-sound relationship will be correctly predicted on the basis of the spelling. If the lexical and the non-lexical structures differ, however, a correct prediction will go formally astray, and the spelling-principle of (9) will call for a correction through the rule of DIAERESIS PLACEMENT. Note that it is essentially the technological ordering of the base rules of (2) which allows us the above analysis. For example, the ‘blind’ letter-to-sound structure of smeuïg differs from its lexical structure only because the ui-RULE is placed before the eu-RULE in (2). And vice versa: the lexical structure of geuit would not be identical to the ‘blind’ variant, if the ui-RULE were not, indeed, ordered before the eu-RULE. In short, the theoretical generalization is contained in the technological ordering of (2).
Finally, observe that much of the formal apparatus of the diaeresis-analysis will be independently present in the Dutch spelling system. The ordered set of base rules, for example, should also be present to condition the rule of HYPHENATION, which also observes the lexical letter-to-sound constituents generated by (2). Compare the examples of (13).
(13)
Furthermore, the spelling-principle of (9) seems to be a natural enough candidate for a general parameter which will be needed by a ‘Universal Spelling Grammar’ to distinguish between, for example, possible ‘diaeresis-full’ and ‘diaeresis-less’ spelling systems. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[pagina 211]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3. ConclusionIn the previous sections, we have sketched a representative fragment of the rewrite component of Dutch applied phonology, a fragment which, as it turned out, could also directly be used as the backbone of a formal analysis of a rather complicated problem in theoretical spelling-research. In this section, we will discuss a few of the more obvious implications of the above analysis, and of the proposed ‘technological strategy’ behind it.
So far, the main strength of the diaeresis-analysis of Section 2 has been its observational value: the fact that the complicated and seemingly irregular phenomena could be captured at all through the formal analysis implied by applied phonology. Even granting its observational merits, however, the analysis can also be evaluated as insightful within the context of Dutch spelling-research, in the following way. One of the few ‘global’ principles known to be operative in the Dutch spelling system is what may be called the ‘Economy Principle’. This principle is in essence a variant of Occam's Razor, and might be stated as follows: (14).
(14) Eliminate spelling-characters which are redundant in the face of the spelling system.
This principle clearly played its role in the Dutch spelling reform of 1947, where identical vowel sequences aa, ee, oo and uu, standing for ‘long’ vowels on a phonemic level, were allowed to drop one of their members in an open syllable context, because in this context a single a, e, o or u would also suffice to ensure a phonologically ‘long’ reading. Thus, forms of the type (15a) were allowed to drop one of their geminate vowels as redundant, leading to the more ‘economical’ variants of (15b).
(15)
The diaeresis-principles involved in our technological analysis, and especially the linguistic ordering of the base rules of Section 1, can be regarded as a further formal effect of the Economy Principle of (14). The crucial observation is, that a simple principle of logical disambiguation such as (4) would not be logically impossible within the context of the Dutch spelling system. That is, with a principle of logical disambiguation the spelling-principle of (9) could still be met: it would only require a dramaric increase of diaereses in Dutch spelled texts. To meet principle (9), Dutch spelling would not only need diaereses in the lefthand column of (7), but also in the other forms, to the effect of (16).
(16) ingeniëur bloeiën geüit eiërdop copiëer koeiëüiër
In this light, a spelling-internal reason for the ordering principle of (2) becomes apparent. What Dutch spelling seems to do according to our analysis, is to standardize the ordered output of the rewrite rules of (2) as unmarked, so the diaeresis will only have to appear in the marked situation where, to obtain a ‘true’ (or ‘lexical’) reading, the ordering of (2) must be overruled. Our analysis thus seems to point in the direction of deep linguistic principles underlying a seemingly superficial spelling system, principles such as ‘linear rule ordering’, ‘markedness’ - and an overall generative framework in which such principles will function. Because of this characteristic, the analysis also seems to ‘rehabilitate’ spelling systems as linguistically interesting products of human communication. In any case, the above analysis should call for a | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[pagina 212]
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reconsideration of the alleged prescriptive, and (thus) linguistically arbitrary nature of spelling. As far as Dutch diaeresis was concerned, the official and prescriptive principle of (4) turned out to be a far cry from the actual linguistic principles involved in the actual diaeresis-phenomenon. Furthermore, establishing such underlying linguistic operators in spelling (of which our case study, of course, presented only one example) should be relatively easy, as the ‘orthodox’ generative approach to language technology incorporates from the outset many of the basic conditions for applied grammars to become non-trivial linguistic theories of the formal relation between spelling and language. Or rather, theories of the formal relation between human secondary and primary language systems. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
References
|
|