研究・教育・社会活動

博士論文一覧

博士論文要旨

論文題目：PHRASAL ANALYSIS OF UNTAGGED CORPORA: ONLINE CONVERSATION AND WRITTEN LANGUAGE
著者：マルチェフ　ミレン（MARTCHEV, Milen）
博士号取得年月日：2008年6月26日

　In today's age of electronic information, vast quantities of text are produced daily and published on the Internet. New kinds of conversational media like Internet message boards, online chat and email have emerged as an important component of human communication. In this context, it is important for corpus linguists to develop techniques for working and analysing raw (i.e. unannotated or unedited) texts in order to keep abreast of the rapid linguistic changes taking place. This paper presents a set of such techniques and uses them to analyse aspects of online communication from a linguistic and behavioural point of view. Written online conversation, which first started to gain widespread popularity in the 1990s, has seen people converse on a large scale not through spoken language but written speech for the first time in human history. On the other hand, a significant length of time since the advent of the Internet has already passed to allow different communicational strategies of writing down speech to mature out of their fumbling infancy. It is in these interesting times that this paper attempts to compare computer-mediated communication (CMC) in the form of Internet message board conversations to a more traditional form of writing ? literary prose, exploring facets of both but chiefly concentrating on the former. An inter-national comparison of certain aspects of online conversation is also presented, being the first attempt of its kind for the languages concerned (Japanese and Bulgarian).
　This study can be situated in a research field lying at the cross-section of corpus linguistics, natural language processing, computer-mediated-communication (CMC) studies in general and studies of the language of message boards in particular. The empirical material used is strings of text, called N-grams. All N-grams in the study are sequences of words (in the case of Bulgarian) or alternatively - letters or characters (in the case of Japanese) and their source is electronic text, which has been processed by a computer program. These string units are also often referred to as phrases, although the term ‘phrase’ is not as narrowly defined as in traditional grammar. At the same time, the consistent use of a string implies that it does function as a semantically and grammatically cohesive unit.
　The question usually posed in corpus research so far could be summed up as "if we are interested in type A, what is the variety of its uses and/or quantity of each of them on the basis of textual evidence". In this paper, a text is approached asking "How much of what is there in the corpus?” The author tries to demonstrate how a partial but useful solution to this problem can be achieved through the use of N-grams and their frequencies. Taking these sequences as the fundamental unit of investigation presents a new approach to interrogating a corpus, referred to throughout the paper as Phrasal Analysis (PA). It is argued that its main advantages are that it offers meaningful ways to analyse unannotated texts and that it adds a new and beneficial dimension to corpus studies which have not thus far sufficiently used N-grams as a primary research tool. It is also proposed that the wider use of PA in corpus linguistics would make the field more dynamic and help it keep abreast of the rapid linguistic developments which the Internet has brought and continues to bring about.
　Much of the analysis in this thesis is based on a comparison of a corpus of online conversation to a corpus of literary prose written between fifty and a hundred years ago. Why compare such different entities? The basic idea is that since these two registers are so distinct functionally and chronologically, they can highlight each other better displaying a broad range of differences which can be used to define many of their basic and unique characteristics. As this is the first time that a large linguistic study solely based on N-gram Phrasal Analysis has been attempted[1], it was decided that the analysis should not be limited to a narrow set of problems but try to examine numerous features fundamental to the respective registers. That being said, this has not been done to an equal degree for the two genres explored. The author's primary research interests involve the modern medium of online conversation and he therefore chiefly concentrates on trying to discover and interpret characteristics of this register, although two sections of the exposition deal entirely with aspects of the language of literary prose.
　PA is applied to data coming from two languages - Japanese and Bulgarian. As the Internet originated in the English-speaking world, the literature on English CMC is larger, more systematic and with a longer history than CMC studies in any other language. It is therefore especially important to provide data on what is happening to other languages in the modern online medium; presenting the findings in English will hopefully be of value to this kind of research on a more global level. The author is well-familiar with Japanese and a native speaker of Bulgarian so this is where he feels he can make a contribution. At the same time, when it comes to contrasting very different languages like Bulgarian and Japanese, message boards are a very suitable conversational medium because they allow us to use the message board posts as a common denominator against which we can compare phrasal frequencies.
　The actual text processing technique used in the study can be outlined as follows. A text is a sequence of words (or letters and characters in the case of Japanese). In natural language processing, sub-sequences of n words from a given text are called 'N-grams'. Thus, a 2-word sequence is called a 'bigram' (or a 'digram') and a 3-word sequence is called a “trigram”. Let us consider the string A B C D E F... etc, where each capital letter represents an individual word, i.e. a string of letters with a space immediately following it. In order to find all possible 3-word strings, for example, we break up our text into ABC, BCD, CDE, DEF … sequences. In doing this, we have effectively split our text into overlapping trigrams, so that all the words in it are accounted for. We then find the respective frequencies of all of these (ABC: p times; BCD: q times; CDE: r times and so on) and sort the results. The procedure for 2-word strings, 4-word strings and so forth is analogous. An important fact to realise is that when we look at our results, frequent N-grams are always linguistically meaningful in some way, i.e. they represent phrases in the above-mentioned sense.
　The value of using N-grams and their frequencies is that we let the text itself (through our processing of it) reveal to us to the phrasal patterns found in it and basically tell us how much of what there is inside. Thus, the starting point is the data itself and not our preconceived ideas of what we should be looking for - clearly representing a very objective, empirical and corpus-driven (as opposed to corpus-based) approach. This approach also enables us to study in detail the language of unannotated corpora (i.e. texts in their original form), which is important in view of the massive quantities of daily text output and dynamic linguistic change in the Information Age.
　Phrasal data produced from a single text or corpus can be informative, but if we compare the N-grams of at least two corpora we are able to have a frame of reference, make our analysis more insightful, situate our findings better, compare whole registers and hence the underlying linguistic behaviour of their agents. In other words, there is much added value to be gained from contrasting texts and this is the reason why this paper makes a series of comparisons: Japanese online conversation versus Japanese prose, Bulgarian online conversation versus Bulgarian prose, Japanese general online conversation versus Japanese students' online conversation and last but not least characteristics of Bulgarian versus Japanese online conversation. To help evaluate N-gram distributions across corpora, a measure called the Leech-Fallon coefficient is used and all N-gram distributions are tested for significance using the chi-square statistic.
　The N-gram approach is especially useful in the case of Japanese. Japanese orthography does not use words separated by white space and this makes the sorting and counting of linguistic items very inconvenient due to the multitude of ambiguous strings. The use of N-grams does not completely solve this problem but rather sidesteps it, as we only deal with string frequencies without having to define what is a word and what is not. When we consider sets of related N-grams and their frequencies in parallel, we can isolate items of interest and account for specific meanings of ambiguous strings much more conveniently and accurately.
　To sum up, this study explains and demonstrates the validity of N-gram-based PA in language research. Moreover, it makes a number of specific socio-linguistic observations regarding online conversation and to a lesser degree ? literary prose. Its major findings are summarised in the next few paragraphs.
　An analysis is made for both Bulgarian and Japanese of trends in the graphic representation of speech. A clear generational contrast is discovered in Japan - high school students use lowercase Hiragana/Katakana about twice as often, and half-width Katakana characters about three times as often as what was observed to be the general level; younger people also use more graphic adornments. The use of numbers instead of certain letters in Bulgarian - a practice born out of technical limitations in the early years of email but now surviving as a social habit - is registered and quantified. Frequencies of different smileys (the use of ASCII letters and symbols to denote facial expressions and emotion) are recorded in both languages. Also, trends in the use of punctuation in modern conversational writing are compared between Japan and Bulgaria. The study shows that some basic punctuation marks (such as exclamation marks and combinations of exclamation and question marks) are used with far greater frequency in the Bulgarian case, while Japanese online writers resort to a greater variety of alternative punctuational expression (such as emotive Kanji for example) but as a consequence each of them with a lower frequency. The question mark is the only punctuation symbol that is more or less equally represented in Bulgarian and Japanese casual online writing, and of course it is the only one without a clear alternative in Japanese. On the other hand, the increase in the density of question and exclamation marks in message boards compared to literary prose is greater in Japan.
　A number of so-called boundary patterns are identified. These are frequent N-grams encountered around sentence boundaries - an area which has not received much attention from corpus linguists. A few important patterns of this kind are discussed under the section "Narrative Deixis" of Chapter 3 in regard to Japanese prose and "Boundary Patterns" in Chapter 4 in regard to Bulgarian prose. In the case of Japanese, sentence-initial N-grams are demonstrated to be a very interesting area for studying language typical of discussion.
　A set of discourse phrases is examined for both languages. These are various expressions central to the discourse of online discussion which involve opinion-statement, agreeing, interacting with others, logically linking arguments etc. Identifying typical and commonly used phrases of this kind is something this paper can be credited for and selected items from this group are later used to make inter-national comparisons.
　Lists of modern and outdated vocabulary are produced for both languages ? taking advantage of working with texts recorded in two clearly separated time periods. A special group of XYXY N-grams[2] are found to be more typical of and exhibit a much greater variety in Japanese prose compared to modern speech.
　Some of the N-grams considered in Chapters 3 and 4 provide insight into gender imbalance. For example, when comparing the phrases 彼は・彼女は・男は and 女は as used in Japanese prose, it is found that while male characters are referred to much more often than female characters (as seen from the frequencies of he and she), explicit reference to men in terms of gender (as in the man) is significantly rarer than explicit reference to women (as in the woman). This finds a parallel in Bulgarian Prose, where the analysis of a set of adjectives paints a picture of male characters as typically being happy, satisfied, tired, calm, young, healthy, ready, etc. or who see, speak, notice, forget, lose, travel, go out, become, succeed, get appointed, require, receive, sleep, die and so on. On the other hand, the fair sex is typically referred to when stating that certain representatives of it are, indeed, fair (the only phrase of any notable frequency being she was pretty). Outside fiction, women are found to agree more often than men in Bulgarian online conversation.
　Furthermore, some common phrases in online discussional writing are compared across languages. Online conversation is perhaps the best medium for such a comparison because we have a common denominator on which to base frequencies. Among CMC formats the language of message boards is perhaps the most useful source of data, because very big samples of it are easily accessible and it possesses a greater variety of linguistic expression than, say, Internet Relay Chat. On the basis of the N-gram data the author is able to make claims such as "Bulgarian speakers use the phrase for example twice as often as speakers of Japanese, are equally likely to use i.e., and more than three times less likely to use by the way". Empirically supported comparisons of this kind, especially regarding languages as different as a Slavic language is from Japanese have not, to the knowledge of the researcher, been conducted before.
　A number of aspects of online behaviour are also compared. Linking practices reveal that Japanese net-users are more oriented towards Internet content in their own language, while Bulgarians are more internationally oriented. It is suggested that Bulgarians are more likely to read informational content in English. A comparison of hyperlinks according to various media types[3] reveals that Japanese message board participants seem to show a relatively greater interest in audio/music content, news and blogs, while Bulgarian users are comparatively more likely to refer one another to web pages containing video and images. Moreover, time-patterns of online forum participation are compared for both countries according to hour of the day and day of the week. The biggest difference is found in weekend posting activity - Japanese posters are much more active during that time. Daily peak-times in Japan are also found to be one hour later than in Bulgaria.
　Also, an interesting contrast emerges in the chapter comparing Bulgarian and Japanese online behaviour (Chapter 5). As was already mentioned, the N-gram analysis of hyperlink activity shows that Japanese Internet users are relatively more home-domain and home-language oriented than their Bulgarian counterparts, but at the same time some phrasal data suggests that it is Bulgarians who are more likely to talk in terms of us (i.e. we Bulgarians versus we Japanese).

The paper is structurally organised as follows:

Introduction
1. Corpora, N-grams and Phrasal Analysis
2. The Corpus Data
3. Corpus Analysis: Japanese
4. Corpus Analysis: Bulgarian
5. Cross-country boarding (an inter-national comparison of message board language and behaviour)
6. Conclusion

　Chapter 1 provides an overview of N-grams and their basic uses in text processing. It defines and explains the idea behind Phrasal Analysis as well as its importance in relation to untagged (unannotated) corpora. Related research that the paper draws upon is introduced. This first chapter also contains a brief explanation of the computer programs (in Perl) used to process the texts.
　Chapter 2 introduces the data used in the study. It explains the sources, size and composition of the four main corpora used - two in Japanese and two in Bulgarian, each pair consisting of a corpus of recent online conversation and a corpus of literary prose from half a century or more ago. An auxiliary corpus of Japanese high-school and junior-high school students is also introduced here. The last section of this chapter is devoted to some problems surrounding the processing of the original texts in order to achieve an N-gram data format convenient for analysis.
　Structurally very similar, Chapters 3 and 4 present PA in action in the case of Japanese and Bulgarian respectively. They deal with unique characteristics of each register first, and then present a selection of case-studies covering specific topics to demonstrate the kind of data breakdown and interpretation PA allows for.
　Chapter 5 concentrates on drawing parallels between features of online language and behaviour in two different countries with two different languages, and tries to establish relationships between what is becoming a decreasingly novel medium of written conversation and trends in linguistic expression in it (focusing on punctuation). Aspects of online behaviour (Internet use from the point of view of hyperlink referencing) are explored as well. A direct comparison follows of selected Japanese and Bulgarian phrases. Time patterns of message-posting activity in the two countries concerned are also contrasted.
The conclusion sums up the major findings of the study, points to ways in which Phrasal Analysis can be improved and outlines some other possible applications of PA that can be expected to produce interesting results in the future.

Notes
1 excluding one, referred to in Chapter 1, Section 1.3., which uses N-grams to evaluate the performance of learners of English as a foreign language
2 i.e. words which consist of four syllables with their first and third, as well as second and fourth element coinciding
3 by looking at N-grams contained in hyperlinks such as .jpg, .gif, image, blog, info, news, .mp3, youtube, etc.

このページの一番上へ

博士論文一覧

博士論文要旨

論文題目：PHRASAL ANALYSIS OF UNTAGGED CORPORA: ONLINE CONVERSATION AND WRITTEN LANGUAGE著者：マルチェフ ミレン （MARTCHEV, Milen）博士号取得年月日：2008年6月26日

論文題目：PHRASAL ANALYSIS OF UNTAGGED CORPORA: ONLINE CONVERSATION AND WRITTEN LANGUAGE
著者：マルチェフ　ミレン（MARTCHEV, Milen）
博士号取得年月日：2008年6月26日