Báo cáo khoa học: THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN
Số trang: 3
Loại file: pdf
Dung lượng: 118.00 KB
Lượt xem: 12
Lượt tải: 0
Xem trước 2 trang đầu tiên của tài liệu này:
Thông tin tài liệu:
IN the course of an analysis of several samples of technical Russian undertaken as part of a study in mechanical translation, a number of statistical data reflecting the structure of these samples were compiled. One of these, the distribution of word length, is presented here as Fig.
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: " THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN" [ Mechanical Translation, vol.1, no.3, December 1954; pp. 38-40] THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN A nthony G. Oettinger C omputation Laboratory, Harvard University I N the course of an analysis of several sam- a mination of the texts indicates that these dif-ples of technical Russian undertaken as part of ferences can safely be attributed to differinga s tudy in mechanical translation, a number of s ubject matter and styles. However, all distri-s tatistical data reflecting the structure of these butions are bimodal, perhaps trimodal, and cuts amples were compiled. One of these, the dis- o ff at k=18. The mode about k= 7 is attributabletribution of word length, is presented here as t o the large number of different words used toFig. 1. d efine the particular subject of each text. The T he theoretical interest of this distribution p eaks at k= 1 and at k= 3 are due to a smalla rises from the possibility of using it as a n umber of very frequent grammatical words,b asis for an operational definition of words in t hat is, prepositions, conjunctions, etc. Thep rinted texts. If texts are considered purely as f ive most frequent words of length 1, 2, and 3s equences of symbols including the letters, i n the total sample are listed in Table 1. Thisp unctuation marks, and space, the resulting se- t able shows that the most frequent two letterquences are of a length which no practicable w ords are consistently less frequent than threem achine can manage. A study of the distribu- l etter words of similar rank. One and two lettertion of the number of symbols between pairs of w ords are exclusively grammatical; 90% of thes uccessive symbols of certain classes would be t hree letter words are also grammatical,o ne way to reveal structural characteristics of l eaving 10% dependent on the subject matter.t he text sequences potentially useful toward the T he words of length 4 are nearly all inflected.d efinition of manageable and significant T he fact that only very few Russian words haves ubsequences. The subsequences included be- s tems of three or less letters probably accountstween successive occurrences of letter pairs f or the valley at k= 4. Indications thus are thath ave not been investigated. Those included be- t he modal and cut-off structure of the distribu-tween successive pairs of periods, exclamation tions are functions of the structure of the Rus-p oints or question marks can be identified with sian language, while variations within theset he classical sentence, and finally, those s tructures are characteristic of individual au-i ncluded between successive pairs of punctua- thors. For those who might wish to draw theirtion marks or spaces can be identified with o wn conclusions, the raw data is given in Tablew ords. The length distribution of the latter 2 , and the sources of the samples are listed ins ubsequences has the desirable property, not T able 3. Letter, diagram and suffix distribu-s hared by the others, of being concentrated at tions compiled from the same samples may ber elatively low values of length, and of having f ound in the reference.n o elements exceeding a certain length (Fig. 1).W ords, defined in this fashion, can readily be TABLE 1i dentified by a machine and they are of limitedv ariety, so that their listing in a dictionary is v 210 na 86 pri 93p racticable. i 165 iz 57 dlja 72 F rom the practical point of view, the distri-bution is useful in planning input and storagef acilities in experimental translating equip- s 91 po 46 chto 50ment. T he samples used were relatively small, and k 43 ot 28 kak 29F ig. 1 should therefore be interpreted withg reat caution. The bar graph represents the a 21 ne 26 ili 22d istribution of a sample totalling 6,486 words.P oints are used to indicate the distributionso btained from smaller constituents of the total.T he scattering is such as to indicate that sam-ples 1, 2, and 3 differ significantly among eacho ther in details of their distributions. An ex- 38THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN 39 k (LENGTH in LETTERS) Figure 140 ANTHONY G. OETTINGER TABLE 2 Word Frequency length Sample Sample Sample Sample Total 1 2 3a 3b 1 67 204 178 88 537 2 36 147 114 54 351 3 40 170 148 80 438 4 43 130 107 45 325 5 74 203 183 117 577 6 61 258 161 99 579 7 89 332 245 129 795 8 49 ...
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: " THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN" [ Mechanical Translation, vol.1, no.3, December 1954; pp. 38-40] THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN A nthony G. Oettinger C omputation Laboratory, Harvard University I N the course of an analysis of several sam- a mination of the texts indicates that these dif-ples of technical Russian undertaken as part of ferences can safely be attributed to differinga s tudy in mechanical translation, a number of s ubject matter and styles. However, all distri-s tatistical data reflecting the structure of these butions are bimodal, perhaps trimodal, and cuts amples were compiled. One of these, the dis- o ff at k=18. The mode about k= 7 is attributabletribution of word length, is presented here as t o the large number of different words used toFig. 1. d efine the particular subject of each text. The T he theoretical interest of this distribution p eaks at k= 1 and at k= 3 are due to a smalla rises from the possibility of using it as a n umber of very frequent grammatical words,b asis for an operational definition of words in t hat is, prepositions, conjunctions, etc. Thep rinted texts. If texts are considered purely as f ive most frequent words of length 1, 2, and 3s equences of symbols including the letters, i n the total sample are listed in Table 1. Thisp unctuation marks, and space, the resulting se- t able shows that the most frequent two letterquences are of a length which no practicable w ords are consistently less frequent than threem achine can manage. A study of the distribu- l etter words of similar rank. One and two lettertion of the number of symbols between pairs of w ords are exclusively grammatical; 90% of thes uccessive symbols of certain classes would be t hree letter words are also grammatical,o ne way to reveal structural characteristics of l eaving 10% dependent on the subject matter.t he text sequences potentially useful toward the T he words of length 4 are nearly all inflected.d efinition of manageable and significant T he fact that only very few Russian words haves ubsequences. The subsequences included be- s tems of three or less letters probably accountstween successive occurrences of letter pairs f or the valley at k= 4. Indications thus are thath ave not been investigated. Those included be- t he modal and cut-off structure of the distribu-tween successive pairs of periods, exclamation tions are functions of the structure of the Rus-p oints or question marks can be identified with sian language, while variations within theset he classical sentence, and finally, those s tructures are characteristic of individual au-i ncluded between successive pairs of punctua- thors. For those who might wish to draw theirtion marks or spaces can be identified with o wn conclusions, the raw data is given in Tablew ords. The length distribution of the latter 2 , and the sources of the samples are listed ins ubsequences has the desirable property, not T able 3. Letter, diagram and suffix distribu-s hared by the others, of being concentrated at tions compiled from the same samples may ber elatively low values of length, and of having f ound in the reference.n o elements exceeding a certain length (Fig. 1).W ords, defined in this fashion, can readily be TABLE 1i dentified by a machine and they are of limitedv ariety, so that their listing in a dictionary is v 210 na 86 pri 93p racticable. i 165 iz 57 dlja 72 F rom the practical point of view, the distri-bution is useful in planning input and storagef acilities in experimental translating equip- s 91 po 46 chto 50ment. T he samples used were relatively small, and k 43 ot 28 kak 29F ig. 1 should therefore be interpreted withg reat caution. The bar graph represents the a 21 ne 26 ili 22d istribution of a sample totalling 6,486 words.P oints are used to indicate the distributionso btained from smaller constituents of the total.T he scattering is such as to indicate that sam-ples 1, 2, and 3 differ significantly among eacho ther in details of their distributions. An ex- 38THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN 39 k (LENGTH in LETTERS) Figure 140 ANTHONY G. OETTINGER TABLE 2 Word Frequency length Sample Sample Sample Sample Total 1 2 3a 3b 1 67 204 178 88 537 2 36 147 114 54 351 3 40 170 148 80 438 4 43 130 107 45 325 5 74 203 183 117 577 6 61 258 161 99 579 7 89 332 245 129 795 8 49 ...
Tìm kiếm theo từ khóa liên quan:
THE DISTRIBUTION OF WORD LENGTH IN TECHNICAL RUSSIAN Anthony G. Oettinger Mechanical Translation báo cáo khoa học báo cáo ngôn ngữ ngôn ngữ tự nhiênTài liệu có liên quan:
-
63 trang 355 0 0
-
13 trang 272 0 0
-
Báo cáo khoa học Bước đầu tìm hiểu văn hóa ẩm thực Trà Vinh
61 trang 260 0 0 -
Tóm tắt luận án tiến sỹ Một số vấn đề tối ưu hóa và nâng cao hiệu quả trong xử lý thông tin hình ảnh
28 trang 233 0 0 -
NGHIÊN CỨU CHỌN TẠO CÁC GIỐNG LÚA CHẤT LƯỢNG CAO CHO VÙNG ĐỒNG BẰNG SÔNG CỬU LONG
9 trang 230 0 0 -
Đề tài nghiên cứu khoa học và công nghệ cấp trường: Hệ thống giám sát báo trộm cho xe máy
63 trang 218 0 0 -
22 trang 199 0 0
-
Đề tài nghiên cứu khoa học: Tội ác và hình phạt của Dostoevsky qua góc nhìn tâm lý học tội phạm
70 trang 198 0 0 -
98 trang 181 0 0
-
7 trang 177 0 0