Dear Info-CHILDES-ers,
I am delighted to be able to announce the addition of a major new
corpus from Thomas Hun-tak Lee (Chinese University of Hong Kong),
Colleen H Wong (Hong Kong Polytechnic University), and Samuel Leung
(University of Hong Kong). This corpus is truly huge with a combined
size of 22 megabytes (compressed to 4 MB using GZIP). The file is
called canton.tar.gz on poppy.psy.cmu.edu and on the mirrors in Chukyo
and Antwerp (later tonight). It was collected from 8 children from the
middle of the second year up to about age 3. Both romanized and
Cantonese script are available in the CHAT files. All of the CHAT files
pass correctly through CHECK. The full 00readme.doc file is appended.
Many many thanks to Thomas, Cathy, Sam and their colleagues for a
fantastic contribution!
--Brian MacWhinney
A Hong Kong Cantonese Child Language Corpus
1. The corpus
This database contains longitudinal data on the language of 8
Cantonese-speaking children, each recorded for approximately one year.
The corpus contains 171 files coded in CHAT format and tagged with a
set of 33 word class labels.
These children were observed in their interactions with the caretakers,
the investigator, and occasionally other adults who chatted with the
children during the visits. Three research students carried out the
observations and the recording. Patricia Man recorded Bohuen and Gakie;
Alice Cheung recorded Bernard, Tsuntsun and Tinfaan; and Kitty Szeto
recorded Johnny, Jenny and Chunyat. The names of the children and the ages
during which they were recorded are as follows:
NAME Sex Age at which recording No. of files
began and ended
Bohuen (wbh) F 2;03;23 - 3;04;08 27
Gakei (cgk) F 1;11;01 - 2;09;09 19
Bernard (mhz) M 1;07;00 - 2;08;06 26
Tsuntsun (ckt) M 1;05;22 - 2;07;22 25
Tinfaan (ltf) F 2;02;10 - 3;02;18 16
Johnny (hhc) M 2;04;08 - 3;04;14 16
Jenny (LLy) F 2;08;10 - 3;08;09 20
Chunyat (ccc) M 1;10;08 - 2;10;27 22
total 171
The file name indicates the name of the child and his/her age at the time of
the recording. Each filename is made up of the initials of the child (the
first three characters) and his/her age at the time of recording, in terms
of Year (1 character), Month (2 characters), and Day (2 characters). All
files in the Childes archive have the suffix 'pas'. For instance, the file
WBH20322.pas contains tagged utterances of Bohuen (whb) when she was 2 years
3 months and 22 days old.
The files of the 8 children and their respective sizes are listed below:
Bohuen (27 files)
WBH20323 PAS 89,489 WBH20926 PAS 106,287
WBH20330 PAS 67,570 WBH21002 PAS 44,733
WBH20402 PAS 37,631 WBH21016 PAS 56,854
WBH20405 PAS 41,410 WBH21023 PAS 58,323
WBH20406 PAS 58,587 WBH21106 PAS 32,651
WBH20413 PAS 41,079 WBH21114 PAS 53,510
WBH20414 PAS 20,251 WBH21128 PAS 59,433
WBH20415 PAS 16,539 WBH30010 PAS 68,317
WBH20506 PAS 34,773 WBH30101 PAS 73,720
WBH20609 PAS 37,001 WBH30220 PAS 88,637
WBH20619 PAS 20,238 WBH30303 PAS 59,221
WBH20703 PAS 37,328 WBH30312 PAS 71,884
WBH20714 PAS 93,592 WBH30408 PAS 113,452
WBH20919 PAS 87,838
Gakei (19 files)
CGK11101 PAS 172,448 CGK20318 PAS 57,860
CGK11108 PAS 70,025 CGK20325 PAS 55,515
CGK11122 PAS 38,073 CGK20408 PAS 84,333
CGK11129 PAS 102,565 CGK20430 PAS 130,782
CGK20008 PAS 63,610 CGK20503 PAS 67,714
CGK20207 PAS 77,541 CGK20711 PAS 116,732
CGK20221 PAS 127,568 CGK20808 PAS 125,984
CGK20228 PAS 58,909 CGK20818 PAS 155,475
CGK20304 PAS 107,718 CGK20909 PAS 186,561
CGK20311 PAS 66,674
Bernard (26 files)
MHZ10700 PAS 70,309 MHZ20115 PAS 188,523
MHZ10800 PAS 145,222 MHZ20129 PAS 143,805
MHZ10814 PAS 48,888 MHZ20212 PAS 140,860
MHZ10828 PAS 46,999 MHZ20226 PAS 180,029
MHZ10904 PAS 191,121 MHZ20309 PAS 160,713
MHZ10918 PAS 135,158 MHZ20328 PAS 200,297
MHZ10925 PAS 197,203 MHZ20407 PAS 169,851
MHZ11010 PAS 169,128 MHZ20421 PAS 202,822
MHZ11023 PAS 198,579 MHZ20504 PAS 164,264
MHZ11106 PAS 168,244 MHZ20519 PAS 158,711
MHZ20003 PAS 176,338 MHZ20604 PAS 164,720
MHZ20016 PAS 184,313 MHZ20618 PAS 242,745
MHZ20101 PAS 224,360 MHZ20806 PAS 198,527
Tsuntsun (25 files)
CKT10522 PAS 29,169 CKT20016 PAS 169,360
CKT10703 PAS 15,793 CKT20108 PAS 209,431
CKT10710 PAS 19,753 CKT20205 PAS 245,170
CKT10800 PAS 178,766 CKT20215 PAS 273,013
CKT10807 PAS 164,152 CKT20303 PAS 223,396
CKT10821 PAS 161,988 CKT20317 PAS 189,870
CKT10907 PAS 189,404 CKT20400 PAS 214,977
CKT10914 PAS 83,333 CKT20414 PAS 192,366
CKT10929 PAS 212,078 CKT20500 PAS 214,916
CKT11030 PAS 200,329 CKT20514 PAS 186,208
CKT11113 PAS 192,529 CKT20618 PAS 210,200
CKT11127 PAS 173,551 CKT20702 PAS 226,511
CKT20009 PAS 241,588
Tinfaan (16 files)
LTF20210 PAS 143,755 LTF20802 PAS 208,942
LTF20302 PAS 138,356 LTF20824 PAS 196,170
LTF20330 PAS 147,790 LTF20907 PAS 176,231
LTF20427 PAS 162,287 LTF21018 PAS 180,737
LTF20518 PAS 117,547 LTF21116 PAS 222,596
LTF20601 PAS 158,564 LTF30020 PAS 206,463
LTF20705 PAS 200,649 LTF30121 PAS 210,528
LTF20720 PAS 183,621 LTF30218 PAS 200,894
Johnny (16 files)
HHC20408 PAS 47,817 HHC20930 PAS 156,411
HHC20503 PAS 188,583 HHC21013 PAS 171,020
HHC20513 PAS 261,319 HHC21108 PAS 219,173
HHC20519 PAS 252,334 HHC30008 PAS 160,351
HHC20610 PAS 195,659 HHC30116 PAS 180,137
HHC20624 PAS 189,396 HHC30216 PAS 167,231
HHC20721 PAS 196,018 HHC30311 PAS 207,616
HHC20808 PAS 151,741 HHC30414 PAS 187,927
Jenny (20 files)
LLY20810 PAS 88,541 LLY30113 PAS 195,725
LLY20822 PAS 108,168 LLY30130 PAS 198,491
LLY20909 PAS 34,691 LLY30213 PAS 197,223
LLY20914 PAS 162,846 LLY30315 PAS 170,454
LLY20928 PAS 188,326 LLY30326 PAS 163,255
LLY21101 PAS 139,403 LLY30422 PAS 182,423
LLY21108 PAS 180,172 LLY30520 PAS 156,676
LLY21129 PAS 206,843 LLY30616 PAS 187,465
LLY30011 PAS 165,590 LLY30725 PAS 186,118
LLY30022 PAS 173,217 LLY30809 PAS 178,805
Chunyat (22 files)
CCC11008 PAS 80,465 CCC20523 PAS 207,903
CCC11100 PAS 31,915 CCC20608 PAS 193,847
CCC11121 PAS 50,132 CCC20624 PAS 170,164
CCC20110 PAS 173,940 CCC20706 PAS 207,702
CCC20117 PAS 100,270 CCC20713 PAS 196,504
CCC20206 PAS 145,920 CCC20800 PAS 200,565
CCC20213 PAS 192,652 CCC20817 PAS 153,548
CCC20307 PAS 185,136 CCC20907 PAS 203,038
CCC20323 PAS 188,144 CCC20923 PAS 176,641
CCC20410 PAS 177,589 CCC21013 PAS 196,346
CCC20507 PAS 143,956 CCC21027 PAS 217,228
2. The background of the 8 Cantonese-speaking children
a) Bohuen and Gakei
Both children were brought up in monolingual Cantonese-speaking working class
families. Bohuen's father was working in the warehouse of a mass transport
company and her mother was a part-time piano teacher. The child had a younger
brother who was about two years younger than her. They lived with the child's
grandmother and uncle. The child had already started attending a nursery
school when data collection started. After school, she was taken care of by
her parents and grandmother.
Gakei's father was a technician in a electronic company and her mother was a
housewife. They lived with the child's grandmother. Gakei's parents
were both born in Hong Kong. The child was not yet enrolled in a nursery during
the whole period of data collection. She was entirely taken care of by her mother.
b) Bernard, Tsuntsun and Tinfaan
All three were Cantonese-speaking children living in Hong Kong. Tsuntsun and
Tinfaan were born in Hong Kong, while Bernard was born in Kent, United Kingdom and
was brought back to Hong Kong at the age of 8 1/2 months old.
Tsuntsun was the only son of the family. His father was a Census &
Survey Officer working in the government and his mother a secondary school teacher teaching
Chinese and Religious Studies. Since his birth, he had been living in
his maternal grandparents' house during weekdays and was taken care of by his
grandmother. His parents visited him occasionally during the weekday evenings and took
him back home on Friday nights to stay over the weekend. They communicated in
Cantonese. When Tsuntsun was 1 year 10 months old, his mother went to study for a year in the
United Kingdom. He started to attend a nursery at the age of 2 years 1 month.
Bernard was the only son of the family. His father was a lecturer in
the Division of Construction and Land Use of the Hong Kong Polytechnic. His mother was a
lecturer of the English Language Teaching Unit of the Chinese University
of Hong Kong. Bernard's mother brought him back from the United Kingdom at the
age of 8 1/2 months. He was then taken care of by his maternal grandmother at her house
until the age of about 1 year 1 month. From that
time to the age of 2 years 6 months, he was taken care of by a caretaker
during the weekdays. He communicated in Cantonese, though his parents occasionally
introduced to him some English terms. He started to attend the nursery play-groups
at the age of 2 years 6 months.
Tinfaan was the youngest child in the family. She had a sister who was
four years older. Her father was an engineer working in the government and her
mother was a piano teacher teaching at home. During the first one-and-a-half years
from her birth, she was taken care of mostly by a Filipino helper while her
mother worked as a school music teacher. After her mother had stopped working in school,
Tinfaan was mostly taken care of by her mother, except at times when her mother
had to give piano lessons or had to go out, when Tinfaan would be looked after by
her Filipino helper. She communicated in Cantonese except when speaking to her
Filipino helper, for which she used 'something English-like' (as described by her mother). She
started to attend kindergarten at the age of 2 years 9 months.
c) Johnny, Jenny and Chunyat
All three children were born in Hong Kong and were brought up in monolingual
Cantonese-speaking families. They had not started going to a nursery
during the period of data collection.
Jenny was the youngest child in the family. She had an elder brother who
was ten years older and an elder sister who was four years older. Jenny's father was a
businessman and her mother was a housewife. The family employed a
Filipino helper, who spoke some Cantonese and English to the children.
Johnny was the youngest child in the family. He had an elder sister who
was seven years older. His father was an engineer and her mother was a typist. The
family employed a Thai helper and she spoke Cantonese to the children.
Chunyat was the only son in the family. His father was a merchant and
his mother taught English in a secondary school. They lived with the child's maternal
grandparents.
3. Tags
Below is a summary list of the syntactic categories used in coding the
corpus. The romanizations are based on the Cantonese romanization scheme of the Linguistic
Society of Hong Kong (LSHK) (see Matthews and Yip 1994: 400-401).
Category e.g.
1. adj = adjective hung4
2. advf = focus adverb zung6, dou1, jau6, zoi3
3. advi = adverb of intensity hou3, gei2, gam3, zan1
4. advm = adverb of manner maan6maan6dei2, ma4ma4dei2
5. advs = sentential adverb bat1jyu4, gam2(joeng2), jat1cai4
6. asp = aspectual marker zo2, zyu6, gan2, gwo3, hoi1
7. aux = auxiliary / modal verb jing1goi1, hang2, ho2ji5, wui, sai2
8. cl = classifier go3, zek3, bun2, bui1, di1
9. com = comparative morpheme gwo3 (as in dai6 gwo3), di1 (as in hung4 di1)
10. conj = connective dan6hai6, tung4maai4, waak6ze2
11. corr = correlative jut6...jut6, jau6...jau6,
gam2...gam3, jat1...jat1
12. ctc = clitic dak1, dou3
13. det = determiner nei1, go2, dai6
14. dir = directional verb lok6, soeng5, ceot1, jap6, lai4
15. ex = expressive utterance baai1baai3, zou2san4
16. gen = genitive marker ge3
17. ins=emphatic inserted marker gwai2 (as in hou3 gwai2 leng3)
18. nn = noun ping4gwo2, ba4ba1
19. nnloc = locative noun phrase soeng6mien6, leoi3mien6
20. nnpr = pronoun ngo5, nei3, keoi3
21. nnpp = proper name tin1faan4, zeon3zeon3
22. neg = negative morpheme m4, mai6, mou5
23. prt = post-verbal particle faan1, sai3, can1, maai4, gwo3, ha2
24. prep = preposition tung4maai2, hai2, bei2
25. q = quantifier jat1, saam1, sap6, gei2, mui5
26. rfl = reflexive pronoun zi6gei2
27. sfp = sentence final particle &la3, &ga1 &ma3, &ge3 &le1
28. vd = ditransitive verb bai2, bei2
29. verg = ergative verb dit3
30. vf = function verb hai6, jau5, hai2
31. vi = intransitive verb siu3
32. vt = transitive verb teoi1
33. wh = wh words mat1, mat1je5, dim2, dim2gaai2,
dim2jeong2
4. Chinese characters
a) The corpus has three versions, all for use in the DOS environment.
The Chinese
version requires the use of Eten 3.5 or later versions. As the data contain
Cantonese characters which are not found in the standard GB or Big-5
character set, we have created userfonts to represent these Cantonese characters which are in
common use in Hong Kong, but not in China or Taiwan. Anyone using the Chinese
version of the corpus will need to copy the following files (which come
with the corpus) to their Eten subdirectory:
usrfont.15m
usrfont.24m
b) The romanized version is derived from the Chinese tagged corpus by
means of a conversion program based on a dictionary. Since a character may have different
pronunciations (due to language variation or context), the romanized
data files sometimes give more than one romanized form for a single character,
separated by '^', a convention suggested by Brian MacWhinney. Thus, for example, the
Cantonese morpheme for 'you' can have an alveolar lateral initial or an alveolar nasal
initial. The morpheme will be rendered as 'lei^nei' in the romanized data. The
romanized corpus contains the categorial tags below each romanized utterance, but it does
not contain English glosses. In time, we hope to seek resources to enable us to
disambiguate the romanized forms, and provide English glosses.
Both the Chinese version as well as the romanized version (without
Chinese characters) are available by ftp from the following sites:
ftp address: humanum.arts.cuhk.hk
-for the Chinese-only corpus:
/usr2/ftp/pub/Faculty/lee_thomas/Canton_Corpus/Cantonese
-for the romanizations-only corpus:
/usr2/ftp/pub/Faculty/lee_thomas/Canton_Corpus/Romanized
c) The CHAT version now in the Childes archive is a version that
incorporates the
Chinese characters on a '%can' tier, with the romanizations on the main
tier. This amalgamation was done first by Brian MacWhinney, whose help and advice
in the final stages of the corpus preparation is gratefully acknowledged, and then
checked by the research team. This version has passed the CHECK test for format consistency.
5. Acknowledgments
The creation of this corpus was made possible by a three-year grant to
Thomas Hun-tak
Lee (Chinese University of Hong Kong), Colleen H Wong (Hong Kong Polytechnic
University), and Samuel Leung (University of Hong Kong) [RGC earmarked
grant CUHK 2/91]. The project was supported by two studentships from the Hong Kong
Polytechnic awarded to Patricia Man and Alice Cheung, and a studentship from the
University of Hong Kong awarded to Kitty Szeto. In addition, funding for the later
stages of the project was provided by a direct grant from Faculty of Arts, Chinese
University of Hong Kong, a grant from the Freemason's Fund for East Asian Studies, as
well as research assistantships from the Hong Kong Polytechnic University. The
support of these funding agencies is hereby acknowledged.
Further details are given in the following report, which should be cited
if data from this corpus are used:
Lee, Thomas H.T., Colleen H Wong, Samuel Leung, Patricia Man, Alice
Cheung, Kitty Szeto, and Cathy S P Wong, "The Development of
Grammatical Competence in Cantonese-speaking Children", Report of
RGC earmarked grant 1991-94.