Dear Info-CHILDES-ers,
  I am delighted to be able to announce the addition of a major new
corpus from Thomas Hun-tak Lee (Chinese University of Hong Kong),
Colleen H Wong (Hong Kong Polytechnic University), and Samuel Leung
(University of Hong Kong).  This corpus is truly huge with a combined
size of 22 megabytes (compressed to 4 MB using GZIP).  The file is
called canton.tar.gz on poppy.psy.cmu.edu and on the mirrors in Chukyo
and Antwerp (later tonight).  It was collected from 8 children from the
middle of the second year up to about age 3.  Both romanized and
Cantonese script are available in the CHAT files.  All of the CHAT files
pass correctly through CHECK.  The full 00readme.doc file is appended. 
Many many thanks to Thomas, Cathy, Sam and their colleagues for a
fantastic contribution!

--Brian MacWhinney


       A Hong Kong Cantonese Child Language Corpus

1. The corpus

This database contains longitudinal data on the language of 8 
Cantonese-speaking children, each recorded for approximately one year. 
The corpus contains 171 files coded in CHAT format and tagged with a 
set of 33 word class labels. 

These children were observed in their interactions with the caretakers, 
the investigator, and occasionally other adults who chatted with the 
children during the visits. Three research students carried out the 
observations and the recording. Patricia Man recorded Bohuen and Gakie; 
Alice Cheung recorded Bernard, Tsuntsun and Tinfaan; and Kitty Szeto 
recorded Johnny, Jenny and Chunyat. The names of the children and the ages 
during which they were recorded are as follows:

NAME              Sex    Age at which recording    No. of files
                         began and ended

Bohuen (wbh)    F       2;03;23 - 3;04;08        27
Gakei (cgk)        F      1;11;01 - 2;09;09         19
Bernard (mhz)    M    1;07;00 - 2;08;06         26
Tsuntsun (ckt)    M     1;05;22 - 2;07;22         25
Tinfaan (ltf)       F      2;02;10 - 3;02;18         16
Johnny (hhc)      M     2;04;08 - 3;04;14         16
Jenny (LLy)       F      2;08;10 - 3;08;09         20
Chunyat (ccc)    M     1;10;08 - 2;10;27         22
total                                                               171 

The file name indicates the name of the child and his/her age at the time of 
the recording. Each filename is made up of the initials of the child (the 
first three characters) and his/her age at the time of recording, in terms 
of Year (1 character), Month (2 characters), and Day (2 characters).  All 
files in the Childes archive have the suffix 'pas'. For instance, the file 
WBH20322.pas contains tagged utterances of Bohuen (whb) when she was 2 years 
3 months and 22 days old.

The files of the 8 children and their respective sizes are listed below:

Bohuen (27 files)

WBH20323 PAS        89,489       WBH20926 PAS       106,287
WBH20330 PAS        67,570       WBH21002 PAS        44,733
WBH20402 PAS        37,631       WBH21016 PAS        56,854
WBH20405 PAS        41,410       WBH21023 PAS        58,323
WBH20406 PAS        58,587       WBH21106 PAS        32,651
WBH20413 PAS        41,079       WBH21114 PAS        53,510
WBH20414 PAS        20,251       WBH21128 PAS        59,433
WBH20415 PAS        16,539       WBH30010 PAS        68,317
WBH20506 PAS        34,773       WBH30101 PAS        73,720
WBH20609 PAS        37,001       WBH30220 PAS        88,637
WBH20619 PAS        20,238       WBH30303 PAS        59,221
WBH20703 PAS        37,328       WBH30312 PAS        71,884
WBH20714 PAS        93,592       WBH30408 PAS       113,452
WBH20919 PAS        87,838


Gakei (19 files)

CGK11101 PAS       172,448      CGK20318 PAS        57,860
CGK11108 PAS        70,025      CGK20325 PAS        55,515
CGK11122 PAS        38,073      CGK20408 PAS        84,333
CGK11129 PAS       102,565      CGK20430 PAS       130,782
CGK20008 PAS        63,610      CGK20503 PAS        67,714
CGK20207 PAS        77,541      CGK20711 PAS       116,732
CGK20221 PAS       127,568      CGK20808 PAS       125,984
CGK20228 PAS        58,909      CGK20818 PAS       155,475
CGK20304 PAS       107,718      CGK20909 PAS       186,561
CGK20311 PAS        66,674


Bernard (26 files)

MHZ10700 PAS        70,309        MHZ20115 PAS       188,523
MHZ10800 PAS       145,222        MHZ20129 PAS       143,805
MHZ10814 PAS        48,888        MHZ20212 PAS       140,860
MHZ10828 PAS        46,999        MHZ20226 PAS       180,029
MHZ10904 PAS       191,121        MHZ20309 PAS       160,713
MHZ10918 PAS       135,158        MHZ20328 PAS       200,297
MHZ10925 PAS       197,203        MHZ20407 PAS       169,851
MHZ11010 PAS       169,128        MHZ20421 PAS       202,822
MHZ11023 PAS       198,579        MHZ20504 PAS       164,264
MHZ11106 PAS       168,244        MHZ20519 PAS       158,711
MHZ20003 PAS       176,338        MHZ20604 PAS       164,720
MHZ20016 PAS       184,313        MHZ20618 PAS       242,745
MHZ20101 PAS       224,360        MHZ20806 PAS       198,527


Tsuntsun (25 files)

CKT10522 PAS        29,169        CKT20016 PAS       169,360
CKT10703 PAS        15,793        CKT20108 PAS       209,431
CKT10710 PAS        19,753        CKT20205 PAS       245,170
CKT10800 PAS       178,766        CKT20215 PAS       273,013
CKT10807 PAS       164,152        CKT20303 PAS       223,396
CKT10821 PAS       161,988        CKT20317 PAS       189,870
CKT10907 PAS       189,404        CKT20400 PAS       214,977
CKT10914 PAS        83,333        CKT20414 PAS       192,366
CKT10929 PAS       212,078        CKT20500 PAS       214,916
CKT11030 PAS       200,329        CKT20514 PAS       186,208
CKT11113 PAS       192,529        CKT20618 PAS       210,200
CKT11127 PAS       173,551        CKT20702 PAS       226,511
CKT20009 PAS       241,588


Tinfaan (16 files)

LTF20210 PAS       143,755        LTF20802 PAS       208,942
LTF20302 PAS       138,356        LTF20824 PAS       196,170
LTF20330 PAS       147,790        LTF20907 PAS       176,231
LTF20427 PAS       162,287        LTF21018 PAS       180,737
LTF20518 PAS       117,547        LTF21116 PAS       222,596
LTF20601 PAS       158,564        LTF30020 PAS       206,463
LTF20705 PAS       200,649        LTF30121 PAS       210,528
LTF20720 PAS       183,621        LTF30218 PAS       200,894


Johnny (16 files)

HHC20408 PAS        47,817        HHC20930 PAS       156,411
HHC20503 PAS       188,583        HHC21013 PAS       171,020
HHC20513 PAS       261,319        HHC21108 PAS       219,173
HHC20519 PAS       252,334        HHC30008 PAS       160,351
HHC20610 PAS       195,659        HHC30116 PAS       180,137
HHC20624 PAS       189,396        HHC30216 PAS       167,231
HHC20721 PAS       196,018        HHC30311 PAS       207,616
HHC20808 PAS       151,741        HHC30414 PAS       187,927


Jenny (20 files)

LLY20810 PAS        88,541        LLY30113 PAS       195,725
LLY20822 PAS       108,168        LLY30130 PAS       198,491
LLY20909 PAS        34,691        LLY30213 PAS       197,223
LLY20914 PAS       162,846        LLY30315 PAS       170,454
LLY20928 PAS       188,326        LLY30326 PAS       163,255
LLY21101 PAS       139,403        LLY30422 PAS       182,423
LLY21108 PAS       180,172        LLY30520 PAS       156,676
LLY21129 PAS       206,843        LLY30616 PAS       187,465
LLY30011 PAS       165,590        LLY30725 PAS       186,118
LLY30022 PAS       173,217        LLY30809 PAS       178,805


Chunyat (22 files)

CCC11008 PAS        80,465        CCC20523 PAS       207,903
CCC11100 PAS        31,915        CCC20608 PAS       193,847
CCC11121 PAS        50,132        CCC20624 PAS       170,164
CCC20110 PAS       173,940        CCC20706 PAS       207,702
CCC20117 PAS       100,270        CCC20713 PAS       196,504
CCC20206 PAS       145,920        CCC20800 PAS       200,565
CCC20213 PAS       192,652        CCC20817 PAS       153,548
CCC20307 PAS       185,136        CCC20907 PAS       203,038
CCC20323 PAS       188,144        CCC20923 PAS       176,641
CCC20410 PAS       177,589        CCC21013 PAS       196,346
CCC20507 PAS       143,956        CCC21027 PAS       217,228


2. The background of the 8 Cantonese-speaking children

a) Bohuen and Gakei

Both children were brought up in monolingual Cantonese-speaking working class 
families.  Bohuen's father was working in the warehouse of a mass transport 
company and her mother was a part-time piano teacher.  The child had a younger 
brother who was about two years younger than her.  They lived with the child's 
grandmother and uncle. The child had already started attending a nursery
school when data collection started.  After school, she was taken care of by
her parents and grandmother. 

Gakei's father was a technician in a electronic company and her mother was a 
housewife.  They lived with the child's grandmother.  Gakei's parents
were both born in Hong Kong.  The child was not yet enrolled in a nursery during
the whole period of data collection.  She was entirely taken care of by her mother.

b) Bernard, Tsuntsun and Tinfaan

All three were Cantonese-speaking children living in Hong Kong.  Tsuntsun and 
Tinfaan were born in Hong Kong, while Bernard was born in Kent, United Kingdom and 
was brought back to Hong Kong at the age of 8 1/2 months old.
 
Tsuntsun was the only son of the family.  His father was a Census &
Survey Officer working in the government and his mother a secondary school teacher teaching 
Chinese and Religious Studies. Since his birth, he had been living in
his maternal grandparents' house during weekdays and was taken care of by his
grandmother.  His parents visited him occasionally during the weekday evenings and took
him back home on Friday nights to stay over the weekend. They communicated in
Cantonese.  When Tsuntsun was 1 year 10 months old, his mother went to study for a year in the 
United Kingdom. He started to attend a nursery at the age of 2 years 1 month.  
 
Bernard was the only son of the family.  His father was a lecturer in
the Division of Construction and Land Use of the Hong Kong Polytechnic.  His mother was a 
lecturer of the English Language Teaching Unit of the Chinese University
of Hong Kong.  Bernard's mother brought him back from the United Kingdom at the
age of 8 1/2 months. He was then taken care of by his maternal grandmother at her house 
until the age of about 1 year 1 month.  From that
time to the age of 2 years 6 months, he was taken care of by a caretaker
during the weekdays. He communicated in Cantonese, though his parents occasionally
introduced to him some English terms.  He started to attend the nursery play-groups
at the age of 2 years 6 months.  

Tinfaan was the youngest child in the family.  She had a sister who was
four years older.  Her father was an engineer working in the government and her
mother was a piano teacher teaching at home.  During the first one-and-a-half years
from her birth, she was taken care of mostly by a Filipino helper while her
mother worked as a school music teacher.  After her mother had stopped working in school,
Tinfaan was mostly taken care of by her mother, except at times when her mother
had to give piano lessons or had to go out, when Tinfaan would be looked after by
her Filipino helper.  She communicated in Cantonese except when speaking to her
Filipino helper, for which she used 'something English-like' (as described by her mother).  She 
started to attend kindergarten at the age of 2 years 9 months.
 
 
c) Johnny, Jenny and Chunyat
 
All three children were born in Hong Kong and were brought up in monolingual 
Cantonese-speaking families. They had not started going to a nursery
during the period of data collection.
 
Jenny was the youngest child in the family. She had an elder brother who
was ten years older and an elder sister who was four years older. Jenny's father was a 
businessman and her mother was a housewife. The family employed a
Filipino helper, who spoke some Cantonese and English to the children. 
 
Johnny was the youngest child in the family. He had an elder sister who
was seven years older. His father was an engineer and her mother was a typist. The
family employed a Thai helper and she spoke Cantonese to the children.
 
Chunyat was the only son in the family. His father was a merchant and
his mother taught English in a secondary school. They lived with the child's maternal 
grandparents. 

3. Tags

Below is a summary list of the syntactic categories used in coding the
corpus. The romanizations are based on the Cantonese romanization scheme of the Linguistic 
Society of Hong Kong (LSHK) (see Matthews and Yip 1994: 400-401).


     Category                                   e.g.
1.   adj = adjective                         hung4
2.   advf = focus adverb                 zung6, dou1, jau6, zoi3
3.   advi = adverb of intensity        hou3, gei2, gam3, zan1
4.   advm = adverb of manner        maan6maan6dei2, ma4ma4dei2
5.   advs = sentential adverb          bat1jyu4, gam2(joeng2), jat1cai4
6.   asp = aspectual marker            zo2, zyu6, gan2, gwo3, hoi1
7.   aux = auxiliary / modal verb    jing1goi1, hang2, ho2ji5, wui, sai2
8.   cl = classifier                           go3, zek3, bun2, bui1, di1
9.   com = comparative morpheme gwo3 (as in dai6 gwo3), di1 (as in  hung4 di1)
10.  conj = connective                   dan6hai6, tung4maai4, waak6ze2
11.  corr = correlative                   jut6...jut6, jau6...jau6, 
                                                      gam2...gam3, jat1...jat1
12.  ctc = clitic                              dak1, dou3
13.  det = determiner                     nei1, go2, dai6
14.  dir = directional verb             lok6, soeng5, ceot1, jap6, lai4
15.  ex = expressive utterance       baai1baai3, zou2san4
16.  gen = genitive marker             ge3
17.  ins=emphatic inserted marker  gwai2 (as in hou3 gwai2 leng3)
18.  nn = noun                                ping4gwo2, ba4ba1
19.  nnloc = locative noun phrase  soeng6mien6, leoi3mien6
20.  nnpr = pronoun                       ngo5, nei3, keoi3
21.  nnpp = proper name               tin1faan4, zeon3zeon3
22.  neg = negative morpheme      m4, mai6, mou5
23.  prt = post-verbal particle      faan1, sai3, can1, maai4, gwo3, ha2
24.  prep = preposition                tung4maai2, hai2, bei2
25.  q = quantifier                        jat1, saam1, sap6, gei2, mui5
26.  rfl = reflexive pronoun         zi6gei2
27.  sfp = sentence final particle &la3, &ga1 &ma3, &ge3 &le1
28.  vd = ditransitive verb            bai2, bei2
29.  verg = ergative verb              dit3
30.  vf = function verb                  hai6, jau5, hai2
31.  vi = intransitive verb             siu3
32.  vt = transitive verb                teoi1
33.  wh = wh words                     mat1, mat1je5, dim2, dim2gaai2, 
                                                    dim2jeong2

4. Chinese characters

a) The corpus has three versions, all for use in the DOS environment.
    The Chinese 
version requires the use of Eten 3.5 or later versions. As the data contain 
Cantonese characters which are not found in the standard GB or Big-5
character set, we have created userfonts to represent these Cantonese characters which are in 
common use in Hong Kong, but not in China or Taiwan. Anyone using the Chinese 
version of the corpus will need to copy the following files (which come
with the corpus) to their Eten subdirectory:

usrfont.15m
usrfont.24m

b) The romanized version is derived from the Chinese tagged corpus by
means of a conversion program based on a dictionary. Since a character may have different 
pronunciations (due to language variation or context), the romanized
data files sometimes give more than one romanized form for a single character,
separated by '^', a convention suggested by Brian MacWhinney. Thus, for example, the
Cantonese morpheme for 'you' can have an alveolar lateral initial or an alveolar nasal
initial. The morpheme will be rendered as 'lei^nei' in the romanized data. The
romanized corpus contains the categorial tags below each romanized utterance, but it does
not contain English glosses. In time, we hope to seek resources to enable us to
disambiguate the romanized forms, and provide English glosses.

Both the Chinese version as well as the romanized version (without
Chinese characters) are available by ftp from the following sites:

ftp address:     humanum.arts.cuhk.hk

-for the Chinese-only corpus:

/usr2/ftp/pub/Faculty/lee_thomas/Canton_Corpus/Cantonese

-for the romanizations-only corpus:

/usr2/ftp/pub/Faculty/lee_thomas/Canton_Corpus/Romanized

c) The CHAT version now in the Childes archive is a version that
incorporates the 
Chinese characters on a '%can' tier, with the romanizations on the main
tier. This amalgamation was done first by Brian MacWhinney, whose help and advice
in the final stages of the corpus preparation is gratefully acknowledged, and then
checked by the research team. This version has passed the CHECK test for format consistency.


5. Acknowledgments

The creation of this corpus was made possible by a three-year grant to
Thomas Hun-tak 
Lee (Chinese University of Hong Kong), Colleen H Wong (Hong Kong Polytechnic 
University), and Samuel Leung (University of Hong Kong) [RGC earmarked
grant CUHK 2/91]. The project was supported by two studentships from the Hong Kong
Polytechnic awarded to Patricia Man and Alice Cheung, and a studentship from the
University of Hong Kong awarded to Kitty Szeto. In addition, funding for the later
stages of the project was provided by a direct grant from Faculty of Arts, Chinese
University of Hong Kong, a grant from the Freemason's Fund for East Asian Studies, as
well as research assistantships from the Hong Kong Polytechnic University. The
support of these funding agencies is hereby acknowledged.

Further details are given in the following report, which should be cited
if data from this corpus are used:

Lee, Thomas H.T., Colleen H Wong, Samuel Leung, Patricia Man, Alice 
Cheung, Kitty Szeto, and Cathy S P Wong, "The Development of 
Grammatical Competence in Cantonese-speaking Children", Report of 
RGC earmarked grant 1991-94.