Quickstart#

After you have downloaded and installed PyLangAcq (see Download and Install), import the package pylangacq in your Python interpreter:

>>> import pylangacq

No errors? Great! Now you’re ready to proceed.

Reading CHAT data#

First off, we need some CHAT data to work with. The function read_chat() asks for a data source and returns a CHAT data reader. The data source can either be local on your computer, or a remote source as a ZIP archive file containing .cha files. A prototypical example for the latter is a dataset from CHILDES. To illustrate, let’s use Eve’s data from the Brown corpus of American English:

Caution

CHAT data is processed in parallelized code to speed things up by default. Especially for Windows users, you may need to put your code under the if __name__ == "__main__": idiom in a script to avoid running into an error. For reference, please see the “safe importing of main module” section for parallelization from the official Python documentation.

>>> url = "https://childes.talkbank.org/data/Eng-NA/Brown.zip"
>>> eve = pylangacq.read_chat(url, "Eve")
>>> eve.n_files()
20

eve is a Reader instance. It has Eve’s 20 CHAT data files all parsed and ready for your analysis. eve has various methods through which you can access different information with Python data structures.

Header Information#

CHAT transcript files store metadata in the header with lines beginning with @. Among other things, eve has the age information of Eve when the recordings were made, which is from 1 year and 6 months old to 2 years and 3 months old:

>>> eve.ages()
[(1, 6, 0),
 (1, 6, 0),
 (1, 7, 0),
 (1, 7, 0),
 (1, 8, 0),
 (1, 9, 0),
 (1, 9, 0),
 (1, 9, 0),
 (1, 10, 0),
 (1, 10, 0),
 (1, 11, 0),
 (1, 11, 0),
 (2, 0, 0),
 (2, 0, 0),
 (2, 1, 0),
 (2, 1, 0),
 (2, 2, 0),
 (2, 2, 0),
 (2, 3, 0),
 (2, 3, 0)]

Transcriptions and Annotations#

words() is the basic method to access the transcriptions:

>>> words = eve.words()  # list of strings, for all the words across all 20 files
>>> len(words)  # total word count
119779
>>> words[:8]
['more', 'cookie', '.', 'you', 'more', 'cookies', '?', 'how_about']

By default, words() returns a flat list of results from all the files. If we are interested in the results for individual files, the method has the optional boolean parameter by_files:

>>> words_by_files = eve.words(by_files=True)  # list of lists of strings, each inner list for one file
>>> len(words_by_files)  # expects 20 -- that's the number of files of ``eve``
20
>>> for words_one_file in words_by_files:
...     print(len(words_one_file))
...
5808
5252
2488
5739
5707
4338
5299
8901
4454
4533
4195
6195
4444
5207
8073
7361
10870
8403
6901
5611

Apart from transcriptions, CHAT data has rich annotations for linguistic and extra-linguistic information. Such annotations are accessible through the methods tokens() and utterances().

Many CHAT datasets on CHILDES have the %mor and %gra tiers for morphological information and grammatical relations, respectively. A reader such as eve from above has all this information readily available to you via tokens() – think of tokens() as words() with annotations:

>>> some_tokens = eve.tokens()[:5]
>>> some_tokens
[Token(word='more', pos='qn', mor='more', gra=Gra(dep=1, head=2, rel='QUANT')),
 Token(word='cookie', pos='n', mor='cookie', gra=Gra(dep=2, head=0, rel='INCROOT')),
 Token(word='.', pos='.', mor='', gra=Gra(dep=3, head=2, rel='PUNCT')),
 Token(word='you', pos='pro:per', mor='you', gra=Gra(dep=1, head=3, rel='SUBJ')),
 Token(word='more', pos='adv', mor='more', gra=Gra(dep=2, head=3, rel='JCT'))]
>>>
>>> # The Token class is a dataclass. A Token instance has attributes as shown above.
>>> for token in some_tokens:
...     print(token.word, token.pos)
...
more qn
cookie n
. .
you pro:per
more adv

Beyond the %mor and %gra tiers, an utterance has yet more information from the original CHAT data file. If you need information such as the unsegmented transcription, time marks, or any unparsed tiers, utterances() is what you need:

>>> eve.utterances()[0]
Utterance(participant='CHI',
          tokens=[Token(word='more', pos='qn', mor='more', gra=Gra(dep=1, head=2, rel='QUANT')),
                  Token(word='cookie', pos='n', mor='cookie', gra=Gra(dep=2, head=0, rel='INCROOT')),
                  Token(word='.', pos='.', mor='', gra=Gra(dep=3, head=2, rel='PUNCT'))],
          time_marks=None,
          tiers={'CHI': 'more cookie . [+ IMP]',
                 '%mor': 'qn|more n|cookie .',
                 '%gra': '1|2|QUANT 2|0|INCROOT 3|2|PUNCT',
                 '%int': 'distinctive , loud'})

Word Frequencies and Ngrams#

For word combinatorics, check out word_frequencies() and word_ngrams():

>>> word_freq = eve.word_frequencies()  # a collections.Counter object
>>> word_freq.most_common(5)
[('.', 20071),
 ('?', 6358),
 ('you', 3681),
 ('the', 2524),
 ('it', 2363)]

>>> bigrams = eve.word_ngrams(2)  # a collections.Counter object
>>> bigrams.most_common(5)
[(('it', '.'), 703),
 (('that', '?'), 619),
 (('what', '?'), 560),
 (('yeah', '.'), 510),
 (('there', '.'), 471)]

Developmental Measures#

To get the mean length of utterance (MLU), use mlu():

>>> eve.mlu()
[2.309041835357625,
488372093023256,
8063241106719365,
618803418803419,
8852691218130313,
203358208955224,
179732313575526,
4171011470281543,
8439306358381504,
822669104204753,
8814317673378076,
177847113884555,
2631578947368425,
9936974789915967,
457182320441989,
416536661466458,
501661129568106,
288242730720607,
3813169984686064,
3172541743970316]

The result is the MLU for each CHAT file. As this is a list of floats, they can be readily piped into other packages for making plots, for example.

The other language developmental measures implemented so far are ttr() for the type-token ratio (TTR) and ipsyn() for the index of productive syntax (IPSyn).

Questions?#

If you have any questions, comments, bug reports etc, please open issues at the GitHub repository, or contact Jackson L. Lee.