Quickstart#
After you have downloaded and installed PyLangAcq (see Download and Install),
import the package pylangacq
in your Python interpreter:
>>> import pylangacq
No errors? Great! Now you’re ready to proceed.
Reading CHAT data#
First off, we need some CHAT data to work with.
The function read_chat()
asks for a data source and returns a CHAT data reader.
The data source can either be local on your computer,
or a remote source as a ZIP archive file containing .cha files.
A prototypical example for the latter is a dataset from CHILDES.
To illustrate, let’s use Eve’s data from the
Brown
corpus of American English:
Caution
CHAT data is processed in parallelized code to speed things up by default.
Especially for Windows users, you may need to put your code under the
if __name__ == "__main__":
idiom in a script to avoid running into an error.
For reference, please see the “safe importing of main module” section
for parallelization from the official Python documentation.
>>> url = "https://childes.talkbank.org/data/Eng-NA/Brown.zip"
>>> eve = pylangacq.read_chat(url, "Eve")
>>> eve.n_files()
20
eve
is a Reader
instance.
It has Eve’s 20 CHAT data files all parsed and ready for your analysis.
eve
has various methods through which you can access different information
with Python data structures.
More on Reading CHAT Data.
Header Information#
CHAT transcript files store metadata in the header with lines beginning with @
.
Among other things, eve
has the age information of Eve when the recordings were made,
which is from 1 year and 6 months old to 2 years and 3 months old:
>>> eve.ages()
[(1, 6, 0),
(1, 6, 0),
(1, 7, 0),
(1, 7, 0),
(1, 8, 0),
(1, 9, 0),
(1, 9, 0),
(1, 9, 0),
(1, 10, 0),
(1, 10, 0),
(1, 11, 0),
(1, 11, 0),
(2, 0, 0),
(2, 0, 0),
(2, 1, 0),
(2, 1, 0),
(2, 2, 0),
(2, 2, 0),
(2, 3, 0),
(2, 3, 0)]
More on Accessing Headers.
Transcriptions and Annotations#
words()
is the basic method to access the transcriptions:
>>> words = eve.words() # list of strings, for all the words across all 20 files
>>> len(words) # total word count
119779
>>> words[:8]
['more', 'cookie', '.', 'you', 'more', 'cookies', '?', 'how_about']
By default, words()
returns a flat list of results from all the files.
If we are interested in the results for individual files,
the method has the optional boolean parameter by_files
:
>>> words_by_files = eve.words(by_files=True) # list of lists of strings, each inner list for one file
>>> len(words_by_files) # expects 20 -- that's the number of files of ``eve``
20
>>> for words_one_file in words_by_files:
... print(len(words_one_file))
...
5808
5252
2488
5739
5707
4338
5299
8901
4454
4533
4195
6195
4444
5207
8073
7361
10870
8403
6901
5611
Apart from transcriptions, CHAT data has rich annotations for linguistic
and extra-linguistic information.
Such annotations are accessible through the methods
tokens()
and utterances()
.
Many CHAT datasets on CHILDES have the %mor
and %gra
tiers
for morphological information and grammatical relations, respectively.
A reader such as eve
from above has all this information readily available
to you via tokens()
– think of tokens()
as words()
with annotations:
>>> some_tokens = eve.tokens()[:5]
>>> some_tokens
[Token(word='more', pos='qn', mor='more', gra=Gra(dep=1, head=2, rel='QUANT')),
Token(word='cookie', pos='n', mor='cookie', gra=Gra(dep=2, head=0, rel='INCROOT')),
Token(word='.', pos='.', mor='', gra=Gra(dep=3, head=2, rel='PUNCT')),
Token(word='you', pos='pro:per', mor='you', gra=Gra(dep=1, head=3, rel='SUBJ')),
Token(word='more', pos='adv', mor='more', gra=Gra(dep=2, head=3, rel='JCT'))]
>>>
>>> # The Token class is a dataclass. A Token instance has attributes as shown above.
>>> for token in some_tokens:
... print(token.word, token.pos)
...
more qn
cookie n
. .
you pro:per
more adv
Beyond the %mor
and %gra
tiers,
an utterance has yet more information from the original CHAT data file.
If you need information such as the unsegmented transcription, time marks,
or any unparsed tiers, utterances()
is what you need:
>>> eve.utterances()[0]
Utterance(participant='CHI',
tokens=[Token(word='more', pos='qn', mor='more', gra=Gra(dep=1, head=2, rel='QUANT')),
Token(word='cookie', pos='n', mor='cookie', gra=Gra(dep=2, head=0, rel='INCROOT')),
Token(word='.', pos='.', mor='', gra=Gra(dep=3, head=2, rel='PUNCT'))],
time_marks=None,
tiers={'CHI': 'more cookie . [+ IMP]',
'%mor': 'qn|more n|cookie .',
'%gra': '1|2|QUANT 2|0|INCROOT 3|2|PUNCT',
'%int': 'distinctive , loud'})
More on Transcriptions and Annotations.
Word Frequencies and Ngrams#
For word combinatorics, check out
word_frequencies()
and word_ngrams()
:
>>> word_freq = eve.word_frequencies() # a collections.Counter object
>>> word_freq.most_common(5)
[('.', 20071),
('?', 6358),
('you', 3681),
('the', 2524),
('it', 2363)]
>>> bigrams = eve.word_ngrams(2) # a collections.Counter object
>>> bigrams.most_common(5)
[(('it', '.'), 703),
(('that', '?'), 619),
(('what', '?'), 560),
(('yeah', '.'), 510),
(('there', '.'), 471)]
More on Word Frequencies and Ngrams.
Developmental Measures#
To get the mean length of utterance (MLU), use mlu()
:
>>> eve.mlu()
[2.309041835357625,
2.488372093023256,
2.8063241106719365,
2.618803418803419,
2.8852691218130313,
3.203358208955224,
3.179732313575526,
3.4171011470281543,
3.8439306358381504,
3.822669104204753,
3.8814317673378076,
4.177847113884555,
4.2631578947368425,
3.9936974789915967,
4.457182320441989,
4.416536661466458,
4.501661129568106,
4.288242730720607,
4.3813169984686064,
3.3172541743970316]
The result is the MLU for each CHAT file. As this is a list of floats, they can be readily piped into other packages for making plots, for example.
The other language developmental measures implemented so far are
ttr()
for the type-token ratio (TTR) and
ipsyn()
for the index of productive syntax (IPSyn).
More on Developmental Measures.
Questions?#
If you have any questions, comments, bug reports etc, please open issues at the GitHub repository, or contact Jackson L. Lee.