This page provides details of the CHAT data format that PyLangAcq assumes and explains how CHAT data files are read.
The CHAT transcription format¶
PyLangAcq is designed to handle the CHAT transcription format as used in the CHILDES database for language acquisition research. The bare minimum of what PyLangAcq assumes about CHAT is explained here. (As of January 2016, PyLangAcq conforms to the latest CHAT manual dated September 22, 2015.)
A CHAT transcript file (typically with the extension name
.cha, though not
strictly required) provides metadata headers using lines starting with
Among the many possible headers,
@ID are of particular interest:
@Participants: Code1 Name1 Role1 , Code2 Name2 Role2 @ID: ||Code1|1;6.||||Role1||| @ID: ||Code2|||||Role2|||
@Participants header states the participants of the transcript. In this
hypothetical example shown just above, there are two participants.
Each participant has a participant code (e.g.,
Code1), a participant name
Name1), and a participant role (e.g.,
The participant code must be an alphanumeric three-character string
which begins with a letter, and all letters must be in uppercase.
The participant code must come first, immediately
followed by a space, and then by the participant name, and in turn by
another space and then the participant role. A comma separates
information between two participants.
If there are
@ID headers, they must appear after, but not before, the
The number of
@ID headers is equal to the number of participants.
@ID header contains detailed information about a
@ID, the fields
must match the information of the relevant participant in the
Often of interest in language acquisition research is the age of the
participant (e.g., the target child). For instance, the age of
Code1 is 1 year and 6 months, as given by
Fields are left empty if no information is available.
While all other
@ headers are optional, PyLangAcq has built-in functions
@Date for potential usage.
After the headers come the transcriptions. All transcriptions are signaled by
* at the beginning of the line:
*Code1: good morning .
* is immediately followed by the participant code (e.g.,
Code1), and then
by a colon
: and a space (or tab). Then the transcribed line follows.
For research purposes, many CHAT transcripts have additional tiers signaled by
%mor (for morphological information such as part-of-speech tag and lemma),
%gra (for dependency and grammatical relations), and other
Much of what PyLangAcq can do relies on the annotations in these tiers with
rich linguistic information.
For any header and transcription lines that are too long, line continuations
are allowed in CHAT files with a tab character (denoted by
<TAB> below), from:
*Code1: this line is to demonstrate how line continuations are done in CHAT .
*Code1: this line is to demonstrate how line continuations <TAB>are done in CHAT .
This implies that all lines in a CHAT file must begin with any of the following
Initializing a Reader instance¶
Assuming that the CHAT transcripts for the Brown portion of CHILDES
are at the current directory:
>>> import pylangacq as pla >>> eve = pla.read_chat('Brown/Eve/*.cha')
The UTF-8 encoding is assumed for all CHAT files.
If your data files are encoded differently (e.g., in Latin-1 instead),
read_chat takes the keyword argument
'encoding'. For instance,
you can read in data like this:
data = pla.read_chat('path/to/data/files', encoding='latin1')
read_chat() can take one or multiple filenames.
These filenames can be either relative paths to the current directory
(as exemplified here) or
absolute paths. Filename pattern matching with
(wildcard for zero or more characters) and
? (wildcard for one or more
characters) can be used. In this example with Eve’s files,
* matches all
the 20 CHAT
files in the subdirectory
Brown/Eve/ relative to the current directory.
read_chat() returns an instance of the
class. For example,
eve is a
Reader instance for short.
Most of the functionality of PyLangAcq is accessed via methods of
instances, in the form of
If your CHAT data comes as an in-memory string (a string of what a single
CHAT data file would look like), a
Reader instance can be created by
from_chat_str class method:
>>> import pylangacq as pla >>> reader = pla.Reader.from_chat_str(your_chat_data_str, encoding='utf-8')
Basic information of a
Reader instance such as
eve can be accessed
>>> eve.number_of_files() # from eve01.cha through eve20.cha 20 >>> len(eve) # same as number_of_files() 20 >>> eve.number_of_utterances() # across all 20 files and all participants 26979
The bulk of the library documentation is about the various
The full API details can be found in The Reader class API.
For the method
number_of_utterances(), an utterance is a transcription line that
* in the CHAT transcripts.
Many methods of
Reader have a dual structure in terms of the return object.
It depends on whether or not you are interested in an return object that
organizes contents by the individual source files.
These methods have the optional parameter
For a given method
some_method() called for a
Reader instance named
number_of_utterances() is one of the methods with
>>> eve.number_of_utterances() # by_files is False by default 26979 >>> counts_by_files = eve.number_of_utterances(by_files=True) # dict(filename: num of utterances) >>> import os >>> for abs_filename, n in sorted(counts_by_files.items()): ... print(os.path.basename(abs_filename), n) ... 010600a.cha 1601 010600b.cha 1304 010700a.cha 618 010700b.cha 1456 010800.cha 1479 010900a.cha 1075 010900b.cha 1277 010900c.cha 2058 011000a.cha 1024 011000b.cha 1060 011100a.cha 952 011100b.cha 1339 020000a.cha 959 020000b.cha 1094 020100a.cha 1651 020100b.cha 1500 020200a.cha 2156 020200b.cha 1760 020300a.cha 1348 020300b.cha 1268
(Many data access methods have the parameter
for the dual possibilities of return objects;
see Transcriptions and annotations.)
We are often interested in what concerns specific participants in the data,
e.g., the target child whose participant code is
methods accept an optional argument to specify the parameter
(see also Child speech versus child directed speech):
>>> for abs_filename, n in sorted(eve.number_of_utterances(participant='CHI', by_files=True).items()): ... print(os.path.basename(abs_filename), n) ... 010600a.cha 749 010600b.cha 488 010700a.cha 253 010700b.cha 590 010800.cha 707 010900a.cha 542 010900b.cha 528 010900c.cha 959 011000a.cha 521 011000b.cha 547 011100a.cha 447 011100b.cha 648 020000a.cha 460 020000b.cha 476 020100a.cha 724 020100b.cha 641 020200a.cha 904 020200b.cha 791 020300a.cha 653 020300b.cha 539
Adding and removing transcripts in a reader¶
It is possible to add or remove transcripts in a
this is important where dynamic data handling is needed.
Three methods are available:
To illustrate, we initialize
corpus as an empty
>>> corpus = pla.read_chat() # empty, no filenames given
To add transcripts, use
add() which takes one or more filenames
>>> corpus.add('Brown/Eve/01*.cha') # all data prior to 2;0. (files are conveniently named by age) >>> corpus.number_of_files() 12
To remove transcripts with
>>> corpus.remove('Brown/Eve/010*.cha') # remove data files prior to 1;10. >>> corpus.number_of_files() 4
update() takes a
Reader instance and updates the current one:
>>> new_corpus = pla.read_chat('Brown/Eve/02*.cha') # all data from 2;0. >>> new_corpus.number_of_files() 8 >>> corpus.update(new_corpus) # use "update" to combine new_corpus into corpus >>> corpus.number_of_files() 12
clear() applies to a
Reader instance to clear everything and reset it
as an empty
>>> corpus.clear() >>> corpus.number_of_files() 0