chat — Reading and parsing CHAT transcripts

The chat module defines two classes for reading and parsing CHAT transcripts:

Reader(*filenames[, encoding]) A class for reading multiple CHAT files.
SingleReader(filename[, encoding]) A class for reading a single CHAT file.

The user does not usually need to call or create objects directly with Reader or SingleReader in their code. Under most circumstances, pylangchat.read_chat() is sufficient; underlyingly, this function returns a Reader object which relies on SingleReader for handling individual data files.

Most of the methods of interest are those of the Reader class. Many of them have the optional parameter by_files. By default, by_files is False and a given method X() returns whatever it is for all the files in question. When by_files is set to be True, then the return object is dict(absolute-path filename: X() for that file) instead.

The Reader methods are categorized into Metadata methods and Data methods.

Metadata methods

filenames([sorted_by_age]) Return the set of absolute-path filenames, or a list sorted by the target child’s age if sorted_by_age is True.
find_filename(file_basename) Return the absolute-path filename of file_basename.
number_of_files() Return the number of files.
number_of_utterances([participant, by_files]) Return the number of utterances for participant in all files.
headers() Return a dict mapping an absolute-path filename to the headers of that file.
participants() Return a dict mapping an absolute-path filename to the file’s participant info dict.
participant_codes([by_files]) Return the participant codes (e.g., {'CHI', 'MOT'}) from all files.
languages() Return a dict mapping an absolute-path filename to the list of languages used.
date
age([participant, month]) Return a dict mapping an absolute-path filename to the participant‘s age in the form of (years, months, days).

Data methods

index_to_tiers() Return a dict mapping an absolute-path filename to the file’s index_to_tiers dict.
utterances([participant, clean, by_files]) Return a list of (participant, utterance) pairs from all files.
words([participant, by_files]) Return a list of words by participant in all files.
tagged_words([participant, by_files]) Return a list of tagged words by participant in all files.
sents([participant, by_files]) Return a list of sents by participant in all files.
tagged_sents([participant, by_files]) Return a list of tagged sents by participant in all files.
part_of_speech_tags([participant, by_files]) Return the part-of-speech tags in the data for participant.
word_frequency([participant, keep_case, ...]) Return a Counter of word frequency dict for participant in all files.
word_ngrams(n[, participant, keep_case, ...]) Return a Counter of word n-grams by participant in all files.
search(search_item[, participant, ...]) Return a list of elements containing search_item by participant in all files.
concordance(search_item[, participant, ...]) Return a list of utterances (as strings) each containing search_item by participant.
MLU([participant]) Return a dict mapping a filename to the file’s mean length of utterance (MLU) in morphemes for participant (default to 'CHI'); same as MLUm().
MLUm([participant]) Return a dict mapping a filename to the file’s mean length of utterance (MLU) in morphemes for participant (default to 'CHI'); same as MLU().
MLUw([participant]) Return a dict mapping a filename to the file’s mean length of utterance (MLU) in words for participant (default to 'CHI').
TTR([participant]) Return a dict mapping a filename to the file’s type-token ratio (TTR) for participant (default to 'CHI').
IPSyn([participant]) Return a dict mapping a filename to the file’s index of productive syntax (IPSyn) for participant (default to 'CHI').
update(reader) Combine the current CHAT Reader instance with reader.
add(*filenames) Add one or multiple CHAT files to the current reader by filenames.
remove(*filenames) Remove one or multiple CHAT files from the current reader by filenames.
clear() Clear everything and reset as an empty Reader instance.

The Reader class API

class pylangacq.chat.Reader(*filenames, encoding='utf8')[source]

A class for reading multiple CHAT files.

IPSyn(participant='CHI')[source]

Return a dict mapping a filename to the file’s index of productive syntax (IPSyn) for participant (default to 'CHI').

Parameters:participant – the participant specified, default to 'CHI'
Return type:dict(str: int)
MLU(participant='CHI')[source]

Return a dict mapping a filename to the file’s mean length of utterance (MLU) in morphemes for participant (default to 'CHI'); same as MLUm().

Parameters:participant – the participant specified, default to 'CHI'
Return type:dict(str: float)
MLUm(participant='CHI')[source]

Return a dict mapping a filename to the file’s mean length of utterance (MLU) in morphemes for participant (default to 'CHI'); same as MLU().

Parameters:participant – the participant specified, default to 'CHI'
Return type:dict(str: float)
MLUw(participant='CHI')[source]

Return a dict mapping a filename to the file’s mean length of utterance (MLU) in words for participant (default to 'CHI').

Parameters:participant – the participant specified, default to 'CHI'
Return type:dict(str: float)
TTR(participant='CHI')[source]

Return a dict mapping a filename to the file’s type-token ratio (TTR) for participant (default to 'CHI').

Parameters:participant – the participant specified, default to 'CHI'
Return type:dict(str: float)
add(*filenames)[source]
Add one or multiple CHAT files to the current reader by filenames.
filenames may take glob patterns with wildcards * and ?.
age(participant='CHI', month=False)[source]

Return a dict mapping an absolute-path filename to the participant‘s age in the form of (years, months, days).

Parameters:
  • participant – The specified participant; defaults to 'CHI'
  • month – If True (default: False), return a float as age in months.
Return type:

dict(str: tuple(int, int, int)) or dict(str: float)

clear()[source]

Clear everything and reset as an empty Reader instance.

concordance(search_item, participant='**ALL**', match_entire_word=True, lemma=False, by_files=False)[source]

Return a list of utterances (as strings) each containing search_item by participant. All strings are aligned for search_item by space padding to create the word concordance effect.

Parameters:
  • search_item – word or lemma to search for.
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • match_entire_word – If False (default: True), substring matching is performed.
  • lemma – If True (default: False), search_item refers to the lemma (from “mor” in the tagged word) instead.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list, or dict(str: list)

date_of_birth()[source]

Return a dict mapping an absolute-path filename to the date-of-birth dict for that file.

Return type:dict(str: dict(str: tuple(int, int, int)))
date_of_recording()[source]

Return a dict mapping an absolute-path filename to the date of recording in the form of (year, month, day).

Return type:dict(str: tuple(int, int, int))
filenames(sorted_by_age=False)[source]

Return the set of absolute-path filenames, or a list sorted by the target child’s age if sorted_by_age is True.

Return type:set(str) or list(str)
find_filename(file_basename)[source]

Return the absolute-path filename of file_basename.

Parameters:file_basename – CHAT file basename such as eve01.cha
headers()[source]

Return a dict mapping an absolute-path filename to the headers of that file.

Return type:dict(str: dict)
index_to_tiers()[source]

Return a dict mapping an absolute-path filename to the file’s index_to_tiers dict.

Return type:dict(str: dict)
languages()[source]

Return a dict mapping an absolute-path filename to the list of languages used.

Return type:dict(str: list(str))
number_of_files()[source]

Return the number of files.

number_of_utterances(participant='**ALL**', by_files=False)[source]

Return the number of utterances for participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

int, or dict(str: int)

part_of_speech_tags(participant='**ALL**', by_files=False)[source]

Return the part-of-speech tags in the data for participant.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

set or dict(str: set)

participant_codes(by_files=False)[source]

Return the participant codes (e.g., {'CHI', 'MOT'}) from all files.

Parameters:by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:set(str), or dict(str: set(str))
participants()[source]

Return a dict mapping an absolute-path filename to the file’s participant info dict.

Return type:dict(str: dict)
remove(*filenames)[source]

Remove one or multiple CHAT files from the current reader by filenames. filenames may take glob patterns with wildcards * and ?.

search(search_item, participant='**ALL**', match_entire_word=True, lemma=False, output_tagged=True, output_sents=True, by_files=False)[source]

Return a list of elements containing search_item by participant in all files. Depending on output_tagged and output_sents, each element can be either a list of tagged or untagged words, or simply a word string.

Parameters:
  • search_item – word or lemma to search for.
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • match_entire_word – If False (default: True), substring matching is performed.
  • lemma – If True (default: False), search_item refers to the lemma (from “mor” in the tagged word) instead.
  • output_tagged – If True (default), a word in the return object is a tagged word of the (word, pos, mor, rel) tuple; otherwise just a word string.
  • output_sents – If True (default), each element in the return object is a list for each utterance; otherwise each element is a word (tagged or untagged) without the utterance structure.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list, or dict(str: list)

sents(participant='**ALL**', by_files=False)[source]

Return a list of sents by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(list(str)), or dict(str: list(list(str)))

tagged_sents(participant='**ALL**', by_files=False)[source]

Return a list of tagged sents by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(list(tuple)), or dict(str: list(list(tuple)))

tagged_words(participant='**ALL**', by_files=False)[source]

Return a list of tagged words by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(tuple), or dict(str: list(tuple))

update(reader)[source]

Combine the current CHAT Reader instance with reader.

Parameters:reader – a Reader instance
utterances(participant='**ALL**', clean=True, by_files=False)[source]

Return a list of (participant, utterance) pairs from all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • clean – Whether to filter away the CHAT annotations in the utterance; defaults to True.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(str), or dict(str: list(str))

word_frequency(participant='**ALL**', keep_case=True, by_files=False)[source]

Return a Counter of word frequency dict for participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • keep_case – If keep_case is True (the default), case distinctions are kept and word tokens like “the” and “The” are treated as distinct types. If keep_case is False, all case distinctions are collapsed, with all word tokens forced to be in lowercase.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

Counter, or dict(str: Counter)

word_ngrams(n, participant='**ALL**', keep_case=True, by_files=False)[source]

Return a Counter of word n-grams by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • keep_case – If keep_case is True (the default), case distinctions are kept and word tokens like “the” and “The” are treated as distinct types. If keep_case is False, all case distinctions are collapsed, with all word tokens forced to be in lowercase.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

Counter, or dict(str: Counter)

words(participant='**ALL**', by_files=False)[source]

Return a list of words by participant in all files.

Parameters:
  • participant – The participant(s) of interest (default is all participants if unspecified). This parameter is flexible. Set it to be 'CHI' for the target child only, for example. If multiple participants are desired, this parameter can take a sequence such as {'CHI', 'MOT'} to pick the participants in question. Underlyingly, this parameter actually performs regular expression matching (so passing 'CHI' to this parameter is an exact match for the participant code 'CHI', for instance). For child-directed speech (i.e., targeting all participant except 'CHI'), use ^(?!.*CHI).*$.
  • by_files – If True (default: False), return dict(absolute-path filename: X for that file) instead of X for all files altogether.
Return type:

list(str), or dict(str: list(str))