chat — Reading and parsing CHAT transcripts

The chat module defines two classes for reading and parsing CHAT transcripts:

Reader(*filenames, **kwargs)

A class for reading multiple CHAT files.

The user does not usually need to call or create objects directly with Reader or SingleReader in their code. Under most circumstances, pylangchat.read_chat() is sufficient; underlyingly, this function returns a Reader object which relies on SingleReader for handling individual data files.

Most of the methods of interest are those of the Reader class. Many of them have the optional parameter by_files. By default, by_files is False and a given method X() returns whatever it is for all the files in question. When by_files is set to be True, then the return object is dict(absolute-path filename: X() for that file) instead.

The Reader methods are categorized into Metadata methods and Data methods.

Metadata methods

filenames(self[, sorted_by_age])

Return the set of absolute-path filenames.

abspath(self, basename)

Return the absolute path of basename.

number_of_files(self)

Return the number of files.

number_of_utterances(self[, participant, …])

Return the number of utterances for participant in all files.

headers(self)

Return a dict mapping a file path to the headers of that file.

participants(self)

Return a dict mapping a file path to the file’s participant info.

participant_codes(self[, by_files])

Return the participant codes (e.g., {'CHI', 'MOT'}).

languages(self)

Return a map from a file path to the languages used.

dates_of_recording(self)

Return a map from a file path to the date of recording.

age(self[, participant, months])

Return a map from a file path to the participant’s age.

Data methods

index_to_tiers(self)

Return a dict mapping a file path to the file’s index_to_tiers dict.

utterances(self[, participant, exclude, …])

Return a list of (participant, utterance) pairs from all files.

words(self[, participant, exclude, by_files])

Return a list of words by participant in all files.

tagged_words(self[, participant, exclude, …])

Return a list of tagged words by participant in all files.

sents(self[, participant, exclude, by_files])

Return a list of sents by participant in all files.

tagged_sents(self[, participant, exclude, …])

Return a list of tagged sents by participant in all files.

part_of_speech_tags(self[, participant, …])

Return the part-of-speech tags in the data for participant.

word_frequency(self[, participant, exclude, …])

Return a word frequency counter for participant in all files.

word_ngrams(self, n[, participant, exclude, …])

Return a word n-gram counter by participant in all files.

search(self, search_item[, participant, …])

Return a list of elements containing search_item by participant.

concordance(self, search_item[, …])

Return a list of utterances with search_item for participant.

MLU(self[, participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUm(self[, participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUw(self[, participant])

Return a map from a file path to the file’s MLU by words.

TTR(self[, participant])

Return a map from a file path to the file’s TTR.

IPSyn(self[, participant])

Return a map from a file path to the file’s IPSyn.

update(self, reader)

Combine the current CHAT Reader instance with reader.

add(self, *filenames)

Add one or more CHAT filenames to the current reader.

remove(self, *filenames)

Remove one or more CHAT filenames from the current reader.

clear(self)

Clear everything and reset as an empty Reader instance.

The Reader class API

class pylangacq.chat.Reader(*filenames, **kwargs)[source]

A class for reading multiple CHAT files.

Parameters
filenamesstr or iterable or str, optional

One or more filenames. A filename may match exactly a CHAT file (e.g., 'eve01.cha') or matches multiple files by glob patterns (e.g., 'eve*.cha', for 'eve01.cha', 'eve02.cha', etc.). * matches any number (including zero) of characters, while ? matches exactly one character. A filename can be either an absolute or relative path. If no filenames are provided, an empty Reader instance is created.

kwargs

Only the keyword encoding is recognized, which defaults to ‘utf8’. (New in version 0.9)

Methods

IPSyn(self[, participant])

Return a map from a file path to the file’s IPSyn.

MLU(self[, participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUm(self[, participant])

Return a map from a file path to the file’s MLU by morphemes.

MLUw(self[, participant])

Return a map from a file path to the file’s MLU by words.

TTR(self[, participant])

Return a map from a file path to the file’s TTR.

abspath(self, basename)

Return the absolute path of basename.

add(self, *filenames)

Add one or more CHAT filenames to the current reader.

age(self[, participant, months])

Return a map from a file path to the participant’s age.

clear(self)

Clear everything and reset as an empty Reader instance.

concordance(self, search_item[, …])

Return a list of utterances with search_item for participant.

date_of_birth(self)

Return a map from a file path to the date of birth.

dates_of_recording(self)

Return a map from a file path to the date of recording.

filenames(self[, sorted_by_age])

Return the set of absolute-path filenames.

from_chat_files(*filenames, **kwargs)

Create a Reader object with CHAT data files.

from_chat_str(chat_str[, encoding])

Create a Reader object with CHAT data as a string.

headers(self)

Return a dict mapping a file path to the headers of that file.

index_to_tiers(self)

Return a dict mapping a file path to the file’s index_to_tiers dict.

languages(self)

Return a map from a file path to the languages used.

number_of_files(self)

Return the number of files.

number_of_utterances(self[, participant, …])

Return the number of utterances for participant in all files.

part_of_speech_tags(self[, participant, …])

Return the part-of-speech tags in the data for participant.

participant_codes(self[, by_files])

Return the participant codes (e.g., {'CHI', 'MOT'}).

participants(self)

Return a dict mapping a file path to the file’s participant info.

remove(self, *filenames)

Remove one or more CHAT filenames from the current reader.

search(self, search_item[, participant, …])

Return a list of elements containing search_item by participant.

sents(self[, participant, exclude, by_files])

Return a list of sents by participant in all files.

tagged_sents(self[, participant, exclude, …])

Return a list of tagged sents by participant in all files.

tagged_words(self[, participant, exclude, …])

Return a list of tagged words by participant in all files.

update(self, reader)

Combine the current CHAT Reader instance with reader.

utterances(self[, participant, exclude, …])

Return a list of (participant, utterance) pairs from all files.

word_frequency(self[, participant, exclude, …])

Return a word frequency counter for participant in all files.

word_ngrams(self, n[, participant, exclude, …])

Return a word n-gram counter by participant in all files.

words(self[, participant, exclude, by_files])

Return a list of words by participant in all files.

IPSyn(self, participant='CHI')[source]

Return a map from a file path to the file’s IPSyn.

IPSyn = index of productive syntax

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: int)
MLU(self, participant='CHI')[source]

Return a map from a file path to the file’s MLU by morphemes.

MLU = mean length of utterance. This method is identical to MLUm.

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: float)
MLUm(self, participant='CHI')[source]

Return a map from a file path to the file’s MLU by morphemes.

MLU = mean length of utterance. This method is identical to MLUm.

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: float)
MLUw(self, participant='CHI')[source]

Return a map from a file path to the file’s MLU by words.

MLU = mean length of utterance.

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: float)
TTR(self, participant='CHI')[source]

Return a map from a file path to the file’s TTR.

TTR = type-token ratio

Parameters
participantstr, optional

The specified participant (default to 'CHI').

Returns
dict(str: float)
abspath(self, basename)[source]

Return the absolute path of basename.

Parameters
basenamestr

The basename (e.g., “foobar.cha”) of the desired data file.

Returns
str
add(self, *filenames)[source]

Add one or more CHAT filenames to the current reader.

Parameters
*filenames

Filenames may take glob patterns with wildcards * and ?.

age(self, participant='CHI', months=False)[source]

Return a map from a file path to the participant’s age.

The age is in the form of (years, months, days).

Parameters
participantstr, optional

The specified participant

monthsbool, optional

If True, age is in months.

Returns
dict(str: tuple(int, int, int)) or dict(str: float)
clear(self)[source]

Clear everything and reset as an empty Reader instance.

concordance(self, search_item, participant=None, exclude=None, match_entire_word=True, lemma=False, by_files=False)[source]

Return a list of utterances with search_item for participant.

All strings are aligned for search_item by space padding to create the word concordance effect.

Parameters
search_itemstr

Word or lemma to search for.

match_entire_wordbool, optional

If False (default: True), substring matching is performed.

lemmabool, optional

If True (default: False), search_item refers to the lemma (from “mor” in the tagged word) instead.

participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list, or dict(str: list)
date_of_birth(self)[source]

Return a map from a file path to the date of birth.

Returns
dict(str: dict(str: tuple(int, int, int)))
dates_of_recording(self)[source]

Return a map from a file path to the date of recording.

The date of recording is in the form of (year, month, day).

Returns
dict(str: list(tuple(int, int, int)))
filenames(self, sorted_by_age=False)[source]

Return the set of absolute-path filenames.

Parameters
sorted_by_agebool, optional

Whether to return the filenames as a list sorted by the target child’s age.

Returns
set of str or list of str
classmethod from_chat_files(*filenames, **kwargs)[source]

Create a Reader object with CHAT data files.

Parameters
filenamesstr or iterable or str, optional

One or more filenames. A filename may match exactly a CHAT file (e.g., 'eve01.cha') or matches multiple files by glob patterns (e.g., 'eve*.cha', for 'eve01.cha', 'eve02.cha', etc.). * matches any number (including zero) of characters, while ? matches exactly one character. A filename can be either an absolute or relative path. If no filenames are provided, an empty Reader instance is created.

kwargs

Only the keyword encoding is recognized, which defaults to ‘utf8’. (New in version 0.9)

Returns
Reader

Notes

Because CHAT data most likely comes as files on disk, an equivalent library top-level function pylangacq.read_chat is defined for convenience.

classmethod from_chat_str(chat_str, encoding='utf8')[source]

Create a Reader object with CHAT data as a string.

Parameters
chat_strstr

CHAT data as an in-memory string. It would be what a single CHAT data file contains.

encoding

Encoding of the CHAT data

Returns
Reader
headers(self)[source]

Return a dict mapping a file path to the headers of that file.

Returns
dict(str: dict)
index_to_tiers(self)[source]

Return a dict mapping a file path to the file’s index_to_tiers dict.

Returns
dict(str: dict)
languages(self)[source]

Return a map from a file path to the languages used.

Returns
dict(str: list(str))
number_of_files(self)[source]

Return the number of files.

Returns
int
number_of_utterances(self, participant=None, exclude=None, by_files=False)[source]

Return the number of utterances for participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
int or dict(str: int)
part_of_speech_tags(self, participant=None, exclude=None, by_files=False)[source]

Return the part-of-speech tags in the data for participant.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
set or dict(str: set)
participant_codes(self, by_files=False)[source]

Return the participant codes (e.g., {'CHI', 'MOT'}).

Parameters
by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
set(str) or dict(str: set(str))
participants(self)[source]

Return a dict mapping a file path to the file’s participant info.

Returns
dict(str: dict)
remove(self, *filenames)[source]

Remove one or more CHAT filenames from the current reader.

Parameters
*filenames

Filenames may take glob patterns with wildcards * and ?.

search(self, search_item, participant=None, exclude=None, match_entire_word=True, lemma=False, output_tagged=True, output_sents=True, by_files=False)[source]

Return a list of elements containing search_item by participant.

Parameters
search_itemstr

Word or lemma to search for.

match_entire_wordbool, optional

Whether to match for the entire word.

lemmabool, optional

Whether the search_item refers to the lemma (from “mor” in the tagged word) instead.

output_taggedbool, optional

Whether a word in the return object is a tagged word of the (word, pos, mor, rel) tuple; otherwise just a word string.

output_sentsbool, optional

Whether each element in the return object is a list for each utterance; otherwise each element is a word (tagged or untagged) without the utterance structure.

participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list or dict(str: list)
sents(self, participant=None, exclude=None, by_files=False)[source]

Return a list of sents by participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(list(str)) or dict(str: list(list(str)))
tagged_sents(self, participant=None, exclude=None, by_files=False)[source]

Return a list of tagged sents by participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(list(tuple)) or dict(str: list(list(tuple)))
tagged_words(self, participant=None, exclude=None, by_files=False)[source]

Return a list of tagged words by participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(tuple) or dict(str: list(tuple))
update(self, reader)[source]

Combine the current CHAT Reader instance with reader.

Parameters
readerReader
utterances(self, participant=None, exclude=None, clean=True, by_files=False)[source]

Return a list of (participant, utterance) pairs from all files.

Parameters
cleanbool, optional

Whether to filter away the CHAT annotations in the utterance.

participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(str) or dict(str: list(str))
word_frequency(self, participant=None, exclude=None, keep_case=True, by_files=False)[source]

Return a word frequency counter for participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

keep_casebool, optional

If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.

Returns
Counter, or dict(str: Counter)
word_ngrams(self, n, participant=None, exclude=None, keep_case=True, by_files=False)[source]

Return a word n-gram counter by participant in all files. participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

keep_casebool, optional

If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.

Returns
Counter, or dict(str: Counter)
words(self, participant=None, exclude=None, by_files=False)[source]

Return a list of words by participant in all files.

Parameters
participantstr or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

excludestr or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_filesbool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns
list(str) or dict(str: list(str))