chat — Reading and parsing CHAT transcripts

The chat module defines two classes for reading and parsing CHAT transcripts:

Reader(*filenames, **kwargs) A class for reading multiple CHAT files.
SingleReader(filename[, encoding]) A class for reading a single CHAT file.

The user does not usually need to call or create objects directly with Reader or SingleReader in their code. Under most circumstances, pylangchat.read_chat() is sufficient; underlyingly, this function returns a Reader object which relies on SingleReader for handling individual data files.

Most of the methods of interest are those of the Reader class. Many of them have the optional parameter by_files. By default, by_files is False and a given method X() returns whatever it is for all the files in question. When by_files is set to be True, then the return object is dict(absolute-path filename: X() for that file) instead.

The Reader methods are categorized into Metadata methods and Data methods.

Metadata methods

filenames([sorted_by_age]) Return the set of absolute-path filenames.
abspath(basename) Return the absolute path of basename.
number_of_files() Return the number of files.
number_of_utterances([participant, exclude, …]) Return the number of utterances for participant in all files.
headers() Return a dict mapping a file path to the headers of that file.
participants() Return a dict mapping a file path to the file’s participant info.
participant_codes([by_files]) Return the participant codes (e.g., {'CHI', 'MOT'}).
languages() Return a map from a file path to the languages used.
dates_of_recording() Return a map from a file path to the date of recording.
age([participant, months]) Return a map from a file path to the participant’s age.

Data methods

index_to_tiers() Return a dict mapping a file path to the file’s index_to_tiers dict.
utterances([participant, exclude, clean, …]) Return a list of (participant, utterance) pairs from all files.
words([participant, exclude, by_files]) Return a list of words by participant in all files.
tagged_words([participant, exclude, by_files]) Return a list of tagged words by participant in all files.
sents([participant, exclude, by_files]) Return a list of sents by participant in all files.
tagged_sents([participant, exclude, by_files]) Return a list of tagged sents by participant in all files.
part_of_speech_tags([participant, exclude, …]) Return the part-of-speech tags in the data for participant.
word_frequency([participant, exclude, …]) Return a word frequency counter for participant in all files.
word_ngrams(n[, participant, exclude, …]) Return a word n-gram counter by participant in all files.
search(search_item[, participant, exclude, …]) Return a list of elements containing search_item by participant.
concordance(search_item[, participant, …]) Return a list of utterances with search_item for participant.
MLU([participant]) Return a map from a file path to the file’s MLU by morphemes.
MLUm([participant]) Return a map from a file path to the file’s MLU by morphemes.
MLUw([participant]) Return a map from a file path to the file’s MLU by words.
TTR([participant]) Return a map from a file path to the file’s TTR.
IPSyn([participant]) Return a map from a file path to the file’s IPSyn.
update(reader) Combine the current CHAT Reader instance with reader.
add(*filenames) Add one or more CHAT filenames to the current reader.
remove(*filenames) Remove one or more CHAT filenames from the current reader.
clear() Clear everything and reset as an empty Reader instance.

The Reader class API

class pylangacq.chat.Reader(*filenames, **kwargs)[source]

A class for reading multiple CHAT files.

Parameters:
filenames : str or iterable or str, optional

One or more filenames. A filename may match exactly a CHAT file (e.g., 'eve01.cha') or matches multiple files by glob patterns (e.g., 'eve*.cha', for 'eve01.cha', 'eve02.cha', etc.). * matches any number (including zero) of characters, while ? matches exactly one character. A filename can be either an absolute or relative path. If no filenames are provided, an empty Reader instance is created.

kwargs

Only the keyword encoding is recognized, which defaults to ‘utf8’. (New in version 0.9)

Methods

IPSyn([participant]) Return a map from a file path to the file’s IPSyn.
MLU([participant]) Return a map from a file path to the file’s MLU by morphemes.
MLUm([participant]) Return a map from a file path to the file’s MLU by morphemes.
MLUw([participant]) Return a map from a file path to the file’s MLU by words.
TTR([participant]) Return a map from a file path to the file’s TTR.
abspath(basename) Return the absolute path of basename.
add(*filenames) Add one or more CHAT filenames to the current reader.
age([participant, months]) Return a map from a file path to the participant’s age.
clear() Clear everything and reset as an empty Reader instance.
concordance(search_item[, participant, …]) Return a list of utterances with search_item for participant.
date_of_birth() Return a map from a file path to the date of birth.
dates_of_recording() Return a map from a file path to the date of recording.
filenames([sorted_by_age]) Return the set of absolute-path filenames.
from_chat_files(*filenames, **kwargs) Create a Reader object with CHAT data files.
from_chat_str(chat_str[, encoding]) Create a Reader object with CHAT data as a string.
headers() Return a dict mapping a file path to the headers of that file.
index_to_tiers() Return a dict mapping a file path to the file’s index_to_tiers dict.
languages() Return a map from a file path to the languages used.
number_of_files() Return the number of files.
number_of_utterances([participant, exclude, …]) Return the number of utterances for participant in all files.
part_of_speech_tags([participant, exclude, …]) Return the part-of-speech tags in the data for participant.
participant_codes([by_files]) Return the participant codes (e.g., {'CHI', 'MOT'}).
participants() Return a dict mapping a file path to the file’s participant info.
remove(*filenames) Remove one or more CHAT filenames from the current reader.
search(search_item[, participant, exclude, …]) Return a list of elements containing search_item by participant.
sents([participant, exclude, by_files]) Return a list of sents by participant in all files.
tagged_sents([participant, exclude, by_files]) Return a list of tagged sents by participant in all files.
tagged_words([participant, exclude, by_files]) Return a list of tagged words by participant in all files.
update(reader) Combine the current CHAT Reader instance with reader.
utterances([participant, exclude, clean, …]) Return a list of (participant, utterance) pairs from all files.
word_frequency([participant, exclude, …]) Return a word frequency counter for participant in all files.
word_ngrams(n[, participant, exclude, …]) Return a word n-gram counter by participant in all files.
words([participant, exclude, by_files]) Return a list of words by participant in all files.
IPSyn(participant='CHI')[source]

Return a map from a file path to the file’s IPSyn.

IPSyn = index of productive syntax

Parameters:
participant : str, optional

The specified participant (default to 'CHI').

Returns:
dict(str: int)
MLU(participant='CHI')[source]

Return a map from a file path to the file’s MLU by morphemes.

MLU = mean length of utterance. This method is identical to MLUm.

Parameters:
participant : str, optional

The specified participant (default to 'CHI').

Returns:
dict(str: float)
MLUm(participant='CHI')[source]

Return a map from a file path to the file’s MLU by morphemes.

MLU = mean length of utterance. This method is identical to MLUm.

Parameters:
participant : str, optional

The specified participant (default to 'CHI').

Returns:
dict(str: float)
MLUw(participant='CHI')[source]

Return a map from a file path to the file’s MLU by words.

MLU = mean length of utterance.

Parameters:
participant : str, optional

The specified participant (default to 'CHI').

Returns:
dict(str: float)
TTR(participant='CHI')[source]

Return a map from a file path to the file’s TTR.

TTR = type-token ratio

Parameters:
participant : str, optional

The specified participant (default to 'CHI').

Returns:
dict(str: float)
abspath(basename)[source]

Return the absolute path of basename.

Parameters:
basename : str

The basename (e.g., “foobar.cha”) of the desired data file.

Returns:
str
add(*filenames)[source]

Add one or more CHAT filenames to the current reader.

Parameters:
*filenames

Filenames may take glob patterns with wildcards * and ?.

age(participant='CHI', months=False)[source]

Return a map from a file path to the participant’s age.

The age is in the form of (years, months, days).

Parameters:
participant : str, optional

The specified participant

months : bool, optional

If True, age is in months.

Returns:
dict(str: tuple(int, int, int)) or dict(str: float)
clear()[source]

Clear everything and reset as an empty Reader instance.

concordance(search_item, participant=None, exclude=None, match_entire_word=True, lemma=False, by_files=False)[source]

Return a list of utterances with search_item for participant.

All strings are aligned for search_item by space padding to create the word concordance effect.

Parameters:
search_item : str

Word or lemma to search for.

match_entire_word : bool, optional

If False (default: True), substring matching is performed.

lemma : bool, optional

If True (default: False), search_item refers to the lemma (from “mor” in the tagged word) instead.

participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
list, or dict(str: list)
date_of_birth()[source]

Return a map from a file path to the date of birth.

Returns:
dict(str: dict(str: tuple(int, int, int)))
dates_of_recording()[source]

Return a map from a file path to the date of recording.

The date of recording is in the form of (year, month, day).

Returns:
dict(str: list(tuple(int, int, int)))
filenames(sorted_by_age=False)[source]

Return the set of absolute-path filenames.

Parameters:
sorted_by_age : bool, optional

Whether to return the filenames as a list sorted by the target child’s age.

Returns:
set of str or list of str
classmethod from_chat_files(*filenames, **kwargs)[source]

Create a Reader object with CHAT data files.

Parameters:
filenames : str or iterable or str, optional

One or more filenames. A filename may match exactly a CHAT file (e.g., 'eve01.cha') or matches multiple files by glob patterns (e.g., 'eve*.cha', for 'eve01.cha', 'eve02.cha', etc.). * matches any number (including zero) of characters, while ? matches exactly one character. A filename can be either an absolute or relative path. If no filenames are provided, an empty Reader instance is created.

kwargs

Only the keyword encoding is recognized, which defaults to ‘utf8’. (New in version 0.9)

Returns:
Reader

Notes

Because CHAT data most likely comes as files on disk, an equivalent library top-level function pylangacq.read_chat is defined for convenience.

classmethod from_chat_str(chat_str, encoding='utf8')[source]

Create a Reader object with CHAT data as a string.

Parameters:
chat_str : str

CHAT data as an in-memory string. It would be what a single CHAT data file contains.

encoding

Encoding of the CHAT data

Returns:
Reader
headers()[source]

Return a dict mapping a file path to the headers of that file.

Returns:
dict(str: dict)
index_to_tiers()[source]

Return a dict mapping a file path to the file’s index_to_tiers dict.

Returns:
dict(str: dict)
languages()[source]

Return a map from a file path to the languages used.

Returns:
dict(str: list(str))
number_of_files()[source]

Return the number of files.

Returns:
int
number_of_utterances(participant=None, exclude=None, by_files=False)[source]

Return the number of utterances for participant in all files.

Parameters:
participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
int or dict(str: int)
part_of_speech_tags(participant=None, exclude=None, by_files=False)[source]

Return the part-of-speech tags in the data for participant.

Parameters:
participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
set or dict(str: set)
participant_codes(by_files=False)[source]

Return the participant codes (e.g., {'CHI', 'MOT'}).

Parameters:
by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
set(str) or dict(str: set(str))
participants()[source]

Return a dict mapping a file path to the file’s participant info.

Returns:
dict(str: dict)
remove(*filenames)[source]

Remove one or more CHAT filenames from the current reader.

Parameters:
*filenames

Filenames may take glob patterns with wildcards * and ?.

search(search_item, participant=None, exclude=None, match_entire_word=True, lemma=False, output_tagged=True, output_sents=True, by_files=False)[source]

Return a list of elements containing search_item by participant.

Parameters:
search_item : str

Word or lemma to search for.

match_entire_word : bool, optional

Whether to match for the entire word.

lemma : bool, optional

Whether the search_item refers to the lemma (from “mor” in the tagged word) instead.

output_tagged : bool, optional

Whether a word in the return object is a tagged word of the (word, pos, mor, rel) tuple; otherwise just a word string.

output_sents : bool, optional

Whether each element in the return object is a list for each utterance; otherwise each element is a word (tagged or untagged) without the utterance structure.

participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
list or dict(str: list)
sents(participant=None, exclude=None, by_files=False)[source]

Return a list of sents by participant in all files.

Parameters:
participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
list(list(str)) or dict(str: list(list(str)))
tagged_sents(participant=None, exclude=None, by_files=False)[source]

Return a list of tagged sents by participant in all files.

Parameters:
participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
list(list(tuple)) or dict(str: list(list(tuple)))
tagged_words(participant=None, exclude=None, by_files=False)[source]

Return a list of tagged words by participant in all files.

Parameters:
participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
list(tuple) or dict(str: list(tuple))
update(reader)[source]

Combine the current CHAT Reader instance with reader.

Parameters:
reader : Reader
utterances(participant=None, exclude=None, clean=True, by_files=False)[source]

Return a list of (participant, utterance) pairs from all files.

Parameters:
clean : bool, optional

Whether to filter away the CHAT annotations in the utterance.

participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
list(str) or dict(str: list(str))
word_frequency(participant=None, exclude=None, keep_case=True, by_files=False)[source]

Return a word frequency counter for participant in all files.

Parameters:
participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

keep_case : bool, optional

If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.

Returns:
Counter, or dict(str: Counter)
word_ngrams(n, participant=None, exclude=None, keep_case=True, by_files=False)[source]

Return a word n-gram counter by participant in all files. participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.
exclude : str or iterable of str, optional
Participants to exclude. If unspecified or None, no participants are excluded.
by_files : bool, optional
If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.
keep_case : bool, optional
If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase.
Returns:
Counter, or dict(str: Counter)
words(participant=None, exclude=None, by_files=False)[source]

Return a list of words by participant in all files.

Parameters:
participant : str or iterable of str, optional

Participants of interest. If unspecified or None, all participants are included.

exclude : str or iterable of str, optional

Participants to exclude. If unspecified or None, no participants are excluded.

by_files : bool, optional

If True, return dict(absolute-path filename: X for that file) instead of X for all files altogether.

Returns:
list(str) or dict(str: list(str))