API Reference

read_chat()

pylangacq.read_chat(path: str, match: str = None, exclude: str = None, encoding: str = 'utf-8', cls: type = <class 'pylangacq.chat.Reader'>) → pylangacq.chat.Reader[source]

Create a reader of CHAT data.

Parameters
pathstr

A path that points to one of the following:

  • ZIP file. Either a local .zip file path or a URL (one that begins with "https://" or "http://"). Example of a URL: "https://childes.talkbank.org/data/Eng-NA/Brown.zip"

  • A local directory, for files under this directory recursively.

  • A single .cha CHAT file.

matchstr, optional

If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in "Eve" here to only handle the data for Eve, since the unzipped Brown data from CHILDES is in a directory structure of Brown/Eve/xxx.cha for Eve’s data. If this parameter is not specified or None is passed in (the default), such file path filtering does not apply.

excludestr, optional

If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.

encodingstr, optional

Text encoding to parse the CHAT data. The default value is "utf-8" for Unicode UTF-8.

clstype, optional

Either Reader (the default), or a subclass from it that expects the same arguments for the methods from_zip(), from_dir(), and from_files(). Pass in your own Reader subclass for new or modified behavior of the returned reader object.

Returns
Reader

Reader

class pylangacq.Reader[source]

A reader that handles CHAT data.

Methods

ages([participant, months])

Return the ages of the given participant in the data.

append(reader)

Append data from another reader.

append_left(reader)

Left-append data from another reader.

clear()

Remove all data from this reader.

dates_of_recording([by_files])

Return the dates of recording.

extend(readers)

Extend data from other readers.

extend_left(readers)

Left-extend data from other readers.

file_paths()

Return the file paths.

from_dir(path[, match, exclude, extension, …])

Instantiate a reader from a local directory with CHAT data files.

from_files(paths[, match, exclude, encoding])

Instantiate a reader from local CHAT data files.

from_strs(strs[, ids])

Instantiate a reader from in-memory CHAT data strings.

from_zip(path[, match, exclude, extension, …])

Instantiate a reader from a local or remote ZIP file.

headers()

Return the headers.

ipsyn([participant])

Return the indexes of productive syntax (IPSyn).

languages([by_files])

Return the languages in the data.

mlu([participant])

Return the mean lengths of utterance (MLU).

mlum([participant])

Return the mean lengths of utterance by morphemes.

mluw([participant])

Return the mean lengths of utterance by words.

n_files()

Return the number of files.

participants([by_files])

Return the participants (e.g., CHI, MOT).

pop()

Drop the last data file from the reader and return it as a reader.

pop_left()

Drop the first data file from the reader and return it as a reader.

sents([participants, exclude, by_files])

Return the sents.

tagged_sents([participants, exclude, by_files])

Return the tagged sents.

tagged_words([participants, exclude, by_files])

Return the tagged words.

tokens([participants, exclude, …])

Return the tokens.

ttr([keep_case, participant])

Return the type-token ratios (TTR).

utterances([participants, exclude, by_files])

Return the utterances.

word_frequencies([keep_case, participants, …])

Return word frequencies.

word_ngrams(n[, keep_case, participants, …])

Return word ngrams.

words([participants, exclude, …])

Return the words.

ages(participant='CHI', months=False) → Union[List[Tuple[int, int, int]], List[float]][source]

Return the ages of the given participant in the data.

Parameters
participantstr, optional

Participant of interest, which defaults to the typical use case of "CHI" for the target child.

monthsbool, optional

If False (the default), age is represented as a tuple of (years, months, days), e.g., “1;06.00” in CHAT becomes (1, 6, 0). If True, age is a float for the number of months, e.g., “1;06.00” in CHAT becomes 18.0 for 18 months.

Returns
List[Tuple[int, int, int]] if months is False, otherwise List[float]
append(reader: pylangacq.chat.Reader)None[source]

Append data from another reader.

New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.

Parameters
readerReader

A reader from which to append data

append_left(reader: pylangacq.chat.Reader)None[source]

Left-append data from another reader.

New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.

Parameters
readerReader

A reader from which to left-append data

clear()None[source]

Remove all data from this reader.

dates_of_recording(by_files=False) → Union[Set[datetime.date], List[Set[datetime.date]]][source]

Return the dates of recording.

Parameters
by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
Set[datetime.date] if by_files is False,
otherwise List[Set[datetime.date]]]
extend(readers: Iterable[pylangacq.chat.Reader])None[source]

Extend data from other readers.

New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.

Parameters
readersIterable[Reader]

Readers from which to extend data

extend_left(readers: Iterable[pylangacq.chat.Reader])None[source]

Left-extend data from other readers.

New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.

Parameters
readersIterable[Reader]

Readers from which to extend data

file_paths() → List[str][source]

Return the file paths.

If the data comes from in-memory strings, then the “file paths” are arbitrary UUID random strings.

Returns
List[str]
classmethod from_dir(path: str, match: str = None, exclude: str = None, extension: str = '.cha', encoding: str = 'utf-8') → pylangacq.chat.Reader[source]

Instantiate a reader from a local directory with CHAT data files.

Parameters
pathstr

Local directory that contains CHAT data files. Files are searched for recursively under this directory, and those that satisfy match and extension are parsed and handled by the reader.

matchstr, optional

If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in "Eve" here to only handle the data for Eve, since the unzipped Brown data from CHILDES is in a directory structure of Brown/Eve/xxx.cha for Eve’s data. If this parameter is not specified or None is passed in (the default), such file path filtering does not apply.

excludestr, optional

If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.

encodingstr, optional

Text encoding to parse the CHAT data. The default value is "utf-8" for Unicode UTF-8.

extensionstr, optional

File extension for CHAT data files. The default value is ".cha".

Returns
pylangacq.Reader
classmethod from_files(paths: List[str], match: str = None, exclude: str = None, encoding: str = 'utf-8') → pylangacq.chat.Reader[source]

Instantiate a reader from local CHAT data files.

Parameters
pathsList[str]

List of local file paths of the CHAT data. The ordering of the paths determines that of the parsed CHAT data in the resulting reader.

matchstr, optional

If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in "Eve" here to only handle the data for Eve, since the unzipped Brown data from CHILDES is in a directory structure of Brown/Eve/xxx.cha for Eve’s data. If this parameter is not specified or None is passed in (the default), such file path filtering does not apply.

excludestr, optional

If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.

encodingstr, optional

Text encoding to parse the CHAT data. The default value is "utf-8" for Unicode UTF-8.

Returns
pylangacq.Reader
classmethod from_strs(strs: List[str], ids: Optional[List[str]] = None) → pylangacq.chat.Reader[source]

Instantiate a reader from in-memory CHAT data strings.

Parameters
strsList[str]

List of CHAT data strings. The ordering of the strings determines that of the parsed CHAT data in the resulting reader.

idsList[str], optional

List of identifiers. If not provided, UUID random strings are used. When file paths are referred to in other parts of this package, they mean these identifiers if you have instantiated the reader by this method.

Returns
pylangacq.Reader
classmethod from_zip(path: str, match: str = None, exclude: str = None, extension: str = '.cha', allow_remote: bool = True, encoding: str = 'utf-8') → pylangacq.chat.Reader[source]

Instantiate a reader from a local or remote ZIP file.

Parameters
pathstr

Either a local file path or a URL (one that begins with "https://" or "http://") for a ZIP file containing CHAT data files. For instance, you can provide either a local path to a ZIP file downloaded from CHILDES, or simply a URL such as "https://childes.talkbank.org/data/Eng-NA/Brown.zip".

matchstr, optional

If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in "Eve" here to only handle the data for Eve, since the unzipped Brown data from CHILDES is in a directory structure of Brown/Eve/xxx.cha for Eve’s data. If this parameter is not specified or None is passed in (the default), such file path filtering does not apply.

excludestr, optional

If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.

allow_remotebool, optional

If True (the default), and if the data source looks like a URL, downloading the data from the internet will be attempted.

encodingstr, optional

Text encoding to parse the CHAT data. The default value is "utf-8" for Unicode UTF-8.

extensionstr, optional

File extension for CHAT data files. The default value is ".cha".

Returns
pylangacq.Reader
headers() → List[Dict][source]

Return the headers.

Returns
List[Dict]
ipsyn(participant='CHI') → List[int][source]

Return the indexes of productive syntax (IPSyn).

Parameters
participantstr, optional

Participant of interest, which defaults to the typical use case of "CHI" for the target child.

Returns
List[float]
languages(by_files=False) → Union[Set[str], List[List[str]]][source]

Return the languages in the data.

Parameters
by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
Set[str] if by_files is False, otherwise List[List[str]]

When by_files is True, the ordering of languages given by the list indicates language dominance. Such ordering would not make sense when by_files is False, in which case the returned object is a set instead of a list.

mlu(participant='CHI') → List[float][source]

Return the mean lengths of utterance (MLU).

This method is equivalent to mlum().

Parameters
participantstr, optional

Participant of interest, which defaults to the typical use case of "CHI" for the target child.

Returns
List[float]
mlum(participant='CHI') → List[float][source]

Return the mean lengths of utterance by morphemes.

Parameters
participantstr, optional

Participant of interest, which defaults to the typical use case of "CHI" for the target child.

Returns
List[float]
mluw(participant='CHI') → List[float][source]

Return the mean lengths of utterance by words.

Parameters
participantstr, optional

Participant of interest, which defaults to the typical use case of "CHI" for the target child.

Returns
List[float]
n_files()int[source]

Return the number of files.

participants(by_files=False) → Union[Set[str], List[Set[str]]][source]

Return the participants (e.g., CHI, MOT).

Parameters
by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
Set[str] if by_files is False, otherwise List[Set[str]]
pop() → pylangacq.chat.Reader[source]

Drop the last data file from the reader and return it as a reader.

Returns
pylangacq.Reader
pop_left() → pylangacq.chat.Reader[source]

Drop the first data file from the reader and return it as a reader.

Returns
pylangacq.Reader
sents(participants=None, exclude=None, by_files=False) → Union[List[List[str]], List[List[List[str]]]][source]

Return the sents.

Deprecated since version 0.13.0: Please use words() with by_utterances=True instead.

Parameters
participantsstr or iterable of str, optional

Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.

excludestr or iterable of str, optional

Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
List[List[str]] if by_files is False, otherwise List[List[List[str]]]
tagged_sents(participants=None, exclude=None, by_files=False) → Union[List[List[pylangacq.objects.Token]], List[List[List[pylangacq.objects.Token]]]][source]

Return the tagged sents.

Deprecated since version 0.13.0: Please use tokens() with by_utterances=True instead.

Parameters
participantsstr or iterable of str, optional

Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.

excludestr or iterable of str, optional

Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
List[List[Token]] if by_files is False,
otherwise List[List[List[Token]]]
tagged_words(participants=None, exclude=None, by_files=False) → Union[List[pylangacq.objects.Token], List[List[pylangacq.objects.Token]]][source]

Return the tagged words.

Deprecated since version 0.13.0: Please use tokens() with by_utterances=False instead.

Parameters
participantsstr or iterable of str, optional

Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.

excludestr or iterable of str, optional

Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
List[Token] if by_files is False, otherwise List[List[Token]]
tokens(participants=None, exclude=None, by_utterances=False, by_files=False) → Union[List[pylangacq.objects.Token], List[List[pylangacq.objects.Token]], List[List[List[pylangacq.objects.Token]]]][source]

Return the tokens.

Parameters
participantsstr or iterable of str, optional

Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.

excludestr or iterable of str, optional

Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

by_utterancesbool, optional

If True, the resulting objects are wrapped as a list at the utterance level. If False (the default), such utterance-level list structure does not exist.

by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
List[List[List[Token]]] if both by_utterances and by_files are True
List[List[Token]] if by_utterances is True and by_files is False
List[List[Token]] if by_utterances is False and by_files is True
List[Token] if both by_utterances and by_files are False
ttr(keep_case=True, participant='CHI') → List[float][source]

Return the type-token ratios (TTR).

Parameters
keep_casebool, optional

If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.

participantstr, optional

Participant of interest, which defaults to the typical use case of "CHI" for the target child.

Returns
List[float]
utterances(participants=None, exclude=None, by_files=False) → Union[List[pylangacq.objects.Utterance], List[List[pylangacq.objects.Utterance]]][source]

Return the utterances.

Parameters
participantsstr or iterable of str, optional

Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.

excludestr or iterable of str, optional

Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
List[Utterance] if by_files is False, otherwise List[List[Utterance]]
word_frequencies(keep_case=True, participants=None, exclude=None, by_files=False) → Union[collections.Counter, List[collections.Counter]][source]

Return word frequencies.

Parameters
participantsstr or iterable of str, optional

Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.

excludestr or iterable of str, optional

Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

keep_casebool, optional

If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.

Returns
collections.Counter if by_files is False,
otherwise List[collections.Counter]
word_ngrams(n, keep_case=True, participants=None, exclude=None, by_files=False) → Union[collections.Counter, List[collections.Counter]][source]

Return word ngrams.

Parameters
participantsstr or iterable of str, optional

Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.

excludestr or iterable of str, optional

Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

keep_casebool, optional

If True (the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. If False, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.

Returns
collections.Counter if by_files is False,
otherwise List[collections.Counter]
words(participants=None, exclude=None, by_utterances=False, by_files=False) → Union[List[str], List[List[str]], List[List[List[str]]]][source]

Return the words.

Parameters
participantsstr or iterable of str, optional

Participants of interest. You may pass in a string (e.g., "CHI" for studying child speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are included. If you pass in None (the default), all participants are included. This parameter cannot be used together with exclude.

excludestr or iterable of str, optional

Participants to exclude. You may pass in a string (e.g., "CHI" for child-directed speech) or an iterable of strings (e.g., {"MOT", "INV"}). Only the specified participants are excluded. If you pass in None (the default), no participants are excluded. This parameter cannot be used together with participants.

by_utterancesbool, optional

If True, the resulting objects are wrapped as a list at the utterance level. If False (the default), such utterance-level list structure does not exist.

by_filesbool, optional

If True, return a list X of results, where len(X) is the number of files in the Reader object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths from file_paths(). If False (the default), return the result that collapses the file distinction just described for when by_files is True.

Returns
List[List[List[str]]] if both by_utterances and by_files are True
List[List[str]] if by_utterances is True and by_files is False
List[List[str]] if by_utterances is False and by_files is True
List[str] if both by_utterances and by_files are False

Token

class pylangacq.objects.Token(word: str, pos: Optional[str], mor: Optional[str], gra: Optional[pylangacq.objects.Gra])[source]

Token with attributes as parsed from a CHAT utterance.

Attributes
wordstr

Word form of the token

posstr

Part-of-speech tag

morstr

Morphological information

graGra

Grammatical relation

Gra

class pylangacq.objects.Gra(dep: int, head: int, rel: str)[source]

Grammatical relation of a word in an utterance.

Attributes
depint

The position of the dependent (i.e., the word itself) in the utterance

headint

The position of the head in the utterance

relstr

Grammatical relation

Utterance

class pylangacq.objects.Utterance(participant: str, tokens: List[pylangacq.objects.Token], time_marks: Optional[Tuple[int, int]], tiers: Dict[str, str])[source]

Utterance in a CHAT transcript data.

Attributes
participantstr

Participant of the utterance, e.g., "CHI", "MOT"

tokensList[Token]

List of tokens of the utterance

time_marksTuple[int, int]

If available from the CHAT data, these are the start and end times (in milliseconds) for a segment in a digitized video or audio file, e.g., (0, 1073), extracted from "·0_1073·" in the CHAT data. "·" is ASCII code 21 (0x15), for NAK (Negative Acknowledgment).

tiersDict[str, str]

This dictionary contains all the original, unparsed data from the utterance, including the transcribed utterance (signaled by *CHI:, *MOT: etc in CHAT), common tiers such as %mor and %gra, as well as all other tiers associated with the utterance. This dictionary is useful to retrieve whatever information not readily handled by this package.