API Reference#
read_chat()
#
- pylangacq.read_chat(path: str, match: str = None, exclude: str = None, encoding: str = 'utf-8', cls: type = <class 'pylangacq.chat.Reader'>) Reader [source]#
Create a reader of CHAT data.
If
path
is a remote ZIP file and you expect to call this function with the same path multiple times, consider downloading the data to the local system and then reading it from there to avoid unnecessary re-downloading. Caching a remote ZIP file isn’t implemented (yet) as the upstream CHILDES / TalkBank data is updated in minor ways from time to time.- Parameters:
- pathstr
A path that points to one of the following:
ZIP file. Either a local
.zip
file path or a URL (one that begins with"https://"
or"http://"
). Example of a URL:"https://childes.talkbank.org/data/Eng-NA/Brown.zip"
A local directory, for files under this directory recursively.
A single
.cha
CHAT file.
- matchstr, optional
If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in
"Eve"
here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure ofBrown/Eve/xxx.cha
for Eve’s data. If this parameter is not specified orNone
is passed in (the default), such file path filtering does not apply.- excludestr, optional
If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.
- encodingstr, optional
Text encoding to parse the CHAT data. The default value is
"utf-8"
for Unicode UTF-8.- clstype, optional
Either
Reader
(the default), or a subclass from it that expects the same arguments for the methodsfrom_zip()
,from_dir()
, andfrom_files()
. Pass in your ownReader
subclass for new or modified behavior of the returned reader object.
- Returns:
Reader
#
- class pylangacq.Reader[source]#
A reader that handles CHAT data.
Methods
ages
([participant, months])Return the ages of the given participant in the data.
append
(reader)Append data from another reader.
append_left
(reader)Left-append data from another reader.
clear
()Remove all data from this reader.
dates_of_recording
([by_files])Return the dates of recording.
extend
(readers)Extend data from other readers.
extend_left
(readers)Left-extend data from other readers.
Return the file paths.
filter
([match, exclude])Return a new reader filtered by file paths.
from_dir
(path[, match, exclude, extension, ...])Instantiate a reader from a local directory with CHAT data files.
from_files
(paths[, match, exclude, ...])Instantiate a reader from local CHAT data files.
from_strs
(strs[, ids, parallel])Instantiate a reader from in-memory CHAT data strings.
from_zip
(path[, match, exclude, extension, ...])Instantiate a reader from a local or remote ZIP file.
head
([n, participants, exclude])Return the first several utterances.
headers
()Return the headers.
info
([verbose])Print a summary of this Reader's data.
ipsyn
([participant])Return the indexes of productive syntax (IPSyn).
languages
([by_files])Return the languages in the data.
mlu
([participant, exclude_switch])Return the mean lengths of utterance (MLU).
mlum
([participant, exclude_switch])Return the mean lengths of utterance in morphemes.
mluw
([participant, exclude_switch])Return the mean lengths of utterance in words.
n_files
()Return the number of files.
participants
([by_files])Return the participants (e.g., CHI, MOT).
pop
()Drop the last data file from the reader and return it as a reader.
pop_left
()Drop the first data file from the reader and return it as a reader.
sents
([participants, exclude, by_files])Return the sents.
tagged_sents
([participants, exclude, by_files])Return the tagged sents.
tagged_words
([participants, exclude, by_files])Return the tagged words.
tail
([n, participants, exclude])Return the last several utterances.
to_chat
(path[, is_dir, filenames, tabular, ...])Export to CHAT data files.
to_strs
([tabular])Yield CHAT data strings.
tokens
([participants, exclude, ...])Return the tokens.
ttr
([keep_case, participant])Return the type-token ratios (TTR).
utterances
([participants, exclude, by_files])Return the utterances.
word_frequencies
([keep_case, participants, ...])Return word frequencies.
word_ngrams
(n[, keep_case, participants, ...])Return word ngrams.
words
([participants, exclude, ...])Return the words.
- ages(participant='CHI', months=False) List[Tuple[int, int, int]] | List[float] [source]#
Return the ages of the given participant in the data.
- Parameters:
- participantstr, optional
Participant of interest, which defaults to the typical use case of
"CHI"
for the target child.- monthsbool, optional
If
False
(the default), age is represented as a tuple of (years, months, days), e.g., “1;06.00” in CHAT becomes(1, 6, 0)
. IfTrue
, age is a float for the number of months, e.g., “1;06.00” in CHAT becomes18.0
for 18 months.
- Returns:
- List[Tuple[int, int, int]] if
months
isFalse
, otherwise List[float]
- List[Tuple[int, int, int]] if
- append(reader: Reader) None [source]#
Append data from another reader.
New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.
- Parameters:
- readerReader
A reader from which to append data
- append_left(reader: Reader) None [source]#
Left-append data from another reader.
New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.
- Parameters:
- readerReader
A reader from which to left-append data
- dates_of_recording(by_files=False) Set[date] | List[Set[date]] [source]#
Return the dates of recording.
- Parameters:
- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- Set[datetime.date] if
by_files
isFalse
, - otherwise List[Set[datetime.date]]]
- Set[datetime.date] if
- extend(readers: Iterable[Reader]) None [source]#
Extend data from other readers.
New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.
- Parameters:
- readersIterable[Reader]
Readers from which to extend data
- extend_left(readers: Iterable[Reader]) None [source]#
Left-extend data from other readers.
New data is appended as-is with no filtering of any sort, even for files whose file paths duplicate those already in the current reader.
- Parameters:
- readersIterable[Reader]
Readers from which to extend data
- file_paths() List[str] [source]#
Return the file paths.
If the data comes from in-memory strings, then the “file paths” are arbitrary UUID random strings.
- Returns:
- List[str]
- filter(match: str = None, exclude: str = None) Reader [source]#
Return a new reader filtered by file paths.
- Parameters:
- matchstr, optional
If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in
"Eve"
here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure ofBrown/Eve/xxx.cha
for Eve’s data. If this parameter is not specified orNone
is passed in (the default), such file path filtering does not apply.- excludestr, optional
If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.
- Returns:
- Raises:
- TypeError
If neither
match
norexclude
is specified.
- classmethod from_dir(path: str, match: str = None, exclude: str = None, extension: str = '.cha', encoding: str = 'utf-8', parallel: bool = True) Reader [source]#
Instantiate a reader from a local directory with CHAT data files.
- Parameters:
- pathstr
Local directory that contains CHAT data files. Files are searched for recursively under this directory, and those that satisfy
match
andextension
are parsed and handled by the reader.- matchstr, optional
If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in
"Eve"
here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure ofBrown/Eve/xxx.cha
for Eve’s data. If this parameter is not specified orNone
is passed in (the default), such file path filtering does not apply.- excludestr, optional
If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.
- encodingstr, optional
Text encoding to parse the CHAT data. The default value is
"utf-8"
for Unicode UTF-8.- extensionstr, optional
File extension for CHAT data files. The default value is
".cha"
.- parallelbool, optional
If
True
(the default), CHAT reading and parsing is parallelized for speed-up, because in most cases multiple CHAT data files and/or strings are being handled. Under certain circumstances (e.g., your application is already parallelized and further parallelization from within PyLangAcq might be undesirable), you may like to consider setting this parameter toFalse
.
- Returns:
- classmethod from_files(paths: List[str], match: str = None, exclude: str = None, encoding: str = 'utf-8', parallel: bool = True) Reader [source]#
Instantiate a reader from local CHAT data files.
- Parameters:
- pathsList[str]
List of local file paths of the CHAT data. The ordering of the paths determines that of the parsed CHAT data in the resulting reader.
- matchstr, optional
If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in
"Eve"
here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure ofBrown/Eve/xxx.cha
for Eve’s data. If this parameter is not specified orNone
is passed in (the default), such file path filtering does not apply.- excludestr, optional
If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.
- encodingstr, optional
Text encoding to parse the CHAT data. The default value is
"utf-8"
for Unicode UTF-8.- parallelbool, optional
If
True
(the default), CHAT reading and parsing is parallelized for speed-up, because in most cases multiple CHAT data files and/or strings are being handled. Under certain circumstances (e.g., your application is already parallelized and further parallelization from within PyLangAcq might be undesirable), you may like to consider setting this parameter toFalse
.
- Returns:
- classmethod from_strs(strs: List[str], ids: List[str] = None, parallel: bool = True) Reader [source]#
Instantiate a reader from in-memory CHAT data strings.
- Parameters:
- strsList[str]
List of CHAT data strings. The ordering of the strings determines that of the parsed CHAT data in the resulting reader.
- idsList[str], optional
List of identifiers. If not provided, UUID random strings are used. When file paths are referred to in other parts of this package, they mean these identifiers if you have instantiated the reader by this method.
- parallelbool, optional
If
True
(the default), CHAT reading and parsing is parallelized for speed-up, because in most cases multiple CHAT data files and/or strings are being handled. Under certain circumstances (e.g., your application is already parallelized and further parallelization from within PyLangAcq might be undesirable), you may like to consider setting this parameter toFalse
.
- Returns:
- classmethod from_zip(path: str, match: str = None, exclude: str = None, extension: str = '.cha', encoding: str = 'utf-8', parallel: bool = True, use_cached: bool = True, session: Session = None) Reader [source]#
Instantiate a reader from a local or remote ZIP file.
If the input data is a remote ZIP file and you expect to call this method with the same path multiple times, consider downloading the data to the local system and then reading it from there to avoid unnecessary re-downloading. Caching a remote ZIP file isn’t implemented (yet) as the upstream CHILDES / TalkBank data is updated in minor ways from time to time.
- Parameters:
- pathstr
Either a local file path or a URL (one that begins with
"https://"
or"http://"
) for a ZIP file containing CHAT data files. For instance, you can provide either a local path to a ZIP file downloaded from CHILDES, or simply a URL such as"https://childes.talkbank.org/data/Eng-NA/Brown.zip"
.- matchstr, optional
If provided, only the file paths that match this string (by regular expression matching) are read and parsed. For example, to work with the American English dataset Brown (containing data for the children Adam, Eve, and Sarah), you can pass in
"Eve"
here to only handle the data for Eve, since the unzipped Brown data from CHILDES has a directory structure ofBrown/Eve/xxx.cha
for Eve’s data. If this parameter is not specified orNone
is passed in (the default), such file path filtering does not apply.- excludestr, optional
If provided, the file paths that match this string (by regular expression matching) are excluded for reading and parsing.
- encodingstr, optional
Text encoding to parse the CHAT data. The default value is
"utf-8"
for Unicode UTF-8.- extensionstr, optional
File extension for CHAT data files. The default value is
".cha"
.- parallelbool, optional
If
True
(the default), CHAT reading and parsing is parallelized for speed-up, because in most cases multiple CHAT data files and/or strings are being handled. Under certain circumstances (e.g., your application is already parallelized and further parallelization from within PyLangAcq might be undesirable), you may like to consider setting this parameter toFalse
.- use_cachedbool, optional
If
True
(the default), and if the path is a URL for a remote ZIP archive, then CHAT reading attempts to use the previously downloaded data cached on disk. This setting allows you to call this function with the same URL repeatedly without hitting the CHILDES / TalkBank server more than once for the same data. Pass inFalse
to force a new download; the upstream CHILDES / TalkBank data is updated in minor ways from time to time, e.g., for CHAT format, header/metadata information, updated annotations. See also the helper functions:pylangacq.chat.cached_data_info()
,pylangacq.chat.remove_cached_data()
.- sessionrequests.Session, optional
If the path is a URL for a remote ZIP archive, data downloading is done with reasonable settings of retries and timeout by default, in order to be robust against intermittent network issues. If necessary, pass in your own instance of
requests.Session
to customize.
- Returns:
- head(n: int = 5, participants=None, exclude=None)[source]#
Return the first several utterances.
- Parameters:
- nint, optional
The number of utterances to return.
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.
- Returns:
- list of utterances
- info(verbose=False) None [source]#
Print a summary of this Reader’s data.
- Parameters:
- verbosebool, optional
If
True
(default isFalse
), show the details of all the files.
- ipsyn(participant='CHI') List[int] [source]#
Return the indexes of productive syntax (IPSyn).
- Parameters:
- participantstr, optional
Participant of interest, which defaults to the typical use case of
"CHI"
for the target child.
- Returns:
- List[float]
- languages(by_files=False) Set[str] | List[List[str]] [source]#
Return the languages in the data.
- Parameters:
- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- Set[str] if
by_files
isFalse
, otherwise List[List[str]] When
by_files
isTrue
, the ordering of languages given by the list indicates language dominance. Such ordering would not make sense whenby_files
isFalse
, in which case the returned object is a set instead of a list.
- Set[str] if
- mlu(participant='CHI', exclude_switch: bool = False) List[float] [source]#
Return the mean lengths of utterance (MLU).
This method is equivalent to
mlum()
.- Parameters:
- participantstr, optional
Participant of interest, which defaults to the typical use case of
"CHI"
for the target child.- exclude_switchbool, optional
If
True
, exclude words with the suffix “@s” for switching to another language (not uncommon in code-mixing or multilingual acquisition). The default isFalse
.
- Returns:
- List[float]
- mlum(participant='CHI', exclude_switch: bool = False) List[float] [source]#
Return the mean lengths of utterance in morphemes.
- Parameters:
- participantstr, optional
Participant of interest, which defaults to the typical use case of
"CHI"
for the target child.- exclude_switchbool, optional
If
True
, exclude words with the suffix “@s” for switching to another language (not uncommon in code-mixing or multilingual acquisition). The default isFalse
.
- Returns:
- List[float]
- mluw(participant='CHI', exclude_switch: bool = False) List[float] [source]#
Return the mean lengths of utterance in words.
- Parameters:
- participantstr, optional
Participant of interest, which defaults to the typical use case of
"CHI"
for the target child.- exclude_switchbool, optional
If
True
, exclude words with the suffix “@s” for switching to another language (not uncommon in code-mixing or multilingual acquisition). The default isFalse
.
- Returns:
- List[float]
- participants(by_files=False) Set[str] | List[Set[str]] [source]#
Return the participants (e.g., CHI, MOT).
- Parameters:
- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- Set[str] if
by_files
isFalse
, otherwise List[Set[str]]
- Set[str] if
- pop_left() Reader [source]#
Drop the first data file from the reader and return it as a reader.
- Returns:
- sents(participants=None, exclude=None, by_files=False) List[List[str]] | List[List[List[str]]] [source]#
Return the sents.
Deprecated since version 0.13.0: Please use
words()
withby_utterances=True
instead.- Parameters:
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- List[List[str]] if
by_files
isFalse
, otherwise List[List[List[str]]]
- List[List[str]] if
- tagged_sents(participants=None, exclude=None, by_files=False) List[List[Token]] | List[List[List[Token]]] [source]#
Return the tagged sents.
Deprecated since version 0.13.0: Please use
tokens()
withby_utterances=True
instead.- Parameters:
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- List[List[Token]] if
by_files
isFalse
, - otherwise List[List[List[Token]]]
- List[List[Token]] if
- tagged_words(participants=None, exclude=None, by_files=False) List[Token] | List[List[Token]] [source]#
Return the tagged words.
Deprecated since version 0.13.0: Please use
tokens()
withby_utterances=False
instead.- Parameters:
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- List[Token] if
by_files
isFalse
, otherwise List[List[Token]]
- List[Token] if
- tail(n: int = 5, participants=None, exclude=None)[source]#
Return the last several utterances.
- Parameters:
- nint, optional
The number of utterances to return.
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.
- Returns:
- list of utterances
- to_chat(path: str, is_dir: bool = False, filenames: Iterable[str] = None, tabular: bool = True, encoding: str = 'utf-8') None [source]#
Export to CHAT data files.
- Parameters:
- pathstr
The path to a file where you want to output the CHAT data, e.g., “data.cha”, “foo/bar/data.cha”.
- is_dirbool, optional
If
True
(default isFalse
), thenpath
is interpreted as a directory instead. The CHAT data is written to possibly multiple files under this directory. The number of files you get can be checked by callingn_files()
, which depends on how this reader object is created.- filenamesIterable[str], optional
Used only when
is_dir
isTrue
. These are the filenames of the CHAT files to write. IfNone
or not given, {0001.cha, 0002.cha, …} are used.- tabularbool, optional
If
True
, adjust spacing such that the three tiers of the utterance, %mor, and %gra are aligned in a tabular form. Note that such alignment would drop annotations (e.g., pauses) on the main utterance tier.- encodingstr, optional
Text encoding to output the CHAT data as. The default value is
"utf-8"
for Unicode UTF-8.
- Raises:
- ValueError
If you attempt to output data to a single local file, but the CHAT data in this reader appears to be organized in multiple files.
If you attempt to output data to a directory while providing your own filenames, but the number of your filenames doesn’t match the number of CHAT files in this reader object.
- to_strs(tabular: bool = True) Generator[str, None, None] [source]#
Yield CHAT data strings.
Note
The header information may not be completely reproduced in the output CHAT strings. Known issues all have to do with a header field used multiple times in the original CHAT data. For
Date
, only the first date of recording is retained in the output string. For all other multiply used header fields (e.g.,Tape Location
,Time Duration
), only the last value in a given CHAT file is retained. Note thatID
for participant information is not affected.- Parameters:
- tabularbool, optional
If
True
, adjust spacing such that the three tiers of the utterance, %mor, and %gra are aligned in a tabular form. Note that such alignment would drop annotations (e.g., pauses) on the main utterance tier.
- Yields:
- str
CHAT data string for one file.
- tokens(participants=None, exclude=None, by_utterances=False, by_files=False) List[Token] | List[List[Token]] | List[List[List[Token]]] [source]#
Return the tokens.
- Parameters:
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.- by_utterancesbool, optional
If
True
, the resulting objects are wrapped as a list at the utterance level. IfFalse
(the default), such utterance-level list structure does not exist.- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- List[List[List[Token]]] if both
by_utterances
andby_files
areTrue
- List[List[Token]] if
by_utterances
isTrue
andby_files
isFalse
- List[List[Token]] if
by_utterances
isFalse
andby_files
isTrue
- List[Token] if both
by_utterances
andby_files
areFalse
- List[List[List[Token]]] if both
- ttr(keep_case=True, participant='CHI') List[float] [source]#
Return the type-token ratios (TTR).
- Parameters:
- keep_casebool, optional
If
True
(the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. IfFalse
, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.- participantstr, optional
Participant of interest, which defaults to the typical use case of
"CHI"
for the target child.
- Returns:
- List[float]
- utterances(participants=None, exclude=None, by_files=False) List[Utterance] | List[List[Utterance]] [source]#
Return the utterances.
- Parameters:
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- List[Utterance] if
by_files
isFalse
, otherwise List[List[Utterance]]
- List[Utterance] if
- word_frequencies(keep_case=True, participants=None, exclude=None, by_files=False) Counter | List[Counter] [source]#
Return word frequencies.
- Parameters:
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.- keep_casebool, optional
If
True
(the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. IfFalse
, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.
- Returns:
- collections.Counter if
by_files
isFalse
, - otherwise List[collections.Counter]
- collections.Counter if
- word_ngrams(n, keep_case=True, participants=None, exclude=None, by_files=False) Counter | List[Counter] [source]#
Return word ngrams.
- Parameters:
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.- keep_casebool, optional
If
True
(the default), case distinctions are kept, e.g., word tokens like “the” and “The” are treated as distinct. IfFalse
, all word tokens are forced to be in lowercase as a preprocessing step. CHAT data from CHILDES intentionally does not follow the orthographic convention of capitalizing the first letter of a sentence in the transcriptions (as would have been done in many European languages), and so leaving keep_case as True is appropriate in most cases.
- Returns:
- collections.Counter if
by_files
isFalse
, - otherwise List[collections.Counter]
- collections.Counter if
- words(participants=None, exclude=None, by_utterances=False, by_files=False) List[str] | List[List[str]] | List[List[List[str]]] [source]#
Return the words.
- Parameters:
- participantsstr or iterable of str, optional
Participants of interest. You may pass in a string (e.g.,
"CHI"
for studying child speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are included. If you pass inNone
(the default), all participants are included. This parameter cannot be used together withexclude
.- excludestr or iterable of str, optional
Participants to exclude. You may pass in a string (e.g.,
"CHI"
for child-directed speech) or an iterable of strings (e.g.,{"MOT", "INV"}
). Only the specified participants are excluded. If you pass inNone
(the default), no participants are excluded. This parameter cannot be used together withparticipants
.- by_utterancesbool, optional
If
True
, the resulting objects are wrapped as a list at the utterance level. IfFalse
(the default), such utterance-level list structure does not exist.- by_filesbool, optional
If
True
, return a list X of results, where len(X) is the number of files in theReader
object, and each element in X is the result for one file; the ordering of X corresponds to that of the file paths fromfile_paths()
. IfFalse
(the default), return the result that collapses the file distinction just described for whenby_files
isTrue
.
- Returns:
- List[List[List[str]]] if both
by_utterances
andby_files
areTrue
- List[List[str]] if
by_utterances
isTrue
andby_files
isFalse
- List[List[str]] if
by_utterances
isFalse
andby_files
isTrue
- List[str] if both
by_utterances
andby_files
areFalse
- List[List[List[str]]] if both
Token
#
- class pylangacq.objects.Token(word: str, pos: str | None, mor: str | None, gra: Gra | None)[source]#
Token with attributes as parsed from a CHAT utterance.
- Attributes:
- wordstr
Word form of the token
- posstr
Part-of-speech tag
- morstr
Morphological information
- graGra
Grammatical relation
Methods
to_gra_tier
()Return the %gra representation.
to_mor_tier
()Return the %mor representation.
Gra
#
Utterance
#
- class pylangacq.objects.Utterance(participant: str, tokens: List[Token], time_marks: Tuple[int, int] | None, tiers: Dict[str, str])[source]#
Utterance in a CHAT transcript data.
- Attributes:
- participantstr
Participant of the utterance, e.g.,
"CHI"
,"MOT"
- tokensList[Token]
List of tokens of the utterance
- time_marksTuple[int, int]
If available from the CHAT data, these are the start and end times (in milliseconds) for a segment in a digitized video or audio file, e.g.,
(0, 1073)
, extracted from"·0_1073·"
in the CHAT data."·"
is ASCII code 21 (0x15), for NAK (Negative Acknowledgment).- tiersDict[str, str]
This dictionary contains all the original, unparsed data from the utterance, including the transcribed utterance (signaled by
*CHI:
,*MOT:
etc in CHAT), common tiers such as %mor and %gra, as well as all other tiers associated with the utterance. This dictionary is useful to retrieve whatever information not readily handled by this package.