Changelog#
[Unreleased]#
Added#
Changed#
Deprecated#
Removed#
Fixed#
Security#
[0.19.1] - 2024-03-29#
Fixed#
Handled the duration mark (e.g.,
[# 0.4]) in utterance cleaning.
[0.19.0] - 2023-12-13#
Added#
Added support for Python 3.12.
Handled pre-clitics and post-clitics from %mor tiers and honored their distinction in the parsed utterance.
[0.18.0] - 2023-03-11#
Added#
Added support for Python 3.11.
Changed#
Updated the test data from Brown’s Eve from the upstream CHILDES.
Removed#
Dropped support for Python 3.7.
[0.17.0] - 2022-06-09#
Added#
Added the
exclude_switchoption for MLU (mlu(),mlum(), andmluw()), so that words with @s for switching language may be excluded.
Fixed#
Fixed MLU computation (
mlu(),mlum(), andmluw()):If xxx, yyy, or www appears in an utterance, the whole utterance is ignored.
If there are no MLU-relevant words/morphemes in an utterances, the whole utterance is ignored.
[0.16.2] - 2022-03-17#
Fixed#
Moved
download_and_extract_browntest function to under thepylangacqpackage namespace, as tests fromBaseTestCHATReaderrequire downloaded CHAT data files.
[0.16.1] - 2022-03-17#
Changed#
Restructured the repository to use top-level
src/andtests/directories.
Removed#
Removed
setup.py.
Fixed#
Moved
BaseTestCHATReaderback under thepylangacqpackage namespace so that downstream packages can importBaseTestCHATReaderfor testing.
[0.16.0] - 2021-12-27#
Added#
Readerobjects can now be concatenated by the addition operator+.Implemented the
head,tail, andinfomethods atReader.Added support for Python 3.10.
Turned on Windows testing on CircleCI.
Added
pyproject.toml. Related to prioritizingsetup.cfgfor specifying build metadata and options.
Changed#
The
to_strsandto_chatmethods of aReaderobject return tabulated outputs by default.Prioritized
to_chatfor the single file output use case.Unzipping CHAT data now uses less memory.
Switched to
setup.cfgto fully specify build metadata and options, while keeping a minimalsetup.pyfor backward compatibility. Related to the newpyproject.toml.Switched the Sphinx docs theme from
sphinx-rtd-themetofuro.
Removed#
Dropped support for Python 3.6.
Security#
Turned on
safetyandbanditchecks at CircleCI builds.
[0.15.0] - 2021-06-06#
Added#
Reader.from_zip(alsoread_chat) now keeps the downloaded ZIP archive in a non-temporary directory for possible re-use.Added the kwarg
use_cachedinReader.from_zip, so that we use the cached data by default for the same input URL, and that we can force re-downloading by settinguse_cachedtoFalse.Added the kwarg
sessioninReader.from_zip, in case using a customizedrequests.Sessioninstance is desired.sessionalso makes it possible to write tests for the new kwarguse_cached.Added the helper functions
cached_data_infoandremove_cached_data.
Readerhas the newto_strsmethod that yields CHAT data strings.Readerhas the newto_chatmethod that exports data to local files.
Changed#
CHAT parsing for the header information is now more robust for varying whitespace characters between the head and its associated value.
Removed#
Dropped kwarg
allow_remoteinReader.from_zip. This kwarg wouldn’t make any sense anymore, or at least would be confusing with the introduction ofuse_cached.
[0.14.1] - 2021-05-16#
Fixed#
The header/metadata has a more reasonable representation for emptiness when input data is empty.
[0.14.0] - 2021-05-12#
Added#
Added the
paralleloptional argument to theReadermethods{from_zip, from_dir, from_files, from_strs}so that parallelization can be turned off if desired.Added the
filtermethod toReaderfor filtering data by file paths.
[0.13.3] - 2021-05-07#
Fixed#
The methods
append,append_left,extend, andextend_leftnow work with a subclass ofReader, not justReaderitself.
[0.13.2] - 2021-05-02#
Fixed#
Fixed utterance cleaning so that it is now compatible with all CHILDES datasets.
[0.13.1] - 2021-03-23#
Fixed#
Fixed a CHAT parsing issue when correction and repetition are combined.
[0.13.0] - 2021-03-15#
API-breaking changes:
The Reader class has been completely rewritten.
A couple methods have been removed, while others have been renamed.
For methods that remain (renamed or not),
their behavior for output data structure and arguments allowed has been changed.
The details are in the following.
Added#
New classmethods of
Readerfor reader instantiation:from_zipfrom_dir
New classes to better structure CHAT data:
UtteranceTokenGra
New Reader methods:
append_left,extend,extend_left,pop,pop_lefttokens(which givesTokenobjects, essentially the “tagged words” from before)
In the header dictionary, each participant’s info has the new key
"dob"for date of birth (if the info is available in the CHAT header). The corresponding value is adatetime.dateobject. (The same info was previously exposed as theReadermethoddate_of_birth, now removed.)The test suite now covers code snippets in both the docstrings and
.rstdoc files.
Changed#
CHAT parsing in
Readerinstantiation has been completely rewritten. The previous private class_SingleReaderhas been removed. This private class duplicated a lot of theReadercode, which made it hard to make changes.The
Readerrewrite has also greatly sped up the reading and parsing of CHAT data.The
by_filesargument, which manyReadermethods has, now gives you a simpler list of results for each data file, no longer the previous output of a dict that mapped a file path to the file’s result.The
participantargument, which manyReadermethods has for specifying which participants’ data to include in the output, has been renamed asparticipantsto avoid confusion. There is no change to its behavior of handling either a single string (e.g.,"CHI") or a collection of strings (e.g.,{"CHI", "MOT"}) .The following
Readermethods have been renamed as indicated, some for stylistic or Pythonic reasons, others for reasons as given:age->agesnumber_of_utterances->n_utterancesnumber_of_files->n_filesfilenames->file_pathsMLU->mluMLUm->mlumMLUw->mluwTTR->ttrIPSyn->ipsynword_frequency->word_frequenciesfrom_chat_str->from_strsfrom_chat_files->from_filesadd->append. Since the data files in aReaderhave a natural ordering (by time of recording sessions, and therefore commonly by file paths as well), a reader is list-like rather than an unordered set of data files, whichaddwould suggest.participant_codes->participants. Before this version, the methodsparticipant_codes(for CHI, MOT, etc) andparticipants(for, say, Eve, Mother, Investigator, etc) co-existed, but in practice we mostly only care about CHI, MOT, etc. So the methodparticipantsfor Eve etc has been removed, andparticipant_codeshas been renamed asparticipants.
Each participant’s info in a header dictionary has these keys renamed:
participant_name->nameparticipant_role->roleSES->ses(socioeconomic status)
The class
DependencyGraphhas been made private (i.e., now_DependencyGraphwith a leading underscore). Its functionality hasn’t really changed (it’s used in the computation of IPSyn). It may be made more visible again in the future if more functionality related to grammatical relations is developed in the package.Switched to sphinx-rtd-theme as the documentation theme.
Switched to CircleCI orbs; update dev requirements’ versions.
Deprecated#
The following Reader methods have been deprecated:
tagged_sents(usetokenswithby_utterances=Trueinstead)tagged_words(usetokenswithby_utterances=Falseinstead)sents(usewordswithby_utterances=Trueinstead)
Removed#
The following methods of the
Readerclass have been removed:abspath. Usefile_pathsinstead.index_to_tiers. All the unparsed tiers are now available fromutterances.participant_codes. It’s been renamed asparticipants, another method now removed; see “Changed” above.part_of_speech_tagsupdateandremove. A reader is a list-like collection of CHAT data files, not a set (whichupdateandremovewould suggest).searchandconcordance. To search, use one of thewords,tokens, andutterancesmethods to walk through a reader’s CHAT data and keep track of elements of interest.date_of_birth. The info is now available underheaders, in each participant’s"dob"key.
Fixed#
Handled
[/-]in cleaning utterances.[x <number>]means a repetition of the previous word/item, not repetition of the entire utterance.
[0.12.0] - 2020-10-11#
Added#
Added support for Python 3.9.
Enabled
blackto enforce styling consistency.
[0.11.0] - 2020-07-02#
Added#
Started testing Python 3.7 and 3.8 on continuous integration. (#9)
Add time marker support (available at
_SingleReader), originally contributed at #3 by @hellolzc. (#8)
Changed#
Switched from Travis CI to CircleCI for autobuilds. (#9)
Switched README from reStructuredText to Markdown. (#9)
Removed conversational quotes in utterance processing; updated test CHAT file to match the latest CHILDES data. (#7)
Removed#
Dropped support for Python 2.7, 3.4, and 3.5. All code related to Python 2+3 cross compatibility was removed. (#9)
[0.10.0] - 2017-11-02#
Fixed unicode handling across Python 2 and 3
Renamed method
find_filenameofReaderasabspath.Fixed bug in
Readermethod decoratorsHandled multiple dates of recording in one CHAT file. The method
dates_of_recordingof aReaderinstance now returns a list of dates.Implemented the
excludeparameter in variousReadermethods for excluding specific participants.Fixed bug in IPSyn.
[0.9.0] - 2017-10-25#
Python 2 and 3 cross compatibility
Renamed the
grammar.pymodule asdependency.pyRewrite the class
DependencyGraph; do not subclass from networkx’s DiGraph anymore (and we remove networkx as a dependency of this library)
Removed multiprocessing in reading data files. Datasets are usually small enough that the performance gain, if any, wouldn’t be worth it for the potential issues w.r.t. spawning multiple processes)
Developed capabilities to handle PhonBank data for handling
%phoand%modtiersImproved
clean_utterance()Added parameter
encodinginread_chat()Added
get_lemma_from_mor()Added
date_of_recording()anddate_of_birth(); removedate()Added
clean_word()Restricted
get_IPSyn()to only the first 100 utterancesAdded tests
[0.8] - 2016-01-30#
Library now compatible only with Python 3.4 or above
For class
Reader:Defined
read_chat()for initializing aReaderobjectAdded parameter
by_filesto various methods; remove the “all_” methodsAdded reader manipulation methods:
update(),add(),remove(),clear()Added parameter
sorted_by_ageinfilenames()Added parameter
monthinage()Added
word_ngrams()Added
find_filename()Added language development measures:
MLUm(),MLUw(),TTR(),IPSyn()Added
search()andconcordance()Allowed regular expression matching for parameter
participantAdded output formats for dependency graphs:
to_tikz()andto_conll()Distinguished
participant_nameandparticipant_rolein metadataThe
@Languagesheader contents are treated as a list but not a set now for ordering in bi/multilingualismUndid collapses in transcriptions such as
[x 4]Various bug fixes
[0.7] - 2016-01-06#
Added
part_of_speech_tags()inSingleReaderAdded “all X” methods in
ReaderBug fixes:
clean_utterance(),DependencyGraph
[0.6] - 2015-12-27#
cha_linesoptimizedMethods added:
tagged_words(),words(),tagged_sents(),sents()Tier detection revamped.
tier_sniffer()method removed, withself.tier_markersinSingleReadernow being a set of %-tier markers.len()forSingleReaderaddedword_frequency()forSingleReaderaddedModule
grammaradded, with classDependencyGraphbeing set upStatic methods in classes pulled out
[0.5] - 2015-12-16#
New
utterances()method for extracting utterances from transcripts_clean_utterancemethod developed for filtering CHAT annotations away in utterancesStandardizing terminology: use “participant(s)” consistently instead of “speaker(s)”
[0.4] - 2015-12-13#
New
number_of_utterances()method for bothReaderandSingleReaderTo avoid confusion,
metadata()method is removed.Extraction of utterances and tiers with dict
index_to_tiers
[0.3] - 2015-12-09#
Class
Readercan read multiple.chafiles. The methods associated withReaderare mostly a dict mapping from a absolute-path filename to something.Readerdepends on the classSingleReaderfor a single CHAT file.Following the conventional CHILDES and CHAT terminology, the
metadata()method inReaderis renamedheaders()(though a “new”metadata()method is defined and points toheaders()for convenience).
[0.2] - 2015-12-05#
new methods for class
Reader:languages(),date(),participants(),participant_codes()
[0.1] - 2015-12-04#
first commit; set up the
chatsubmoduleclass
Readerdefined for reading CHAT files, with methodscha_lines(),metadata(), andage()