Word frequency and ngrams¶
The way words combine in natural language data is of great interest in all areas of linguistic research. PyLangAcq provides several handy functionality for word combinatorics.
We are often interested in word frequency information and word ngrams:
Counter dict(word: word token count)
Counter dict(ngram as a tuple: ngram token count)
Both methods return Counter dicts from collections in the Python standard library. As a result (and as an advantage!), all Counter methods apply to the return objects of these methods.
word_ngrams(n) (with the obligatory parameter
n) share the following optional parameters:
participant: see Child speech versus child directed speech
by_files: see Reader methods
keep_case: whether case distinction such as “Rocky” and “rocky” is kept; defaults to
True. If “Rocky” and “rocky” are treated as distinct, then they are two distinct word types and have separate key:value entries in the return object.
To illustrate, we set up a
based on Eve’s data from CHILDES Brown:
>>> import pylangacq as pla >>> eve = pla.read_chat('Brown/Eve/*.cha')
We use the
participant parameter and the
most_common() method available for Counter dicts below.
>>> word_freq = eve.word_frequency() # all participants >>> word_freq_CDS = eve.word_frequency(exclude='CHI') # child-directed speech >>> word_freq_child = eve.word_frequency(participant='CHI') # child speech >>> word_freq.most_common(5) [('.', 20130), ('?', 6358), ('you', 3695), ('the', 2524), ('it', 2365)] >>> word_freq_CDS.most_common(5) [('.', 9687), ('?', 4909), ('you', 3061), ('the', 1966), ('it', 1563)] >>> word_freq_child.most_common(5) [('.', 10443), ('?', 1449), ('I', 1199), ('that', 1051), ('a', 968)]
This tiny example already shows a key and expected difference (“you” versus “I”) between child speech and child directed speech in terms of pronoun distribution.
>>> from pprint import pprint >>> trigrams_CDS = eve.word_ngrams(3, exclude='CHI') # 3 for trigrams; for child-directed speech >>> trigrams_child = eve.word_ngrams(3, participant='CHI') # child speech >>> pprint(trigrams_CDS.most_common(10)) # lots of questions in child-directed speech! [(("that's", 'right', '.'), 178), (('what', 'are', 'you'), 149), (('is', 'that', '?'), 124), (('do', 'you', 'want'), 122), (('what', 'is', 'that'), 99), (('are', 'you', 'doing'), 94), (("what's", 'that', '?'), 92), (('is', 'it', '?'), 89), (('what', 'do', 'you'), 89), (('would', 'you', 'like'), 89)] >>> pprint(trigrams_child.most_common(10)) # was grape juice Eve's favorite? [(('grape', 'juice', '.'), 74), (('another', 'one', '.'), 55), (('what', 'that', '?'), 50), (('a', 'b', 'c'), 47), (('right', 'there', '.'), 45), (('in', 'there', '.'), 43), (('b', 'c', '.'), 42), (('hi', 'Fraser', '.'), 39), (('I', 'want', 'some'), 39), (('a', 'minute', '.'), 35)]