Word Length Frequency
The resources on this page were originally created by Dr. Aaron Bradley of Summit Middle School. I've done some reformatting to add clarity but that is about it. Enjoy!
|
During this lesson, you will learn several Python tools:
- How to extract text from a file and turn it into a string.
- Splitting the string into separate words.
- Calculate the number of words in a string.
- Calculate the length of each word in a string.
- Creating a tally
- How to use "For loops"
How frequently do short words appear in a given book? How frequently do long words appear? How does frequency vary by text or by author? The goal of this project is to explore word length frequency in texts.
One way of analyzing word length frequency is to sit down with a book and start tallying the number of words of lengths 1, 2, 3, and so on. For example, in this paragraph,
One way of analyzing word length frequency is to sit down with a book and start tallying the number of words of lengths 1, 2, 3, and so on. For example, in this paragraph,
"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to heaven, we were all going direct the other way - in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only." (Charles Dickens, A Tale of Two Cities)
|
approximately 32.5% of the words have length 3. I don't know about you, but manually analyzing a long text does not excite me.
In contrast to us, computers are fast and lack the capacity for impatience. They eat data for breakfast and then ask for a second helping---and a third, and a fourth, if they're available. In this project, we'll write a program to do the tallying for us. A simplifying assumption is that we won't worry about punctuation, so that, in the text above, "times," will count as a 6-letter word because of the comma. In other words, we'll consider only spaces as demarcating word boundaries. When you feel ready, revisit this exploration and modify the program to get rid of this assumption.
Let's get started!
In contrast to us, computers are fast and lack the capacity for impatience. They eat data for breakfast and then ask for a second helping---and a third, and a fourth, if they're available. In this project, we'll write a program to do the tallying for us. A simplifying assumption is that we won't worry about punctuation, so that, in the text above, "times," will count as a 6-letter word because of the comma. In other words, we'll consider only spaces as demarcating word boundaries. When you feel ready, revisit this exploration and modify the program to get rid of this assumption.
Let's get started!
Extracting Words From a String
1. Try the following:
>>> x = "It was the best of times, it was the worst of times..."
>>> words = x.split()
>>> words
['It', 'was', 'the', 'best', 'of', 'times,', 'it', 'was', 'the', 'worst', 'of', 'times...']
>>> len(words)
12
>>> [len(word) for word in words]
[2, 3, 3, 4, 2, 6, 2, 3, 3, 5, 2, 8]
2. Make sure that you understand what each statement accomplishes. x.split() separates the string x into a list of strings along space boundaries. Verify that the list of word lengths matches the list of words.
1. Try the following:
>>> x = "It was the best of times, it was the worst of times..."
>>> words = x.split()
>>> words
['It', 'was', 'the', 'best', 'of', 'times,', 'it', 'was', 'the', 'worst', 'of', 'times...']
>>> len(words)
12
>>> [len(word) for word in words]
[2, 3, 3, 4, 2, 6, 2, 3, 3, 5, 2, 8]
2. Make sure that you understand what each statement accomplishes. x.split() separates the string x into a list of strings along space boundaries. Verify that the list of word lengths matches the list of words.
Tallying Lengths
Given a list as above,
>>> lengths = [len(word) for word in words]
>>> lengths
[2, 3, 3, 4, 2, 6, 2, 3, 3, 5, 2, 8]
we need to count the number of times that "2" appears, that "3" appears, and so on. To do so, we will maintain a list of tallies.
1. First, what is the longest length?
>>> max_length = max(lengths)
>>> max_length
8
2. Initially, the tallies are 0 for all lengths:
>>> tallies = [0 for x in range(max_length + 1)]
>>> tallies
[0, 0, 0, 0, 0, 0, 0, 0, 0]
Make sure that all possible lengths between 0 and 8 are covered. Why do we use range(max_length + 1) instead of, say, range(max_length)?
3. The first length in lengths is 2, so we should increase its corresponding tally:
>>> tallies[2] = tallies[2] + 1
>>> tallies
[0, 0, 1, 0, 0, 0, 0, 0, 0]
4. The next length in lengths is 3, so we should increase its corresponding tally:
>>> tallies[3] += 1
>>> tallies
[0, 0, 1, 1, 0, 0, 0, 0, 0]
Here I used a convenient shorthand for increasing tallies[3] by 1.
5. But this process is almost as tedious as counting manually! Let's make the computer work for us. First, let's reset tallies to all 0s:
>>> tallies = [0 for x in range(max_length + 1)]
>>> tallies
[0, 0, 0, 0, 0, 0, 0, 0, 0]
6. Now let's process all of lengths in one fell swoop:
>>> for length in lengths:
tallies[length] += 1
>>> tallies
[0, 0, 4, 4, 1, 1, 1, 0, 1]
Verify that the final word length tallies are as expected.
The "for loop" iterates over each item in the list lengths, temporarily calls the item length, and allows the indented block below it to act on length.
Given a list as above,
>>> lengths = [len(word) for word in words]
>>> lengths
[2, 3, 3, 4, 2, 6, 2, 3, 3, 5, 2, 8]
we need to count the number of times that "2" appears, that "3" appears, and so on. To do so, we will maintain a list of tallies.
1. First, what is the longest length?
>>> max_length = max(lengths)
>>> max_length
8
2. Initially, the tallies are 0 for all lengths:
>>> tallies = [0 for x in range(max_length + 1)]
>>> tallies
[0, 0, 0, 0, 0, 0, 0, 0, 0]
Make sure that all possible lengths between 0 and 8 are covered. Why do we use range(max_length + 1) instead of, say, range(max_length)?
3. The first length in lengths is 2, so we should increase its corresponding tally:
>>> tallies[2] = tallies[2] + 1
>>> tallies
[0, 0, 1, 0, 0, 0, 0, 0, 0]
4. The next length in lengths is 3, so we should increase its corresponding tally:
>>> tallies[3] += 1
>>> tallies
[0, 0, 1, 1, 0, 0, 0, 0, 0]
Here I used a convenient shorthand for increasing tallies[3] by 1.
5. But this process is almost as tedious as counting manually! Let's make the computer work for us. First, let's reset tallies to all 0s:
>>> tallies = [0 for x in range(max_length + 1)]
>>> tallies
[0, 0, 0, 0, 0, 0, 0, 0, 0]
6. Now let's process all of lengths in one fell swoop:
>>> for length in lengths:
tallies[length] += 1
>>> tallies
[0, 0, 4, 4, 1, 1, 1, 0, 1]
Verify that the final word length tallies are as expected.
The "for loop" iterates over each item in the list lengths, temporarily calls the item length, and allows the indented block below it to act on length.
Printing the Results
The data should be displayed to the user in a meaningful way. When analyzing multiple texts, seeing that words of length 3 appear in one text 51 times and in the other 69 times is not informative. Percentages are more informative. Therefore, we will print the percentage of words of length 1, 2, 3, and so on.
1. First, we need a way of printing numbers:
>>> x = 3
>>> print("Three: " + x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'int' object to str implicitly
Whoops! Let's convert x, which is a number, to a string:
>>> x = 3
>>> print("Three: " + str(x))
'Three: 3'
That's better.
Now let's print the tallies data in a readable fashion:
>>> total_words = len(words)
>>> for length in range(len(tallies)):
freq = tallies[length]/total_words
print("Length: " + str(length) + " Frequency: " + str(freq))
Length: 0 Frequency: 0.0
Length: 1 Frequency: 0.0
Length: 2 Frequency: 0.333333333333
Length: 3 Frequency: 0.333333333333
Length: 4 Frequency: 0.0833333333333
Length: 5 Frequency: 0.0833333333333
Length: 6 Frequency: 0.0833333333333
Length: 7 Frequency: 0.0
Length: 8 Frequency: 0.0833333333333
You can of course play around with how you want to display the data.
The data should be displayed to the user in a meaningful way. When analyzing multiple texts, seeing that words of length 3 appear in one text 51 times and in the other 69 times is not informative. Percentages are more informative. Therefore, we will print the percentage of words of length 1, 2, 3, and so on.
1. First, we need a way of printing numbers:
>>> x = 3
>>> print("Three: " + x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'int' object to str implicitly
Whoops! Let's convert x, which is a number, to a string:
>>> x = 3
>>> print("Three: " + str(x))
'Three: 3'
That's better.
Now let's print the tallies data in a readable fashion:
>>> total_words = len(words)
>>> for length in range(len(tallies)):
freq = tallies[length]/total_words
print("Length: " + str(length) + " Frequency: " + str(freq))
Length: 0 Frequency: 0.0
Length: 1 Frequency: 0.0
Length: 2 Frequency: 0.333333333333
Length: 3 Frequency: 0.333333333333
Length: 4 Frequency: 0.0833333333333
Length: 5 Frequency: 0.0833333333333
Length: 6 Frequency: 0.0833333333333
Length: 7 Frequency: 0.0
Length: 8 Frequency: 0.0833333333333
You can of course play around with how you want to display the data.
Putting it all Together
We've been using the interactive command line to explore each step of the process, but now it's time to write a single program. Open a new Python module, delete the default content, save it as "WordFrequency.py" in "My Programs," and assemble the above steps into one program:
1. Ask the user for the name of a file to read. (See this lesson to review how to use input().)
2. Read the file and extract the words. (see "Reading a File" section below)
We've been using the interactive command line to explore each step of the process, but now it's time to write a single program. Open a new Python module, delete the default content, save it as "WordFrequency.py" in "My Programs," and assemble the above steps into one program:
1. Ask the user for the name of a file to read. (See this lesson to review how to use input().)
2. Read the file and extract the words. (see "Reading a File" section below)
Reading a File
1. Python makes reading and processing files simple. To read a file named "AFile.txt," first open it: >>>file = open("AFile.txt") [Be careful--you must use the exact file name along with the extension!] 2. Then read it: >>>text = file.read() 3. The variable text is a string. Finally, close the file: >>>file.close() |
3. Perform the word length analysis.
4. Print the results.
4. Print the results.
Go to Project Gutenberg and select a few books (e.g., by Mark Twain; by Jane Austen; from English III, The Strange Case of Dr. Jekyll and Mr. Hyde, Much Ado about Nothing; from English IV, Frankenstein, Gawain and the Green Knight) to analyze. For each book, select the text format, save it in your "My Programs" directory, open it with Notepad, and delete the header and footer information, leaving only the actual text; then run your program, providing that file as the input file.
How do word length frequencies vary across the books? Across books from the same author? Across books from different authors?
How do word length frequencies vary across the books? Across books from the same author? Across books from different authors?
Extensions
All programs can be improved. Here are some possible extensions, all of which will require further advancement in programming proficiency:
All programs can be improved. Here are some possible extensions, all of which will require further advancement in programming proficiency:
- Remove punctuation so that it does not affect the analysis.
- Plot the frequency distribution using text. For example:
1 0.00
2 xxx 0.33
3 xxx 0.33
4 x 0.08
5 x 0.08
6 x 0.08
7 0.00
8 x 0.08 - Plot the frequency data using graphics.
- Analyze other characteristics of texts: mean sentence length, mean number of sentences per paragraph, mean amount of punctuation per sentence or paragraph, standard deviations of these quantities, etc.
- Analyze multiple texts, and plot the data in a way that allows comparing among the texts.