We begin our analysis of "Walden" by computing a list of all words appearing in this text, together with the number occurrences of each word:

In [53]:
walden = sorted_list('walden.txt')

The length of this list gives the number of unique words in the text:

In [54]:
print('Number of unique words in Walden: {}'.format(len(walden)))
Number of unique words in Walden: 10456

The next computation shows that almost half of these words appear in the text only one time:

In [55]:
uniques = [w for w in walden if w[1] == 1]
print('Number of words appearing only once: {}'.format(len(uniques)))
Number of words appearing only once: 5025

At the other end of the spectrum there are words that appear in the text hundreds or even thousands of times. The list of top 10 most frequently occurring words looks as follows:

In [63]:
print('rank   word     occurences')
print('----   ------   ----------')
for i in range(10):
    print('{:2}     {:8} {}'.format(i+1, walden[i][0], walden[i][1])) 
rank   word     occurences
----   ------   ----------
 1     the      6249
 2     of       3157
 3     a        2616
 4     to       2557
 5     and      2370
 6     in       1677
 7     i        1191
 8     it       1158
 9     is       1118
10     that     1022