List Algorithms

List Algorithms

This chapter is a bit different from what we’ve done so far: rather than introduce more new Python syntax and features, we’re going to focus on the program development process, and some algorithms that work with lists.

As in all parts of this book, our expectation is that you, the reader, will copy our code into your Python environment, play and experiment, and work along with us.

Part of this chapter works with the book Alice in Wonderland and a vocabulary file. Your browser should be able to download and save these files from these links.

14.1. Test-driven development

Early in our Fruitful functions chapter we introduced the idea of incremental development, where we added small fragments of code to slowly build up the whole, so that we could easily find problems early. Later in that same chapter we introduced unit testing and gave code for our testing framework so that we could capture, in code, appropriate tests for the functions we were writing.

Test-driven development (TDD) is a software development practice which takes these practices one step further. The key idea is that automated tests should be written first. This technique is called test-driven because — if we are to believe the extremists — non-testing code should only be written when there is a failing test to make pass.

We can still retain our mode of working in small incremental steps, but now we’ll define and express those steps in terms of a sequence of increasingly sophisticated unit tests that demand more from our code at each stage.

We’ll turn our attention to some standard algorithms that process lists now, but as we proceed through this chapter we’ll attempt to do so in the spirit envisaged by TDD.

14.2. The linear search algorithm

We’d like to know the index where a specific item occurs within in a list of items. Specifically, we’ll return the index of the item if it is found, or we’ll return -1 if the item doesn’t occur in the list. Let us start with some tests:

1
2
3
4
5
friends = ["Joe", "Zoe", "Brad", "Angelina", "Zuki", "Thandi", "Paris"]
test(search_linear(friends, "Zoe") == 1)
test(search_linear(friends, "Joe") == 0)
test(search_linear(friends, "Paris") == 6)
test(search_linear(friends, "Bill") == -1)

Motivated by the fact that our tests don’t even run, let alone pass, we now write the function:

1
2
3
4
5
6
def search_linear(xs, target):
    """ Find and return the index of target in sequence xs """
    for (i, v) in enumerate(xs):
       if v == target:
           return i
    return -1

There are a some points to learn here: We’ve seen a similar algorithm in section 8.10 when we searched for a character in a string. There we used a while loop, here we’ve used a for loop, coupled with enumerate to extract the (i, v) pair on each iteration. There are other variants — for example, we could have used range and made the loop run only over the indexes, or we could have used the idiom of returning None when the item was not found in the list. But the essential similarity in all these variations is that we test every item in the list in turn, from first to last, using the pattern of the short-circuit eureka traversal that we introduced earlier — that we return from the function as soon as we find the target that we’re looking for.

Searching all items of a sequence from first to last is called a linear search. Each time we check whether v == target we’ll call it a probe. We like to count probes as a measure of how efficient our algorithm is, and this will be a good enough indication of how long our algorithm will take to execute.

Linear searching is characterized by the fact that the number of probes needed to find some target depends directly on the length of the list. So if the list becomes ten times bigger, we can expect to wait ten times longer when searching for things. Notice too, that if we’re searching for a target that is not present in the list, we’ll have to go all the way to the end before we can return the negative value. So this case needs N probes, where N is the length of the list. However, if we’re searching for a target that does exist in the list, we could be lucky and find it immediately in position 0, or we might have to look further, perhaps even all the way to the last item. On average, when the target is present, we’re going to need to go about halfway through the list, or N/2 probes.

We say that this search has linear performance (linear meaning straight line) because, if we were to measure the average search times for different sizes of lists (N), and then plot a graph of time-to-search against N, we’d get a more-or-less straight line graph.

Analysis like this is pretty meaningless for small lists — the computer is quick enough not to bother if the list only has a handful of items. So generally, we’re interested in the scalability of our algorithms — how do they perform if we throw bigger problems at them. Would this search be a sensible one to use if we had a million or ten million items (perhaps the catalog of books in your local library) in our list? What happens for really large datasets, e.g. how does Google search so brilliantly well?

14.3. A more realistic problem

As children learn to read, there are expectations that their vocabulary will grow. So a child of age 14 is expected to know more words than a child of age 8. When prescribing reading books for a grade, an important question might be “which words in this book are not in the expected vocabulary at this level?”

Let us assume we can read a vocabulary of words into our program, and read the text of a book, and split it into words. Let us write some tests for what we need to do next. Test data can usually be very small, even if we intend to finally use our program for larger cases:

1
2
3
4
5
6
vocab = ["apple", "boy", "dog", "down",
                          "fell", "girl", "grass", "the", "tree"]
book_words = "the apple fell from the tree to the grass".split()
test(find_unknown_words(vocab, book_words) == ["from", "to"])
test(find_unknown_words([], book_words) == book_words)
test(find_unknown_words(vocab, ["the", "boy", "fell"]) == [])

Notice we were a bit lazy, and used split to create our list of words — it is easier than typing out the list, and very convenient if you want to input a sentence into the program and turn it into a list of words.

We now need to implement the function for which we’ve written tests, and we’ll make use of our linear search. The basic strategy is to run through each of the words in the book, look it up in the vocabulary, and if it is not in the vocabulary, save it into a new resulting list which we return from the function:

1
2
3
4
5
6
7
def find_unknown_words(vocab, wds):
    """ Return a list of words in wds that do not occur in vocab """
    result = []
    for w in wds:
        if (search_linear(vocab, w) < 0):
            result.append(w)
    return result

We can happily report now that the tests all pass.

Now let us look at the scalability. We have more realistic vocabulary in the text file that could be downloaded at the beginning of this chapter, so let us read in the file (as a single string) and split it into a list of words. For convenience, we’ll create a function to do this for us, and test it on a file we happen to have available:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def load_words_from_file(filename):
    """ Read words from filename, return list of words. """
    f = open(filename, "r")
    file_content = f.read()
    f.close()
    wds = file_content.split()
    return wds

bigger_vocab = load_words_from_file("vocab.txt")
print("There are {0} words in the vocab, starting with\n {1} "
              .format(len(bigger_vocab), bigger_vocab[:6]))

Python responds with:

There are 19469 words in the vocab, starting with
['a', 'aback', 'abacus', 'abandon', 'abandoned', 'abandonment']

So we’ve got a more sensible size vocabulary. Now let us load up a book, once again we’ll use the one we downloaded at the beginning of this chapter. Loading a book is much like loading words from a file, but we’re going to do a little extra black magic. Books are full of punctuation, and have mixtures of lowercase and uppercase letters. We need to clean up the contents of the book. This will involve removing punctuation, and converting everything to the same case (lowercase, because our vocabulary is all in lowercase). So we’ll want a more sophisticated way of converting text to words.

1
2
3
test(text_to_words("My name is Earl!") == ["my", "name", "is", "earl"])
test(text_to_words('"Well, I never!", said Alice.') ==
                             ["well", "i", "never", "said", "alice"])

There is a powerful translate method available for strings. The idea is that one sets up desired substitutions — for every character, we can give a corresponding replacement character. The translate method will apply these replacements throughout the whole string. So here we go:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def text_to_words(the_text):
    """ return a list of words with all punctuation removed,
        and all in lowercase.
    """

    my_substitutions = the_text.maketrans(
      # If you find any of these
      "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>[email protected][]^_`{|}~'\\",
      # Replace them by these
      "abcdefghijklmnopqrstuvwxyz                                          ")

    # Translate the text now.
    cleaned_text = the_text.translate(my_substitutions)
    wds = cleaned_text.split()
    return wds

The translation turns all uppercase characters into lowercase, and all punctuation characters and digits into spaces. Then, of course, split will get rid of the spaces as it breaks the text into a list of words. The tests pass.

Now we’re ready to read in our book:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def get_words_in_book(filename):
    """ Read a book from filename, and return a list of its words. """
    f = open(filename, "r")
    content = f.read()
    f.close()
    wds = text_to_words(content)
    return wds

book_words = get_words_in_book("AliceInWonderland.txt")
print("There are {0} words in the book, the first 100 are\n{1}".
           format(len(book_words), book_words[:100]))

Python prints the following (all on one line, we’ve cheated a bit for the textbook):

There are 27336 words in the book, the first 100 are
['alice', 's', 'adventures', 'in', 'wonderland', 'lewis', 'carroll',
    'chapter', 'i', 'down', 'the', 'rabbit', 'hole', 'alice', 'was',
    'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by',
    'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having',
    'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had',
    'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading',
    'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in',
    'it', 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book',
    'thought', 'alice', 'without', 'pictures', 'or', 'conversation',
    'so', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind',
    'as', 'well', 'as', 'she', 'could', 'for', 'the', 'hot', 'day',
    'made', 'her', 'feel', 'very', 'sleepy', 'and', 'stupid',
    'whether', 'the', 'pleasure', 'of', 'making', 'a']

Well now we have all the pieces ready. Let us see what words in this book are not in the vocabulary:

>>> missing_words = find_unknown_words(bigger_vocab, book_words)

We wait a considerable time now, something like a minute, before Python finally works its way through this, and prints a list of 3398 words in the book that are not in the vocabulary. Mmm... This is not particularly scaleable. For a vocabulary that is twenty times larger (you’ll often find school dictionaries with 300 000 words, for example), and longer books, this is going to be slow. So let us make some timing measurements while we think about how we can improve this in the next section.

1
2
3
4
5
6
7
import time

t0 = time.clock()
missing_words = find_unknown_words(bigger_vocab, book_words)
t1 = time.clock()
print("There are {0} unknown words.".format(len(missing_words)))
print("That took {0:.4f} seconds.".format(t1-t0))

We get the results and some timing that we can refer back to later:

There are 3398 unknown words.
That took 49.8014 seconds.

14.5. Removing adjacent duplicates from a list

We often want to get the unique elements in a list, i.e. produce a new list in which each different element occurs just once. Consider our case of looking for words in Alice in Wonderland that are not in our vocabulary. We had a report that there are 3398 such words, but there are duplicates in that list. In fact, the word “alice” occurs 398 times in the book, and it is not in our vocabulary! How should we remove these duplicates?

A good approach is to sort the list, then remove all adjacent duplicates. Let us start with removing adjacent duplicates

1
2
3
4
test(remove_adjacent_dups([1,2,3,3,3,3,5,6,9,9]) == [1,2,3,5,6,9])
test(remove_adjacent_dups([]) == [])
test(remove_adjacent_dups(["a", "big", "big", "bite", "dog"]) ==
                                   ["a", "big", "bite", "dog"])

The algorithm is easy and efficient. We simply have to remember the most recent item that was inserted into the result, and avoid inserting it again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def remove_adjacent_dups(xs):
    """ Return a new list in which all adjacent
        duplicates from xs have been removed.
    """
    result = []
    most_recent_elem = None
    for e in xs:
        if e != most_recent_elem:
            result.append(e)
            most_recent_elem = e

    return result

The amount of work done in this algorithm is linear — each item in xs causes the loop to execute exactly once, and there are no nested loops. So doubling the number of elements in xs should cause this function to run twice as long: the relationship between the size of the list and the time to run will be graphed as a straight (linear) line.

Let us go back now to our analysis of Alice in Wonderland. Before checking the words in the book against the vocabulary, we’ll sort those words into order, and eliminate duplicates. So our new code looks like this:

1
2
3
4
5
6
7
all_words = get_words_in_book("AliceInWonderland.txt")
all_words.sort()
book_words = remove_adjacent_dups(all_words)
print("There are {0} words in the book. Only {1} are unique.".
                      format(len(all_words), len(book_words)))
print("The first 100 words are\n{0}".
           format(book_words[:100]))

Almost magically, we get the following output:

There are 27336 words in the book. Only 2570 are unique.
The first 100 words are
['_i_', 'a', 'abide', 'able', 'about', 'above', 'absence', 'absurd',
 'acceptance', 'accident', 'accidentally', 'account', 'accounting',
 'accounts', 'accusation', 'accustomed', 'ache', 'across', 'act',
 'actually', 'ada', 'added', 'adding', 'addressed', 'addressing',
 'adjourn', 'adoption', 'advance', 'advantage', 'adventures',
 'advice', 'advisable', 'advise', 'affair', 'affectionately',
 'afford', 'afore', 'afraid', 'after', 'afterwards', 'again',
 'against', 'age', 'ago', 'agony', 'agree', 'ah', 'ahem', 'air',
 'airs', 'alarm', 'alarmed', 'alas', 'alice', 'alive', 'all',
 'allow', 'almost', 'alone', 'along', 'aloud', 'already', 'also',
 'altered', 'alternately', 'altogether', 'always', 'am', 'ambition',
 'among', 'an', 'ancient', 'and', 'anger', 'angrily', 'angry',
 'animal', 'animals', 'ann', 'annoy', 'annoyed', 'another',
 'answer', 'answered', 'answers', 'antipathies', 'anxious',
 'anxiously', 'any', 'anything', 'anywhere', 'appealed', 'appear',
 'appearance', 'appeared', 'appearing', 'applause', 'apple',
 'apples', 'arch']

Lewis Carroll was able to write a classic piece of literature using only 2570 different words!

14.6. Merging sorted lists

Suppose we have two sorted lists. Devise an algorithm to merge them together into a single sorted list.

A simple but inefficient algorithm could be to simply append the two lists together, and sort the result:

1
2
newlist = (xs + ys)
newlist.sort()

But this doesn’t take advantage of the fact that the two lists are already sorted, and is going to have poor scalability and performance for very large lists.

Lets get some tests together first:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
xs = [1,3,5,7,9,11,13,15,17,19]
ys = [4,8,12,16,20,24
zs = xs+ys
zs.sort()
test(merge(xs, []) == xs)
test(merge([], ys) == ys)
test(merge([], []) == [])
test(merge(xs, ys) == zs)
test(merge([1,2,3], [3,4,5]) == [1,2,3,3,4,5])
test(merge(["a", "big", "cat"], ["big", "bite", "dog"]) ==
               ["a", "big", "big", "bite", "cat", "dog"])