The Programming Historian (Jul 2012)

From HTML to List of Words (part 2)

  • William J. Turkel,
  • Adam Crymble

Abstract

Read online

In this lesson, you will learn the Python commands needed to implement the second part of the algorithm begun in the From HTML to a List of Words (part 1). The first half of the algorithm gets the content of an HTML page and saves only the content that follows the tags. The second half of the algorithm does the following: Look at every character in the pageContents string, one character at a time If the character is a left angle bracket () we are now leaving the tag; ignore the current character, but look at each following character If we’re not inside a tag, append the current character to a new variable: text Split the text string into a list of individual words that can later be manipulated further.

Keywords