Split a string into a list of words in Python

How do I split a string into a list of words using Python?

The Solution

We can do this using the string method split:

sentence = 'Jackdaws love my big sphinx of quartz.' wordlist = sentence.split() print(wordlist) # will print ['Jackdaws', 'love', 'my', 'big', 'sphinx', 'of', 'quartz.']

The split method takes an optional sep argument, allowing us to specify a substring to treat as the separator between items in our list. If no separator is specified, as in the above example, the string is split on each run of one or more whitespace characters (spaces, tabs, and newlines).

Splitting the string on whitespace characters preserves sentence punctuation such as periods and commas, like quartz. in the example. If this is acceptable, you can stick with split. But if you would like to remove punctuation from the wordlist, use a regular expression with re.findall to build a list containing all words from the string. For example:

import re sentence = 'Jackdaws love my big sphinx of quartz.' wordlist = re.findall(r'\b\w+\b', sentence) print(wordlist) # will print ['Jackdaws', 'love', 'my', 'big', 'sphinx', 'of', 'quartz']

The regular expression \b\w+\b matches all words of at least one character (\w+) surrounded by word boundaries (\b), such as spaces, commas and periods. However, this code will also split on apostrophes and hyphens, which you may not want. To avoid that, you need to make the regular expression a little more complicated:

import re sentence = "Jack's mother-in-law's favorite cozy tavern was a quaint pub where zebras, lynxes, and quokkas danced." wordlist = re.findall(r"\b\w+(?:[-']\w+)*\b", sentence) print(wordlist) # will print ["Jack's", "mother-in-law's" 'favorite', 'cozy', 'tavern', 'was', 'a', 'quaint', 'pub', 'where', 'zebras', 'lynxes', 'and', 'quokkas', 'danced']

Here, we’ve added (?:[-']\w+)* to the regular expression, which allows characters after the first one in the word to be apostrophes or hyphens. Our code now splits the sentence into a list of words, preserving words with apostrophes and hyphens, while discarding commas, periods, and other sentence-level punctuation. The regular expression can be further tweaked to serve different use cases.

