Sentry Answers>Python>

Split a string into a list of words in Python

Split a string into a list of words in Python

David Y.

The Problem

How do I split a string into a list of words using Python?

The Solution

We can do this using the string method split:

Click to Copy
sentence = 'Jackdaws love my big sphinx of quartz.' wordlist = sentence.split() print(wordlist) # will print ['Jackdaws', 'love', 'my', 'big', 'sphinx', 'of', 'quartz.']

The split method takes an optional sep argument, allowing us to specify a substring to treat as the separator between items in our list. If no separator is specified, as in the above example, the string is split on each run of one or more whitespace characters (spaces, tabs, and newlines).

Splitting the string on whitespace characters preserves sentence punctuation such as periods and commas, like quartz. in the example. If this is acceptable, you can stick with split. But if you would like to remove punctuation from the wordlist, use a regular expression with re.findall to build a list containing all words from the string. For example:

Click to Copy
import re sentence = 'Jackdaws love my big sphinx of quartz.' wordlist = re.findall(r'\b\w+\b', sentence) print(wordlist) # will print ['Jackdaws', 'love', 'my', 'big', 'sphinx', 'of', 'quartz']

The regular expression \b\w+\b matches all words of at least one character (\w+) surrounded by word boundaries (\b), such as spaces, commas and periods. However, this code will also split on apostrophes and hyphens, which you may not want. To avoid that, you need to make the regular expression a little more complicated:

Click to Copy
import re sentence = "Jack's mother-in-law's favorite cozy tavern was a quaint pub where zebras, lynxes, and quokkas danced." wordlist = re.findall(r"\b\w+(?:[-']\w+)*\b", sentence) print(wordlist) # will print ["Jack's", "mother-in-law's" 'favorite', 'cozy', 'tavern', 'was', 'a', 'quaint', 'pub', 'where', 'zebras', 'lynxes', 'and', 'quokkas', 'danced']

Here, we’ve added (?:[-']\w+)* to the regular expression, which allows characters after the first one in the word to be apostrophes or hyphens. Our code now splits the sentence into a list of words, preserving words with apostrophes and hyphens, while discarding commas, periods, and other sentence-level punctuation. The regular expression can be further tweaked to serve different use cases.

  • Sentry BlogPython Performance Testing: A Comprehensive Guide
  • Syntax.fmListen to the Syntax Podcast
  • Sentry BlogLogging in Python: A Developer’s Guide
  • CodecovPython - Codecov
  • Syntax.fm logo
    Listen to the Syntax Podcast

    Tasty treats for web developers brought to you by Sentry. Get tips and tricks from Wes Bos and Scott Tolinski.

    SEE EPISODES

Considered “not bad” by 4 million developers and more than 100,000 organizations worldwide, Sentry provides code-level observability to many of the world’s best-known companies like Disney, Peloton, Cloudflare, Eventbrite, Slack, Supercell, and Rockstar Games. Each month we process billions of exceptions from the most popular products on the internet.

© 2024 • Sentry is a registered Trademark of Functional Software, Inc.