Why You Should Use Python 3 for Text Processing¶

Date:	2013-03-16
Speaker:	David Mertz

Why?¶

Native unicode (str is unicode, bytes is str)
String and bytes have grown handy methods
Wrote “Text Processing in Python” (download: http://gnosis.cx/TPiP)
Impressionistic review of nice-to-have improvements
These features are backported into 2.7
- 3.1 -> 2.7
- 3.2 -> 2.7.3
- 3.3 -> 2.7.4?

Cool Stuff in collections¶

namedtuple, OrderedDict, HashMap

namedtuple¶

Useful for dealing with CSV and database rows:

import csv, collections
users = open('user'csv')
headers = users.readline()
UserRecord = collections.namedtuple('UserRecord', headers)
for row in csv.reader(users, rename=True):
    print(UserRecord(*row))

If you set rename=True it renames attributes that may be reserved types
Less memory than dicts (because of __slots__)

Counters¶

Useful for histograms (such as commonality of letters):

import collections
c1 = collections.Counter('abracadabra')
print c1.most_common(4)
# 3.3 only:
# c1['d'] -= 10

Pseudo-arithmetic stuff, basically defaultdict to value 0:

c2 = Counter('ramalama bim boom')
(c1 + c2).most_common(4)
# +c1 Increment

ChainMap¶

New in Python 3.3
Collection of mappings... “container of containers”
Sneaky equiv. to dnyamic inheritance and MRO

ChainMaps can include ChainMaps:

d1 = {'a':, 1, 'b': 2}
d2 = {'c':, 3, 'd': 4}
chian = ChainMap(d1, d1)
print chain['a'], chain['d]
# => (1, 4)

Unicode is hard¶

Most (not all!) Unicode is in the BMP (Basic Multilingual Plane)
All of Latin-1 is in range 00 of the BMP
Internal encoding matters
- Fixed-width (UTF-32/UCS-4): Uses a lot of memory
- Variable-width (UTF-8): positing indexing is very slow
- With UTF-16/UCS-2 you get the worst of everything:
  - Not strictly fixed-width (i.e. surrogate pairs)
  - Usually wasted memory
I have no idea what any of this means (fixed- vs. variable- width)

PEP-393¶

Strings are normally Latin-1
Python encodes everything in the most compact form it can.
v3.3 adds back explicit unicode literals u'bacon'.

Pro Tips¶

str.startswith() and str.endswith() take tuples as well a string
- (But not lists or other iterables)

Module textwrap

People reimplment this over-and-over (cough Trigger)

Don’t roll your own:

textwrap.fill(s, width=35, initial_indent='| ', subsequent_indent='| '))

Dedent multiline strings:

multiline = """
foo
bar
bacon"""
multi_line = textwrap.dedent(multi_line)

textwrap.indent() new in Python 3.3:

def my_pred(line):
  return not line.endswith('wrote:\n')
print(textwrap.indent(s, '| ', predicate=my_pred))

Module html.entities:

from html.entities import h5ml5, entitydefs, codepoint2name
print html5['Exists;']

Module unicodedata
- Get names of glyphs
- Validate, inspect
Proper quoting
- Hidden in pipes.quote()
- In Python 3.3 it’s moved to shlex.quote()
- Useful for generating shell scripts or CLI/subprocess args
Use format()
- Mini language
- More powerful and robust than ‘%s’ style.
- e.g. Thousand separator (locale aware)
  - '${:,.2f}'.format(1000000) # => '$1,000,000.00'
  - This is in Python 2.7, also
Module email
- msg = email.message_from_file(...)
- payload = msg.get_payload()
- payload.get_content_type()