############################################### Why You Should Use Python 3 for Text Processing ############################################### :Date: 2013-03-16 :Speaker: David Mertz Why? ==== + Native unicode (str is unicode, bytes is str) + String and bytes have grown handy methods + Wrote "Text Processing in Python" (download: http://gnosis.cx/TPiP) + Impressionistic review of nice-to-have improvements + These features are backported into 2.7 - 3.1 -> 2.7 - 3.2 -> 2.7.3 - 3.3 -> 2.7.4? Cool Stuff in collections ========================= + namedtuple, OrderedDict, HashMap namedtuple ---------- + Useful for dealing with CSV and database rows:: import csv, collections users = open('user'csv') headers = users.readline() UserRecord = collections.namedtuple('UserRecord', headers) for row in csv.reader(users, rename=True): print(UserRecord(*row)) + If you set ``rename=True`` it renames attributes that may be reserved types + Less memory than dicts (because of __slots__) Counters -------- + Useful for histograms (such as commonality of letters):: import collections c1 = collections.Counter('abracadabra') print c1.most_common(4) # 3.3 only: # c1['d'] -= 10 + Pseudo-arithmetic stuff, basically defaultdict to value 0:: c2 = Counter('ramalama bim boom') (c1 + c2).most_common(4) # +c1 Increment ChainMap -------- + New in Python 3.3 + Collection of mappings... "container of containers" + Sneaky equiv. to dnyamic inheritance and MRO + ChainMaps can include ChainMaps:: d1 = {'a':, 1, 'b': 2} d2 = {'c':, 3, 'd': 4} chian = ChainMap(d1, d1) print chain['a'], chain['d] # => (1, 4) Unicode is hard =============== + *Most* (not all!) Unicode is in the BMP (Basic Multilingual Plane) + All of Latin-1 is in range 00 of the BMP + Internal encoding matters - Fixed-width (UTF-32/UCS-4): Uses a lot of memory - Variable-width (UTF-8): positing indexing is very slow - With UTF-16/UCS-2 you get the worst of everything: * Not strictly fixed-width (i.e. surrogate pairs) * Usually wasted memory + I have no idea what any of this means (fixed- vs. variable- width) PEP-393 ======= + Strings are normally Latin-1 + Python encodes everything in the most compact form it can. + v3.3 adds back explicit unicode literals ``u'bacon'``. Pro Tips ======== + ``str.startswith()`` and ``str.endswith()`` take tuples as well a string - (But not lists or other iterables) + Module ``textwrap`` - People reimplment this over-and-over (*cough* Trigger) - Don't roll your own:: textwrap.fill(s, width=35, initial_indent='| ', subsequent_indent='| ')) - Dedent multiline strings:: multiline = """ foo bar bacon""" multi_line = textwrap.dedent(multi_line) - ``textwrap.indent()`` new in Python 3.3:: def my_pred(line): return not line.endswith('wrote:\n') print(textwrap.indent(s, '| ', predicate=my_pred)) + Module html.entities:: from html.entities import h5ml5, entitydefs, codepoint2name print html5['Exists;'] + Module unicodedata - Get names of glyphs - Validate, inspect + Proper quoting - Hidden in ``pipes.quote()`` - In Python 3.3 it's moved to ``shlex.quote()`` - Useful for generating shell scripts or CLI/subprocess args + Use ``format()`` - Mini language - More powerful and robust than '%s' style. - e.g. Thousand separator (locale aware) * ``'${:,.2f}'.format(1000000) # => '$1,000,000.00'`` * This is in Python 2.7, also + Module email - ``msg = email.message_from_file(...)`` - ``payload = msg.get_payload()`` - ``payload.get_content_type()``