###############################################
Why You Should Use Python 3 for Text Processing
###############################################

:Date:
    2013-03-16

:Speaker:
    David Mertz

Why?
====

+ Native unicode (str is unicode, bytes is str)
+ String and bytes have grown handy methods
+ Wrote "Text Processing in Python" (download: http://gnosis.cx/TPiP)
+ Impressionistic review of nice-to-have improvements
+ These features are backported into 2.7

  - 3.1 -> 2.7
  - 3.2 -> 2.7.3
  - 3.3 -> 2.7.4?

Cool Stuff in collections
=========================

+ namedtuple, OrderedDict, HashMap

namedtuple
----------

+ Useful for dealing with CSV and database rows::

    import csv, collections
    users = open('user'csv')
    headers = users.readline()
    UserRecord = collections.namedtuple('UserRecord', headers)
    for row in csv.reader(users, rename=True):
        print(UserRecord(*row))
  
+ If you set ``rename=True`` it renames attributes that may be reserved types
+ Less memory than dicts (because of __slots__)

Counters
--------

+ Useful for histograms (such as commonality of letters)::

    import collections
    c1 = collections.Counter('abracadabra')
    print c1.most_common(4)
    # 3.3 only:
    # c1['d'] -= 10

+ Pseudo-arithmetic stuff, basically defaultdict to value 0::
  
    c2 = Counter('ramalama bim boom')
    (c1 + c2).most_common(4)
    # +c1 Increment

ChainMap
--------

+ New in Python 3.3
+ Collection of mappings... "container of containers"
+ Sneaky equiv. to dnyamic inheritance and MRO
+ ChainMaps can include ChainMaps::

    d1 = {'a':, 1, 'b': 2}
    d2 = {'c':, 3, 'd': 4}
    chian = ChainMap(d1, d1)
    print chain['a'], chain['d]
    # => (1, 4)

Unicode is hard
===============

+ *Most* (not all!) Unicode is in the BMP (Basic Multilingual Plane)
+ All of Latin-1 is in range 00 of the BMP
+ Internal encoding matters

  - Fixed-width (UTF-32/UCS-4): Uses a lot of memory
  - Variable-width (UTF-8): positing indexing is very slow
  - With UTF-16/UCS-2 you get the worst of everything:

    * Not strictly fixed-width (i.e. surrogate pairs)
    * Usually wasted memory

+ I have no idea what any of this means (fixed- vs. variable- width)

PEP-393
=======

+ Strings are normally Latin-1
+ Python encodes everything in the most compact form it can.
+ v3.3 adds back explicit unicode literals ``u'bacon'``.

Pro Tips
========

+ ``str.startswith()`` and ``str.endswith()`` take tuples as well a string 

  - (But not lists or other iterables)

+ Module ``textwrap``

  - People reimplment this over-and-over (*cough* Trigger)
  - Don't roll your own::

      textwrap.fill(s, width=35, initial_indent='| ', subsequent_indent='| '))

  - Dedent multiline strings::

      multiline = """
      foo
      bar
      bacon"""
      multi_line = textwrap.dedent(multi_line)

  - ``textwrap.indent()`` new in Python 3.3::

      def my_pred(line):
        return not line.endswith('wrote:\n')
      print(textwrap.indent(s, '| ', predicate=my_pred))
    
+ Module html.entities::

    from html.entities import h5ml5, entitydefs, codepoint2name
    print html5['Exists;']

+ Module unicodedata

  - Get names of glyphs
  - Validate, inspect

+ Proper quoting

  - Hidden in ``pipes.quote()``
  - In Python 3.3 it's moved to ``shlex.quote()``
  - Useful for generating shell scripts or CLI/subprocess args

+ Use ``format()``

  - Mini language
  - More powerful and robust than '%s' style.
  - e.g. Thousand separator (locale aware)

    * ``'${:,.2f}'.format(1000000) # => '$1,000,000.00'``
    * This is in Python 2.7, also

+ Module email

  - ``msg = email.message_from_file(...)``
  - ``payload = msg.get_payload()``
  - ``payload.get_content_type()``