Detect encoding-tags in python

Posted 2012-04-22 19:50. Tagged python, unicode, hack, fileinput.

Please note that this post is 13 years old. The information herein may be outdated.

When iterating over a set of input files in Python, how to decode the lines of each file with the correct encoding?

I wrote a hook for the fileinput module to look for a -*- coding: foo -*- marker in the first lines of each file.

In Python, there is a usefull method called fileinput.input for reading input lines from multiple files, commonly used like so:

# files may be from command line arguments
for line in fileinput.input(files):
    parse(line)

Used as above, each line will consist of the encoded bytes exactly as they appear in the files. That is rarely usefull, so there is way to open each input file with a hook to decode the data:

for line in fileinput.input(files,
                            openhook=fileinput.hook_encoded('utf-8')):
    parse(line)

This is good if you know the file encoding when you write the program (or can give them with e.g. a command line argument). But what if different input files may be written in different encodings?

Luckily, there is a convention often used to specify the encoding of a file in the beginning of a file itself, in rather plain text, like so:

#  -*- coding: utf-8 -*-
actual content goes here.
The first line don't have to start with a hash, but is often
a comment in whatever languge / format the file is in.

I didn’t find any existing way to detect the coding tag so I can just ignore any encoding issues and read the files as intended, so I wrote one. Maybe it can be of use for someone else?

def detect_encoding(default='utf-8'):
    """A hook for fileinput to detect encoding tags at start of each file."""
    def open_hook(filename, mode):
        import codecs, re
        f = open(filename, mode)
        # Check the two first lines for an encoding mark!
        coding = re.search(r'-\*- +(en)?coding: (?P<c>[a-z0-9_-]+) +-\*-',
                           f.readline() + f.readline())
        f.close()
        return codecs.open(filename, mode,
                           coding.group('c') if coding else default)
    return open_hook

for line in fileinput.input(sourcefiles, openhook=detect_encoding()):
    parse(line)

As you can see, I open each file (in raw bytes mode) and read two lines from it, before closing it and opening it again with the correct encoding.

Yes, I’m sure there is a performance penalty to this, but I can’t see a more efficient way to do it in Python. Can you?

Comments

This post is 13 years old, comments are disabled.