Detect encoding-tags in python
Posted 2012-04-22 19:50. Tagged python, unicode, hack, fileinput.
When iterating over a set of input files in Python, how to decode the lines of each file with the correct encoding?
I wrote a hook for the fileinput module to look for a
-*- coding: foo -*- marker in the first lines of each file.
In Python, there is a usefull method called
reading input lines from multiple files, commonly used like so:
# files may be from command line arguments
Used as above, each line will consist of the encoded bytes exactly as they appear in the files. That is rarely usefull, so there is way to open each input file with a hook to decode the data:
This is good if you know the file encoding when you write the program (or can give them with e.g. a command line argument). But what if different input files may be written in different encodings?
Luckily, there is a convention often used to specify the encoding of a file in the beginning of a file itself, in rather plain text, like so:
# -*- coding: utf-8 -*- actual content goes here. The first line don't have to start with a hash, but is often a comment in whatever languge / format the file is in.
I didn’t find any existing way to detect the coding tag so I can just ignore any encoding issues and read the files as intended, so I wrote one. Maybe it can be of use for someone else?
"""A hook for fileinput to detect encoding tags at start of each file.""" = # Check the two first lines for an encoding mark! = return return
As you can see, I open each file (in raw bytes mode) and read two lines from it, before closing it and opening it again with the correct encoding.
Yes, I’m sure there is a performance penalty to this, but I can’t see a more efficient way to do it in Python. Can you?