Monday, August 28, 2006

How can I tell the difference between a text file and a binary file?

Short answer: You can't. The question does not make sense. A "text" file type is a subset of all binary file types. Let's make the question more generic: how do I tell the difference between file type X and a binary file (where X is any chosen file type (text, MP3, wav, etc.)? Or to put it in a different domain, how do I tell the difference between oak and wood? Now does it make a bit more sense why you can't ask this question? If not, read on...

A binary file contains binary data. Binary data is composed of bytes with values ranging from 0 to 255. Therefore, every file is a binary file. A file of any given type is a binary file which has a specific structure. It is simply a matter of convention and the interpretation of the structure of the data which differentiates one type of file from another. Many file types have specific values which are expected at particular byte offsets within the file. If these values are not found, then the file is not of that type. Of course, just because the file contains those bytes at the expected locations does not ensure the file is of that type, it just gives an indication that it might be.

The text file type typcially does not have this type of structure, an exception being some of the unicode standards. Some of these have an expected byte value in the first two bytes of the file. If these bytes exist, then it's assumed the file is a unicode text file.

All this said, for some uses, it might be possible to define, in a limited way, what it means to be a text file. One definition would be if the file contains anything other than byte values 9 (tab), 10 (new line), 13 (carriage return) or within the range of 32 to 127, then it is not a text file. The downside to this is that it eliminates the use of accented characters and does not include other control characters which might be included in some applications. The definition could be expanded to include the accented characters in the range 129 through 255. However, this now includes most of the range of bytes and might cause some false positives.

The bottom line is every file is a binary file. Every other type of file is a matter of interpretation of the binary data.

No comments: