2008-09-20

Data file format

Initially the XML format was a hype. It was praised for it's flexibility, human readability (clear text) and praised to be the ultimate solution for every export and import problem. Although it is true that this format can be used for many different things I never liked it.

The format is flexible, that is true, but other advantages are myths:
  • Human readability:
    Yes, it is a text format. But from the practical view it can be very difficult to get an overview in a bigger file. This because without an appropriate tool matching opening and closing tags are not obviously if the text is not really well formatted.

  • Data Exchange (import, export and application interoperability):
    That the XML format is very flexible is a disadvantage here but as any common format has to be flexible we cannot see this as a disadvantage here. The problem is different: No matter which format you are using you always need additional definitions on the tags, columns, sections or whatever - which elements are required, which are optional and what exactly are the allowed values. For application interoperability you always need a detailed specification on how the format is used and which elements. So the file format itself can never be a general solution.

  • Simple and efficient:
    Although the general rules are simple there are a lot of additional optionals like specifying DTD, parameters of tags or XML headers. But when I once tried to write a parser for XML you know that there cannot be an algorithm with real good performance on reading XML. Parsing other formats like CSV, INI and so on will always be faster. You need to use existing libraries that do that parse the XML for you if you have to achieve your goal quickly. Further XML cannot be efficient just from the character overhead. XML is simply big.
What developers already noticed has not yet been really broadcasted but there already have been created alternatives like YAML and JSON. Both formats are very easy to learn, very easy human readable, easy to create and easy to parse. And further they produce smaller files and hence produce less traffic. So I think those formats should be more enphasized in the future. However a lot of people are still betting on XML...

An argument for XML could be the possibility to specify element types and so on. Well, in any case there must be a business logic checking the input (in whatever format) for it's validity. I do not think that all type validation can be done through XML definitions. As there might be code logic involved also it is better to have only one place where checks are done.

Related post: Document file format, Ignorance of the different.

No comments: