XML is a cheap DSL (unplannedobsolescence.com)

by y1n0 • 265 comments • 276 points

265 comments

[−] necovek 63d ago

XML is notoriously expensive to properly parse in many languages. Basically, the entire world centers around 3 open source implementations (libxml2, expat and Xerces), if you want to get anywhere close to actual compliance. Even with them, you might hit challenges (libxml2 was largely unmaintained recently, yet it is the basis for many bindings in other languages).

The main property of SGML-derived languages is that they make "list" a first class object, and nesting second class (by requiring "end" tags), and have two axes for adding metadata: one being the tag name, another being attributes.

So while it is a suitable DSL for many things (it is also seeing new life in web components definition), we are mostly only talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.

Another comment to make here is that you can have an imperative looking DSL that is interpreted as a declarative one: nothing really stops you from saying that

  totalOwed = totalTax - totalPayments
  totalTax = tentativeTaxNetNonRefundableCredits + totalOtherTaxes
  totalPayments = totalEstimatedTaxesPaid +
                      totalTaxesPaidOnSocialSecurityIncome +
                      totalRefundableCredits

means exactly the same as the XML-alike DSL you've got.

One declarative language looking like an imperative language but really using "equations" which I know about is METAFONT. See eg. https://en.wikipedia.org/wiki/Metafont#Example (the example might not demonstrate it well, but you can reorder all equations and it should produce exactly the same result).

[−] Someone1234 63d ago

I keep seeing people make the same mistake as XML made over and over; without learning from it. I will clarify the problem thusly:

> The more capabilities you add to a interchange format, the harder that format is to parse.

There is a reason why JSON is so popular, it supports so little, that it is legitimately easy to import. Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.

CSV may be under-specified, but it remains popular largely due to its simplicity to produce/consume. Unfortunately, we're seeing people slowly ruin JSON by adding e.g. commands to the format, with others than using those "comments" to hold data (e.g. type information), which must be parsed. Which is a bad version of an XML Attribute.

[−] GuB-42 63d ago

I think JSON has the opposite problem, it is too simple, the lack of comments in particular is particularly bad for many common usages of the format today.

I know some implementations of JSON support comments and other things, but is is not true JSON, in the same way that most simple XML implementations are not true XML. That's what I say "opposite problem", XML is too complex, and most practical uses of XML use incomplete implementations, while many practical uses of JSON use extended implementations.

By the way, this is not a problem for what JSON was designed for: a text interchange format, with JS being the language of choice, but it has gone beyond its design: configuration files, data stores, etc...

[−] conartist6 63d ago

A lot of people dislike that decision not to include comments in JSON, but I think while shocking it was and is totally correct.

In a programming language it's usually free to have comments because the comment is erased before the program runs; we usually render comments in grey text because they can't change the meaning of the program.

In a data language you have no such luxury. In a data language there's no comment erasure happening between the producer and the consumer, so comments are just dangerous as they would without doubt evolve into a system of annotations -- an additional layer of communication which would then not be standardized at all and which then would grow into a wild west of nonstandard features and compatibility workarounds.

[−] da_chicken 63d ago

I've said it before, but I maintain that XML has only two real problems:

1. Attributes should not exist. They make the document suddenly have two dimensions instead of one, which significantly increases complexity. Anything that could be an attribute should actually be a child element.

2. There should be one close tag: which closes the last element, which burns a significant amount of space with useless syntax. Other than that and the self-closing (which itself is less useful without attributes) there isn't much that you need. Maybe a document close tag like

You'll notice that, yes, JSON solves both of those things. That's a part of why it's so popular. The other is just that a lot more effort was put into maximizing the performance of JavaScript than shredding XML, and XSLT, the intended solution to this problem, is infamous at this point.

The problem of comments is kind of a non-issue in practice, IMO. You can just add a "_COMMENT" element or similar. Sure, yes, it will get parsed. But you shouldn't have that many comments that it will cause a genuine performance issue.

However, JSON still has two problems:

1. Schema support. You can't validate that a file before de-serializing it in your application. JSON Schema does exist, but it's support is still thin, IMX.

2. Many serializers are pretty bad with tabular data, and nearly all of them are bad with tabular data by default. So sometimes it's a data serialization format that's bad at serializing bulk data. Yeah, XML is worse at this. Yeah, you can use the "colNames": ["id", ...], "rows": [ [1,...],[2,...] ] method or go columnar with "id": [1,2,...], "name": [...], "createDate": [...], but you had better be sure both ends can support that format.

In both cases, it seems like there is an attempt to resolve both of those issues. OpenAPI 3.1 has JSON schema included in it. The most popular JSON parsers seem to be adding tabular data support. I guess we'll see.

[−] python-b5 63d ago

I've been working on an XML parser of my own recently and, to be honest, as long as you're fine with a non-validating parser (which are still compliant), it's really not that bad. You have to parse DTDs, but you don't need to actually _do_ anything with them. Namespaces are annoying but they're not in the main spec. CDATA sections aren't all that useful, but they're easy to parse. As far as I'm aware, parsers don't actually need to handle xml:lang/xml:space/etc themselves - they're for use by applications using the parser. Really the only thing that's been particularly frustrating for me is entity expansion.

If you want to support the wider XML ecosystem, with all the complex auxiliary standards, then yes, it's a lot of work, but the language itself isn't that awful to parse. It's a little messy, but I appreciate it at least being well-specified, which JSON is absolutely not.

[−] conartist6 63d ago

Just gonna drop this here : ) https://docs.bablr.org/guides/cstml

CSTML is my attempt to fix all these issues with XML and revive the idea of HTML as a specific subset of a general data language.

As you mention one of the major learnings from the success of JSON was to keep the syntax stupid-simple -- easy to parse, easy to handle. Namespaces were probably the feature to get the most rework.

In theory it could also revive the ability we had with XHTML/XSLT to describe a document in a minimal, fully-semantic DSL, only generating the HTML tag structure as needed for presentation.

[−] 0xbadcafebee 63d ago

The problem is that engineers of data formats have ignored the concept of layers. With network protocols, you make one layer (Ethernet), you add another layer (IP), then another (TCP), then another (HTTP). Each one fits inside the last, but is independent, and you can deal with them separately or together. Each one has a specialty and is used for certain things. The benefits are 1) you don't need "a kitchen sink", 2) you can replace layers as needed for your use-case, 3) you can ship them together or individually.

I don't think anyone designs formats this way, and I doubt any popular formats are designed for this. I'm not that familiar with enterprise/big-data formats so maybe one of them is?

For example: CSV is great, but obviously limited, and not specified all that well. A replacement table data format could be binary (it's 2026, let's stop "escaping quotes", and make room for binary data). Each row can have header metadata to define which columns are contained, so you can skip empty columns. Each cell can be any data format you want (specifically so you can layer!). The header at the beginning of the data format could (optionally) include an index of all the rows, or it could come at the end of the file. And this whole table data format could be wrapped by another format. Due to this design, you can embed it in other formats, you can choose how to define cells (pick a cell-data-format of your choosing to fit your data/type/etc, replace it later without replacing the whole table), you can view it out-of-order, you can stream it, and you can use an index.

[−] PunchyHamster 63d ago

Constant erosion of data formats into the shittiest DSLs in existence is annoying. "Oh, hey, instead of writing Python, how about you write in

* YAML, with magical keywords that turn data into conditions/commands * template language for the YAML in places when that isn't enough * ....Python, because you need to eventually write stuff that ingests the above either way .... ansible is great isn't it?"

... and for some reason others decide "YES THIS IS AWESOME" and we now have a bunch of declarative YAML+template garbage.

> There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.

It's just a bunch of records put in tables with pretty simple data types. And it's trivial to convert into other formats while being compact and queryable on its own. So as far as formats go, you could do a whole lot worse.

[−] necovek 63d ago

Funnily enough, XML was an attempt to simplify SGML so it is easier to parse (as SGML only ever had one compliant parser, nsgml).

[−] neonstatic 63d ago

What do you think about Apache Arrow binary formats in this context?

[−] xienze 63d ago

> Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

But you don't have to use all those things. Configure your parser without namespace support, DTD support, etc. I'd much rather have a tool with tons of capabilities that can be selectively disabled rather than a "simple" one that requires _me_ to bolt on said extra capabilities.

[−] moron4hire 63d ago

I consider CSV to be a signal of an unserious organization. The kind of place that uses thousand line Excel files with VBA macros instead of just buying a real CRM already. The kind of place that thinks junior developers are cheaper than senior developers. The kind of place where the managers brow beat you into working overtime by arguing from a single personal perspective that "this is just how business is done, son."

People will blithely parrot, "it's a poor Workman who blames his tools." But I think the saying, as I've always heard it used to suggest that someone who is complaining is a just bad at their job, is a backwards sentiment. Experts in their respective fields do not complain about their tools not because they are internalizing failure as their own fault. They don't complain because they insist on only using the best tools and thus have nothing to complain about.

[−] quotemstr 63d ago

> XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

Ah, the old "throw a bag of nouns at the reader and hope he's intimidated" rhetorical flutist. These things are either non-issues (like QName), things a parser does for you, or optional standards adjacent to XML but not essential to it, e.g. XInclude.

[−] alexpetros 63d ago

Author here. I agree with all this, and I think it's important to note that nothing precludes you from doing a declarative specification that looks like imperative math notation, but it's also somewhat besides the point. Yes, you could make your own custom language, but then you have created the problem that the article is about: You need to port your parser to every single different place you want to use it.

That's to say nothing of all the syntax decisions you have to make now. If you want to do infix math notation, you're going to be making a lot of choices about operator precedence. The article is using a lot of simple functions to explain the domain, but we also have switch statements—how are those going to expressed? Ditto functions that don't have a common math notation, like stepwise multiply. All of these can be solved, but they also make your parser much more complicated and create a situation where you are likely to only have one implementation of it.

If you try to solve that by standardizing on prefix notations and parenthesis, well, now you have s-expressions (an option also discussed in the post).

That's what "cheap" means in this context: There's a library in every environment that can immediately parse it and mature tooling to query the document. Adding new ideas to your XML DSL does not at all increase the complexity of your parsing. That's really helpful on a small team! I agonized over the word "cheap" in the title and considered using something more obviously positive like "cost-effective" but I still think "cheap" is the right one. You're making a cost-cutting choice with the syntax, and that has expressiveness tradeoffs like OP notes, but it's a decision that is absolutely correct in many domains, especially one where you want people to be able to widely (and cheaply) build on the thing you're specifying.

[−] petcat 63d ago

> XML Is a Cheap [...]

> XML is notoriously expensive to properly parse in many languages.

I'm glad this is the top comment. I have extensive experience in enterprise-y Java and XML and XML is anything but cheap. In fact, doing anything non-trivial with XML was regularly a memory and CPU bottleneck.

[−] twoodfin 63d ago

Much of XML’s complexity derives from either the desire to be round-trip compatible with any number of existing character and data encodings or the desire to be largely forward-compatible with SGML.

A parser that only had to support a specified “profile” of XML (say, UTF-8 only, no user-defined entities or DTD support generally) could be much simpler and more efficient while still capturing 99% of the value of the language expressed by this post.

[−] PantaloonFlames 63d ago

Your first counterpoint seems unnecessarily picky.

> So while it is a suitable DSL for many things (it is also seeing new life in web components definition), we are mostly only talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.

But the TWE did not embrace all that stuff. It’s not required for its purpose. And to call it “xml lookalike” on that basis seems odd. It’s objectively XML. It doesn’t use every xml feature, but it’s still XML.

It’s as if you’re saying, a school bus isn’t a bus, it’s just a bus-lookalike. Buses can have cup holders and school buses lack cup holders. Therefore a school bus is not really a bus.

I don’t see the validity or the relevance.

[−] phlakaton 63d ago

Unless you are compiling really large systems of DSL specification, speed of parsing is not the operation you want to be optimizing. XML for this use case, even if you DOM it, is plenty fast.

What are more concerning are the issues that result in unbounded parses – but there are several ways to control for this.

[−] electroly 62d ago

I shipped 20MB of XML with a product back in 2014; we loaded it at startup, validated it against the XSD, and the performance for this use case was fine. It was big because we did something kinda like what TFA suggests: I designed a declarative XML "DSL" and then wrote a bunch of "code" in it. We had lots of performance problems in that project, but the XML DSL wasn't the cause of any of them; that part was fine. I think "expensive" can mean a lot of different things. It was cheap in terms of development time and the loading/validation time, even on 20MB of XML, was not a problem. Visual Studio ships a tool that generates C# classes from the XSDs which was handy. I just wrote the XSDs and the framework provided the parsing, validation, node classes, and tree construction. This is as "XML proper" as I think it's possible to get.

I don't believe that .NET's XML serializer uses any of the open source projects mentioned in your post, so maybe we just have especially good XML support in .NET. I think Java has its own XML serializer, too. I bet most XML generated and consumed in the world is not one of those three open source C/C++ libraries. I think Java alone might be responsible for more than half of it.

[−] necovek 63d ago

FWIW, this is also one of the reasons MathML has never become the "input" language for mathematics, and the layout-focused (La)TeX remains the de-facto standard.

Ergonomics of input are important because they increase chances of it being correct, and you can usually still keep it strict and semantic enough (eg. LaTeX is less layout-focused than Plain TeX)

[−] twic 63d ago

You don't even need to specify a DSL to make that code declarative. It can be real code that's manipulating expression objects instead of numbers (though not in JavaScript, where there's no operator overloading), with the graph of expression objects being the result.

[−] gchamonlive 63d ago

That's a strange comment...

Cheap here is semantically different from cheap in the article. Here it means "how hard it hits the CPU" and in the article is "how hard it is to specify and widely support your DSL".

You also posted a piece of code that the author himself acknowledged that is not bad and ommited the one pathological example where implementation details leak when translating to JavaScript.

It just seems like you didn't approach reading the article willing to understand what the author was trying to say, as if you already decided the author is wrong before reading.

[−] sriku 63d ago

While this can give a notation for the domain, you'd still need an engine to process it. Prolong+CLPFD perhaps meets it well (not too familiar with the tax domain) and one could perhaps paraphrase Greenspun's tenth rule to this combo too.

[−] raverbashing 63d ago

> and have two axes for adding metadata: one being the tag name, another being attributes

Yes let's not even get started on implementations who do

[−] tannhaeuser 63d ago

The main property of SGML-derived languages is that they make "list" a first class object, and nesting second class (by requiring "end" tags) ...

I think you're missing the forrest for the trees ;)

The major point of SGML in this context is that elements have content models defined by regular expressions, just like any other grammar productions eg. BNF.

[−] quotemstr 63d ago

> The main property of SGML-derived languages is that they make "list" a first class object, and nesting second class (by requiring "end" tags),

As opposed to JSON, which famously lacks lists? What does "second class" even mean here? How is having an end-indicator somehow a demotion?

> talking about XML-lookalike language, and not XML proper. If you go XML proper, you need to throw "cheap" out the window.

libxml2 and expat are plenty fast. You can get ~120MB/s out of them and that's nowhere near the limit. Something like pugixml or VTD can do faster once you've detected you're not working with some kind of exotic document with DTD entities.

[−] jaen 63d ago

Or... you could just use a programming language that looks good and has great support for embedded domain-specific languages (eDSL), like Haskell, OCaml or Scala.

Or, y'know, use the language you have (JavaScript) properly, eg. add a sum abstraction instead of .reduce((acc, val) => { return acc+val }, 0).

In particular, the problem of "all the calculations are blocked for a single user input" is solved by eg. applicatives or arrows (these are fairly trivial abstract algebraic concepts, but foreign to most programmers), which have syntactic support in the abovementioned languages.

(Of course, avoid the temptation to overcomplicate it with too abstract functional programming concepts.)

If you write an XML DSL:

1. You have to solve the problem of "what parts can I parallelize and evaluate independently" anyway. Except in this case, that problem has been solved a long time ago by functional programming / abstract algebra / category-theoretic concepts.

2. It looks ugly (IMHO).

3. You are inventing an entirely new vocabulary unreadable to fellow programmers.

4. You will very likely run into Greenspun's tenth rule if the domain is non-trivial.

[−] twic 63d ago

FWIW you can do a better job with the JSON structure than in the article:

    {"GreaterOf": [
        {"Value": [0, "Dollar"]},
        {"Subtract": [
            {"Dependency": ["/totalTentativeTax"]},
            {"Dependency": ["/totalNonRefundableCredits"]}
        ]}
    ]}

Basically, a node is an object with one entry, whose key is the type and whose value is an array. It's a rather S-expressiony approach. if you really don't like using arrays for all the contents, you could always use more normal values at the leaves:

    {"GreaterOf": [
        {"Value": {"value": 0, "kind": "Dollar"}},
        {"Subtract": {
            "minuend": {"Dependency": "/totalTentativeTax"},
            "subtrahend": {"Dependency": "/totalNonRefundableCredits"}
        }}
    ]}

It has the nice property that you're always guaranteed to see the type before any of the contents, even if object keys get reordered, so you can do streaming decoding without having to buffer arbitrary amounts of JSON. Probably not important when parsing a tax code, but can be useful for big datasets.

[−] sgarland 63d ago

While a great article, I actually found this linked post [0] to be even better, in which the author lays out how so much modern tooling for web dev exists simply because XML lost the browser war.

EDIT: obviously, JSON tooling sprang up because JSON became the lingua franca. I meant that it became necessary to address the shortcomings of JSON, which XML had solved.

0: https://marcosmagueta.com/blog/the-lost-art-of-xml/

[−] Decabytes 63d ago

S-expressions are a cheap dsl too. I use it in my desktop browser runtime that is powered by wasm that I’m developing As the “HTML”^1 and CSS^2 in fact it works so well I use it also reused it to do the styling for html exports in my markup language designed to fight documentation drift^3.

1. https://gitlab.com/canvasui/canvasui-engine/-/blame/main/exa...

2. https://gitlab.com/canvasui/canvasui-engine/-/blob/main/exam...

3. https://gitlab.com/sablelang/libcuidoc

[−] Kwpolska 63d ago

XML is beloved by tax authorities. The Polish tax authorities really love their e-documents and online filing. Except their XML documents are completely human-unreadable, since the schemas are based on field numbers in paper forms. Even in the brand new National e-Invoicing System, designed from scratch, with no paper forms, most fields have names like ‹P_19N›1‹/P_19N›. You read the XML schema to find out it is a "Marker of lack of delivery of goods or provision of services exempt from tax under Article 43 paragraph 1 of the [VAT] Act, Article 113 paragraphs 1 and 9 of the Act or regulations issued under Article 82 paragraph 3 of the Act or under other provisions" (Google Translated, because of course everything is in Polish). So my invoice is saying "yes [1], I am not [N] exempt from tax under $allThatNonsense [P_19]".

In unrelated news, the main author of the VAT Act is offering tax consulting services, as Registered Tax Advisor #00001.

[−] jfengel 63d ago

It's not a DSL. It's a generic lexer and parser. It takes the text and gives you an abstract syntax tree. The actual DSL is your spec, and the syntax you apply.

It's one of many equivalent such parser tools, a particularly verbose one. As such it's best for stuff not written by hand, but it's ok for generated text.

It has some advantages mostly stemming from its ubiquity, so it has a big tool kit. It has a lot of (somewhat redundant) features, making it complex compared to other options, but sometimes one of those features really fits your use case.

[−] 1a527dd5 63d ago

The trouble with XML has never been XML itself.

It was also about how easy it was to generate great XML.

Because it is complicated and everyone doesn't really agree on how to properly representative an idea or concept, you have to deal with varying output between producers.

I personally love well formed XML, but the std dev is huge.

Things like JSON have a much more tighter std dev.

The best XML I've seen is generated by hashdeep/md5deep. That's how XML should be.

Financial institutions are basically run on XML, but we do a tonne of work with them and my god their "XML" makes you pray and weep for a swift end.

[−] exabrial 63d ago

Given that that is had strong schema XSD verification built in, where you can tell in an instant whether or not the document is correct; it’s the right tool for a majority of jobs.

My experience has been the people complaining about it were simply not using automated tools to handle it. It’s be like people complaining that “binaries/assembly are too hard to handle” and never using a disassembler.

[−] somat 63d ago

XML makes for a pretty good markup language and an ok data interchange format(not a great fit, but the tooling is pretty good). but every single time I have seen it used as a programing language I found it deeply regrettable.

For comparison JSON is a terrible markup language, a pretty good data interchange format, and again, a deeply regrettable programing language. I don't know if anyone has put programing language in straight JSON (I suspect they have shudders) but ansible has quite a few programing structures and is in YAML which is JSON dressed in a config language's clothes.

However as a counter point to my json indictment, it may be possible to make a decent language out of it, look to lisp, it's S-expressions are a sort of a data interchange format(roughly equivalent to json) and it is a pretty good language.