Thinking about the future of data

XML pioneer Dave Hollander had a vision for Internet documents a decade ago. Now it's about to become a reality.

Paul Festa Staff Writer, CNET News.com

Paul Festa

covers browser development and Web standards.

Paul Festa

Oct. 2, 2002 11:36 a.m. PT

12 min read

In the early 1990s, Dave Hollander devised a markup language for CD-ROM browsers. The language became an early contender for use with the nascent World Wide Web, but Hewlett-Packard, Hollander's employer at the time, kept a tight leash on its intellectual property. As a result, HTML carried the day.

But history has a way of circling back, and 10 years later HTML (Hypertext Markup Language) is giving way to another of Hollander's brainchildren with the introduction of an XML-based HTML substitute called XHTML.

While Hollander has long since moved away from document-based technologies in favor of data-focused work in the e-commerce and Web services areas, his early work on XML (Extensible Markup Language) is bringing the Web back to a more machine-readable model like the one he originally envisioned for it.

A native of Baldwinsville, N.Y., Hollander worked his way through school, driving a truck and working other odd jobs before graduating with a bachelor of science degree from Michigan Tech. These days he is the chief technology officer of Contivo. The 46-year-old Hollander also is the co-author of "XML Applications," technical editor of "XML Schemas," and a contributor to standards including OAGI, RosettaNet, OBI and the ECO Framework. He co-chairs the World Wide Web Consortium's (W3C) XML Schema working group as well as its Web Services Architecture working group, and is co-author of Namespaces in XML, a W3C recommendation.

From his home in Loveland, Colo., Hollander spoke with CNET News.com about the alphabet soup of specifications underlying his work with data integration and Web services.

Q: What was your role in the creation of XML?
A: Back early '90s I was publishing all the manuals for HP--Unix and others--on CD-ROM. This is pre-Web. And I created a language called SDL (Semantic Delivery Language), which has nothing to do with the Semantic Web! (laughing) There were SGML languages that were around, which we couldn't deliver onto CD-ROM. So SDL was an intermediate language that could be read by CD-ROM browsers, which gave me a good forum for translating from high-level SGML to something closer to the computer.

There's a fine line between rewarding people for work they've done and putting things into public domain.

I talked to my peers, including (World Wide Web inventor and W3C director) Tim Berners-Lee, and we saw it as something we could use to access the Web. At the time, Tim was working on the precursor of the Web, within the CERN environment. Tim was working on gaining broad support for it, but the Web as we know it today did not exist. We didn't even allow commercial traffic until April of '94. There was an informal group called the Davenport Group, in '91, but I couldn't get (intellectual property) clearance from HP to use SDL. So they started the HTML project. And by the time we did get this cleared, that project was well along.

Now, 10 years later, the W3C is trying to move us away from HTML to XHTML, the XML-based alternative. What if HP had given you that IP clearance? Would that have saved us having to go through this transition?
It might have been less painful. This is part of a learning process that the community at large had to go through to treat information differently, to begin to understand the concept of markup and metadata and be concerned about those aspects of their data as well as the content. When I was doing HP.com, we had a primitive set of metadata we needed for search engines. And until I got the managers to add that into the contract, they didn't worry about it as long as the pages looked right. The world was starting to understand that a Web page may be pretty to look at, but that it may be difficult for computers to work with. People now understand that there's a necessity for that.

So how did XML come out of HTML?
It was at the second World Wide Web conference, in 1992. By then I'd become Webmaster for HP.com, which I started. We showed up in Chicago and were dealing with HTML 2.0, and as a publisher, I didn't find that a very suitable technology. There were others there with like minds. On the way back to the airport, (Sun Microsystems engineer) Jon Bosak and I shared a cab and knew we had to have something more confined than SGML.

What do you mean confined?
Meaning that SGML is a deep and rich spec with lots of features that aren't used, which kept it from being implemented on the Web. Today you find XML tools and parsers everywhere. There were only three in the world for SGML and they were very expensive to buy. We wanted to keep the idea of having a metalanguage, and didn't want to have to use somebody else's semantics, and wanted to be able to create our own languages. Something a little higher-level--more abstract than SDL or HTML. Within a couple of months Jon had a team of people who worked together and we started drafting what became XML.

We were working on this on our own. We started meeting Saturday mornings for a couple of months, figured out our goals and principles, while Jon worked on finding us a venue, which wound up being the W3C.

What was the next step in your career?
I went from HP to CommerceNet. My interpretation of their charter was to explore the white space between companies--to look at how companies do business and try to fill the huge gaps with respect to security, payment standards, etc. I was doing a lot of work in catalogs, payment, in trying to define XML standards to help businesses do these kinds of B2B transactions. For me it was a way to stop thinking about documents and manuals and start thinking about those things as being the tools we use in business every day. From CommerceNet I went to Contivo. Whether it's documents or B2B, to me the biggest issue turns out to be a transformation problem.

It's perfectly appropriate to do a research project with them, but right now I can't go out and sell a large volume of accounts on a technology that hasn't been proven.

What do you mean by transformation?
It means, how does the receiver interpret the intent of the sender of the information? The easy one to think about is in language, going from French to German, "rue" to "strasse." Being able to understand that intent, and transform what the sender intended to what the receiver needs to do, becomes a transformation. Semantics is understanding the intent of a thing, of a concept. I like to think of it as the boundary between data and behavior. If you send out the same data and get five different responses, then there are five different semantics associated with that data. In order to do a transformation from "rue" to "strasse," I have to understand the fundamental transformation that this is a street, and whether it's a street or highway or road needs to be differentiated. In order to make meaningful transformations, you have to understand the semantics of the information.

How does all this relate to the W3C's Semantic Web activity?
The Semantic Web is based on a couple of fundamental technologies and is a research project. If you look at the funding for the Semantic Web project, it's primarily through a research grant, not member fees. The underlying technologies are RDF, and ontology languages such as DAML or OWL. Those technologies are not mature or proven, and it's not clear to me how I build a commercial product with them. It's perfectly appropriate to do a research project with them, but right now I can't go out and sell a large volume of accounts on a technology that hasn't been proven.

What do you think about the comparison people make between the Semantic Web and artificial intelligence?
Many people point to them as being historical parallels. I'm not sure I would go that far, because the verdict is still out on the Semantic Web. And even though AI has a black eye, it's showing up as a viable technology all over the place. In decision systems used in business intelligence, in medical and clinical analysis, you use AI techniques, even though they don't claim to be AI systems. Conceptually, though, one of the fundamentals of AI was some sort of knowledge tree, some way of understanding and classifying information. KIF is a framework for describing the relationships that you build knowledge around, and DAML is also a hierarchical framework for storing related information. If you discovered that pins are sharp, it shows you where in that hierarchy to store that information so that you could later associate pins and nails as sharp objects.

You're on the W3C's XML Schema working group. What's a schema?
In general they're any sort of formal description of structure and data types, the pattern by which something is designed. XML Schemas are a much stronger way than DTDs (document type definitions). They have a way of being very precise about the data types, as you are in relational databases.

What was wrong with DTDs?
With DTDs there was no way to describe that you were going to have a date or a dollar value between zero and 100. Everything was a string and all you could do was check whether that string existed. Schemas also provide the ability to structure information very differently. They support namespaces, so I can actively use namespaces in my document where it was very difficult with DTDs.

OK, what are namespaces?
Usually, as a co-author of XML Namespaces, when I answer this question it becomes a long debate. In computing, namespaces are a more general concept than in XML. XML Namespaces are ways to identify who is the authority for a set of tags. So in XHTML 2, if I have a tag sitting in the middle of my page that says bracket P close-bracket, whose paragraph tag is that? Who says what the content can be? Using namespaces, I can differentiate between XHTML and the funny language of the day...It's a very powerful tool that we couldn't use fully until we had schemas.

You use an XML Schema to describe a class of documents. I want to describe a purchase order, or on my Web site I have several pages that are link lists pointing to research pages. So I've created a schema that will capture my research notes and bibliographic references, and it describes all those documents. They have a certain structure and predictability. I can make sure I capture the date and the relevant data I want to have. In the business context, it can be the basis of a contract. If you send me all the info that's relevant within these constraints, we have the legal intent to do this kind of transaction, to send me a shipping notice for something like that.

Creating conflict where none exists is not helpful to the community at large.

What's the relationship of schemas to Web services?
That's a very interesting question. It goes to the root of what the Web services community is struggling with. There are two factions in it right now: those who think of it in terms similar to EDI, or B2B transactions, and those who come to it from the perspective of the Semantic Web. One of the Web's guiding principles is partial understanding. If I get a message I don't completely understand, I still deal with it. In business, or commerce, if I don't understand the whole message, if I send you a purchase order, and don't understand the street address or the product description, you don't do your best, you return it until we have an understanding.

When you go the Web and look up the site for the bicycle, you might go to another page, but you figure out what you need to know. The question is: Are Web services going to act more like a commerce transaction, or is there no presumption of how things are set up ahead of time and you do your best with how the information is presented? For those who think about it in terms of commerce, schemas are the way of making that contact. Send me this information and I will respond with another information set. A purchase order for an acknowledgement. Send me a stock quote request, and I'll send you back the stock values.

What's the difference between the goals of the Web Services Architecture working group and the Web Services Interoperability (WS-I) Organization?
The goals of the WS-I--I'm not a member--are to help assure that this new technology emerges in an interoperable way. They do not have the goal to be the developers of standards. The W3C's goal is to develop standards based on industry practice and secondarily to do research in electronic communication. And right now, there's lots of active conversations between the organizations to figure out best way to work together to reach this goal of being interoperable.

Do you think the WS-I--which was originally created by Microsoft and IBM--is trying to steal the W3C's thunder?
Creating conflict where none exists is not helpful to the community at large. There is some inherent conflict because both are trying to achieve interoperability. The W3C's firmly entrenched with SOAP, WSDL and XML, and Web services contain fundamental underpinning in place at the W3C. Where's the most productive place to work on the rest of the pieces? The W3C has a list of projects; they don't need to work on these things to be successful, but they know it's part of the evolution of the Web and want to be part of it.

Where does work stand on security and transactions standards? How does the W3C's work mesh with standards being announced by Microsoft and IBM?
These are all questions that are in active negotiations. XML Digital Signatures is going through the W3C, and signatures are a building block of any security standard. Where do we begin to draw the line in an

I'd put everybody in a room, like a jury, and tell them you don't come out until you have an agreement. I don't care what the agreement is.

architecture? ISO has a seven-layer stack. We are going to have to have one for Web services. Will another group build a higher level of security on top of what's provided by the W3C? We'll have to wait and see. But the good thing is that I hear from everyone in the industry how important it is to make these things happen in an interoperable way. The only question is what's the most effective way to get there. That wasn't true back in the browser war era.

What are you thoughts on the intellectual property issues the W3C is dealing with now--RAND (reasonable and nondiscriminatory) vs. royalty free?
I think it is the most fundamental issue facing the whole standards arena right now. There's a lot more work going into it, a lot more confusion about the right way to go. There's a fine line between rewarding people for work they've done and putting things into public domain.

If it were up to you, if you were czar, how would you solve the problem?
I'd put everybody in a room, like a jury, and tell them you don't come out until you have an agreement. I don't care what the agreement is.

Wait, you really don't care which way it turns out?
There's part of me that says, sure I'd like everything to be free. But another part of me knows that nothing's free, and considering the cost of it in terms of the pace and progress, I may be willing to pay for something.

Were you ever compensated for your work on XML?
No. Nobody was directly. We were just trying to solve a problem that we all shared. XML was not something that was being pushed by the industry. We had support from our companies as on a personal project level. It was 12 people in that group, who all knew each other. Now there are more than 12 working group chairs in the XML activity at the W3C. They're running 50 to 70 people in each group. Whole companies are betting their futures on Web services. It's a very different place than where we were in '94, '95 with XML.