纷繁的 XML Schema 技术:基于语法的语言

来源:互联网 发布:美团商家数据 编辑:程序博客网 时间:2024/06/12 00:38

3. 基于语法的语言 (RELAX NG)

We have seen that a schema can be described as a set of rules formalized using a language such as Schematron of XSLT (other languages such as Prolog are probably good candidates too). The fact that this is possible isn't a proof that it's easy and people have developed other classes of more specific schema languages describing the structure of the documents rather than the rules to apply to validate them.

我们已经可以看到 schema 能够描述为一套使用像脱胎于 XSLT 的 Schematron (其他像 Prolog 这样的语言也是很好的候选者)这样的形式化的规则。这是可能的事实并不是一个简单的证明,而且人们还开发了其他更专门的 schema 语言来描述文档的结构而不是验证它们的规则。

RELAX NG is the main example of such languages qualified of "grammar based" since they describe documents in the manner of a BNF adapted to describing XML trees.

RELAX NG 是这样的 "基于语法的" 语言中的主要代表,因为它们以适合描述 XML 树的 BNF 的方法描述文档。

Although its syntax is very different from XPath, RELAX NG is all about named patterns allowed in the structure.

虽然它的语法不同于 XPath,但是 RELAX NG 完全是与在结构中允许的命名 pattern 有关。

3.1. 入门

The description of our simplified library could be:

我们的简化库可以描述为这样:

 <?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<element name="library">
<zeroOrMore>
<element name="book">
<attribute name="id"/>
</element>
</zeroOrMore>
</element>
</start>
</grammar>

This schema reads almost as plain English and describes: 'a grammar starting with a document element named "library" containing zero or more elements named "book" with an attribute named "id"' and his equivalent to our XSLT closed schema accepting only "/library", "/library/book" and "/library/book/@id" --except that the restriction on ids being unique is not captured (yet) in our RELAX NG schema.

这个 schema 读起来几乎和英语一样,而且描述了:'一个语法,它由名字为 "library" 的文档元素开始,包含零个或者多个名字为 "book" ,有一个名字为 "id" 的属性的元素',而且它相当于我们只接受 "/library", "/library/book" 和 "/library/book/@id" 的 XSLT 封闭 schema —— 除了在我们的 RELAX NG schema 中还没有添加 id 必须是唯一的限制。

3.2. 非 XML 语法

The XML syntax of this schema is still quite verbose and James Clark has proposed an equivalent yet more concise non XML syntax. Using this syntax, our schema would become:

这个 schema 的 XML 语法仍然非常冗长而且 James Clark 以及提议了一种 XML 语法的对等物,而且更精确。使用这种语法,我们的 schema 变成:

 grammar {
start = element library{
element book {attribute id {text}} *
}
}

This syntax has roughly the same meaning, except that a) it's non XML b) some DTD goodies are used: here the "*" means "zero or more" and we will see more of these goodies in more complete examples later on.

这种语法基本上有相同的含义,除了 a) 它是非 XML 的 b) 一些 DTD 的优点被用上了:这里 "*" 意味着 "零个或者多个" 并且我们将在后面的更复杂的例子中看到更多这样的好处。

3.3. 标识符

We are still behind what we had implemented with our XSLT or Schematron schemas which did test the uniqueness of the book identifiers. Although it is generally impossible to implement with a grammar based XML schema language all the constraints which can be expressed as rules, this example has been chosen so that we can find a way to define a nearly equivalent constraint with RELAX NG.

我们仍然躲在我们用 XSLT 或者 Schematron 实现的防火墙的背后,它会测试 book 标识符的唯一性的。虽然总体上来说以基于语法的 XML schema 语言不可能实现所有能够用规则表达的约束,但是这个例子还是被选出来让我们能够在 RELAX NG 中找到一种定义近似等价的约束的办法。

This is achievable through a set of features defined to achieve a certain level of compatibility with the XML DTD and the ability of RELAX NG to interface with datatype systems.

这是通过一套定义出来用于实现某种程度与 XML DTD 兼容的特性以及 RELAX NG 用于和数据类型系统交互的功能来达到的。

The datatype system to use in this case is "http://relaxng.org/ns/compatibility/datatypes/1.0" and the datatype to use is "ID" since our "id" attributes can be considered as DTD ID attributes (they are globally unique all over a document and they match the XML "NMTOKEN" production).

本例中要使用的数据类型系统是 "http://relaxng.org/ns/compatibility/datatypes/1.0" ,要用的数据类型是 "ID" 因为我们的 "id" 属性能够被认为是 DTD ID 属性(它们在整个文档全局唯一的而且它们匹配了 XML "NMTOKEN")。

The amended schema to express this new constraint becomes:

修改为能表达这种新的约束的 schema 变成:

 <?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0">
<start>
<element name="library">
<zeroOrMore>
<element name="book">
<attribute name="id">
<data type="ID"/>
</attribute>
</element>
</zeroOrMore>
</element>
</start>
</grammar>

The syntax is still straightforward: the attribute is now specified as holding data of type "ID" per the datatype library "http://relaxng.org/ns/compatibility/datatypes/1.0" defined through the datatypeLibrary attribute of an ancestor of the "data" element.

语法仍然是清晰的:属性现在被限制为保存数据类型库 "http://relaxng.org/ns/compatibility/datatypes/1.0" 中的 "ID" 数据类型的数据。而数据类型库的选择是由父元素 "data" 的 datatypeLibrary 属性定义的。

The non XML syntax uses a namespace prefix declaration (also available in the XML syntax) and becomes:

非 XML 语法使用名字空间做前缀的声明 (也可一在 XML 语法中使用),变成:

 datatypes dtd = "http://relaxng.org/ns/compatibility/datatypes/1.0"
grammar {
start = element library{
element book {attribute id {dtd:ID}} *
}
}

We will see later on that these datatypes are not without side effects: they are provided to provide compatibility with DTDs and emulate DTDs to the point of affecting the flexibility of RELAX NG.

我们将在后面看到这些数据类型并不是没有副作用的:提供它们给你是为了提供对 DTD 的兼容,但是对 DTD 的这种模仿也影响了 RELAX NG 的灵活性。

3.4. Patterns

All over our brief experience with RELAX NG, we've been manipulating patterns and it's worth coming back on this concept which is really fundamental.

我们使用 RELAX NG 的短短经历之中,我们已经熟练地使用了 pattern 并且它值得我们回过头来看看这些真正是基础性的概念。

The basic think to note is that when we write something such as "element library{element book {attribute id {dtd:ID}} *}", we are not giving definitions of what the elements "library", "book" and the attribute "id" are but defining a pattern of nodes which may appear in the documents.

对于这个注释的基本思考是当我们写诸如 "element library{element book {attribute id {dtd:ID}} *}" 这样的东西的时候,我们没有给出元素 "library","book" 以及属性 "id" 是什么而是定义了可能出现在文档中的一种节点。

In this respect, we are here much closer to the schemas which we have written with XSLT or Schematron than to the schemas we will write later on with W3C XML Schema and the meaning of the pattern defined above is "accept here an element node library with children element nodes book having an id attribute having data of type ID".

就这点来说,我们和已经用 XSLT 或者 Schematron 编写的 schema 靠得更近了,胜过了我们将要在后面用 W3C XML Schema 编写的 schema。而且上面定义的 pattern 是 "允许这儿有一个 library 元素有子元素 book,book 有一个属性类型为 ID 的 id 属性"。

The nodes manipulated in this pattern are always anonymous, which means that we cannot make a reference to these nodes elsewhere in the schema. What's possible, though, is to define global named patterns (aka named templates in a XSLT transformation) and to refer to these patterns in other patterns.

在这个 pattern 中操作的节点总是匿名的,这就意味着我们不能在 schema 的其他地方对这些节点进行引用。虽然,我们能够定义全局命名 pattern(类似于 XSLT 转换中的命名模板)并且在其他 pattern 中引用这些 pattern。

The syntax to define a named pattern holding the set of book elements would be:

定义保存一批 book 元素的命名 pattern 的语法将是:

  <define name="bookElements">
<zeroOrMore>
<element name="book">
<attribute name="id">
<data type="ID"/>
</attribute>
</element>
</zeroOrMore>
</define>

or (non XML):

或者(非 XML 形式):

  bookElements = element book {attribute id {dtd:ID}} *

And a reference to this pattern would be:

而且一个对这 pattern 的参考将是:

  <start>
<element name="library">
<ref name="bookElements"/>
</element>
</start>

or (non XML):

或者(非 XML 形式):

  start = element library{ bookElements }

Note that there is no restriction on the "content" located in named patterns. We have chosen here to include a set of zero or more book elements but could also have created patterns to include a single book element or the id attributes. In every case, named patterns are containers and even when a name pattern contains a single element, it's a pattern containing a single element rather than a definition of this element.

注意对于 "content" 在命名模板中的位置没有任何限制。我们选择在这儿包含零个或者多个 book 元素但是也也能创建了包含一个 book 元素或者 id 属性的 pattern。无论在哪一种情况中,命名模板都是容器而且即使当命名模板只包含一个元素的时候,那都是一个 pattern 包含了一个元素而不是一个对这个元素的定义。

3.5. 更多特性

It's now time to add some more elements to explore more features from RELAX NG... let's describe the "author" element:

现在是添加更多元素来从 RELAX NG 中探询更多特性的时候了……让我们描述一下 "author" 元素:

  <author id="Charles-M.-Schulz">
<name>
Charles M. Schulz
</name>
<nickName>
SPARKY
</nickName>
<born>
1922-11-26
</born>
<dead>
2000-02-12
</dead>
</author>

Since the definition of the id attribute is common to several elements, we can isolate it in a pattern:

因为 id 属性的定义对于好几个元素都是要用的,我们可以把它单独分离到一个 pattern 中:

  <define name="idAttribute">
<attribute name="id">
<data type="ID" datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0"/>
</attribute>
</define>

or:

或者:

 idAttribute = attribute id {dtd:ID}

This description of the author element is straightforward using the few features which we've already seen:

这个对 author 元素的描述是直接了当的,使用了一些我们已经看到的特性:

    <element name="author">
<ref name="idAttribute"/>
<element name="name">
<text/>
</element>
<element name="nickName">
<text/>
</element>
<element name="born">
<text/>
</element>
<element name="dead">
<text/>
</element>
</element>

or:

或者:

  element author {
idAttribute,
element name {text},
element nickName {text},
element born {text},
element dead {text}
}

Note that we have defined all the sub-elements as "text" meaning that they can hold any text node. We could also use a datatype library such as the W3C XML Schema datatype library which we can define as the default type library since we've define the datatype library used for the id attribute in the type definition itself.

注意我们已经定义了所有子元素为 "text" 意味着它们能够包含任何文本节点。我们也能使用诸如 W3C XML Schema 数据类型库这样的东西,因为我们已经在类型定义本身中给 id 属性定义了数据类型库。

The definition involves then choosing the right type for each of the element. Here for instance, we've been lucky enough to have date expressed in the ISO 8601 date format supported by W3C XML Schema and can use this type in our schema. For string types, we need to distinguish between "token" and "string" depending on the behavior we want to space normalization (token applies full space normalization and trimming while string applies none). Depending on these choices, our definition might become:

其中的定义然后给每个元素选择正确的类型。例如这儿,我们很幸运日期已经在 W3C XML Schema 支持的 ISO 8601 日期格式中表达了。对于字符串类型,我们需要区别 "token" 以及 "string",根据我们对空白的不同处理需要(token 完全的施加空白标准化,在字符串赋给的时候进行 trim 操作)。根据这些选择,我们的定义可能变为:

    <element name="author">
<ref name="idAttribute"/>
<element name="name">
<data type="token"/>
</element>
<element name="nickName">
<data type="token"/>
</element>
<element name="born">
<data type="date"/>
</element>
<element name="dead">
<data type="date"/>
</element>
</element>

or:

或者:

  element author {
idAttribute,
element name {xs:token},
element nickName {xs:token},
element born {xs:date},
element dead {xs:date}
}

3.6. 完整的 schema

Writing the full schema for the complete example is pretty much repeating the same process:

给整个的例子编写完整的 schema 是一个相当重复的过程:

 <?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
<start>
<element name="library">
<oneOrMore>
<choice>
<ref name="bookElement"/>
<ref name="authorElement"/>
<ref name="characterElement"/>
</choice>
</oneOrMore>
</element>
</start>
  <define name="idAttribute">
<attribute name="id">
<data type="ID" datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0"/>
</attribute>
</define>
  <define name="idrefAttribute">
<attribute name="id">
<data type="IDREF" datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0"/>
</attribute>
</define>
  <define name="bookElement">
<element name="book">
<ref name="idAttribute"/>
<element name="isbn">
<data type="token"/>
</element>
<element name="title">
<data type="token"/>
</element>
<zeroOrMore>
<element name="author-ref">
<ref name="idrefAttribute"/>
</element>
</zeroOrMore>
<zeroOrMore>
<element name="character-ref">
<ref name="idrefAttribute"/>
</element>
</zeroOrMore>
</element>
</define>
  <define name="authorElement">
<element name="author">
<ref name="idAttribute"/>
<element name="name">
<data type="token"/>
</element>
<element name="nickName">
<data type="token"/>
</element>
<element name="born">
<data type="date"/>
</element>
<element name="dead">
<data type="date"/>
</element>
</element>
</define>
  <define name="characterElement">
<element name="character">
<ref name="idAttribute"/>
<element name="name">
<data type="token"/>
</element>
<element name="since">
<data type="date"/>
</element>
<element name="qualification">
<data type="string"/>
</element>
</element>
</define>
</grammar>

or:

或者:

 datatypes dtd = "http://relaxng.org/ns/compatibility/datatypes/1.0"
datatypes xs = "http://www.w3.org/2001/XMLSchema-datatypes"
grammar {
start = element library{ (bookElement|authorElement|characterElement)+ }
idAttribute = attribute id {dtd:ID}
idrefAttribute = attribute id {dtd:IDREF}
bookElement = element book {
idAttribute,
element isbn {xs:token},
element title {xs:token},
element author-ref{idrefAttribute} *,
element character-ref{idrefAttribute} *
}
authorElement = element author {
idAttribute,
element name {xs:token},
element nickName {xs:token},
element born {xs:date},
element dead {xs:date}
}
characterElement = element character {
idAttribute,
element name {xs:token},
element since {xs:date},
element qualification {xs:string}
}
}

Note the usage to define the "library" element of the "choice" element (XML syntax) represented in the non XML syntax by the "|" operator. The meaning of this compositor is to allow one possibility only within a list. Here, the choice may have "zeroOrMore" (or "*" in the non XML syntax) occurrences which means that the choice may be repeated indefinitely.

注意定义 "choice" 元素中的 "library" 元素的用法(XML 语法)以非 XML 语法表达起来使用 "|" 操作符。这个符号的意思是只允许在列表中出现一次。这儿,choice 可能出现 "zeroOrMore" (或者是 "*" 在非 XML 语法中)次,这意味着 choice 被无限重复。

3.7. 在顺序不那么重要的时候

There are cases when the relative order between elements doesn't matter for the application. For instance, one may wonder what's the point of constraining the order of the sub-elements of "author" and impose to write:

有这样的情况,元素之间的相关顺序对于程序来说无关紧要。例如,你可能对约束 "author" 子元素的顺序感到奇怪,并且为什么要:

  <author id="Charles-M.-Schulz">
<name>
Charles M. Schulz
</name>
<nickName>
SPARKY
</nickName>
<born>
1992-11-26
</born>
<dead>
2000-02-12
</dead>
</author>

rather than

而不是

  <author id="Charles-M.-Schulz">
<name>
Charles M. Schulz
</name>
<dead>
2000-02-12
</dead>
<born>
1992-11-26
</born>
<nickName>
SPARKY
</nickName>
</author>

After all, the elements have names and it's not much more complex to write applications which will retrieve the information they need whatever the order of the sub-elements is. So, why should we bother document writers with respecting a fixed order?

毕竟,元素们都有名字而且不管子元素的顺序获取它们所需信息的程序也不是非常复杂。因此,为什么我们要劳烦文档的作者说明一个固定的顺序呢?

RELAX NG allows such definitions without any restriction through the use of "interleave" elements (XML syntax) or "&" operator (non XML syntax) and the updated definition of the author element to remove the restriction on the order of the sub-elements would be:

RELAX NG 允许这样的定义而且没有任何约束通过使用 "interleave" 元素(XML 语法)或者是 "&" 操作符(非 XML 语法)。更新后的 author 元素的定义移除了对子元素的顺序之后将是:

    <element name="author">
<ref name="idAttribute"/>
<interleave>
<element name="name">
<data type="token"/>
</element>
<element name="nickName">
<data type="token"/>
</element>
<element name="born">
<data type="date"/>
</element>
<element name="dead">
<data type="date"/>
</element>
</interleave>
</element>

or:

或者:

  element author {
idAttribute&
element name {xs:token}&
element nickName {xs:token}&
element born {xs:date}&
element dead {xs:date}
}

Note that this does apply even when the number of occurrences of some of the sub-elements is greater than one such as for our "book" element:

注意,这在某些子元素出现的次数大于一次例如我们的 "book" 元素的时候也能使用:

  bookElement = element book {
idAttribute&
element isbn {xs:token}&
element title {xs:token}&
element author-ref{idrefAttribute} *&
element character-ref{idrefAttribute} *
}

3.8. 开放 我们的 schema

If we come back to our highly simplified example with only "library" and "book" elements, we have achieved a pretty good equivalence with the closed schemas previously developed with XSLT and Schematron and you may wonder if we can open our schema to allow arbitrary text and element nodes within our book element like we had been able to do.

如果我们回到我们高度简化的只有 "library" 和 "book"元素的例子中来,我们已经实现了和前面用 XSLT 以及 Schematron 开发的封闭 schema 几乎完全相同的对等物。而且你可能想要知道是否我们能开放我们的 schema 让任意的文本和元素节点都能出现在我们的 book 元素中像我们前面能够做到的那样。

The first step to do so is to define an open pattern for accepting any element. There is no predefined pattern to do so with RELAX NG, but this is not a big deal with all what we've seen so far and a new goodies which is the "anyName" element implementing name wildcards (or "*" in the non XML syntax):

这么做的第一步是给接受任何元素定义一个开放的 pattern。在 RELAX NG 中没有预先定义好的这样的 pattern,但是用我们已经看到的东西这不是什么难事。一个好东西名字为 "anyName" 的元素实现了名字通配符(或者是非 XML 语法中的 "*"):

  <define name="anyElement">
<element>
<anyName/>
<zeroOrMore>
<choice>
<attribute>
<anyName/>
</attribute>
<text/>
<ref name="anyElement"/>
</choice>
</zeroOrMore>
</element>
</define>

or:

或者:

  anyElement = element * {(attribute * {text}|text|anyElement)*}

The other thing to note is that recursive patterns are allowed when the recursion happens within an element like it's the case here.

另外一个要注意的事情是递归 pattern 在递归发生在像这样的元素之中的时候是允许的:

The surprise comes when we try to use this named pattern in our book element:

当我们尝试在我们的 book 元素中使用整个命名 pattern,意外来了:

      <element name="book">
<attribute name="id">
<data type="ID"/>
</attribute>
<zeroOrMore>
<choice>
<ref name="anyElement"/>
<text/>
</choice>
</zeroOrMore>
</element>

or:

或者:

 element book {attribute id {dtd:ID}, anyElement*}

The schema is then detected as invalid with the following error:

shema 然后被检测到无效,有以下错误:

 Error at URL ...
line number 5, column number 22:
conflicting ID-types for attribute "id" of element "book"

We've been hit by a side effect of the DTD compatibility library used for our id attribute and to make sure that this is not a limitation of the RELAX NG language itself, we can just change the definition of these attributes to be plain text:

我们被用于我们的 id 属性的 DTD 兼容性库给打击了,并且为了确信整个不是 RELAX NG 语言本身的限制,我们可以把这些属性的定义变成普通的文本:

      <element name="book">
<attribute name="id">
<text/>
</attribute>
<zeroOrMore>
<choice>
<ref name="anyElement"/>
<text/>
</choice>
</zeroOrMore>
</element>

or:

或者:

  element book {attribute id {text}, anyElement*}

And our schemas become valid.

这样我们的 schema 变成有效的了。

What's happening here is that to emulate the behavior of a DTD, RELAX NG imposes that if an ID attribute is defined somewhere a in a element, the same ID attribute must be defined in all the other definitions of this element and this is not the case in the definition of our "anyElement" pattern which may -through the wildcard- include a "book" element which does not include a mandatory id attribute with the type dtd:ID...

这里发生的事情是这样的:为了模拟 DTD 的行为,RELAX NG 强行认为 ID 属性定义在某处的元素之中,同一个 ID 属性必须定义在整个元素的所有其他定义之中。而这我们的 "anyElement"pattern" 不同,它通过通配符包含了一个 "book" 元素,而 "book" 元素并不包括一个必须的类型为 dtd:ID 的 id 属性……

To workaround this issue, we may either avoid using the ID type as shown above or if we want to use this type, exclude or handle separately the case of a book element included as a sub-element of the top level book. This exclusion can be done through the "except" and "name" elements (or "-" operator in the non XML syntax):

我了不发生这样的情况,我们要么避免像上面显示的那样避免使用 ID 类型,要么如果我们想要使用这种类型,排斥或者单独处理 book 元素作为顶层 book 的子元素被包含的这种情况。这种排除可以通过 "except" 和 "name" 元素(或者以非 XML 语法的 "-" 操作符)来完成:

  <define name="anyElement">
<element>
<anyName>
<except>
<name>book</name>
</except>
</anyName>
<zeroOrMore>
<choice>
<attribute>
<anyName/>
</attribute>
<text/>
<ref name="anyElement"/>
</choice>
</zeroOrMore>
</element>
</define>

or:

或者:

 anyElement = element * - book {(attribute * {text}|text|anyElement)*}

3.9. 其他特性

RELAX NG has some other nice features which we will not cover here and are detailed on the very good tutorial available on their web site (http://relaxng.org), such as:

RELAX NG 有一些其他很好的特性,我们在这儿不能都提到,在它们的网站 (http://relaxng.org) 上有关于具体信息的一些非常好的教程,例如:

  • Schema composition and pattern redefinitions.
  • Namespace support
  • Annotations
  • List of values