# Resumen: Satisfiability of Constraint Specifications on XML Documents

## Fecha

## Editor

## Publicado en

## Licencia

## Resumen

In this work, we present an approach for specifying the structure of XML documents using three kinds of constraints based on XPath, together with a sound and complete method for reasoning about them. Currently, the standard specification of classes of XML documents is done by means of DTDs or XML Schemas. In both cases, we essentially describe the abstract syntax of the class of documents and the types of its attributes. This is quite limited. In particular, we may want to state more complex conditions about the structure of documents in a given class or about their contents. For example, with respect to the structure of documents, we may want to state that if an element includes an attribute with a given content, then these documents should not include some other element. Or, with respect to the contents of documents, we may want to express that the value of some numeric attribute of a certain element is smaller than the value of another attribute of a different element. In this paper, we concentrate on the specification of the structure of documents, not paying much attention to their contents. In this sense, we present an abstract approach for the specification of (the structure of) classes of XML documents using sets of constraints that are based on XPath [8, 9] queries, as given in [4], using the concept of tree patterns. Roughly, a tree pattern describes a basic property on the structure of documents. Its root repre sents the root of documents. Nodes represent elements that must be present on the given documents and their labels represent their contents, i.e. the names of elements and their value, if any. A wild card (the symbol ∗), means that we don’t know or we don’t care about the contents of that element. Finally, single edges represent parent/child relations between elements, while double edges represent a descendant relationship between elements. Again, if any of these two relations is included in a tree pattern, then it should also be included in the documents satisfying that property. For instance, on the left of Fig. 1 we show a tree pattern p describing documents D whose root node is labelled with a, some child node of the root node in D is labelled b, and some descendant node of the root node in D has two child nodes labelled c and d, respectively. Similarly, we represent, in an abstract way, XML documents using the same kind of trees. The difference between a document and a tree pattern is that a document does not include double edges or wildcards. For example, on the right of Fig. 1 we show a document that satisfies the pattern on the left. In particular, we may see that the root of the document is labelled by a. Moreover, that root has a child node labelled b and a descendant node (the element labelled f) that has two child nodes labelled c and d, respectively. We consider three kinds of (atomic) constraints. The first one, called positive constraints, are tree patterns. The second one are negative constraints, ¬p, where p is a tree pattern, expressing that documents should not satisfy p. Finally, the third sort of constraint are conditional constraints, written ∀(c : p → q), where both p and q are tree patterns. Roughly speaking, these constraints express that if a document satisfies p then it must also satisfy q. Moreover, these constraints can be combined using the connectives ∧ and ∨. These kinds of constraints are similar to the graph constraints studied in [6, 7] in the context of graph transformation. The work presented in those papers, shows how to use graph constraints as a specification formalism, and how to reason about these specifications. However, the application of these ideas to our setting is not trivial. Specifically, the descendant relation in our constraints makes non-trivial the application of these techniques since, the descendent relation would be second-order in the logic of graph constraints defined in [6, 7]. Obviously, there are conditions on the structure of XML documents that are not expressible using the kind of constraints studied in this paper. However, our experience in the area of graph transformation [6, 7] shows that, in practice, these constraints are sufficient in most cases. Nevertheless, we believe that the ideas presented here can be extended to a class of XML constraints, similar to the class of nested graph conditions that has been shown equivalent to first-order logic of graphs [2]. However, we also believe that this extension is not straightforward. Since our aim is to be able to reason about these specifications, we present inference rules that are shown, by means of tableaux, to define a sound and complete refutation procedure for checking satisfiability of a given specification. We strongly believe that satisfiability problem for this class of constraints is only semidecidable, since we believe that it would be similar to the (un)decidability of the satisfiability problem for the Horn clause fragment of first-order logic. As a consequence, if a given specification is inconsistent, we can be sure that our procedure will terminate showing that unsatisfiability. However, our procedure may not terminate if the given specification is satisfiable. Nevertheless, we may like to have an idea about the performance of our approach when the procedure terminates. One could think, that this performance would be quite poor, since checking if there is a monomorphism between two trees (a basic operation in our deduction procedure) is an NP-complete problem [3]. Actually, this is not our experience with the tool that we have implemented [1]. We think that the situation is similar to what happens with graph transformation tools. In these tools, applying a graph transformation rule means finding a subgraph isomorphism, which is also a wellknown NP-complete problem. However, the fact that the graphs are typed (in our case, the trees are labelled), in practice, reduces considerably the search. Finally, in the future, we plan to extend our approach to consider also cross-references and properties about the contents of documents. The former problem means, in fact, to extend our approach to graphs and graph patterns. For the latter case, we plan to follow the same approach that we used to extend our results for graphs in [6, 7] to the case of attributed graphs in [5].