|
XML::Twig - A perl module for processing huge XML documents in tree mode.
single-tree mode
my $t= new XML::Twig();
$t->parse( '<doc><para>para1</para></doc>');
$t->print;
chunk mode
my $t= new XML::Twig( TwigHandlers => { section => \&flush});
$t->parsefile( 'doc.xml');
$t->flush;
sub flush { $_[0]->flush; }
my $t= new XML::Twig( TwigHandlers =>
{ 'section/title' => \&print_elt_text} );
$t->parsefile( 'doc.xml');
sub print_elt_text
{ my( $t, $elt)= @_;
print $elt->text;
}
my $t= new XML::Twig( TwigHandlers =>
{ 'section[@level="1"]' => \&print_elt_text }
);
$t->parsefile( 'doc.xml');
roots mode (builds only the required sub-trees)
my $t= new XML::Twig(
TwigRoots => { 'section/title' => \&print_elt_text}
);
$t->parsefile( 'doc.xml');
sub print_elt_text
{ my( $t, $elt)= @_;
print $elt->text;
}
This module provides a way to process XML documents. It is build on top
of XML::Parser.
The module offers a tree interface to the document, while allowing you
to output the parts of it that have been completely processed.
It allows minimal resource (CPU and memory) usage by building the tree
only for the parts of the documents that need actual processing, through the
use of the TwigRoots and TwigPrintOutsideRoots options. The finish and
finish_print methods also help to increase performances.
XML::Twig tries to make simple things easy so it tries its best to takes care
of a lot of the (usually) annoying (but sometimes necessary) features that
come with XML and XML::Parser.
- Whitespaces
-
Whitespaces that look non-significant are discarded, this behaviour can be
controlled using the KeepSpaces, KeepSpacesIn and DiscardSpacesIn options.
- Encoding
-
You can specify that you want the output in the same encoding as the input
(provided you have valid XML, which means you have to specify the encoding
either in the document or when you create the Twig object) using the KeepEncoding
option
A twig is a subclass of XML::Parser, so all XML::Parser methods can be
called on a twig object, including parse and parsefile.
setHandlers on the other hand cannot not be used, see
- new
-
This is a class method, the constructor for XML::Twig. Options are passed
as keyword value pairs. Recognized options are the same as XML::Parser,
plus some XML::Twig specifics:
- TwigHandlers
-
This argument replaces the corresponding XML::Parser argument. It consists
of a hash { expression => \&handler} where expression is a
generic_attribute_condition, string_condition,
an attribute_condition,full_path, a partial_path, a gi,
_default_ or <_all_>.
The idea is to support a usefull but efficient (thus limited) subset of
XPATH. A fuller expression set will be supported in the future, as users
ask for more and as I manage to implement it efficiently. This will never
encompass all of XPATH due to the streaming nature of parsing (no lookhead
after the element end tag).
A generic_attribute_condition is a condition on an attribute, in the form
*[@att=``val''] or *[@att], simple quotes can be used instead of double
quotes and the leading '*' is actually optional. No matter what the gi of the
element is, the handler will be triggered either if the attribute has the
specified value or if it just exists.
A string_condition is a condition on the content of an element, in the form
gi[string()=``foo''], simple quotes can be used instead of double quotes, at
the moment you cannot escape the quotes (this will be added as soon as I
dig out my copy of Mastering Regular Expressions from its storage box).
The text returned is, as per what I (and Matt Sergeant!) understood from
the XPATH spec the concatenation of all the text in the element, excluding
all markup. Thus to call a handler on the element <p>text <b>bold</b></p>
the appropriate condition is p[string()=``text bold'']. Note that this is not
exactly conformant to the XPATH spec, it just tries to mimic it while being
still quite concise.
A extension of that notation is gi[string(child_gi)=``foo''] where the handler
will be called if a child of a gi element has a text value of foo.
At the moment only direct children of the gi element are checked. If you
need to test on descendants of the element let me know. The fix is trivial
but would slow down the checks, so I'd like to keep it the way it is.
A regexp_condition is a condition on the content of an element, in the form
gi[string()=~ /foo/``]. This is the same as a string condition except that
the text of the element is matched to the regexp. The i, m, <s> and o
modifiers can be used on the regexp.
The gi[string(child_gi)=~ /foo/``] extension is also supported.
An attribute_condition is a simple condition of an attribute of the
current element in the form gi[@att=``val''] (simple quotes can be used
instead of double quotes, you can escape quotes either).
If several attribute_condition are true the same element all the handlers
can be called in turn (in the order in which they were first defined).
If the =``val'' part is ommited ( the condition is then gi[@att]) then
the handler is triggered if the attribute actually exists for the element,
no matter what it's value is.
A full_path looks like '/doc/section/chapter/title', it starts with
a / then gives all the gi's to the element. The handler will be called if
the path to the current element (in the input document) is exactly as
defined by the full_path.
A partial_path is like a full_path except it does not start with a /:
'chapter/title' for example. The handler will be called if the path to
the element (in the input document) ends as defined in the partial_path.
WARNING: (hopefully temporary) at the moment string_condition, regexp_condition and
attribute_condition are only supported on a simple gi, not on a path.
A gi (generic identifier) is just a tag name.
A special gi _all_ is used to call a function for each element.
The special gi _default_ is used to call a handler for each element
that does NOT have a specific handler.
The order of precedence to trigger a handler is: generic_attribute_condition,
string_condition, regexp_condition, attribute_condition, full_path, longer partial_path,
shorter partial_path, gi, _default_ .
Important: once a handler has been triggered if it returns 0 then no other handler
is called, exept a _all_ handler which will be called anyway.
If a handler returns a true value and other handlers apply, then the next
applicable handler will be called. Repeat, rince, lather..;
When an element is CLOSED the corresponding handler is called, with 2
arguments: the twig and the . The twig includes the document
tree that has been built so far, the element is the complete sub-tree for
the element.
Text is stored in elements where gi is #PCDATA (due to mixed content, text
and sub-element in an element there is no way to store the text as just an
attribute of the enclosing element).
Warning: if you have used purge or flush on the twig the element might not
be complete, some of its children might have been entirely flushed or purged,
and the start tag might even have been printed (by flush) already, so changing
its gi might not give the expected result.
More generally, the full_path, partial_path and gi expressions are
evaluated against the input document. Which means that even if you have changed
the gi of an element (changing the gi of a parent element from a handler for
example) the change will not impact the expression evaluation. Attributes in
attribute_condition are different though. As the initial value of attribute
is not stored the handler will be triggered if the current attribute/value
pair is found when the element end tag is found. Although this can be quite
confusing it should not impact most of users, and allow others to play clever
tricks with temporary attributes. Let me know if this is a problem for you.
- TwigRoots
-
This argument let's you build the tree only for those elements you are interested
in.
Example: my $t= new XML::Twig( TwigRoots => { title => 1, subtitle => 1});
$t->parsefile( file);
my $t= new XML::Twig( TwigRoots => { 'section/title' => 1});
$t->parsefile( file);
returns a twig containing a document including only title and subtitle elements,
as children of the root element.
You can use generic_attribute_condition, attribute_condition, full_path,
partial_path, gi,
_default_ and _all_ to trigger the building of the twig.
string_condition and regexp_condition cannot be used as the content of the element, and the
string, have not yet been parsed when the condition is checked.
WARNING: path are checked for the document. Even if the TwigRoots option is used
they will be checked against the full document tree, not the virtual tree created
by XML::Twig
WARNING: TwigRoots elements should NOT be nested, that would hopelessly confuse
XML::Twig ;--(
Note: you can set handlers (TwigHandlers) using TwigRoots
Example: my $t= new XML::Twig( TwigRoots => { title => sub { $_{1]->print;},
subtitle => \&process_subtitle });
$t->parsefile( file);
- TwigPrintOutsideRoots
-
To be used in conjunction with the TwigRoots argument. When set to a true value this
will print the document outside of the TwigRoots elements.
Example: my $t= new XML::Twig( TwigRoots => { title => \&number_title },
TwigPrintOutsideRoots => 1,
);
$t->parsefile( file);
{ my $nb;
sub number_title
{ my( $twig, $title);
$nb++;
$title->prefix( "$nb "; }
$title->print;
}
}
This example prints the document outside of the title element, calls number_title for
each title element, prints it, and then resumes printing the document. The twig is built
only for the title elements.
- StartTagHandlers
-
A hash { expression => \&handler}. Sets element handlers that are called when the element
is open (at the end of the XML::Parser Start handler). The handlers are called with
2 params: the twig and the element. The element is empty at that point, its
attributes are created though.
You can use generic_attribute_condition, attribute_condition, full_path,
partial_path, gi, _default_ and _all_ to trigger the handler.
string_condition and regexp_condition cannot be used as the content of the element,
and the string, have not yet been parsed when the condition is checked.
The main use for those handlers is probably to create temporary attributes
that will be used when processing sub-element with TwigHanlders.
You should also use it to change tags if you use flush. If you change the tag in a
regular TwigHandler then the start tag might already have been flushed.
Note: StartTag handlers can be called outside ot TwigRoots if this
argument is used, in this case handlers are called with the following arguments:
$t (the twig), $gi (the gi of the element) and %att (a hash of the attributes
of the element).
If the TwigPrintOutsideRoots argument is also used then the start tag will be printed
if the last handler called returns a true value, if it does not then the start tag will
not be printed (so you can print a modified string yourself for example);
- EndTagHandlers
-
A hash { expression => \&handler}. Sets element handlers that are called when the element
is closed (at the end of the XML::Parser End handler). The handlers are called with
2 params: the twig and the gi of the element.
TwigHandlers are called when an element is completely parsed, so why have this redundant
option? There is only one use for EndTagHandlers: when using the TwigRoots option, to
trigger a handler for an element outside the roots. It is for example very useful to
number titles in a document using nested sections:
my @no= (0);
my $no;
my $t= new XML::Twig(
StartTagHandlers => { section => sub { $no[$#no]++; $no= join '.', @no; push @no, 0; } },
TwigRoots => { title => sub { $_[1]->prefix( $no); $_[1]->print; } },
EndTagHandlers => { section => sub { pop @no; } },
TwigPrintOutsideRoots => 1
);
$t->parsefile( $file);
Using the EndTagHandlers argument without TwigRoots will result in an error.
- CharHandler
-
A reference to a subroutine that will be called every time PCDATA is found.
- KeepEncoding
-
This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and
you want to keep it that way, then setting KeepEncoding will use the Expat
original_string method for character, thus keeping the original encoding, as
well as the original entities in the strings.
WARNING: if the original encoding is multi-byte then attribute parsing will
be EXTREMELY unsafe under any Perl before 5.6, as it uses regular expressions
which do not deal properly with multi-byte characters.
WARNING: this option is NOT used when parsing with the non-blocking parser
(parse_start, parse_more, parse_done methods).
- LoadDTD
-
If this argument is set to a true value, parse or parsefile on the twig will load
the DTD information. This information can then be accessed through the twig,
in a DTDHandler for example. This will load even an external DTD.
See DTD Handling for more information
- DTDHandler
-
Sets a handler that will be called once the doctype (and the DTD) have been loaded,
with 2 arguments, the twig and the DTD.
- Id
-
This optional argument gives the name of an attribute that can be used as
an ID in the document. Elements whose ID is known can be accessed through
the elt_id method. Id defaults to 'id'.
See
- DiscardSpaces
-
If this optional argument is set to a true value then spaces are discarded
when they look non-significant: strings containing only spaces are discarded.
This argument is set to true by default.
- KeepSpaces
-
If this optional argument is set to a true value then all spaces in the
document are kept, and stored as PCDATA.
KeepSpaces and DiscardSpaces cannot be both set.
- DiscardSpacesIn
-
This argument sets KeepSpaces to true but will cause the twig builder to
discard spaces in the elements listed.
The syntax for using this argument is:
new XML::Twig( DiscardSpacesIn => [ 'elt1', 'elt2']);
- KeepSpacesIn
-
This argument sets DiscardSpaces to true but will cause the twig builder to
keep spaces in the elements listed.
The syntax for using this argument is:
new XML::Twig( KeepSpacesIn => [ 'elt1', 'elt2']);
- PrettyPrint
-
Sets the pretty print method, amongst 'none' (default), 'nsgmls', 'nice',
'indented', 'record' and rec'record'ord_c
- EmptyTags
-
Sets the empty tag display style (normal, html or expand).
- Comments
-
Sets the way comments are processed: drop (default), keep or process
- drop
-
drops the comments, they are not read, nor printed to the output
- keep
-
comments are loaded and will appear on the output, they are not
accessible within the twig and will not interfere with processing
though
Bug: comments in the middle of a text element such as
<p>text <!-- comment --> more text --></p>
are output at the end of the text:
<p>text more text <!-- comment --></p>
- process
-
comments are loaded in the twig and will be treated as regular elements
(their
gi is #COMMENT) this can interfere with processing if you
expect $elt-{first_child}> to be an element but find a comment there.
Validation will not protect you from this as comments can happen anywhere.
You can use $elt-first_child( 'gi')> (which is a good habit anyway)
to get where you want. Consider using
- Pi
-
Sets the way processing instructions are processed:
drop, keep (default) or process
Note that you can also set PI handlers in the TwigHandlers option:
'?' => \&handler
'?target' => \&handler 2
The handlers will be called with 2 parameters, the twig and the PI element if Pi is set to process,
and with 3, the twig, the target and the data if Pi is set to keep. Of course they will not be
called if PI is set to drop.
If Pi is set to keep the handler should return a string that will be used as-is as the PI text
(it should look like `` <?target data? >'' or '' if you want to remove the PI),
Only one handler will be called, ?target or ? if no specific handler for that target is available.
Note: I _HATE_ the Java-like name of arguments used by most XML modules. As XML::Twig
is based on XML::Parser I kept the style, but you can also use a more perlish syntax, using
twig_print_outside_roots instead of TwigPrintOutsideRoots or pretty_print instead of
PrettyPrint, XML::Twig then normalizes all the argument names.
- parse(SOURCE [, OPT => OPT_VALUE [...]])
-
This method is inherited from XML::Parser.
The SOURCE parameter should either be a string containing the whole XML
document, or it should be an open IO::Handle. Constructor options to
XML::Parser::Expat given as keyword-value pairs may follow the SOURCE
parameter. These override, for this call, any options or attributes passed
through from the XML::Parser instance.
A die call is thrown if a parse error occurs. Otherwise it will return
the twig built by the parse. Use safe_parse if you want the parsing
to return even when an error occurs.
- parsestring
-
This is just an alias for parse for backwards compatibility.
- parsefile(FILE [, OPT => OPT_VALUE [...]])
-
This method is inherited from XML::Parser.
Open FILE for reading, then call parse with the open handle. The file
is closed no matter how parse returns.
A die call is thrown if a parse error occurs. Otherwise it will return
the twig built by the parse. Use safe_parsefile if you want the parsing
to return even when an error occurs.
- safe_parse( SOURCE [, OPT => OPT_VALUE [...]])
-
This method is similar to parse except that it wraps the parsing in an
eval block. It returns the twig on success and 0 on failure (the twig object
also contains the parsed twig). $@ contains the error message on failure.
Note that the parsing still stops as soon as an error is detected, there is
no way to keep going after an error.
- safe_parsefile(FILE [, OPT => OPT_VALUE [...]])
-
This method is similar to parsefile except that it wraps the parsing in an
eval block. It returns the twig on success and 0 on failure (the twig object
also contains the parsed twig) . $@ contains the
error message on failure
Note that the parsing still stops as soon as an error is detected, there is
no way to keep going after an error.
- setTwigHandlers ($handlers)
-
Set the Twig handlers. $handlers is a reference to a hash similar to the
one in the TwigHandlers option of new. All previous handlers are unset.
The method returns the reference to the previous handlers.
- setTwigHandler ($gi $handler)
-
Set a single Twig handlers for the $gi element. $handler is a reference to
a subroutine. If the handler was previously set then the reference to the
previous handler is returned.
- setStartTagHandlers ($handlers)
-
Set the StartTag handlers. $handlers is a reference to a hash similar to the
one in the StartTagHandlers option of new. All previous handlers are unset.
The method returns the reference to the previous handlers.
- setStartTagHandler ($gi $handler)
-
Set a single StartTag handlers for the $gi element. $handler is a reference to
a subroutine. If the handler was previously set then the reference to the
previous handler is returned.
- setEndTagHandlers ($handlers)
-
Set the EndTag handlers. $handlers is a reference to a hash similar to the
one in the EndTagHandlers option of new. All previous handlers are unset.
The method returns the reference to the previous handlers.
- setEndTagHandler ($gi $handler)
-
Set a single EndTag handlers for the $gi element. $handler is a reference to
a subroutine. If the handler was previously set then the reference to the
previous handler is returned.
- setTwigHandlers ($handlers)
-
Set the Twig handlers. $handlers is a reference to a hash similar to the
one in the TwigHandlers option of new.
- dtd
-
Returns the dtd (an XML::Twig::DTD object) of a twig
- root
-
Returns the root element of a twig
- first_elt ($optionnal_gi)
-
Returns the first element whose gi is $optionnal_gi of a twig, if
no $optionnal_gi is given then the root is returned
- elt_id ($id)
-
Returns the element whose id attribute is $id
- entity_list
-
Returns the entity list of a twig
- change_gi ($old_gi, $new_gi)
-
Performs a (very fast) global change. All elements old_gi are now new_gi.
See
- flush ($optional_filehandle, $options)
-
Flushes a twig up to (and including) the current element, then deletes
all unnecessary elements from the tree that's kept in memory.
flush keeps track of which elements need to be open/closed, so if you
flush from handlers you don't have to worry about anything. Just keep
flushing the twig every time you're done with a sub-tree and it will
come out well-formed. After the whole parsing don't forget to flush
one more time to print the end of the document.
The doctype and entity declarations are also printed.
flush take an optional filehandle as an argument.
options: use the Update_DTD option if you have updated the (internal) DTD
and/or the entity list and you want the updated DTD to be output
The PrettyPrint option sets the pretty printing of the document.
Example: $t->flush( Update_DTD => 1);
$t->flush( \*FILE, Update_DTD => 1);
$t->flush( \*FILE);
- flush_up_to ($elt, $optionnal_filehandle, %options)
-
Flushes up to the $elt element. This allows you to keep part of the
tree in memory when you flush.
options: see flush.
- purge
-
Does the same as a flush except it does not print the twig. It just deletes
all elements that have been completely parsed so far.
- purge_up_to ($elt)
-
Purges up to the $elt element. This allows you to keep part of the
tree in memory when you flush.
- print ($optional_filehandle, %options)
-
Prints the whole document associated with the twig. To be used only AFTER the
parse.
options: see flush.
- sprint
-
Returns the text of the whole document associated with the twig. To be used only
AFTER the parse.
options: see flush.
- set_pretty_print ($style)
-
Sets the pretty print method, amongst 'none' (default), 'nsgmls', 'nice',
'indented', 'record' and rec'record'ord_c
WARNING: the pretty print style is a GLOBAL variable, so once set it's
applied to ALL print's (and sprint's). Same goes if you use XML::Twig
with mod_perl . This should not be a problem as the XML that's generated
is valid anyway, and XML processors (as well as HTML processors, including
browsers) should not care. Let me know if this is a big problem, but at the
moment the performance/cleanliness trade-off clearly favors the global
approach.
- set_empty_tag_style ($style)
-
Sets the empty tag display style (normal, html or expand). As with
set_pretty_print this sets a global flag.
normal outputs an empty tag '<tag/>', html adds a space '<tag /> and
expand outputs '<tag></tag>'
- print_prolog ($optional_filehandle, %options)
-
Prints the prolog (XML declaration + DTD + entity declarations) of a document.
options: see flush.
- prolog ($optional_filehandle, %options)
-
Returns the prolog (XML declaration + DTD + entity declarations) of a document.
options: see flush.
- finish
-
Call Expat finish method.
Unsets all handlers (including internal ones that set context), but expat
continues parsing to the end of the document or until it finds an error.
It should finish up a lot faster than with the handlers set.
- finish_print
-
Stop twig processing, flush the twig and proceed to finish printing the document as
fast as possible. Use this method when modifying a document and the modification is
done.
- depth
-
Calls Expat's depth method , which returns the depth in the tree during the parsing.
This is usefull when using the TwigRoots option to still get info on the actual document.
- in_element ($gi)
-
Call Expat in_element method.
Returns true if $gi is equal to the name of the innermost currently opened
element. If namespace processing is being used and you want to check
against a name that may be in a namespace, then use the generate_ns_name
method to create the $gi argument. Usefull when using the TwigRoots option.
within_element($gi)
-
Call Expat within_element method.
Returns the number of times the given name appears in the context list.
If namespace processing is being used and you want to check
against a name that may be in a namespace, then use the generate_ns_name
method to create the $gi argument. Usefull when using the TwigRoots option.
- context
-
Returns a list of element names that represent open elements, with the last
one being the innermost. Inside start and end tag handlers, this will be the
tag of the parent element.
path($gi)
-
Returns the element context in a form similar to XPath's short
form: '/root/gi1/../gi'
- get_xpath ($xpath, $optional_offset)
-
Performs a get_xpath on the document root (see Elt)
- new ($optional_gi, $optional_atts, @optional_content)
-
The gi is optional (but then you can't have a content ), the optional atts
is the ref of a hash of attributes, the content can be just a string or a
list of strings and element. A content of '#EMPTY' creates an empty element;
Examples: my $elt= new XML::Twig::Elt();
my $elt= new XML::Twig::Elt( 'para', { align => 'center' });
my $elt= new XML::Twig::Elt( 'br', '#EMPTY');
my $elt= new XML::Twig::Elt( 'para');
my $elt= new XML::Twig::Elt( 'para', 'this is a para');
my $elt= new XML::Twig::Elt( 'para', $elt3, 'another para');
The strings are not parsed, the element is not attached to any twig.
WARNING: if you rely on ID's then you will have to set the id yourself. At
this point the element does not belong to a twig yet, so the ID attribute
is not known so it won't be strored in the ID list.
- parse ($string, %args)
-
Creates an element from an XML string. The string is actually
parsed as a new twig, then the root of that twig is returned.
The arguments in %args are passed to the twig.
As always if the parse fails the parser will die, so use an
eval if you want to trap syntax errors.
As obviously the element does not exist beforehand this method has to be
called on the class:
my $elt= parse XML::Twig::Elt( "<a> string to parse, with <sub/>
<elements>, actually tons of </elements>
h</a>");
- set_gi ($gi)
-
Sets the gi of an element
- gi
-
Returns the gi of the element
- closed
-
Returns true if the element has been closed. Might be usefull if you are
somewhere in the tree, during the parse, and have no idea whether a parent
element is completely loaded or not.
- is_pcdata
-
Returns 1 if the element is a #PCDATA element, returns 0 otherwise.
- pcdata
-
Returns the text of a PCDATA element or undef if the element is not PCDATA.
- set_pcdata ($text)
-
Sets the text of a PCDATA element.
- append_pcdata ($text)
-
Add the text at the end of a #PCDATA element.
- is_cdata
-
Returns 1 if the element is a #CDATA element, returns 0 otherwise.
- is_text
-
Returns 1 if the element is a #CDATA or #PCDATA element, returns 0 otherwise.
- cdata
-
Returns the text of a CDATA element or undef if the element is not CDATA.
- set_cdata ($text)
-
Sets the text of a CDATA element.
- append_cdata ($text)
-
Add the text at the end of a #CDATA element.
- is_empty
-
Returns 1 if the element is empty, 0 otherwise
- set_empty
-
Flags the element as empty. No further check is made, so if the element
is actually not empty the output will be messed. The only effect of this
method is that the output will be <gi att=``value''``/>.
- set_not_empty
-
Flags the element as not empty. if it is actually empty then the element will
be output as <gi att=``value''``></gi>
- root
-
Returns the root of the twig in which the element is contained.
- twig
-
Returns the twig containing the element.
- parent ($optional_gi)
-
Returns the parent of the element, or the first ancestor whose gi is
$optional_gi.
- first_child ($optional_gi)
-
Returns the first child of the element, or the first child whose gi is
$optional_gi (ie the first of the element children whose gi matches).
- child ($offset, $optional_gi)
-
Returns the $offset-th child of the element, optionally the $offset-th child
with a gi of $optional_gi. The children are treated as a list, so $elt->child( 0)
is the first child, while $elt->chlid( -1) is the last child.
- child_text ($offset, $optional_gi)
-
Returns the text of a child or undef if the sibling does not exist. Arguments are
the same as child.
- first_child_text ($optional_gi)
-
Returns the text of the first child of the element, or the first child
whose gi is $optional_gi.(ie the first of the element children whose gi matches).
If there is no first_child then returns ''. This avoids getting the
child, checking for its existence then getting the text for trivial cases.
- field ($optional_gi)
-
Same method as first_child_text with a different name
- last_child ($optional_gi)
-
Returns the last child of the element, or the last child whose gi is
$optional_gi (ie the last of the element children whose gi matches).
- last_child_text ($optional_gi)
-
Same as first_child_text but for the last child.
- prev_sibling ($optional_gi)
-
Returns the previous sibling of the element, or the first one whose gi is
$optional_gi.
- sibling ($offset, $optional_gi)
-
Returns the next or previous $offset-th sibling of the element, or the $offset-th one
whose gi is $optional_gi. If $offset is negative then a previous sibling is returned,
if $offset is positive then a next sibling is returned. $offset=0 returns the element
if there is no $optional_gi or if the element gi matches $optional_gi, undef otherwise.
- sibling_text ($offset, $optional_gi)
-
Returns the text of a sibling or undef if the sibling does not exist. Arguments are
the same as sibling.
- next_sibling ($optional_gi)
-
Returns the next sibling of the element, or the first one whose gi is $gi.
- next_elt ($optional_elt, $optional_gi)
-
Returns the next elt (optionally whose gi is $gi) of the element. This is
defined as the next element which opens after the current element opens.
Which usually means the first child of the element.
Counter-intuitive as it might look this allows you to loop through the
whole document by starting from the root.
The $optional_elt is the root of a subtree. When the next_elt is out of the
subtree then the method returns undef. You can then walk a sub tree with:
my $elt= $subtree_root;
while( $elt= $elt->next_elt( $subtree_root)
{ # insert processing code here
}
- prev_elt ($optional_gi)
-
Returns the previous elt (optionally whose gi is $gi) of the
element. This is the first element which opens before the current one.
It is usually either the last descendant of the previous sibling or
simply the parent
- children ($optional_gi)
-
Returns the list of children (optionally whose gi is $gi) of the element.
The list is in document order.
- descendants ($optional_gi)
-
Returns the list of all descendants (optionally whose gi is $gi) of the element
This is the equivalent of the getElementsByTagName of the DOM
- ancestors ($optional_gi)
-
Returns the list of ancestors (optionally whose gi is $gi) of the element.
The list is ordered from the innermost ancestor to the outtermost one
NOTE: the element itself is not part of the list, in order to include it
you will have to write:
my @array= ($elt, $elt->ancestors)
=item prev_siblings ($optional_gi)
Returns the list of previous siblings (optionaly whose gi is $optional_gi)
for the element. The elements are ordered in document order.
- next_siblings ($optional_gi)
-
Returns the list of siblings (optionaly whose gi is $optional_gi)
following the element. The elements are ordered in document order.
- get_xpath ($xpath, $optional_offset)
-
Returns a list of elements satisfying the $xpath. $xpath is an XPATH-like expression.
A subset of the XPATH abbreviated syntax is covered:
gi
gi[1] (or any other positive number)
gi[last()]
gi[@att] (the attribute exists for the element)
gi[@att="val"]
gi[att1="val1" and att2="val2"]
gi[att1="val1" or att2="val2"]
gi[string()="toto"] (returns gi elements which text (as per the text method) is toto)
gi[string()=~/regexp/] (returns gi elements which text (as per the text method) matches regexp)
expressions can start with / (search starts at the document root)
expressions can start with . (search starts at the current element)
// can be used to get all descendants instead of just direct children
* matches any gi
So the following examples from the XPATH recommendation (http://www.w3.org/TR/xpath.html#path-abbrev)
work:
para selects the para element children of the context node
* selects all element children of the context node
para[1] selects the first para child of the context node
para[last()] selects the last para child of the context node
*/para selects all para grandchildren of the context node
/doc/chapter[5]/section[2] selects the second section of the fifth chapter of the doc
chapter//para selects the para element descendants of the chapter element children of the context node
//para selects all the para descendants of the document root and thus selects all para elements in the same document as the
context node
//olist/item selects all the item elements in the same document as the context node that have an olist parent
.//para selects the para element descendants of the context node
.. selects the parent of the context node
para[@type="warning"] selects all para children of the context node that have a type attribute with value warning
employee[@secretary and @assistant] selects all the employee children of the context node that have both a secretary attribute and
an assistant attribute
The elements will be returned in the document order.
If $optional_offset is used then only one element will be returned, the one with the
appropriate offset in the list, starting at 0
Quoting and interpolating variables can be a pain when the Perl syntax and the XPATH syntax collide, so here are some
more examples to get you started:
my $p1= "p1";
my $p2= "p2";
my @res= $t->get_xpath( "p[string( '$p1') or string( '$p2')]");
my $a= "a1";
my @res= $t->get_xpath( "//*[@att=\"$a\"]);
my $val= "a1";
my $exp= "//p[ \@att='$val']"; # note that you need to use \@ or you will get a warning
my @res= $t->get_xpath( $exp);
XML::Twig does not provide full XPATH support. If that's what you want then look no further than the XML::XPath module
on CPAN.
- level ($optional_gi)
-
Returns the depth of the element in the twig (root is 0).
If the optional gi is given then only ancestors of the given type are counted.
B<WARNING>: in a tree created using the TwigRoots option this will not return the
level in the document tree, level 0 will be the document root, level 1 will be
the TwigRoots elements. During the parsing (in a TwigHandler)
you can use the depth method on the twig object to get the real parsing depth.
- in ($potential_parent)
-
Returns true if the element is in the potential_parent ($potential_parent is an element)
- in_context ($gi, $optional_level)
-
Returns true if the element is included in an element whose gi is $gi,
optionally within $optional_level levels. The returned value is the including
element.
- atts
-
Returns a hash ref containing the element attributes
- set_atts ({att1=>$att1_val, att2=> $att2_val... })
-
Sets the element attributes with the hash ref supplied as the argument
- del_atts
-
Deletes all the element attributes.
- set_att ($att, $att_value)
-
Sets the attribute of the element to the given value
- att ($att)
-
Returns the attribute value
- del_att ($att)
-
Delete the attribute for the element
- inherit_att ($att, @optional_gi_list)
-
Returns the value of an attribute inherited from parent tags. The value
returned is found by looking for the attribute in the element then in turn
in each of its ancestors. If the @optional_gi_list is supplied only those
ancestors whose gi is in the list will be checked.
- set_id ($id)
-
Sets the id attribute of the element to the value.
See
to change the id attribute name
- id
-
Gets the id attribute value
- del_id ($id)
-
Deletes the id attribute of the element and remove it from the id list
for the document
- cut
-
Cuts the element from the tree.
- copy ($elt)
-
Returns a copy of the element. The copy is a ``deep'' copy: all sub elements of
the element are duplicated.
- paste ($optional_position, $ref)
-
Pastes a (previously cut) element.
The optional position element can be:
- first_child (default)
-
The element is pasted as the first child of the element object this
method is called on.
- last_child
-
The element is pasted as the last child of the element object this
method is called on.
- before
-
The element is pasted before the element object, as its previous
sibling.
- after
-
The element is pasted after the element object, as its next sibling.
- move ($optional_position, $ref)
-
Move an element in the tree.
This is just a cut then a paste. The syntax is the same as paste.
- replace ($ref)
-
Replaces an element in the tree. Sometimes it is just not possible to cut
an element then paste another in its place, so replace comes in handy.
- prefix ($text)
-
Add a prefix to an element. If the element is a PCDATA element the text
is added to the pcdata, if the elements first_child is a PCDATA then the
text is added to it's pcdata, otherwise a new PCDATA element is created
and pasted as the first child of the element.
- suffix ($text)
-
Add a suffix to an element. If the element is a PCDATA element the text
is added to the pcdata, if the elements last_child is a PCDATA then the
text is added to it's pcdata, otherwise a new PCDATA element is created
and pasted as the last child of the element.
- erase
-
Erases the element: the element is deleted and all of its children are
pasted in its place.
- delete
-
Cut the element and frees the memory.
- DESTROY
-
Frees the element from memory.
- start_tag
-
Returns the string for the start tag for the element, including
the /> at the end of an empty element tag
- end_tag
-
Returns the string for the end tag of an element. For an empty
element, this returns the empty string ('').
- print ($optional_filehandle, $pretty_print_style)
-
Prints an entire element, including the tags, optionally to a $optional_filehandle,
optionally with a $pretty_print_style.
- sprint ($elt, $optional_no_enclosing_tag)
-
Returns the string for an entire element, including the tags. To be used
with caution!
If the optional second argument is true then only the string inside the
element is returned (the start and end tag for $elt are not).
- set_pretty_print ($style)
-
Sets the pretty print method, amongst 'none' (default), 'nsgmls', 'nice',
'indented', 'record' and 'record_c'
- none
-
the default, no \n is used
- nsgmls
-
nsgmls style, with \n added within tags
- nice
-
adds \n wherever possible (NOT SAFE, can lead to invalid XML)
- indented
-
same as nice plus indents elements (NOT SAFE, can lead to invalid XML)
- record
-
table-oriented pretty print, one field per line
- record_c
-
table-oriented pretty print, more compact than record, one record per line
- set_empty_tag_style ($style)
-
Sets the method to output empty tags, amongst 'normal' (default), 'html',
and 'expand',
- set_indent ($string)
-
Sets the indentation for the indented pretty print style (default is 2 spaces)
- set_quote ($quote)
-
Sets the quotes used for attributes. can be 'double' (default) or 'single'
- text
-
Returns a string consisting of all the PCDATA and CDATA in an element,
without any tags.
- set_text ($string)
-
Sets the text for the element: if the element is a PCDATA, just set its
text, otherwise cut all the children of the element and create a single
PCDATA child for it, which holds the text.
- set_content ( $optional_atts, @list_of_elt_and_strings)
( $optional_atts, '#EMPTY')
-
Sets the content for the element, from a list of strings and
elements. Cuts all the element children, then pastes the list
elements as the children. This method will create a PCDATA element
for any strings in the list.
The optional_atts argumentis the ref of a hash of attributes. If this
argument is used then the previous attrubutes are deleted, otherwise they
are left untouched.
WARNING: if you rely on ID's then you will have to set the id yourself. At
this point the element does not belong to a twig yet, so the ID attribute
is not known so it won't be strored in the ID list.
A content of '#EMPTY' creates an empty element;
- insert (@gi)
-
For each gi in the list inserts an element $gi as the only child of the
element. All children of the element are set as children of the new element.
The upper level element is returned.
$p->insert( 'table', 'tr', 'td') puts $p in a table with a single tr and a
single td and returns the table element.
- wrap_in (@gi)
-
Wraps elements $gi as the successive ancestors of the element, returns the
new element.
$elt->wrap_in( 'td', 'tr', 'table') wraps the element as a single cell in a
table for example.
- cmp ($elt)
Compare the order of the 2 elements in a twig.
-
$a is the <A>..</A> element, $b is the <B>...</B> element
document $a->cmp( $b)
<A> ... </A> ... <B> ... </B> -1
<A> ... <B> ... </B> ... </A> -1
<B> ... </B> ... <A> ... </A> 1
<B> ... <A> ... </A> ... </B> 1
$a == $b 0
$a and $b not in the same tree undef
- before ($elt)
-
Returns 1 if $elt starts before the element, 0 otherwise. If the 2 elements
are not in the same twig then return undef.
if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
- after ($elt)
-
Returns 1 if $elt starts after the element, 0 otherwise. If the 2 elements
are not in the same twig then return undef.
if( $a->cmp( $b) == -1) { return 1; } else { return 0; }
- path
-
Returns the element context in a form similar to XPath's short
form: '/root/gi1/../gi'
- private methods
-
- set_parent ($parent)
-
- set_first_child ($first_child)
-
- set_last_child ($last_child)
-
- set_prev_sibling ($prev_sibling)
-
- set_next_sibling ($next_sibling)
-
- set_twig_current
-
- del_twig_current
-
- twig_current
-
- flushed
-
This method should NOT be used, always flush the twig, not an element.
- set_flushed
-
- del_flushed
-
- flush
-
- contains_text
-
Those methods should not be used, unless of course you find some creative
and interesting, not to mention useful, ways to do it.
- new
-
Creates an entity list.
- add ($ent)
-
Adds an entity to an entity list.
- delete ($ent or $gi).
-
Deletes an entity (defined by its name or by the Entity object)
from the list.
- print ($optional_filehandle)
-
Prints the entity list.
- new ($name, $val, $sysid, $pubid, $ndata)
-
Same arguments as the Entity handler for XML::Parser.
- print ($optional_filehandle)
-
Prints an entity declaration.
- text
-
Returns the entity declaration text.
See the test file in t/test[1-n].t
Additional examples (and a complete tutorial) can be found at
http://www.xmltwig.cx/
To figure out what flush does call the following script with an
xml file and an element name as arguments
use XML::Twig;
my ($file, $elt)= @ARGV;
my $t= new XML::Twig( TwigHandlers =>
{ $elt => sub {$_[0]->flush; print "\n[flushed here]\n";} });
$t->parsefile( $file, ErrorContext => 2);
$t->flush;
print "\n";
There are 3 possibilities here. They are:
- No DTD
-
No doctype, no DTD information, no entity information, the world is simple...
- Internal DTD
-
The XML document includes an internal DTD, and maybe entity declarations.
If you use the LoadDTD option when creating the twig the DTD information and
the entity declarations can be accessed.
The DTD and the entity declarations will be flush'ed (or print'ed) either as is
(if they have not been modified) or as reconstructed (poorly, comments are lost,
order is not kept, due to it's content this DTD should not be viewed by anyone)
if they have been modified. You can also modify them directly by changing the
$twig->{twig_doctype}->{internal} field (straight from XML::Parser, see the
Doctype handler doc)
- External DTD
-
The XML document includes a reference to an external DTD, and maybe entity
declarations.
If you use the LoadDTD when creating the twig the DTD information and the entity
declarations can be accessed. The entity declarations will be flush'ed (or
print'ed) either as is (if they have not been modified) or as reconstructed (badly,
comments are lost, order is not kept).
You can change the doctype through the $twig->set_doctype method and print the
dtd through the $twig->dtd_text or $twig->dtd_print methods.
If you need to modify the entity list this is probably the easiest way to do it.
If you set handlers and use flush, do not forget to flush the twig one
last time AFTER the parsing, or you might be missing the end of the document.
Remember that element handlers are called when the element is CLOSED, so
if you have handlers for nested elements the inner handlers will be called
first. It makes it for example trickier than it would seem to number nested
clauses.
- ID list
-
The ID list is NOT updated when ID's are modified or elements cut or
deleted.
- change_gi
-
This method will not function properly if you do:
$twig->change_gi( $old1, $new);
$twig->change_gi( $old2, $new);
$twig->change_gi( $new, $even_newer);
- sanity check on XML::Parser method calls
-
XML::Twig should really prevent calls to some XML::Parser methods, especially
the setHandlers method.
These are the things that can mess up calling code, especially if threaded.
They might also cause problem under mod_perl.
- Exported constants
-
Whether you want them or not you get them! These are subroutines to use
as constant when creating or testing elements
- PCDATA
-
returns '#PCDATA'
- CDATA
-
returns '#CDATA'
- PI
-
returns '#PI', I had the choice between PROC and PI :--(
- Module scoped values: constants
-
these should cause no trouble:
%base_ent= ( '>' => '>',
'<' => '<',
'&' => '&',
"'" => ''',
'"' => '"',
);
CDATA_START = "<![CDATA[";
CDATA_END = "]]>";
PI_START = "<?";
PI_END = "?>";
COMMENT_START = "<!--";
COMMENT_END = "-->";
pretty print styles
( $NSGMLS, $NICE, $INDENTED, $RECORD1, $RECORD2)= (1..5);
empty tag output style
( $HTML, $EXPAND)= (1..2);
- Module scoped values: might be changed
-
Most of these deal with pretty printing, so the worst that can
happen is probably that XML output does not look right, but is
still valid and processed identically by XML processors.
$empty_tag_style can mess up HTML bowsers though and changing $ID
would most likely create problems.
$pretty=0; # pretty print style
$quote='"'; # quote for attributes
$INDENT= ' '; # indent for indented pretty print
$empty_tag_style= 0; # how to display empty tags
$ID # attribute used as a gi ('id' by default)
- Module scoped values: definitely changed
-
These 2 variables are used to replace gi's by an index, thus
saving some space when creating a twig. If they really cause
you too much trouble, let me know, it is probably possible to
create either a switch or at least a version of XML::Twig that
does not perform this optimisation.
%gi2index; # gi => index
@index2gi; # list of gi's
- multiple twigs are not well supported
-
A number of twig features are just global at the moment. These include
the ID list and the ``gi pool'' (if you use change_gi then you change the gi
for ALL twigs).
The next version will try to support this while trying not to be to
hard on performance (at least when a single twig is used!).
- XML::Parser-like handlers
-
Sometimes it would be nice to be able to use both XML::Twig handlers and
XML::Parser handlers, for example to perform generic tasks on all open
tags, like adding an ID, or taking care of the autonumbering.
Next version...
You can use the benchmark_twig file to do additional benchmarks.
Please send me benchmark information for additional systems.
Michel Rodriguez <m.v.rodriguez@ieee.org>
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.
Bug reports and comments to m.v.rodriguez@ieee.org.
The XML::Twig page is at http://www.xmltwig.cx/
XML::Parser
|
|

|
Browse our Perldoc FAQs: |

|
|

|
Previous Topics |

|

|
Next Topics |

|

|
Website Spotlight |

|
Domain Hosting with the Leader in Innovative and Comprehensive Web Hosting Solutions, Globalnet GNP.
Reliable Domain Hosting
|
|
|
|
|
|

|