xml_processing

= Python - XML Processing =

  



= What is XML ? = The Extensible Markup Language (XML) is a markup language much like HTML or SGML. This is recommended by the World Wide Web Consortium and available as an open standard. XML is a portable, open source language that allows programmers to develop applications that can be read by other applications, regardless of operating system and/or developmental language. XML is extremely useful for keeping track of small to medium amounts of data without requiring a SQL-based backbone. = XML Parser Architectures and APIs: = The Python standard library provides a minimal but useful set of interfaces to work with XML. The two most basic and broadly used APIs to XML data are the SAX and DOM interfaces. > > SAX obviously can't process information as fast as DOM can when working with large files. On the other hand, using DOM exclusively can really kill your resources, especially if used on a lot of small files. SAX is read-only, while DOM allows changes to the XML file. Since these two different APIs literally complement each other there is no reason why you can't use them both for large projects. For all our XML code examples, let's use a simple XML file //movies.xml// as an input: code   War, Thriller DVD 2003   PG    10 Talk about a US-Japan war  Anime, Science Fiction DVD 1989   R    8 A schientific fiction <movie title="Trigun"> Anime, Action DVD 4   PG    10 Vash the Stampede! <movie title="Ishtar"> Comedy VHS PG   2 Viewable boredom code || = Parsing XML with SAX APIs: = <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">SAX is a standard interface for event-driven XML parsing. Parsing XML with SAX generally requires you to create your own ContentHandler, by subclassing xml.sax.ContentHandler. <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">Your //ContentHandler// handles the particular tags and attributes of your flavor(s) of XML. A ContentHandler object provides methods to handle various parsing events. Its owning parser calls ContentHandler methods as it parses the XML file. <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">The methods //startDocument// and //endDocument// are called at the start and the end of the XML file. The method //characters(text)// is passed character data of the XML file via the parameter text. <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">The ContentHandler is called at the start and end of each element. If the parser is not in namespace mode, the methods //startElement(tag, attributes)// and //endElement(tag)// are called; otherwise, the corresponding methods //startElementNS// and //endElementNS// are called. Here, tag is the element tag, and attributes is an Attributes object. <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">Here are other important methods to understand before proceeding:
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **Simple API for XML (SAX) :** Here you register callbacks for events of interest and then let the parser proceed through the document. This is useful when your documents are large or you have memory limitations, it parses the file as it reads it from disk, and the entire file is never stored in memory.
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **Document Object Model (DOM) API :** This is World Wide Web Consortium recommendation wherein the entire file is read into memory and stored in a hierarchical (tree-based) form to represent all the features of an XML document.

The //make_parser// Method:
<span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">Following method creates a new parser object and returns it. The parser object created will be of the first parser type the system finds. code <span style="font-family: 'Courier New',monospace; font-size: 12px;">xml.sax.make_parser( [parser_list] ) code || <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">Here is the detail of the parameters: >
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **parser_list:** The optional argument consisting of a list of parsers to use, which must all implement the make_parser method.

The //parse// Method:
<span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">Following method creates a SAX parser and use it to parse a document. code <span style="font-family: 'Courier New',monospace; font-size: 12px;">xml.sax.parse( xmlfile, contenthandler[, errorhandler]) code || <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">Here is the detail of the parameters: > > >
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **xmlfile:** This is the name of the XML file to read from.
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **contenthandler:** This must be a ContentHandler object.
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **errorhandler:** If specified, errorhandler must be a SAX ErrorHandler object.

The //parseString// Method:
<span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">There is one more method to create a SAX parser and to parse the specified **XML string**. code <span style="font-family: 'Courier New',monospace; font-size: 12px;">xml.sax.parseString(xmlstring, contenthandler[, errorhandler]) code || <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">Here is the detail of the parameters: > > >
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **xmlstring:** This is the name of the XML string to read from.
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **contenthandler:** This must be a ContentHandler object.
 * <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: left;"> **errorhandler:** If specified, errorhandler must be a SAX ErrorHandler object.

Example:
code <span style="font-family: 'Courier New',monospace; font-size: 12px;">#!/usr/bin/python

import xml.sax

class MovieHandler( xml.sax.ContentHandler ): def __init__(self): self.CurrentData = "" self.type = "" self.format = "" self.year = "" self.rating = "" self.stars = "" self.description = ""

# Call when an element starts def startElement(self, tag, attributes): self.CurrentData = tag if tag == "movie": print "*****Movie*****" title = attributes["title"] print "Title:", title

# Call when an elements ends def endElement(self, tag): if self.CurrentData == "type": print "Type:", self.type elif self.CurrentData == "format": print "Format:", self.format elif self.CurrentData == "year": print "Year:", self.year elif self.CurrentData == "rating": print "Rating:", self.rating elif self.CurrentData == "stars": print "Stars:", self.stars elif self.CurrentData == "description": print "Description:", self.description self.CurrentData = ""

# Call when a character is read def characters(self, content): if self.CurrentData == "type": self.type = content elif self.CurrentData == "format": self.format = content elif self.CurrentData == "year": self.year = content elif self.CurrentData == "rating": self.rating = content elif self.CurrentData == "stars": self.stars = content elif self.CurrentData == "description": self.description = content

if ( __name__ == "__main__"):

# create an XMLReader parser = xml.sax.make_parser # turn off namepsaces parser.setFeature(xml.sax.handler.feature_namespaces, 0)

# override the default ContextHandler Handler = MovieHandler parser.setContentHandler( Handler )

parser.parse("movies.xml") code || <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">This would produce following result: code <span style="font-family: 'Courier New',monospace; font-size: 12px;">*****Movie***** Title: Enemy Behind Type: War, Thriller Format: DVD Year: 2003 Rating: PG Stars: 10 Description: Talk about a US-Japan war Title: Transformers Type: Anime, Science Fiction Format: DVD Year: 1989 Rating: R Stars: 8 Description: A schientific fiction Title: Trigun Type: Anime, Action Format: DVD Rating: PG Stars: 10 Description: Vash the Stampede! Title: Ishtar Type: Comedy Format: VHS Rating: PG Stars: 2 Description: Viewable boredom code || <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">For a complete detail on SAX API documentation, please refer to standard Python SAX APIs. = Parsing XML with DOM APIs: = <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">The Document Object Model, or "DOM," is a cross-language API from the World Wide Web Consortium (W3C) for accessing and modifying XML documents. <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">The DOM is extremely useful for random-access applications. SAX only allows you a view of one bit of the document at a time. If you are looking at one SAX element, you have no access to another. <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">Here is the easiest way to quickly load an XML document and to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that will quickly create a DOM tree from the XML file. <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file designated by file into a DOM tree object. code <span style="font-family: 'Courier New',monospace; font-size: 12px;">#!/usr/bin/python
 * Movie*****
 * Movie*****
 * Movie*****

from xml.dom.minidom import parse import xml.dom.minidom

DOMTree = xml.dom.minidom.parse("text.xml") collection = DOMTree.documentElement if collection.hasAttribute("shelf"): print "Root element : %s" % collection.getAttribute("shelf")
 * 1) Open XML document using minidom parser

movies = collection.getElementsByTagName("movie")
 * 1) Get all the movies in the collection

for movie in movies: print "*****Movie*****" if movie.hasAttribute("title"): print "Title: %s" % movie.getAttribute("title")
 * 1) Print detail of each movie.

type = movie.getElementsByTagName('type')[0] print "Type: %s" % type.childNodes[0].data format = movie.getElementsByTagName('format')[0] print "Format: %s" % format.childNodes[0].data rating = movie.getElementsByTagName('rating')[0] print "Rating: %s" % rating.childNodes[0].data description = movie.getElementsByTagName('description')[0] print "Description: %s" % description.childNodes[0].data code || <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">This would produce following result: code <span style="font-family: 'Courier New',monospace; font-size: 12px;">Root element : New Arrivals Title: Enemy Behind Type: War, Thriller Format: DVD Rating: PG Description: Talk about a US-Japan war Title: Transformers Type: Anime, Science Fiction Format: DVD Rating: R Description: A schientific fiction Title: Trigun Type: Anime, Action Format: DVD Rating: PG Description: Vash the Stampede! Title: Ishtar Type: Comedy Format: VHS Rating: PG Description: Viewable boredom code || <span style="background-color: #ffffff; font-family: verdana,helvetica,arial,sans-serif; font-size: 11px; text-align: justify;">For a complete detail on DOM API documentation, please refer to standard Python DOM APIs.
 * Movie*****
 * Movie*****
 * Movie*****
 * Movie*****