removing duplicate nodes in the sitemap

Posted: 9 June 2009 in CSSDK, Java, Site map, SitePublisher

In order to remove duplicate nodes in the sitemap, the following code might help. A duplicate is a second node in the site map that points to the same page as another node. They can have different labels.

                // we need to remove duplicates.
                // we're going to store the Ids of the duplicate nodes in this collection
                ArrayList<String> duplicateNodes = new ArrayList<String>();


                siteMapFile = (CSSimpleFile) workarea.getFile(new CSAreaRelativePath("sites/" + siteName + "/default.sitemap"));

                org.dom4j.Document doc = Dom4jUtils.newDocument(new File(siteMapFile.getVPath().getPathNoServer().toString()));

                SiteMapXml siteMapXML = new SiteMapXml(doc, siteMapFile.getVPath().toString());

                siteMapDocument = siteMapXML.getDocument();
                nodeIterator = siteMapDocument.getRootElement().element("segment").elementIterator("node");
                while (nodeIterator.hasNext()) {
                    // we need to get the link of this node and see it there are others pointing to it
                    org.dom4j.Element node = (org.dom4j.Element);
                    String thisNodeLink = node.element("link").element("value").getText();
                    System.out.println("this node points to " + thisNodeLink);

                    if (!duplicateNodes.contains(node.attribute("id").getText())) {
                        // this node has not already been scheduled for removal in the sitemap
                        Iterator otherNodeIterator = siteMapDocument.getRootElement().element("segment").elementIterator("node");
                        while (otherNodeIterator.hasNext()) {
                            org.dom4j.Element otherNode = (org.dom4j.Element);
                            // if this other node points to the same page
                            // but unless it's the same node
                            // remove the node
                            if (thisNodeLink.equals(otherNode.element("link").element("value").getText())) {
                                if (!node.element("label").getText().equals(otherNode.element("label").getText())) {
                                    // keep this id. We will want to do all the deletes after we've processed this iterator, otherwise, we'd need to refresh the data
                                    // however, it poses 1 problem for us. When we get round to process the other node, we must make sure it's not on the deletion list!
                                    String otherNodeId = otherNode.attribute("id").getText();
                                    System.out.println(otherNode.element("label").getText() + " is a duplicate of " + node.element("label").getText());
                                } else {
                                    System.out.println("we found ourselves in this sitemap. that's ok");
                            } else {
                                System.out.println("this node does not point to the same page as us.");

                System.out.println("removing duplicates");
                for (Iterator<String> it = duplicateNodes.iterator(); it.hasNext();) {
                    String nodeId =;
                    System.out.println("removing node with id " + nodeId);
                    siteMapXML.deleteNode(new SiteMapOperation.Delete(nodeId));
                System.out.println("removing duplicates - Done");

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s