Hello Everyone

I am having some trouble in parsing a XML document with a perl script.
I have a file like the attached file(I have just taken a part of the original file
as it is too big to be posted overhere and is hard to analyze manually).
Now, what I have to do is count the number of different authors publishing a paper
in a single year.
To count this, I can either use the issue printdate or the cpyrtdate.

The trouble is that I am not able to pass a value in this line:
my $nodeset = $xp->findnodes('//issue[@printdate="1917-07-00"]/..//author');
my $nodeset = $xp->findnodes('//issue[@printdate="$x"]/..//author');
where $x comes from a list containing all the years like 1917, 1913 etc.

I am using this code but it is not helping much .

use XML::XPath;
my $file = 'Aj.xml';
my $xp = XML::XPath->new(filename=>$file);
my $nodeset = $xp->findnodes('//issue[@printdate="1917-07-00"]/..//author');
my @date;                 
if (my @nodelist = $nodeset->get_nodelist)
 {
 @date = map($_->string_value, @nodelist);
  @date = sort(@date);
  local $" = "\n";
  print "I found these authors:\n@date\n";
}

I have analyzed the file manually and the things to be considered are as follows:
issue printdate="1913-01-00"
Author names:
DavidW.Cornelius.
FrederickSlate

issue printdate="1913-02-00"
Author names:
DavidW.Cornelius.
LachlanGilchrist

issue printdate="1917-08-00"
Author names:
H.W.Nichols.

issue printdate="1917-07-00"
Author names:
JohnZeleny

So, what I want should like this:

Year No. of different authors publishing in a single year
1913 3
1917 2

I am kind of stuck with it, can somebody please help.

Thanks
Aj

Attachments
<articles>
<article doi="10.1103/PhysRev.1.16">
<journal jcode="PR" short="Phys. Rev.">Physical Review</journal>
<volume>1</volume>
<issue printdate="1913-01-00">1</issue>
<fpage>16</fpage>
<lpage>34</lpage>
<seqno>1</seqno>
<price></price><tocsec>Articles</tocsec>
<arttype type="article"></arttype><doi>10.1103/PhysRev.1.16</doi>
<title>The Velocity of Electrons in the Photo-electric Effect, as a Function of the Wave Lengths of the Light</title>
<authgrp>
<author><givenname>David</givenname><middlename>W.</middlename><surname>Cornelius.</surname></author>
<aff>Laboratory of Physics, University of Illinois</aff>
</authgrp>
<history>
<received date="1912-05-06"/>
</history>
<cpyrt>
<cpyrtdate date="1913" /><cpyrtholder>The American Physical Society</cpyrtholder>
</cpyrt>
</article>
<article doi="10.1103/PhysRev.1.56">
<journal jcode="PR" short="Phys. Rev.">Physical Review</journal>
<volume>1</volume>
<issue printdate="1913-01-00">1</issue>
<fpage>56</fpage>
<lpage>59</lpage>
<seqno>1</seqno>
<price></price><tocsec>Articles</tocsec>
<arttype type="article"></arttype><doi>10.1103/PhysRev.1.56</doi>
<title>Angle in Vector Algebra; and Composition of Rotations</title>
<authgrp>
<author><givenname>Frederick</givenname><surname>Slate</surname></author>
<aff >University of California</aff>
</authgrp>
<cpyrt>
<cpyrtdate date="1913" /><cpyrtholder>The American Physical Society</cpyrtholder>
</cpyrt>
</article>
<article doi="10.1103/PhysRev.1.60">
<journal jcode="PR" short="Phys. Rev.">Physical Review</journal>
<volume>1</volume>
<issue printdate="1913-01-00">1</issue>
<fpage>60</fpage>
<lpage>84</lpage>
<seqno>1</seqno>
<price></price><tocsec>American Physical Society</tocsec>
<arttype type="miscellaneous"></arttype><doi>10.1103/PhysRev.1.60</doi>
<title>Minutes of the Sixty-Fourth Meeting</title>
<cpyrt>
<cpyrtdate date="1913" /><cpyrtholder>The American Physical Society</cpyrtholder>
</cpyrt>
</article>
<article doi="10.1103/PhysRev.1.124">
<journal jcode="PR" short="Phys. Rev.">Physical Review</journal>
<volume>1</volume>
<issue printdate="1913-02-00">2</issue>
<fpage>124</fpage>
<lpage>140</lpage>
<seqno>1</seqno>
<price></price><tocsec>Articles</tocsec>
<arttype type="article"></arttype><doi>10.1103/PhysRev.1.124</doi>
<title>An Absolute Determination of the Viscosity of Air</title>
<authgrp>
<author><givenname>Lachlan</givenname><surname>Gilchrist</surname></author>
<aff>Ryerson Physical Laboratory, University of Chicago</aff>
</authgrp>
<history>
<received date="1912-08-00"/>
</history>
<cpyrt>
<cpyrtdate date="1913" /><cpyrtholder>The American Physical Society</cpyrtholder>
</cpyrt>
</article>
<article doi="10.1103/PhysRev.1.124">
<journal jcode="PR" short="Phys. Rev.">Physical Review</journal>
<volume>1</volume>
<issue printdate="1913-02-00">2</issue>
<fpage>124</fpage>
<lpage>140</lpage>
<seqno>1</seqno>
<price></price><tocsec>Articles</tocsec>
<arttype type="article"></arttype><doi>10.1103/PhysRev.1.124</doi>
<title>An Absolute Determination of the Viscosity of Air</title>
<authgrp>
<author><givenname>David</givenname><middlename>W.</middlename><surname>Cornelius.</surname></author>
<aff>Laboratory of Physics, University of Illinois</aff>
<aff>Ryerson Physical Laboratory, University of Chicago</aff>
</authgrp>
<history>
<received date="1912-08-00"/>
</history>
<cpyrt>
<cpyrtdate date="1913" /><cpyrtholder>The American Physical Society</cpyrtholder>
</cpyrt>
</article>
<article doi="10.1103/PhysRev.10.171">
<journal jcode="PR" short="Phys. Rev.">Physical Review</journal>
<volume>10</volume>
<issue printdate="1917-08-00">2</issue>
<fpage>171</fpage>
<lpage>193</lpage>
<seqno>1</seqno>
<price></price><tocsec>Articles</tocsec>
<arttype type="article"></arttype><doi>10.1103/PhysRev.10.171</doi>
<title>Theory of Variable Dynamical-Electrical Systems</title>
<authgrp>
<author><givenname>H.</givenname><middlename>W.</middlename><surname>Nichols.</surname></author>
</authgrp>
<cpyrt>
<cpyrtdate date="1917" /><cpyrtholder>The American Physical Society</cpyrtholder>
</cpyrt>
</article>
<article doi="10.1103/PhysRev.10.1">
<journal jcode="PR" short="Phys. Rev.">Physical Review</journal>
<volume>10</volume>
<issue printdate="1917-07-00">1</issue>
<fpage>1</fpage>
<lpage>6</lpage>
<seqno>1</seqno>
<price></price><tocsec>Articles</tocsec>
<arttype type="article"></arttype><doi>10.1103/PhysRev.10.1</doi>
<title>Instability of Electrified Liquid Surfaces</title>
<authgrp>
<author><givenname>John</givenname><surname>Zeleny</surname></author>
<aff >Sloane Laboratory, Yale University.</aff>
</authgrp>
<cpyrt>
<cpyrtdate date="1917" /><cpyrtholder>The American Physical Society</cpyrtholder>
</cpyrt>
</article>
</articles>

I found this incredibly difficult as I've done very little XML. I tried using XPath but don't know enough about XPath. Finally, I tried XML::Simple.

#!/usr/bin/perl
#parse_xml_simple.pl
use strict;
use warnings;

# use module
use XML::Simple;

# create object
my $xml = new XML::Simple;

# read XML file
my $file = '/home/david/Programming/Perl/data/Aj.xml';
my $data = $xml->XMLin($file,
                        ForceArray => 1,
                        KeyAttr    => {},
                      );

my %count;
foreach my $article (@{$data->{article}}) {
    my $year = $article->{cpyrt}->[0]->{cpyrtdate}->[0]->{date};
    my $author = $article->{authgrp}->[0]->{author}->[0];
    my $autname;
    
    foreach ( keys %{$author} ){
        $autname .= $$author{$_}->[0];
    }

    $count{$year}->{$autname}++ if defined($autname);
}

foreach (keys %count){
    my $ctr = keys %{$count{$_}};
    print "$_ $ctr\n"
}

This gives the following output:

1913 3
1917 2

ajaj_p5, I slightly change your code in the XPath and data process lines.

use strict;
use warnings;

use XML::XPath;
my $file = 'Aj.xml';
my $xp = XML::XPath->new(filename=>$file);

print "\nYear\tNo. of different authors publishing in a single year";
foreach my $year ( 1913, 1917) {
### Declare the hash as NULL value
my %hash=();	

### Declare your xpath function
my $nodeset = $xp->findnodes("//issue[contains(\@printdate, \'$year\')]/..//author");
my @nodelist = $nodeset->get_nodelist;

### Store the author name in hash format
$hash{$_ ->string_value}++ for @nodelist;
print "\n", $year, "\t", scalar keys %hash;
}
Comments
Well done.
This article has been dead for over six months. Start a new discussion instead.