Hi,
I'm new to perl, and need help about comparing 2 columns within 2 different .tsv files.
I have search through the forum for some similar case of mine, such as http://www.daniweb.com/software-development/perl/threads/335711 , http://www.daniweb.com/software-development/perl/threads/336421 and http://www.daniweb.com/software-development/perl/threads/311399 , I have tried to modified them according to my needs but somehow the final output is not right, so I decided to make a new thread.
I have 2 input tsv files,
The first one is archive.tsv. It is a paper database with the format of
\t\t\t
P08-1016 Lexicalized Phonotactic Word Segmentation Margaret M. Fleck <a href="http://aclweb.org/anthology-new/P/P08/P08-1016.pdf">http://aclweb.org/anthology-new/P/P08/P08-1016.pdf</a>
P08-1021 Correcting Misuse of Verb Forms John Lee; Stephanie Seneff <a href="http://aclweb.org/anthology-new/P/P08/P08-1021.pdf">http://aclweb.org/anthology-new/P/P08/P08-1021.pdf</a>
P08-1030 Refining Event Extraction through Cross-Document Inference Heng Ji; Ralph Grishman <a href="http://aclweb.org/anthology-new/P/P08/P08-1030.pdf">http://aclweb.org/anthology-new/P/P08/P08-1030.pdf</a>
P08-1038 A Logical Basis for the D Combinator and Normal Form in CCG Frederick Hoyt; Jason Baldridge <a href="http://aclweb.org/anthology-new/P/P08/P08-1038.pdf">http://aclweb.org/anthology-new/P/P08/P08-1038.pdf</a>
P08-1039 Parsing Noun Phrase Structure with CCG David Vadas; James R. Curran <a href="http://aclweb.org/anthology-new/P/P08/P08-1039.pdf">http://aclweb.org/anthology-new/P/P08/P08-1039.pdf</a>
P08-1040 Sentence Simplification for Semantic Role Labeling David Vickrey; Daphne Koller <a href="http://aclweb.org/anthology-new/P/P08/P08-1040.pdf">http://aclweb.org/anthology-new/P/P08/P08-1040.pdf</a>
P08-1042 Ad Hoc Treebank Structures Markus Dickinson <a href="http://aclweb.org/anthology-new/P/P08/P08-1042.pdf">http://aclweb.org/anthology-new/P/P08/P08-1042.pdf</a>
P08-3003 Inferring Activity Time in News through Event Modeling Vladimir Eidelman <a href="http://aclweb.org/anthology-new/P/P08/P08-3003.pdf">http://aclweb.org/anthology-new/P/P08/P08-3003.pdf</a>
P08-5003 Semi-Supervised Learning for Natural Language Processing John Blitzer; Xiaojin Jerry Zhu <a href="http://aclweb.org/anthology-new/P/P08/P08-5003.pdf">http://aclweb.org/anthology-new/P/P08/P08-5003.pdf</a>
P08-5004 Advanced Online Learning for Natural Language Processing Koby Crammer <a href="http://aclweb.org/anthology-new/P/P08/P08-5004.pdf">http://aclweb.org/anthology-new/P/P08/P08-5004.pdf</a>
and the second file is program.tsv. It is a conference program database with the format of\t\t
Information Extraction 2 Refining Event Extraction through Cross-Document Inference Ji, Heng; Ralph Grishman
Syntax & Parsing 1 A Logical Basis for the D Combinator and Normal Form in CCG Hoyt, Frederick; Jason Baldridge
Speech Processing Lexicalized Phonotactic Word Segmentation Fleck, Margaret M.
Syntax & Parsing 1 Parsing Noun Phrase Structure with CCG Vadas, David; James R. Curran
Syntax & Parsing 1 Sentence Simplification for Semantic Role Labeling Vickrey, David; Daphne Koller
Student Research Workshop A Supervised Learning Approach to Automatic Synonym Identification Based on Distributional Features Hagiwara, Masato
Student Research Workshop An Integrated Architecture for Generating Parenthetical Constructions Banik, Eva
These datas that I posted are only some of the data that I think will represent my questions.My objectives:
1. I want to match the Paper Title each line of the archive.tsv file with the Paper Title in the program.tsv files.
If the title in the archive file exists in the program file, I should print exactly each line in the archive file, then added the respective session from the program file.
If the title in the archive file does not exist in the program file, I should print exactly each line in the archive file, then added *NA* in the session column (added in the last column).
2. If there is a Title that is available in the program.tsv files but not available in the archive.tsv files, it should be ignored.
Here's my code:
#!/usr/bin/perl
open(FILE1,"<archive.tsv");
open(FILE2,"<program.tsv");
open (OUT,">combine.tsv");
my @array1=<FILE1>;
my @array2=<FILE2>;
close FILE1;
close FILE2;
for (@array2){
chomp;
my($session,$titleprog,$authorprog)=split(/\t/);
push(@sesname,"$session");
push(@title2, "$titleprog");
}
for (@array1){
chomp;
my($paperid,$titlearcv,$authorarcv,$url)=split(/\t/);
push(@all, "$_");
push(@title1, "$titlearcv");
}
$count = 0;
foreach my $progline (@title2){
foreach my $arcline (@all){
if ($arcline =~ m/$progline/i){
print OUT "$arcline\t$sesname[$i]\n";
$count++;
}
else {$count++;}print OUT "$arcline\t*NA*\n" if ($count == $#all);
}
$i++;
}
close OUT; and I got the following output: P08-1030 Refining Event Extraction through Cross-Document Inference Heng Ji; Ralph Grishman <a href="http://aclweb.org/anthology-new/P/P08/P08-1030.pdf">http://aclweb.org/anthology-new/P/P08/P08-1030.pdf</a> Information Extraction 2
P08-5003 Semi-Supervised Learning for Natural Language Processing John Blitzer; Xiaojin Jerry Zhu <a href="http://aclweb.org/anthology-new/P/P08/P08-5003.pdf">http://aclweb.org/anthology-new/P/P08/P08-5003.pdf</a> *NA*
P08-1038 A Logical Basis for the D Combinator and Normal Form in CCG Frederick Hoyt; Jason Baldridge <a href="http://aclweb.org/anthology-new/P/P08/P08-1038.pdf">http://aclweb.org/anthology-new/P/P08/P08-1038.pdf</a> Syntax & Parsing 1
P08-1016 Lexicalized Phonotactic Word Segmentation Margaret M. Fleck <a href="http://aclweb.org/anthology-new/P/P08/P08-1016.pdf">http://aclweb.org/anthology-new/P/P08/P08-1016.pdf</a> Speech Processing
P08-1039 Parsing Noun Phrase Structure with CCG David Vadas; James R. Curran <a href="http://aclweb.org/anthology-new/P/P08/P08-1039.pdf">http://aclweb.org/anthology-new/P/P08/P08-1039.pdf</a> Syntax & Parsing 1
P08-1040 Sentence Simplification for Semantic Role Labeling David Vickrey; Daphne Koller <a href="http://aclweb.org/anthology-new/P/P08/P08-1040.pdf">http://aclweb.org/anthology-new/P/P08/P08-1040.pdf</a> Syntax & Parsing 1
My desired output should be like this:
format:\t\t\t\t
P08-1016 Lexicalized Phonotactic Word Segmentation Margaret M. Fleck <a href="http://aclweb.org/anthology-new/P/P08/P08-1016.pdf">http://aclweb.org/anthology-new/P/P08/P08-1016.pdf</a> Speech Processing
P08-1021 Correcting Misuse of Verb Forms John Lee; Stephanie Seneff <a href="http://aclweb.org/anthology-new/P/P08/P08-1021.pdf">http://aclweb.org/anthology-new/P/P08/P08-1021.pdf</a> *NA*
P08-1030 Refining Event Extraction through Cross-Document Inference Heng Ji; Ralph Grishman <a href="http://aclweb.org/anthology-new/P/P08/P08-1030.pdf">http://aclweb.org/anthology-new/P/P08/P08-1030.pdf</a> Information Extraction 2
P08-1038 A Logical Basis for the D Combinator and Normal Form in CCG Frederick Hoyt; Jason Baldridge <a href="http://aclweb.org/anthology-new/P/P08/P08-1038.pdf">http://aclweb.org/anthology-new/P/P08/P08-1038.pdf</a> Syntax & Parsing 1
P08-1039 Parsing Noun Phrase Structure with CCG David Vadas; James R. Curran <a href="http://aclweb.org/anthology-new/P/P08/P08-1039.pdf">http://aclweb.org/anthology-new/P/P08/P08-1039.pdf</a> Syntax & Parsing 1
P08-1040 Sentence Simplification for Semantic Role Labeling David Vickrey; Daphne Koller <a href="http://aclweb.org/anthology-new/P/P08/P08-1040.pdf">http://aclweb.org/anthology-new/P/P08/P08-1040.pdf</a> Syntax & Parsing 1
P08-1042 Ad Hoc Treebank Structures Markus Dickinson <a href="http://aclweb.org/anthology-new/P/P08/P08-1042.pdf">http://aclweb.org/anthology-new/P/P08/P08-1042.pdf</a> *NA*
P08-3003 Inferring Activity Time in News through Event Modeling Vladimir Eidelman <a href="http://aclweb.org/anthology-new/P/P08/P08-3003.pdf">http://aclweb.org/anthology-new/P/P08/P08-3003.pdf</a> *NA*
P08-5003 Semi-Supervised Learning for Natural Language Processing John Blitzer; Xiaojin Jerry Zhu <a href="http://aclweb.org/anthology-new/P/P08/P08-5003.pdf">http://aclweb.org/anthology-new/P/P08/P08-5003.pdf</a> *NA*
P08-5004 Advanced Online Learning for Natural Language Processing Koby Crammer <a href="http://aclweb.org/anthology-new/P/P08/P08-5004.pdf">http://aclweb.org/anthology-new/P/P08/P08-5004.pdf</a> *NA*
Can anyone please help me, what is wrong with my code?
I'll have another look when I get some time but the first thing I'd suggest:
use strict;
use warnings;
belong in every non-trivial Perl script. Add them to your script and declare any undeclared variables, such as @sesname, @title1, @all, @title2 and $i.
Another general suggestion: when opening files, use lexical filehandles instead of bare words and test for success or failure to open by including an or die "blah blah: $!" clause.
open my $fh1, '<', 'archive.tsv' or die "Failed to open archive.tsv: $!";
open my $fh2, '<', 'program.tsv' or die "Failed to open program.tsv: $!";
open my $fh3, '>', 'combine.tsv' or die "Failed to open combine.tsv: $!";Because you only need to read each record from the archive file once, you don't need to save it into an array. Just iterate through it.
Because each program title should occur no more than once in the program file, you can read it into a hash for more efficient lookup than having to iterate through an array each time you want to search for a program title. The following works for me:
#!/usr/bin/perl
#match_archive.pl
use strict;
use warnings;
my %programs;#Hash to save program title => program session for each program rec
open my $fh1, '<', 'program.tsv' or die "Failed to open program.tsv: $!";
foreach (<$fh1>){
chomp;#Remove newline character from end of line
my ($ps, $pt) = split(/\t/);
$programs{$pt} = $ps;
}
close $fh1;
open my $fh2, '<', 'archive.tsv' or die "Failed to open archive.tsv: $!";
open my $fh3, '>', 'combine.tsv' or die "Failed to open combine.tsv: $!";#Output
foreach (<$fh2>){
chomp;#Remove newline character from end of line
my $pt = (split(/\t/))[1];#Paper title
if (exists $programs{$pt}){
print $fh3 "$_\t$programs{$pt}\n";
}
else{
print $fh3 "$_\t*NA*\n";
}
}Hi, d5e5!
thanks, it works like a charm.
I have one question regarding the code,
During the first 'foreach' loop, you split each column into $pt and $ps , and then assign the values of $programs{$pt} into $ps .
foreach (<$fh1>){
chomp;#Remove newline character from end of line
my ($ps, $pt) = split(/\t/);
$programs{$pt} = $ps;
}
$programs{$pt} refers to the content of program session, am I correct?
and, during the 'if' loop in the second 'foreach' loop,
foreach (<$fh2>){
chomp;#Remove newline character from end of line
my $pt = (split(/\t/))[1];#Paper title
if (exists $programs{$pt}){
print $fh3 "$_\t$programs{$pt}\n";
}
else{
print $fh3 "$_\t*NA*\n";
}
}
if (exists $programs{$pt}){ isn't that mean you're trying to check the program session whether it exists in the archive file?
I can't understand the difference between $programs{$pt} in if (exists $programs{$pt}){ and in print $fh3 "$_\t$programs{$pt}\n";
During the first 'foreach' loop, you split each column into $pt and $ps...
I split eachrow and assign the first two elements of the resulting list to the variables $pt (program title) and $ps (program session).
...and then assign the values of $programs{$pt} into $psNo, I assign the content of $ps to the $programs{$pt} entry in the %programs hash. A hash contains one or more key-value pairs, where the key must be unique for the hash. This allows us to look up any value in the hash by specifying its key. The key of this particular hash entry is $pt and the value is $ps.if (exists $programs{$pt}){ isn't that mean you're trying to check the program session whether it exists in the archive file?No, I'm checking to see if there is an entry in the %programs hash having that program title as key. Remember that the %program hash contains all data read from program.tsv so in effect I'm checking to see if that program title (from the archive file) exists in the program file.I can't understand the difference between $programs{$pt} in if (exists $programs{$pt}){ and in print $fh3 "$_\t$programs{$pt}\n";There is no difference. I check if $programs{$pt} exists before trying to print $programs{$pt} because if I try to print a hash value for a key that doesn't exist in the hash perl will give me an error message. For an introduction to using hashes in Perl see http://www.perltutorial.org/perl-hash.aspx
Thanks so much for the detailed explanations and the tutorial! I really appreciate it.