read a file

Please support our Perl advertiser: Programming Forums - DaniWeb Sister Site
Reply

Join Date: Jun 2006
Posts: 10
Reputation: b.janahi is an unknown quantity at this point 
Solved Threads: 0
b.janahi b.janahi is offline Offline
Newbie Poster

read a file

 
0
  #1
Jun 30th, 2006
hello,

i'm trying to run a progrm written in perl

this is the code i wrote
perl aramorph.pl <infile> outfile

where aramorph is a notpad file with my code saved in c:\perlscripts



infile is the file that should be read from
outfile ................................wirte the output to.
both saved in c:\perlscripts

but the msg
"the system cannot find the file specified" is appere :rolleyes:



shuold i chang the in/outfile locations , or what ?

so how can i run my script:mrgreen: , any body knows the solution?


B.janahi
Reply With Quote Quick reply to this message  
Join Date: Dec 2004
Posts: 2,413
Reputation: Comatose is a jewel in the rough Comatose is a jewel in the rough Comatose is a jewel in the rough Comatose is a jewel in the rough 
Solved Threads: 211
Team Colleague
Comatose's Avatar
Comatose Comatose is offline Offline
Taboo Programmer

Re: read a file

 
0
  #2
Jun 30th, 2006
Post the Perl code...
Reply With Quote Quick reply to this message  
Join Date: Jun 2006
Posts: 10
Reputation: b.janahi is an unknown quantity at this point 
Solved Threads: 0
b.janahi b.janahi is offline Offline
Newbie Poster

Re: read a file

 
0
  #3
Jun 30th, 2006
it's to long brother Comatose and toooo compilcated
i will try to get the site that my teacher got the code from
as i told the code is error free but i can't run it


it's 390 lines ithink its hard to post it here
so what should i do?
Reply With Quote Quick reply to this message  
Join Date: Dec 2004
Posts: 2,413
Reputation: Comatose is a jewel in the rough Comatose is a jewel in the rough Comatose is a jewel in the rough Comatose is a jewel in the rough 
Solved Threads: 211
Team Colleague
Comatose's Avatar
Comatose Comatose is offline Offline
Taboo Programmer

Re: read a file

 
0
  #4
Jun 30th, 2006
you can post it here, but make sure you use code tags. You do this by enclosing the source in [ CODE ] and [ /CODE ] but don't use spaces.... all one word.
Reply With Quote Quick reply to this message  
Join Date: Jun 2006
Posts: 10
Reputation: b.janahi is an unknown quantity at this point 
Solved Threads: 0
b.janahi b.janahi is offline Offline
Newbie Poster

Re: read a file

 
0
  #5
Jul 1st, 2006
this is the code , briefly it the program reads a file written in arabic then it will (analyze) all words and show the output in another file with all possible meanings for all words

so plz show mehow to run it and get the output






  1.  
  2. ################################################################################
  3. # AraMorph.pl
  4. # Portions (c) 2002 QAMUS LLC (<a rel="nofollow" class="t" href="http://www.qamus.org/" target="_blank">www.qamus.org</a>),
  5. # (c) 2002 Trustees of the University of Pennsylvania
  6. #
  7. # This program is free software; you can redistribute it and/or
  8. # modify it under the terms of the GNU General Public License
  9. # as published by the Free Software Foundation version 2.
  10. #
  11. # This program is distributed in the hope that it will be useful,
  12. # but WITHOUT ANY WARRANTY; without even the implied warranty of
  13. # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  14. # GNU General Public License for more details (../gpl.txt).
  15. #
  16. # You should have received a copy of the GNU General Public License
  17. # along with this program; if not, write to the Free Software
  18. # Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
  19. #
  20. # You can contact LDC by sending electronic mail to: <a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
  21. # or by writing to:
  22. # Linguistic Data Consortium
  23. # 3600 Market Street
  24. # Suite 810
  25. # Philadelphia, PA, 19104-2653, USA.
  26. ################################################################################
  27. # usage:
  28. # perl -w AraMorph.pl < infile > outfile
  29. # were "infile" is the input text in Arabic Windows encoding (cp1256)
  30. # and "outfile" is the output text with morphology analyses and POS tags
  31. # Not found items and related stats are written to STDERR and filename "notFound"
  32. # load 3 compatibility tables (load these first so when we load the lexicons we can check for undeclared $cat values)
  33. %hash_AB = load_table("tableAB"); # load compatibility table for prefixes-stems (AB)
  34. %hash_AC = load_table("tableAC"); # load compatibility table for prefixes-suffixes (AC)
  35. %hash_BC = load_table("tableBC"); # load compatibility table for stems-suffixes (BC)
  36. # load 3 lexicons
  37. %prefix_hash = load_dict("dictPrefixes"); # dict of prefixes (A)
  38. %stem_hash = load_dict("dictStems"); # dict of stems (B)
  39. %suffix_hash = load_dict("dictSuffixes"); # dict of suffixes (C)
  40. while (<STDIN>) {
  41. #$file .= $_; # we might want to get stats on the orthography and fix it (eg. if there is no "Y" convert all "y" to "Y")
  42. print STDERR "reading input line $.\r";
  43. @tokens = tokenize($_); # returns a list of tokens (one line at a time)
  44. foreach $token (@tokens) {
  45. if ($token =~ m/[\x81\x8D\x8E\x90\xC1-\xD6\xD8-\xDB\xDD-\xDF\xE1\xE3-\xE6\xEC-\xED\xF0-\xF3\xF5\xF6\xF8\xFA]/) {
  46. # it's an Arabic word because it has 1 or more Ar. chars
  47. print "\nINPUT STRING: $token\n";
  48. $lookup_word = get_lookup($token); # returns the Arabic string without vowels/diacritics and converted to transliteration
  49. print "LOOK-UP WORD: $lookup_word\n"; $tokens++; $types{$lookup_word}++;
  50. if ( exists($found{$lookup_word}) ) {
  51. print $found{$lookup_word}; # no need to re-analyse it
  52. }
  53. elsif ( exists($notfound{$lookup_word}) ) { # we keep %found and %notfound separate because %notfound can have additional lookups
  54. print $notfound{$lookup_word}; $freqnotfound{$lookup_word}++;
  55. }
  56. else {
  57. if ( @solutions = analyze($lookup_word) ) { # if lookup word has 1 or more solutions
  58. foreach $solution (@solutions) {
  59. $found{$lookup_word} .= $solution;
  60. }
  61. print $found{$lookup_word};
  62. }
  63. else {
  64. $notfound{$lookup_word} = " Comment: $lookup_word NOT FOUND\n";
  65. if ( @alternatives = get_alternatives($lookup_word) ) {
  66. foreach $alt (@alternatives) {
  67. $notfound{$lookup_word} .= " ALTERNATIVE: $alt\n";
  68. if ( exists($found{$alt}) ) {
  69. $notfound{$lookup_word} .= $found{$alt};
  70. }
  71. else {
  72. if ( @solutions = analyze($alt) ) {
  73. foreach $solution (@solutions) {
  74. $notfound{$lookup_word} .= $solution;
  75. }
  76. }
  77. else {
  78. $notfound{$lookup_word} .= " Comment: $alt NOT FOUND\n";
  79. }
  80. }
  81. }# end foreach
  82. }# end if
  83. print $notfound{$lookup_word}; $freqnotfound{$lookup_word}++;
  84. }
  85. }#end else
  86. }
  87. else {
  88. # it's not an Arabic word
  89. @nonArabictokens = tokenize_nonArabic($token); # tokenize it on white space
  90. foreach $item (@nonArabictokens) {
  91. print "\nINPUT STRING: $item\n Comment: Non-Alphabetic Data\n" unless ($item eq " " or $item eq "");
  92. }
  93. }
  94. }#end foreach
  95. }#end while
  96. # ====================================================
  97. # print out not-found words by frequency:
  98. open (OUT, ">notFound") || die "cannot open: $!";
  99. print STDERR "\n\n========= Some stats ============================\n";
  100. @types = keys %types; $types = @types;
  101. print STDERR "Tokens: $tokens -- Types: $types\n";
  102. print STDERR "\n========= Frequency count of Not-Found =========\n";
  103. my @items = keys %freqnotfound;
  104. foreach my $item (sort { $freqnotfound{$b} <=> $freqnotfound{$a} } @items) {
  105. $rank++;
  106. print STDERR "$item\t$freqnotfound{$item}\n" unless ( $rank > 25 );
  107. print OUT "$item\t$freqnotfound{$item}\n"; # unless ( $rank > 25 );
  108. }
  109. close OUT;
  110. # ============================
  111. sub analyze { # returns a list of 1 or more solutions
  112. $this_word = shift @_; @solutions = (); $cnt = 0;
  113. segmentword($this_word); # get a list of valid segmentations
  114. foreach $segmentation (@segmented) {
  115. ($prefix,$stem,$suffix) = split ("\t",$segmentation); #print $segmentation, "\n";
  116. if (exists($prefix_hash{$prefix})) {
  117. if (exists($stem_hash{$stem})) {
  118. if (exists($suffix_hash{$suffix})) {
  119. # all 3 components exist in their respective lexicons, but are they compatible? (check the $cat pairs)
  120. foreach $prefix_value (@{$prefix_hash{$prefix}}) {
  121. ($prefix, $voc_a, $cat_a, $gloss_a, $pos_a) = split (/\t/, $prefix_value);
  122. foreach $stem_value (@{$stem_hash{$stem}}) {
  123. #($stem, $voc_b, $cat_b, $gloss_b, $pos_b) = split (/\t/, $stem_value);
  124. ($stem, $voc_b, $cat_b, $gloss_b, $pos_b, $lemmaID) = split (/\t/, $stem_value);
  125. if ( exists($hash_AB{"$cat_a"." "."$cat_b"}) ) {
  126. foreach $suffix_value (@{$suffix_hash{$suffix}}) {
  127. ($suffix, $voc_c, $cat_c, $gloss_c, $pos_c) = split (/\t/, $suffix_value);
  128. if ( exists($hash_AC{"$cat_a"." "."$cat_c"}) ) {
  129. if ( exists($hash_BC{"$cat_b"." "."$cat_c"}) ) {
  130. #$cnt++; push (@solutions, " SOLUTION $cnt: ($voc_a$voc_b$voc_c) $pos_a$pos_b$pos_c\n (GLOSS): $gloss_a + $gloss_b + $gloss_c\n");
  131. $cnt++; push (@solutions, " SOLUTION $cnt: ($voc_a$voc_b$voc_c) [$lemmaID] $pos_a$pos_b$pos_c\n (GLOSS): $gloss_a + $gloss_b + $gloss_c\n");
  132. }
  133. }
  134. }
  135. }
  136. }
  137. }# end foreach $prefix_value
  138. }
  139. }# end if (exists($stem_hash{$stem}))
  140. }
  141. }# end foreach $segmentation
  142. return (@solutions);
  143. }
  144. # ============================
  145. sub get_alternatives { # returns a list of alternative spellings
  146. $word = shift @_; @alternatives = ();
  147. $temp = $word;
  148. if ($temp =~ m/Y'$/) { # Y_w'_Y'
  149. $temp =~ s/Y/y/g; # y_w'_y'
  150. push (@alternatives, $temp); # y_w'_y' -- pushed
  151. if ($temp =~ s/w'/&/) { # y_&__y'
  152. push (@alternatives, $temp); # y_&__y' -- pushed
  153. }
  154. $temp = $word; # Y_w'_Y'
  155. $temp =~ s/Y/y/g; # y_w'_y'
  156. $temp =~ s/y'$/}/; # y_w'_}
  157. push (@alternatives, $temp); # y_w'_} -- pushed
  158. if ($temp =~ s/w'/&/) { # y_&__}
  159. push (@alternatives, $temp); # y_&__} -- pushed
  160. }
  161. }
  162. elsif ($temp =~ m/y'$/) { # Y_w'_y'
  163. if ($temp =~ s/Y/y/g) { # Y_w'_y'
  164. push (@alternatives, $temp); # y_w'_y' -- pushed
  165. }
  166. if ($temp =~ s/w'/&/) { # y_w'_y'
  167. push (@alternatives, $temp); # y_&__y' -- pushed
  168. }
  169. $temp = $word; # Y_w'_y'
  170. $temp =~ s/Y/y/g; # y_w'_y'
  171. $temp =~ s/y'$/}/; # y_w'_}
  172. push (@alternatives, $temp); # y_w'_} -- pushed
  173. if ($temp =~ s/w'/&/) { # y_&__}
  174. push (@alternatives, $temp); # y_&__} -- pushed
  175. }
  176. }
  177. elsif ($temp =~ s/Y$/y/) { # Y_w'_y
  178. $temp =~ s/Y/y/g; # y_w'_y
  179. push (@alternatives, $temp); # y_w'_y -- pushed
  180. if ($temp =~ s/w'/&/) { # y_&__y
  181. push (@alternatives, $temp); # y_&__y -- pushed
  182. }
  183. }
  184. elsif ($temp =~ m/y$/) { # Y_w'_y
  185. $temp =~ s/Y/y/g; # y_w'_y
  186. if ($temp =~ s/w'/&/) { # y_&__y
  187. push (@alternatives, $temp); # y_&__y -- pushed
  188. }
  189. $temp = $word; # Y_w'_y
  190. $temp =~ s/Y/y/g; # y_w'_y
  191. $temp =~ s/y$/Y/g; # y_w'_Y
  192. push (@alternatives, $temp); # y_w'_Y -- pushed
  193. if ($temp =~ s/w'/&/) { # y_&__Y
  194. push (@alternatives, $temp); # y_&__Y -- pushed
  195. }
  196. }
  197. elsif ($temp =~ m/h$/) { # Y_w'_h
  198. if ($temp =~ s/Y/y/g) { # y_w'_h
  199. push (@alternatives, $temp); # y_w'_h -- pushed
  200. }
  201. if ($temp =~ s/w'/&/) { # y_&__h
  202. push (@alternatives, $temp); # y_&__h -- pushed
  203. }
  204. $temp =~ s/h$/p/; # y_w'_p
  205. push (@alternatives, $temp); # y_&__p -- pushed
  206. }
  207. elsif ($temp =~ m/p$/) { # Y_w'_h
  208. if ($temp =~ s/Y/y/g) { # y_w'_h
  209. push (@alternatives, $temp); # y_w'_h -- pushed
  210. }
  211. if ($temp =~ s/w'/&/) { # y_&__h
  212. push (@alternatives, $temp); # y_&__h -- pushed
  213. }
  214. $temp =~ s/p$/h/; # y_w'_p
  215. push (@alternatives, $temp); # y_&__p -- pushed
  216. }
  217. elsif ($temp =~ s/Y/y/g) { # Y_w'__
  218. push (@alternatives, $temp); # y_w'__ -- pushed
  219. if ($temp =~ s/w'/&/) { # y_&___
  220. push (@alternatives, $temp); # y_&___ -- pushed
  221. }
  222. }
  223. elsif ($temp =~ s/w'/&/) { # y_w'__
  224. push (@alternatives, $temp); # y_&___ -- pushed
  225. }
  226. else {
  227. # nothing
  228. }
  229. return @alternatives;
  230. }
  231. # ============================
  232. sub tokenize_nonArabic { # tokenize non-Arabic strings by splitting them on white space
  233. $nonArabic = shift @_;
  234. $nonArabic =~ s/^\s+//; $nonArabic =~ s/\s+$//; # remove leading & trailing space
  235. @nonArabictokens = split (/\s+/, $nonArabic);
  236. return @nonArabictokens;
  237. }
  238. # ============================
  239. sub tokenize { # returns a list of tokens
  240. $line = shift @_;
  241. chomp($line);
  242. $line =~ s/^\s+//; $line =~ s/\s+$//; $line =~ s/\s+/ /g; # remove or minimize white space
  243. @tokens = split (/([^\x81\x8D\x8E\x90\xC1-\xD6\xD8-\xDF\xE1\xE3-\xE6\xEC-\xED\xF0-\xF3\xF5\xF6\xF8\xFA]+)/,$line);
  244. return @tokens;
  245. }
  246. # ================================
  247. sub get_lookup { # creates a suitable lookup version of the Arabic input string (removes diacritics; transliterates)
  248. $input_str = shift @_;
  249. $tmp_word = $input_str; # we need to modify the input string for lookup
  250. $tmp_word =~ s/\xDC//g; # remove kashida/taTwiyl (U+0640)
  251. $tmp_word =~ s/[\xF0-\xF3\xF5\xF6\xF8\xFA]//g; # remove fatHatAn and all vowels/diacritics (ًٌٍَُِّْ)
  252. $tmp_word =~ tr/\x81\x8D\x8E\x90\xA1\xBA\xBF\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE1\xE3\xE4\xE5\xE6\xEC\xED\xF0\xF1\xF2\xF3\xF5\xF6\xF8\xFA/PJRG,;?'|>&<}AbptvjHxd*rzs\$SDTZEg_fqklmnhwYyFNKaui~o/; # convert to transliteration
  253. return $tmp_word;
  254. }
  255. # ============================
  256. sub segmentword { # returns a list of valid segmentations
  257. $str = shift @_;
  258. @segmented = ();
  259. $prefix_len = 0;
  260. $suffix_len = 0;
  261. $str_len = length($str);
  262. while ( $prefix_len <= 4 ) {
  263. $prefix = substr($str, 0, $prefix_len);
  264. $stem_len = ($str_len - $prefix_len);
  265. $suffix_len = 0;
  266. while (($stem_len >= 1) and ($suffix_len <= 6)) {
  267. $stem = substr($str, $prefix_len, $stem_len);
  268. $suffix = substr($str, ($prefix_len + $stem_len), $suffix_len);
  269. push (@segmented, "$prefix\t$stem\t$suffix");
  270. $stem_len--;
  271. $suffix_len++;
  272. }
  273. $prefix_len++;
  274. }
  275. return @segmented;
  276. }
  277. # ==============================================================
  278. sub load_dict { # loads a dict into a hash table where the key is $entry and its value is a list (each $entry can have multiple values)
  279. %temp_hash = (); $entries = 0; $lemmaID = "";
  280. $filename = shift @_;
  281. open (IN, $filename) || die "cannot open: $!";
  282. print STDERR "loading $filename ...";
  283. while (<IN>) {
  284. if (m/^;; /) {
  285. $lemmaID = $';
  286. chomp($lemmaID);
  287. if ( exists($seen{$lemmaID}) ) {
  288. die "lemmaID $lemmaID in $filename (line $.) isn't unique\n" ; # lemmaID's must be unique
  289. }
  290. else {
  291. $seen{$lemmaID} = 1; $lemmas++;
  292. }
  293. }
  294. elsif (m/^;/) { } # comment
  295. else {
  296. chomp(); $entries++;
  297. # a little error-checking won't hurt:
  298. $trcnt = tr/\t/\t/; if ($trcnt != 3) { die "entry in $filename (line $.) doesn't have 4 fields (3 tabs)\n" };
  299. ($entry, $voc, $cat, $glossPOS) = split (/\t/, $_); # get the $entry for use as key
  300. # two ways to get the POS info:
  301. # (1) explicitly, by extracting it from the gloss field:
  302. if ($glossPOS =~ m!<pos>(.+?)</pos>!) {
  303. $POS = $1; # extract $POS from $glossPOS
  304. $gloss = $glossPOS; # we clean up the $gloss later (see below)
  305. }
  306. # (2) by deduction: use the $cat (and sometimes the $voc and $gloss) to deduce the appropriate POS
  307. else {
  308. $gloss = $glossPOS; # we need the $gloss to guess proper names
  309. if ($cat =~ m/^(Pref-0|Suff-0)$/) {$POS = ""} # null prefix or suffix
  310. elsif ($cat =~ m/^F/) {$POS = "$voc/FUNC_WORD"}
  311. elsif ($cat =~ m/^IV/) {$POS = "$voc/VERB_IMPERFECT"}
  312. elsif ($cat =~ m/^PV/) {$POS = "$voc/VERB_PERFECT"}
  313. elsif ($cat =~ m/^CV/) {$POS = "$voc/VERB_IMPERATIVE"}
  314. elsif (($cat =~ m/^N/)
  315. and ($gloss =~ m/^[A-Z]/)) {$POS = "$voc/NOUN_PROP"} # educated guess (99% correct)
  316. elsif (($cat =~ m/^N/)
  317. and ($voc =~ m/iy~$/)) {$POS = "$voc/NOUN"} # (was NOUN_ADJ: some of these are really ADJ's and need to be tagged manually)
  318. elsif ($cat =~ m/^N/) {$POS = "$voc/NOUN"}
  319. else { die "no POS can be deduced in $filename (line $.) "; };
  320. }
  321. # clean up the gloss: remove POS info and extra space, and convert upper-ASCII to lower (it doesn't convert well to UTF-8)
  322. $gloss =~ s!<pos>.+?</pos>!!; $gloss =~ s/\s+$//; $gloss =~ s!;!/!g;
  323. $gloss =~ tr/ہءآأؤإابةتثجحخدرزسشصضظعغـ/AAAAAACEEEEIIIINOOOOOUUUU/;
  324. $gloss =~ tr/àلâمنهçèéêëىيîïٌٍَôُِùْûü/aaaaaaceeeeiiiinooooouuuu/;
  325. $gloss =~ s/ئ/AE/g; $gloss =~ s/ٹ/Sh/g; $gloss =~ s/ژ/Zh/g; $gloss =~ s/ك/ss/g;
  326. $gloss =~ s/و/ae/g; $gloss =~ s/ڑ/sh/g; $gloss =~ s/*/zh/g;
  327. # note that although we read 4 fields from the dict we now save 5 fields in the hash table
  328. # because the info in last field, $glossPOS, was split into two: $gloss and $POS
  329. #push ( @{ $temp_hash{$entry} }, "$entry\t$voc\t$cat\t$gloss\t$POS") ; # the value of $temp_hash{$entry} is a list of values
  330. push ( @{ $temp_hash{$entry} }, "$entry\t$voc\t$cat\t$gloss\t$POS\t$lemmaID") ; # the value of $temp_hash{$entry} is a list of values
  331. }
  332. }
  333. close IN;
  334. print STDERR " $lemmas lemmas and" unless ($lemmaID eq "");
  335. print STDERR " $entries entries \n";
  336. return %temp_hash;
  337. }
  338. # ==============================================================
  339. sub load_table { # loads a compatibility table into a hash table where the key is $_ and its value is 1
  340. %temp_hash = ();
  341. $filename = shift @_;
  342. open (IN, $filename) || die "cannot open: $!";
  343. while (<IN>) {
  344. unless ( m/^;/ ) {
  345. chomp();
  346. s/^\s+//; s/\s+$//; s/\s+/ /g; # remove or minimize white space
  347. $temp_hash{$_} = 1;
  348. }
  349. }
  350. close IN;
  351. return %temp_hash;
  352. }
  353. # ==============================================================
  354. ] [ /################################################################################
  355. # AraMorph.pl
  356. # Portions (c) 2002 QAMUS LLC (<a rel="nofollow" class="t" href="http://www.qamus.org/" target="_blank">www.qamus.org</a>),
  357. # (c) 2002 Trustees of the University of Pennsylvania
  358. #
  359. # This program is free software; you can redistribute it and/or
  360. # modify it under the terms of the GNU General Public License
  361. # as published by the Free Software Foundation version 2.
  362. #
  363. # This program is distributed in the hope that it will be useful,
  364. # but WITHOUT ANY WARRANTY; without even the implied warranty of
  365. # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  366. # GNU General Public License for more details (../gpl.txt).
  367. #
  368. # You should have received a copy of the GNU General Public License
  369. # along with this program; if not, write to the Free Software
  370. # Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
  371. #
  372. # You can contact LDC by sending electronic mail to: <a href="mailto:ldc@ldc.upenn.edu">ldc@ldc.upenn.edu</a>
  373. # or by writing to:
  374. # Linguistic Data Consortium
  375. # 3600 Market Street
  376. # Suite 810
  377. # Philadelphia, PA, 19104-2653, USA.
  378. ################################################################################
  379. # usage:
  380. # perl -w AraMorph.pl < infile > outfile
  381. # were "infile" is the input text in Arabic Windows encoding (cp1256)
  382. # and "outfile" is the output text with morphology analyses and POS tags
  383. # Not found items and related stats are written to STDERR and filename "notFound"
  384. # load 3 compatibility tables (load these first so when we load the lexicons we can check for undeclared $cat values)
  385. %hash_AB = load_table("tableAB"); # load compatibility table for prefixes-stems (AB)
  386. %hash_AC = load_table("tableAC"); # load compatibility table for prefixes-suffixes (AC)
  387. %hash_BC = load_table("tableBC"); # load compatibility table for stems-suffixes (BC)
  388. # load 3 lexicons
  389. %prefix_hash = load_dict("dictPrefixes"); # dict of prefixes (A)
  390. %stem_hash = load_dict("dictStems"); # dict of stems (B)
  391. %suffix_hash = load_dict("dictSuffixes"); # dict of suffixes (C)
  392. while (<STDIN>) {
  393. #$file .= $_; # we might want to get stats on the orthography and fix it (eg. if there is no "Y" convert all "y" to "Y")
  394. print STDERR "reading input line $.\r";
  395. @tokens = tokenize($_); # returns a list of tokens (one line at a time)
  396. foreach $token (@tokens) {
  397. if ($token =~ m/[\x81\x8D\x8E\x90\xC1-\xD6\xD8-\xDB\xDD-\xDF\xE1\xE3-\xE6\xEC-\xED\xF0-\xF3\xF5\xF6\xF8\xFA]/) {
  398. # it's an Arabic word because it has 1 or more Ar. chars
  399. print "\nINPUT STRING: $token\n";
  400. $lookup_word = get_lookup($token); # returns the Arabic string without vowels/diacritics and converted to transliteration
  401. print "LOOK-UP WORD: $lookup_word\n"; $tokens++; $types{$lookup_word}++;
  402. if ( exists($found{$lookup_word}) ) {
  403. print $found{$lookup_word}; # no need to re-analyse it
  404. }
  405. elsif ( exists($notfound{$lookup_word}) ) { # we keep %found and %notfound separate because %notfound can have additional lookups
  406. print $notfound{$lookup_word}; $freqnotfound{$lookup_word}++;
  407. }
  408. else {
  409. if ( @solutions = analyze($lookup_word) ) { # if lookup word has 1 or more solutions
  410. foreach $solution (@solutions) {
  411. $found{$lookup_word} .= $solution;
  412. }
  413. print $found{$lookup_word};
  414. }
  415. else {
  416. $notfound{$lookup_word} = " Comment: $lookup_word NOT FOUND\n";
  417. if ( @alternatives = get_alternatives($lookup_word) ) {
  418. foreach $alt (@alternatives) {
  419. $notfound{$lookup_word} .= " ALTERNATIVE: $alt\n";
  420. if ( exists($found{$alt}) ) {
  421. $notfound{$lookup_word} .= $found{$alt};
  422. }
  423. else {
  424. if ( @solutions = analyze($alt) ) {
  425. foreach $solution (@solutions) {
  426. $notfound{$lookup_word} .= $solution;
  427. }
  428. }
  429. else {
  430. $notfound{$lookup_word} .= " Comment: $alt NOT FOUND\n";
  431. }
  432. }
  433. }# end foreach
  434. }# end if
  435. print $notfound{$lookup_word}; $freqnotfound{$lookup_word}++;
  436. }
  437. }#end else
  438. }
  439. else {
  440. # it's not an Arabic word
  441. @nonArabictokens = tokenize_nonArabic($token); # tokenize it on white space
  442. foreach $item (@nonArabictokens) {
  443. print "\nINPUT STRING: $item\n Comment: Non-Alphabetic Data\n" unless ($item eq " " or $item eq "");
  444. }
  445. }
  446. }#end foreach
  447. }#end while
  448. # ====================================================
  449. # print out not-found words by frequency:
  450. open (OUT, ">notFound") || die "cannot open: $!";
  451. print STDERR "\n\n========= Some stats ============================\n";
  452. @types = keys %types; $types = @types;
  453. print STDERR "Tokens: $tokens -- Types: $types\n";
  454. print STDERR "\n========= Frequency count of Not-Found =========\n";
  455. my @items = keys %freqnotfound;
  456. foreach my $item (sort { $freqnotfound{$b} <=> $freqnotfound{$a} } @items) {
  457. $rank++;
  458. print STDERR "$item\t$freqnotfound{$item}\n" unless ( $rank > 25 );
  459. print OUT "$item\t$freqnotfound{$item}\n"; # unless ( $rank > 25 );
  460. }
  461. close OUT;
  462. # ============================
  463. sub analyze { # returns a list of 1 or more solutions
  464. $this_word = shift @_; @solutions = (); $cnt = 0;
  465. segmentword($this_word); # get a list of valid segmentations
  466. foreach $segmentation (@segmented) {
  467. ($prefix,$stem,$suffix) = split ("\t",$segmentation); #print $segmentation, "\n";
  468. if (exists($prefix_hash{$prefix})) {
  469. if (exists($stem_hash{$stem})) {
  470. if (exists($suffix_hash{$suffix})) {
  471. # all 3 components exist in their respective lexicons, but are they compatible? (check the $cat pairs)
  472. foreach $prefix_value (@{$prefix_hash{$prefix}}) {
  473. ($prefix, $voc_a, $cat_a, $gloss_a, $pos_a) = split (/\t/, $prefix_value);
  474. foreach $stem_value (@{$stem_hash{$stem}}) {
  475. #($stem, $voc_b, $cat_b, $gloss_b, $pos_b) = split (/\t/, $stem_value);
  476. ($stem, $voc_b, $cat_b, $gloss_b, $pos_b, $lemmaID) = split (/\t/, $stem_value);
  477. if ( exists($hash_AB{"$cat_a"." "."$cat_b"}) ) {
  478. foreach $suffix_value (@{$suffix_hash{$suffix}}) {
  479. ($suffix, $voc_c, $cat_c, $gloss_c, $pos_c) = split (/\t/, $suffix_value);
  480. if ( exists($hash_AC{"$cat_a"." "."$cat_c"}) ) {
  481. if ( exists($hash_BC{"$cat_b"." "."$cat_c"}) ) {
  482. #$cnt++; push (@solutions, " SOLUTION $cnt: ($voc_a$voc_b$voc_c) $pos_a$pos_b$pos_c\n (GLOSS): $gloss_a + $gloss_b + $gloss_c\n");
  483. $cnt++; push (@solutions, " SOLUTION $cnt: ($voc_a$voc_b$voc_c) [$lemmaID] $pos_a$pos_b$pos_c\n (GLOSS): $gloss_a + $gloss_b + $gloss_c\n");
  484. }
  485. }
  486. }
  487. }
  488. }
  489. }# end foreach $prefix_value
  490. }
  491. }# end if (exists($stem_hash{$stem}))
  492. }
  493. }# end foreach $segmentation
  494. return (@solutions);
  495. }
  496. # ============================
  497. sub get_alternatives { # returns a list of alternative spellings
  498. $word = shift @_; @alternatives = ();
  499. $temp = $word;
  500. if ($temp =~ m/Y'$/) { # Y_w'_Y'
  501. $temp =~ s/Y/y/g; # y_w'_y'
  502. push (@alternatives, $temp); # y_w'_y' -- pushed
  503. if ($temp =~ s/w'/&/) { # y_&__y'
  504. push (@alternatives, $temp); # y_&__y' -- pushed
  505. }
  506. $temp = $word; # Y_w'_Y'
  507. $temp =~ s/Y/y/g; # y_w'_y'
  508. $temp =~ s/y'$/}/; # y_w'_}
  509. push (@alternatives, $temp); # y_w'_} -- pushed
  510. if ($temp =~ s/w'/&/) { # y_&__}
  511. push (@alternatives, $temp); # y_&__} -- pushed
  512. }
  513. }
  514. elsif ($temp =~ m/y'$/) { # Y_w'_y'
  515. if ($temp =~ s/Y/y/g) { # Y_w'_y'
  516. push (@alternatives, $temp); # y_w'_y' -- pushed
  517. }
  518. if ($temp =~ s/w'/&/) { # y_w'_y'
  519. push (@alternatives, $temp); # y_&__y' -- pushed
  520. }
  521. $temp = $word; # Y_w'_y'
  522. $temp =~ s/Y/y/g; # y_w'_y'
  523. $temp =~ s/y'$/}/; # y_w'_}
  524. push (@alternatives, $temp); # y_w'_} -- pushed
  525. if ($temp =~ s/w'/&/) { # y_&__}
  526. push (@alternatives, $temp); # y_&__} -- pushed
  527. }
  528. }
  529. elsif ($temp =~ s/Y$/y/) { # Y_w'_y
  530. $temp =~ s/Y/y/g; # y_w'_y
  531. push (@alternatives, $temp); # y_w'_y -- pushed
  532. if ($temp =~ s/w'/&/) { # y_&__y
  533. push (@alternatives, $temp); # y_&__y -- pushed
  534. }
  535. }
  536. elsif ($temp =~ m/y$/) { # Y_w'_y
  537. $temp =~ s/Y/y/g; # y_w'_y
  538. if ($temp =~ s/w'/&/) { # y_&__y
  539. push (@alternatives, $temp); # y_&__y -- pushed
  540. }
  541. $temp = $word; # Y_w'_y
  542. $temp =~ s/Y/y/g; # y_w'_y
  543. $temp =~ s/y$/Y/g; # y_w'_Y
  544. push (@alternatives, $temp); # y_w'_Y -- pushed
  545. if ($temp =~ s/w'/&/) { # y_&__Y
  546. push (@alternatives, $temp); # y_&__Y -- pushed
  547. }
  548. }
  549. elsif ($temp =~ m/h$/) { # Y_w'_h
  550. if ($temp =~ s/Y/y/g) { # y_w'_h
  551. push (@alternatives, $temp); # y_w'_h -- pushed
  552. }
  553. if ($temp =~ s/w'/&/) { # y_&__h
  554. push (@alternatives, $temp); # y_&__h -- pushed
  555. }
  556. $temp =~ s/h$/p/; # y_w'_p
  557. push (@alternatives, $temp); # y_&__p -- pushed
  558. }
  559. elsif ($temp =~ m/p$/) { # Y_w'_h
  560. if ($temp =~ s/Y/y/g) { # y_w'_h
  561. push (@alternatives, $temp); # y_w'_h -- pushed
  562. }
  563. if ($temp =~ s/w'/&/) { # y_&__h
  564. push (@alternatives, $temp); # y_&__h -- pushed
  565. }
  566. $temp =~ s/p$/h/; # y_w'_p
  567. push (@alternatives, $temp); # y_&__p -- pushed
  568. }
  569. elsif ($temp =~ s/Y/y/g) { # Y_w'__
  570. push (@alternatives, $temp); # y_w'__ -- pushed
  571. if ($temp =~ s/w'/&/) { # y_&___
  572. push (@alternatives, $temp); # y_&___ -- pushed
  573. }
  574. }
  575. elsif ($temp =~ s/w'/&/) { # y_w'__
  576. push (@alternatives, $temp); # y_&___ -- pushed
  577. }
  578. else {
  579. # nothing
  580. }
  581. return @alternatives;
  582. }
  583. # ============================
  584. sub tokenize_nonArabic { # tokenize non-Arabic strings by splitting them on white space
  585. $nonArabic = shift @_;
  586. $nonArabic =~ s/^\s+//; $nonArabic =~ s/\s+$//; # remove leading & trailing space
  587. @nonArabictokens = split (/\s+/, $nonArabic);
  588. return @nonArabictokens;
  589. }
  590. # ============================
  591. sub tokenize { # returns a list of tokens
  592. $line = shift @_;
  593. chomp($line);
  594. $line =~ s/^\s+//; $line =~ s/\s+$//; $line =~ s/\s+/ /g; # remove or minimize white space
  595. @tokens = split (/([^\x81\x8D\x8E\x90\xC1-\xD6\xD8-\xDF\xE1\xE3-\xE6\xEC-\xED\xF0-\xF3\xF5\xF6\xF8\xFA]+)/,$line);
  596. return @tokens;
  597. }
  598. # ================================
  599. sub get_lookup { # creates a suitable lookup version of the Arabic input string (removes diacritics; transliterates)
  600. $input_str = shift @_;
  601. $tmp_word = $input_str; # we need to modify the input string for lookup
  602. $tmp_word =~ s/\xDC//g; # remove kashida/taTwiyl (U+0640)
  603. $tmp_word =~ s/[\xF0-\xF3\xF5\xF6\xF8\xFA]//g; # remove fatHatAn and all vowels/diacritics (ًٌٍَُِّْ)
  604. $tmp_word =~ tr/\x81\x8D\x8E\x90\xA1\xBA\xBF\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE1\xE3\xE4\xE5\xE6\xEC\xED\xF0\xF1\xF2\xF3\xF5\xF6\xF8\xFA/PJRG,;?'|>&<}AbptvjHxd*rzs\$SDTZEg_fqklmnhwYyFNKaui~o/; # convert to transliteration
  605. return $tmp_word;
  606. }
  607. # ============================
  608. sub segmentword { # returns a list of valid segmentations
  609. $str = shift @_;
  610. @segmented = ();
  611. $prefix_len = 0;
  612. $suffix_len = 0;
  613. $str_len = length($str);
  614. while ( $prefix_len <= 4 ) {
  615. $prefix = substr($str, 0, $prefix_len);
  616. $stem_len = ($str_len - $prefix_len);
  617. $suffix_len = 0;
  618. while (($stem_len >= 1) and ($suffix_len <= 6)) {
  619. $stem = substr($str, $prefix_len, $stem_len);
  620. $suffix = substr($str, ($prefix_len + $stem_len), $suffix_len);
  621. push (@segmented, "$prefix\t$stem\t$suffix");
  622. $stem_len--;
  623. $suffix_len++;
  624. }
  625. $prefix_len++;
  626. }
  627. return @segmented;
  628. }
  629. # ==============================================================
  630. sub load_dict { # loads a dict into a hash table where the key is $entry and its value is a list (each $entry can have multiple values)
  631. %temp_hash = (); $entries = 0; $lemmaID = "";
  632. $filename = shift @_;
  633. open (IN, $filename) || die "cannot open: $!";
  634. print STDERR "loading $filename ...";
  635. while (<IN>) {
  636. if (m/^;; /) {
  637. $lemmaID = $';
  638. chomp($lemmaID);
  639. if ( exists($seen{$lemmaID}) ) {
  640. die "lemmaID $lemmaID in $filename (line $.) isn't unique\n" ; # lemmaID's must be unique
  641. }
  642. else {
  643. $seen{$lemmaID} = 1; $lemmas++;
  644. }
  645. }
  646. elsif (m/^;/) { } # comment
  647. else {
  648. chomp(); $entries++;
  649. # a little error-checking won't hurt:
  650. $trcnt = tr/\t/\t/; if ($trcnt != 3) { die "entry in $filename (line $.) doesn't have 4 fields (3 tabs)\n" };
  651. ($entry, $voc, $cat, $glossPOS) = split (/\t/, $_); # get the $entry for use as key
  652. # two ways to get the POS info:
  653. # (1) explicitly, by extracting it from the gloss field:
  654. if ($glossPOS =~ m!<pos>(.+?)</pos>!) {
  655. $POS = $1; # extract $POS from $glossPOS
  656. $gloss = $glossPOS; # we clean up the $gloss later (see below)
  657. }
  658. # (2) by deduction: use the $cat (and sometimes the $voc and $gloss) to deduce the appropriate POS
  659. else {
  660. $gloss = $glossPOS; # we need the $gloss to guess proper names
  661. if ($cat =~ m/^(Pref-0|Suff-0)$/) {$POS = ""} # null prefix or suffix
  662. elsif ($cat =~ m/^F/) {$POS = "$voc/FUNC_WORD"}
  663. elsif ($cat =~ m/^IV/) {$POS = "$voc/VERB_IMPERFECT"}
  664. elsif ($cat =~ m/^PV/) {$POS = "$voc/VERB_PERFECT"}
  665. elsif ($cat =~ m/^CV/) {$POS = "$voc/VERB_IMPERATIVE"}
  666. elsif (($cat =~ m/^N/)
  667. and ($gloss =~ m/^[A-Z]/)) {$POS = "$voc/NOUN_PROP"} # educated guess (99% correct)
  668. elsif (($cat =~ m/^N/)
  669. and ($voc =~ m/iy~$/)) {$POS = "$voc/NOUN"} # (was NOUN_ADJ: some of these are really ADJ's and need to be tagged manually)
  670. elsif ($cat =~ m/^N/) {$POS = "$voc/NOUN"}
  671. else { die "no POS can be deduced in $filename (line $.) "; };
  672. }
  673. # clean up the gloss: remove POS info and extra space, and convert upper-ASCII to lower (it doesn't convert well to UTF-8)
  674. $gloss =~ s!<pos>.+?</pos>!!; $gloss =~ s/\s+$//; $gloss =~ s!;!/!g;
  675. $gloss =~ tr/ہءآأؤإابةتثجحخدرزسشصضظعغـ/AAAAAACEEEEIIIINOOOOOUUUU/;
  676. $gloss =~ tr/àلâمنهçèéêëىيîïٌٍَôُِùْûü/aaaaaaceeeeiiiinooooouuuu/;
  677. $gloss =~ s/ئ/AE/g; $gloss =~ s/ٹ/Sh/g; $gloss =~ s/ژ/Zh/g; $gloss =~ s/ك/ss/g;
  678. $gloss =~ s/و/ae/g; $gloss =~ s/ڑ/sh/g; $gloss =~ s/*/zh/g;
  679. # note that although we read 4 fields from the dict we now save 5 fields in the hash table
  680. # because the info in last field, $glossPOS, was split into two: $gloss and $POS
  681. #push ( @{ $temp_hash{$entry} }, "$entry\t$voc\t$cat\t$gloss\t$POS") ; # the value of $temp_hash{$entry} is a list of values
  682. push ( @{ $temp_hash{$entry} }, "$entry\t$voc\t$cat\t$gloss\t$POS\t$lemmaID") ; # the value of $temp_hash{$entry} is a list of values
  683. }
  684. }
  685. close IN;
  686. print STDERR " $lemmas lemmas and" unless ($lemmaID eq "");
  687. print STDERR " $entries entries \n";
  688. return %temp_hash;
  689. }
  690. # ==============================================================
  691. sub load_table { # loads a compatibility table into a hash table where the key is $_ and its value is 1
  692. %temp_hash = ();
  693. $filename = shift @_;
  694. open (IN, $filename) || die "cannot open: $!";
  695. while (<IN>) {
  696. unless ( m/^;/ ) {
  697. chomp();
  698. s/^\s+//; s/\s+$//; s/\s+/ /g; # remove or minimize white space
  699. $temp_hash{$_} = 1;
  700. }
  701. }
  702. close IN;
  703. return %temp_hash;
  704. }
  705. # ==============================================================
Last edited by b.janahi; Jul 1st, 2006 at 5:21 am.
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC