classification of text in a corpus of articles

Question

10 Years Ago

classification of a text in a corpus with five themes and each theme contains 10 items in perl
for exemple:
i have:
1 theme food
10 articles about food
2 theme politic
10 articles about politic
...etc
and i give him a new text and he classify this text on the appropriate theme
using:
- K-means or
- decision tree or
- segmentation or
-naive bayesien
Pleeeeease help me

perl

Edited 10 Years Ago by chahinez.abdelo.9

3 Contributors
9 Replies
704 Views
6 Years Discussion Span
Latest Post 4 Years Ago Latest Post by Sarah_41

All 9 Replies

2teez 43 Posting Whiz

10 Years Ago

Hi,
What have you tried, and where are you having issues with your script? What desired output are you expecting?

2teez 43 Posting Whiz

10 Years Ago

Can you help me please

Yes, I really want to help, but unless you could show the sample of your input, and how your desired output should look like, it would be difficult to tell.

Moreover, I love French language, but I am a beginner in it and it has been long I checked up my class-work on french language. So, please could you show your code in English so that one can follow, especially your comments.

WIll be waiting to hear from you soon.

2teez 43 Posting Whiz

10 Years Ago

Hi,
I would have loved to see how your data raw data look like, because I don't think you will have to use s/// these much parsing your data.

Secondly, you could module File::Find instead of the trying to handpick your files from the directory.

It a lot better to use three arguments open function and a lexcical filehandler than you are presently doing something like:
open my $fh, '<', $filename or die "can't open file: $!"

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

chahinez.abdelo.9 0 Newbie Poster · Answer 1 · 2014-12-17T19:38:51+00:00

#!/usr/bin/perl

use strict; use warnings;

# Lancement des traitements sur le corpus

sub Run 
{
   my ($rep) = @_;
   my $DIRarticles = "$rep/art";
   my $DIRclean= "$rep/clean";
   my $DIRtag= "$rep/tag";
   my $DIRvect= "$rep/vect";
   my $nbfiles= 0;

   #On recupere tout les fichiers contenu dans le repertoire
   opendir (REP, $DIRarticles) or die ("CAUTION : Impossible d'ouvrir le
repertoire");

   #On les stocke dans dans un tableau
   my @articles = readdir (REP);
   closedir (REP);
   #Ensuite pour chaque fichier on extrait les candidats termes
   print ("PRE TRAITEMENT sur le REPERTOIRE : $DIRarticles en cours \n");
   foreach my $entree (@articles)
   {
      if (not($entree eq ".") and not($entree eq ".."))
      {
          print ("CLEANING $DIRarticles,$entree ...\n");
          Clean ("$DIRarticles","$entree","$DIRclean");
          print "TAGGING $DIRclean,$entree ...\n";
          Tagger ("$DIRclean","$entree","$DIRtag");
          $nbfiles ++;
      }
   }
   print ("\t$nbfiles dans $DIRarticles ont أ©tأ© traitأ©s \n\n");
}

#Operation de nettoyage
sub Clean
{
   my ($dir2clean,$fichier,$rep2clean) = @_;
   #Ouverture du fichier
   open (F,"$dir2clean/$fichier");
   #Fichier de sortie
   open (Sortie,"> $rep2clean/$fichier");
   #On parcours le texte
   while (my $chaine = <F>) 
   {
       $chaine =~ s/\./ \. /g;
       $chaine =~ s/\,/ \. /g;
       $chaine =~ s/\:/ \. /g;
       $chaine =~ s/\;/ \. /g;
       $chaine =~ s/\'/ \. /g;
       $chaine =~ s/\"/ \. /g;
       $chaine =~ s/\?/ \. /g;
       $chaine =~ s/\!/ \. /g;
       $chaine =~ s/\// \. /g;




       $chaine =~ s/أ€/\$A/g;
       $chaine =~ s/أپ/\\A/g;
       $chaine =~ s/أ‚/#A/g;
       $chaine =~ s/أ„/~A/g;
       $chaine =~ s/أ /\$a/g;
       $chaine =~ s/أ،/\\A/g;
       $chaine =~ s/أ¢/#A/g;
       $chaine =~ s/أ¤/~a/g;
       $chaine =~ s/أ’/\$O/g;
       $chaine =~ s/أ“/\\O/g;
       $chaine =~ s/أ”/#O/g;
       $chaine =~ s/أ–/~O/g;
       $chaine =~ s/أ²/\$o/g;
       $chaine =~ s/أ³/\\o/g;
       $chaine =~ s/أ´/#o/g;
       $chaine =~ s/أ¶/~o/g;
       $chaine =~ s/أˆ/\$E/g;
       $chaine =~ s/أ‰/\\E/g;
       $chaine =~ s/أٹ/#E/g;
       $chaine =~ s/أ‹/~E/g;
       $chaine =~ s/أ¨/\$e/g;
       $chaine =~ s/أ©/\\e/g;
       $chaine =~ s/أھ/#e/g;
       $chaine =~ s/أ«/~e/g;
       $chaine =~ s/أŒ/\$I/g;
       $chaine =~ s/أچ/\\I/g;
       $chaine =~ s/أژ/#I/g;
       $chaine =~ s/أڈ/~I/g;
       $chaine =~ s/أ¬/\$i/g;
       $chaine =~ s/أ/\\i/g;
       $chaine =~ s/أ®/#i/g;
       $chaine =~ s/أ¯/~i/g;
       $chaine =~ s/أ™/\$U/g;
       $chaine =~ s/أڑ/\\U/g;
       $chaine =~ s/أ›/#U/g;
       $chaine =~ s/أœ/~U/g;
       $chaine =~ s/أ¹/\$u/g;
       $chaine =~ s/أ؛/\\u/g;
       $chaine =~ s/أ»/#u/g;
       $chaine =~ s/أ¼/~u/g;
       $chaine =~ s/أ؟/~y/g;
       $chaine =~ s/أ‡/\\C/g;
       $chaine =~ s/أ§/\\c/g;
       print Sortie $chaine;
   }
   #Fermeture des fichiers
   close (F);
   close (Sortie);
}


#

sub makeChaine
{
     my ($dir2open,$fichier) = @_;
     my $chaine = "";
     #Ouverture du fichier
     open (F,"$dir2open/$fichier");
     while (my $ligne = <F>) 
     {
           $chaine .= $ligne;
     }
     close(F);
     return ($chaine);
}

#
sub appel_Sygmart 
{
     my $chaine = @_;
     require LWP::UserAgent;
     my $ua = LWP::UserAgent->new;
     $ua->timeout(10);
     $ua->env_proxy;
     my $response = $ua->post(' http://www.lirmm.fr/~chauche/cgi-
bin/runsygmart.cgi', {
              Services => 'lslm',
              FormeSortie => 'lslmsc',
              filtre => 'tout3',
              texte_entree => $chaine,
              Charset => 'utf8',
            });
     if ($response->is_success) 
     {
         return $response->content; # or whatever
     }
     else 
     {
         return $response->status_line;
     }
}



#
sub Tagger
{
     my ($dir2open,$fichier,$rep2tag) = @_;
     my $texte = makeChaine("$dir2open","$fichier");
     my $textetag;
     #Fichier de sortie
     open (Sortie,"> $rep2tag/$fichier");
     $textetag = appel_Sygmart($texte);
     print Sortie $textetag;
     close (Sortie);
}


Run("../article/Donquichote");
Run("../article/ParisElection");
Run("../article/SarkozyCarla");
Run("../article/SkiGrange");
Run("../article/Tf1DaylimotionYoutube");

chahinez.abdelo.9 0 Newbie Poster · Answer 2 · 2014-12-17T19:41:24+00:00

Hello 2teez
That is my script
When i execute it
he show an error in function Run i don t know how!!!
Can you help me please

chahinez.abdelo.9 0 Newbie Poster · Answer 3 · 2014-12-20T21:12:40+00:00

#!/usr/bin/perl
use strict; use warnings;

# Launch treatments on the corpus

sub Run 
{
   my ($rep) = @_;     #repertory
   my $DIRarticles = "$rep/art";
   my $DIRclean= "$rep/clean";
   my $DIRtag= "$rep/tag";
   my $DIRvect= "$rep/vect";
   my $nbfiles= 0;

   #We Recovered all the files contained in the directory
   opendir (REP, $DIRarticles) or die ("CAUTION : Impossible to open the directory");

   #We Stores them in an array
   my @articles = readdir (REP);
   closedir (REP);
   # Then for each file is extracted candidate terms
   print ("PRE TRAITEMENT  : $DIRarticles en cours \n"); #pre tritment on the directory
   foreach my $entree (@articles)
   {
      if (not($entree eq ".") and not($entree eq ".."))
      {
          print ("CLEANING $DIRarticles,$entree ...\n");
          Clean ("$DIRarticles","$entree","$DIRclean");
          print "TAGGING $DIRclean,$entree ...\n";
          Tagger ("$DIRclean","$entree","$DIRtag");
          $nbfiles ++;
      }
   }
   print ("\t$nbfiles dans $DIRarticles ont أ©tأ© traitأ©s \n\n");
}

# Cleaning Operation
sub Clean
{
   my ($dir2clean,$fichier,$rep2clean) = @_;
   # opening file
   open (F,"$dir2clean/$fichier");
   #output file
   open (Sortie,"> $rep2clean/$fichier");
   # We travel the text    chaine-> caracter
   while (my $chaine = <F>) 
   {
       $chaine =~ s/\./ \. /g;
       $chaine =~ s/\,/ \. /g;
       $chaine =~ s/\:/ \. /g;
       $chaine =~ s/\;/ \. /g;
       $chaine =~ s/\'/ \. /g;
       $chaine =~ s/\"/ \. /g;
       $chaine =~ s/\?/ \. /g;
       $chaine =~ s/\!/ \. /g;
       $chaine =~ s/\// \. /g;




       $chaine =~ s/أ€/\$A/g;
       $chaine =~ s/أپ/\\A/g;
       $chaine =~ s/أ‚/#A/g;
       $chaine =~ s/أ„/~A/g;
       $chaine =~ s/أ /\$a/g;
       $chaine =~ s/أ،/\\A/g;
       $chaine =~ s/أ¢/#A/g;
       $chaine =~ s/أ¤/~a/g;
       $chaine =~ s/أ’/\$O/g;
       $chaine =~ s/أ“/\\O/g;
       $chaine =~ s/أ”/#O/g;
       $chaine =~ s/أ–/~O/g;
       $chaine =~ s/أ²/\$o/g;
       $chaine =~ s/أ³/\\o/g;
       $chaine =~ s/أ´/#o/g;
       $chaine =~ s/أ¶/~o/g;
       $chaine =~ s/أˆ/\$E/g;
       $chaine =~ s/أ‰/\\E/g;
       $chaine =~ s/أٹ/#E/g;
       $chaine =~ s/أ‹/~E/g;
       $chaine =~ s/أ¨/\$e/g;
       $chaine =~ s/أ©/\\e/g;
       $chaine =~ s/أھ/#e/g;
       $chaine =~ s/أ«/~e/g;
       $chaine =~ s/أŒ/\$I/g;
       $chaine =~ s/أچ/\\I/g;
       $chaine =~ s/أژ/#I/g;
       $chaine =~ s/أڈ/~I/g;
       $chaine =~ s/أ¬/\$i/g;
       $chaine =~ s/أ/\\i/g;
       $chaine =~ s/أ®/#i/g;
       $chaine =~ s/أ¯/~i/g;
       $chaine =~ s/أ™/\$U/g;
       $chaine =~ s/أڑ/\\U/g;
       $chaine =~ s/أ›/#U/g;
       $chaine =~ s/أœ/~U/g;
       $chaine =~ s/أ¹/\$u/g;
       $chaine =~ s/أ؛/\\u/g;
       $chaine =~ s/أ»/#u/g;
       $chaine =~ s/أ¼/~u/g;
       $chaine =~ s/أ؟/~y/g;
       $chaine =~ s/أ‡/\\C/g;
       $chaine =~ s/أ§/\\c/g;
       print Sortie $chaine;
   }
   #close file
   close (F);
   close (Sortie);
}


#
#ligne -> line
sub makeChaine
{
     my ($dir2open,$fichier) = @_;
     my $chaine = "";
     #opening the file
     open (F,"$dir2open/$fichier");
     while (my $ligne = <F>) 
     {
           $chaine .= $ligne;
     }
     close(F);
     return ($chaine);
}

#We can delete this function it s operationnal
sub appel_Sygmart 
{
     my $chaine = @_;
     require LWP::UserAgent;
     my $ua = LWP::UserAgent->new;
     $ua->timeout(10);
     $ua->env_proxy;
     my $response = $ua->post(' http://www.lirmm.fr/~chauche/cgi-
bin/runsygmart.cgi', {
              Services => 'lslm',
              FormeSortie => 'lslmsc',
              filtre => 'tout3',
              texte_entree => $chaine,
              Charset => 'utf8',
            });
     if ($response->is_success) 
     {
         return $response->content; # or whatever
     }
     else 
     {
         return $response->status_line;
     }
}



#
sub Tagger
{
     my ($dir2open,$fichier,$rep2tag) = @_;
     my $texte = makeChaine("$dir2open","$fichier");
     my $textetag;
     #Fichier de sortie
     open (Sortie,"> $rep2tag/$fichier");
     $textetag = appel_Sygmart($texte);
     print Sortie $textetag;
     close (Sortie);
}


Run("../article/Donquichote");
Run("../article/ParisElection");
Run("../article/SarkozyCarla");
Run("../article/SkiGrange");
Run("../article/Tf1DaylimotionYoutube");

chahinez.abdelo.9 0 Newbie Poster · Answer 4 · 2014-12-20T21:47:14+00:00

Hi 2teez,
I totally forget that this website is in english so i tried to convert my script in english i hope that it will help you

chahinez.abdelo.9 0 Newbie Poster · Answer 5 · 2014-12-29T10:33:38+00:00

Hello,
firstly i thank you to take time to answer me.

Here exactly what i should exactly do: 

Established the corpus.
Prepare our project structure.
Write a Perl script that :
      1 Browse the corpus.

      2 Cleans files and makes the necessary substitutions SYGMART .

      3 Call Sgmart and save the result .
The purpose of this project is to implement and evaluate a document classification method programmed in Perl.
**First step: formation of the corpus**
In a first step, a body should be formed . We propose to develop a body of five distinct themes (for exemple: politics , cooking, etc. ). This corpus will be normalized (removal HTML tags , etc ) . To do this , you will find ten texts written in French or English relating to each of these five themes.
**Second step: implementation of a classification algorithm**
Further work will be to implement a classification algorithm . many
learning approaches can be used for text classification :
• K nearest neighbors
• Decision Trees
• Naïve Bayes
• Neural Networks
• support vector machines
In this project, we propose to use the well-known method of K nearest neighbors ( KNN ) view
in progress.
Third step : taking account of linguistic information
The goal here is to use your texts with different information:
• Gross Texts .
• lemmatised Texts .
• Texts lemmatised with parsing .

**The project structure** as I see it is this:

ROOT
|____REP Article
     |____REP Donquichote
          |
          |
          |____REP Art
               |
               |
               |
               |____Txt files
          |
          |
          |
          |
          |
          |____REP clean
               |____Txt files cleaned
          |
          |
          |
          |____REP tag
               |____Tagged files in .txt format
          |
          |
          |
          |
          |
          |____REP vect
               |____Txt files
     |
     |____REP ParisElection
          |
          |
          |____REP Art
               |____Txt files
          |
          |
          |____REP clean
               |____Txt files cleaned
          |
          |
          |____REP tag
               |____Tagged files in .txt format
          |
          |
          |____REP vect
               |____Txt files
     |
     |____REP SarkozyCarla
          |
          |
          |____REP Art
               |____Txt files
          |
          |
          |____REP clean
               |____txt files cleaned
          |
          |
          |____REP tag
               |____Tagged files in .txt format
          |
          |
          |____REP vect
               |____Txt files
    |
    |____REP SkiGrange
          |
          |
          |____REP Art
               |____Txt files
          |
          |
          |____REP clean
               |____Txt files cleaned
          |
          |
          |____REP tag
               |____Tagged files in .txt format
          | 
          |
          |____REP vect
               |____Txt files
          |
          |____REP Tf1DaylimotionYoutube
               |
               |____REP Art
                    |____Txt files
               |
               |
               |____REP clean
                   |____Txt files cleaned
               |
               |
               |____REP tag
                    |____Tagged files in .txt format
               |
               |
               |____REP vect
                    |____Txt files
|
|____REP Binary
     |____Executions files
|
|____REP Data
     |____...

Sarah_41 0 Newbie Poster · Answer 6 · 2021-01-14T18:26:32+00:00

Sarah_41 0 Newbie Poster

4 Years Ago

hello do you have the solution please??

classification of text in a corpus of articles

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers