My company is retiring an online document warehousing application that stored lots of text data. The application stored the data in a folder hierarchy that looked just like a Windows folder tree. I'm trying to replicate that hierarchy on a UNIX file system, but the tools provided with the application to extract the hierarchy information are not terribly useful.

One tool will give me a list of folder ID numbers and names like so:

Folder 8777 - fozzy.
Folder 8778 - fozzy1.
Folder 8779 - fozzy2.
Folder 8780 - grover1.
Folder 8781 - grover2.
Folder 8782 - rolf1.
Folder 8783 - rolf2.
Folder 8784 - rolf3.
Folder 8785 - rolf4.
Folder 8786 - travel_statements.
Folder 8787 - invoices.

Another tool will give me sub-folder relationships based on folder ID like the following:

Folder 100 - <root>.
   subfolder 101 - flag 0.
   subfolder 119 - flag 0.
   subfolder 227 - flag 0.
   subfolder 239 - flag 0.
   subfolder 1198 - flag 0.
   subfolder 1320 - flag 0.
   subfolder 2264 - flag 0.
   subfolder 3025 - flag 0.
   subfolder 3028 - flag 0.
   subfolder 3031 - flag 0.
Folder 1198 - kermit1.
   subfolder 1227 - flag 0.
   subfolder 1231 - flag 0.
   subfolder 1238 - flag 0.
   subfolder 1374 - flag 0.
   subfolder 1504 - flag 0.
   subfolder 1538 - flag 0.
   subfolder 1642 - flag 0.
   subfolder 2459 - flag 0.
   subfolder 2635 - flag 0.
   subfolder 2642 - flag 0.
   subfolder 3998 - flag 0.
   subfolder 7942 - flag 0.
   subfolder 8656 - flag 0.
Folder 1227 - monkey1.
   subfolder 1228 - flag 0.
   subfolder 1327 - flag 0.
   subfolder 1347 - flag 0.
   subfolder 1390 - flag 0.
   subfolder 1396 - flag 0.
Folder 3333 - piggy1.
   No sub folders.

I first approached this problem by just looping through the list of folder ID's, and for each folder ID run a recursive function that would continue to scan through the sub-folder information until a path could be built back to the root folder (folder ID 100). This appeared to work great, but I encountered 2 problems:

  1. I discovered that some sub-folders were present in more than one location, but my code only picked up the first instance
  2. I also found that some folders were positioned outside the hierarchy of the root folder

Next, I tried using the sub-folder information to start with. I built a list of simple strings representing one parent/child pair like this: 100/1198. Then, for each pair, I looped through the sub-folder info again and tried building paths based on the child element matching the parent element of any scanned lines. This caught some of the duplicate paths, but I ended up with a bunch of paths that had no relationship to the beginning or end of the tree.

Can anyone here think of how I could build folder hierarchy based on this kind of data? Or can anyone here even think of a good way I could represent this data internally so I could build the paths without missing any possible path combinations? Any assistance would be greatly appreciated. Thank you!

use strict;
use warnings;
use File::Path;
use Cwd;

## Source.txt file having for the above data
undef $/;
open (FIN, "<source.txt") || die "Cannot Open the Input File";
my $file=<FIN>;
close (FIN);

my ($root, @lines, @folder, $flag, $cwd);

# Get root folder name
$root=$1 if ($file=~ m{(\d+).*?<root>}s); 

# The input file store in to a array
@lines=split(/\n/, $file); 

# ( o => root folder process, 1 => second level folder process)
$flag=0;

# Get current working directory and change the root folder name.
# insist of $cwd you may assign your location
$cwd=cwd();  
$root="$cwd/$root";

# Here $lines[0] ignored, because root folder name already captured.
foreach my $i ( 1 .. $#lines)
{
	# Generate 2nd level folder
	if ( $lines[$i]=~ m{^folder (\d+)}i)
	{
		push (@folder, $1); $flag=1;
		mkpath ("$root/$folder[$#folder]");
	}
	# Generate sub folders of 2nd level folder
	elsif ($lines[$i]=~ m{^\s*subfolder (\d+)}i && ($flag == 1))
	{
		mkpath ("$root/$folder[$#folder]/$1");
	}
	# Generate sub folders of root folder
	elsif ($lines[$i]=~ m{^\s*subfolder (\d+)}i)
	{
		mkpath ("$root/$1");
	}
}

I assumed the data should be
1) 1st level ( Root Folder and their sub folder).
2) 2nd level ( 2nd level folder and their sub folder ).

Hello k_manimuthu, thank you for your post. Unfortunately, the data was not merely level 1 and level 2 data...the depth could have extended indefinitely (though usually no deeper than 5 or 6 levels). However, I was able to get the assistance I needed over at PerlMonks. The link will take you directly to the node with my question. The code they provided me with was brilliant, and worth a look if you're interested. Thanks to everyone else to even looked at this post!

This article has been dead for over six months. Start a new discussion instead.