The Problem: I have a long string of serial numbers, that looks like this:

"1_2_3_4_5 0_100_1_0_4 1_2_3_4_5 0_100_1_0_4 0_100_1_0_4
3_1_34_5_6";
I want to remove all serial numbers that do not occur EXACTLY 3 or 1 times in the string.

The Result: In this case the return string should be:

0_100_1_0_4 0_100_1_0_4 0_100_1_0_4 3_1_34_5_6

The question: How can I use pure regex in php to remove all serial numbers that do not occur 3 or 1 times. I do not want to use array unique or array_count_values as they are too slow when dealing with large amounts. Pure regex is faster for my case.

EDITED to give full explanation of what code does, in case someone knows of a better way to approach problem: Code reads in 10 files with millions of serial numbers within them, merges them and prints out all numbers that appear 3 times. This is performed thousands of times and array_count_values statement is killing the speed. Everything else is fast, I know because I timed profiled the code.

    $result_hold=""; 

for($i=0; $i<10000; $i++)
 {  
$linespreset=file("sometext2.txt",FILE_IGNORE_NEW_LINES); 
$holdpreset=explode(" ",$linespreset[0]);   
$holdpreset=array_map("trim", $holdpreset); 

$print1=file_get_contents('six.txt'); 
$print2=file_get_contents('seven.txt'); 
$print3=file_get_contents('thetext.txt'); 
$print4=file_get_contents('sometext.txt'); 
$print5=file_get_contents('moretext.txt'); 
$print6=file_get_contents('alltext.txt'); 
$print7=file_get_contents('trialtext.txt'); 
$print8=file_get_contents('two.txt'); 
$print9=file_get_contents('four.txt'); 
$print10=file_get_contents('five.txt');

        $resultround=$print1."\r\n".$print2."\r\n".$print3."\r\n".$print4."\r\n".$print5."\r\n".$print6."\r\n".$print7."\r\n".$print8."\r\n".$print9."\r\n".$print10;
$frequency = array_count_values($resultround=explode("\r\n",$resultround));

$result = array_filter($frequency, function ($x) { return $x ==3; });
$result=array_keys($result);

$resultnew=implode("\r\n",$result);
$result_hold .=$resultnew."\r\n";

unset($result);

        unset($frequency);
        unset($resultnew);
        }

Recommended Answers

All 30 Replies

Where does this data originate from? Is there some reason you are not using a database? Is everthing between underscores a seperate serial number?

Hello!

  1. The data originates from flat .txt files that hold millions of serial numbers.

  2. I am not using a database because in short I am not experienced enough in databasing indexing efficiency to handle millions of constantly changing files. I talked to an expert and he felt in my case I really needed to stay away from mysql given that I did not have the experience needed to index millions of data properly. Insertion alone, was super slow.

  3. The serial numbers are separated by one whitespace. I can see now that it is hard to see that so I will write again here:

    "1_2_3_4_5      0_100_1_0_4     1_2_3_4_5     0_100_1_0_4       0_100_1_0_4     3_1_34_5_6";

So 0_100_1_0_4 is one serial number.

Result should be:

`0_100_1_0_4     0_100_1_0_4      0_100_1_0_4      3_1_34_5_6`

I know the data is in flat files. I meant how does it get there in the first place. I am sure no one is hand typing millions of rows of data.

Is the data just one big bunch of serial numbers seperated by a white space or is it x amount of serial numbers per line?

If you can, post or Pm one of the text files.

You said you want to remove serial numbers that do not appear 3 OR 1 time. How can you expect to remove something that is not there at least one time?

It would be really helpful if you could give an explanation of what you are doing rather than your attempt at doing it.

Also, for which platform? And can you use something different from PHP?

  1. @Cereal Yes I can use something other than PHP, and I am on mac os.
  2. @benanamen The files are just a continuously changing smaller data dump from 3 larger data files. The serial numbers file are always one serial number per line. So every file is formatted one serial number per line like below:

    0_100_1_0_4
    0_200_1_0_4
    1_5_6_7_8
    12_22_45_33_2
    0_3_10_1_7

What I mean by 3 or 1 times is that I will accept it, if it is in the file only one time or if it is in the file 3 times but I want to remove it for any other number of occurrences. I could have said I want to remove it, if it is only in the file one time or 3 times and keep it, if it occurs any other times.

Also a full explanation of the large picture of what I am doing is there under EDIT. It is as simple as that. I take 10 files that each have millions of serial numbers within them, each serial number being on its own line as demostrated up above. The 10 file arrays are merged into one big array and then array_count_values function is used to give back all serial numbers that occur only 3 or only 1 time in the larger merged array.

Array_count_values is super slow when handling millions of numbers, I am looking for an alternative to array_count_values. Someone suggested a regex string statement as an alternative.

I wonder if this is just something we can use Linux, UNIX or your OS's command lines to get done.
Be aware that your top post wrote one way but in a later post you write:

So every file is formatted one serial number per line like below:

For example "uniq" has a method to not only strip repeats but count them. See the -c option at https://www.computerhope.com/unix/uuniq.htm

OK, so let's say I run this file through uniq with the -c. Now I can strip that to the 1's and the 3's with any number of methods or commands.
Why code this?

Ok, a bit more important info revealed. You dont seem to understand what I am asking.

Where do these 3 other data files come from? I am trying to find out what the actual source of all this data is. Where does it come from and how do the newly revealed 3 bigger data files get their data? I dont know how to explain it any simpler.

Your explanation is an explanation of your ATTEMPT. I dont care about your attempt. This is an XY Problem.

Are these 3 bigger data files also text files? This whole thing really boils down to you not using the right tool for the job which is using a database. Anything else is just a hack.

@benanamen The larger data files are pulled from a database that I don't have access too and given to me in a flat file text format. Those 3 large databases never change but the files that I create from them do -- constantly. The 10 files each hold 2-3000000 random serial numbers, extracted from the large files with random array slices.

I already consulted with a database expert, given that my project is not linked to a larger corporate environment i.e. I have only my local computer and local apache, and I am dealing with millions of numbers and in no way a database expert-- it was them that said I would have to do it outside of database.

According to them and they have been a 100% correct in my opinion, millions of changing database numbers without a 10 year database expert to oversee them and write the proper indexes was going to be ludicrously slow. Insertions alone were taking hours because each file changes constantly with millions of numbers.

Flat file has been way speedier at my skill level. What I am doing is trivial, merging some numbers and using array_count_values to get back info but doing that in a database with constant new insertions and correct indexes on millions of numbers is not trivial according to the database guy.

@rproffitt Thank you for this suggestion, I will try this and get back to you...I am looking for any new way that might help me. I was coding this because sometimes, I pass what is left over (the 3 or 1s) to other objects, later in my script. In this particular case, I don't have to pass the info anywhere, just print to file.

@rproffitt Thanks for the idea! I timed it against array_count_values using approx. 1 million numbers and it was 2x slower than the php function --- but we tried.

Ok, thats the information I was looking for. I knew there had to be a database involved somewhere.

I dont know who this "Expert" is you talked to but this is a very simple matter for a database. Data handling is what they are made for. Mysql can easily handle millions and millions of records. In your case it is even simpler. You have a single column of data. I have to question this "expert" if he is telling you about inserts, which would indeed be slow. The proper method is to import the data.

If you can provide one of your data files I will show you how this can be done easily and painlessly as well as automated.

Unless I missed it, the one thing you have not mentioned is what you want to do with the results.

@Charles. It was just one of those no-code ideas. The problem is that to benchmark this you have to run it twice and take the second time runs to even out caching effects. Same for the PHP test.

Anyhow, if I had to do this, the commands would be my first release so the job is up and running. If it's just a matter of 5 minutes vs 10 minutes then you have to think about ROI. That is, to code it up to go faster, how much time will your programmer be using up?

@rproffitt - Unfortunately, I am doing all of the coding myself (LOL)...array_count_values/or uniq -c has to be run something like 48000 times in one script.

Right now it takes me 1 hour to run a script with array_count_values and that's with using PHP Pthreads, I have 10 thread workers fired up and going constantly.

I suspect that the uniq command line was much slower than php array_count_values because the 10 threads were bottlenecking at the access to the command line. It appears there is no way to give each thread its own private access to the command line like you can with resources. But maybe I'm wrong about that.

I am trying to get this figured out because it takes 1 hour to run one script but I have 40 different scripts to be run with slight variations of their use with the array_count_values. So that's 40 hours, which sucks for me.

@benanamen Thanks! I think I should do a better job of making clear what is going on and why it might be hard to database with only local computer use. I don't have a server to work off of, I am using easyPHP on my mac os laptop.

I have 1000s of files that must be processed at one time that are formatted with only one number per line and there can be up to 2 -3 million numbers in each file. I use a for loop to run 1000's of iterations of array_merge (which merges 10 of these files together and then array_count_values them to give back only numbers that occur 3 or 1 time).

Sometimes I just print out the results and sometimes they get fed to another object working in PHP parallel pthreads.

The 1000s of files are constantly changing, which means in easyPHP mysql, I would have 1000s of columns that are constantly changing their total data sets. The files change according to other scripts that I have that dump their data into the files based on specs I have for creating them from the non changing large data files.

I have attached one of the files, cheers.

@benanamen I had to delete a lot of numbers from the text file to make it able to be uploaded to daniweb and zip it. So it is possible that you could get an answer where there is no numbers that match 3 times or only once in the file.

EDIT: I see the download

Do you have Skype or any other instant messenger.

Just to give you an example, it took 0.1 seconds to get a list of all the serial numbers that occur only 1 or 3 times. This was run on a Windows 7 machine.

The file you gave me there are 43663 serials that occur only 1 time, and 9786 that occur 3 times out of 120,050 serials.

@benanamen
Very Cool!!! On your machine is it possible to copy and paste the text in the file 10 times so that there are a million numbers in the file? How long is it taking for you to insert or import the data into 10 columns, merge the data and then do the 3 or 1 pull down over a 24000 iteration for loop? I am trying to assimilate the 24000 calculations that have to be done this way, with each new 10 files needing to be loaded into the database. Right now it is taking me 1 hour with PHP Pthreads parallel programming.

I don't have instant messengers on my mac os but I could download...not a problem for me to...

Edited post: wrote 2400 when it should be 24000

If you would, install Skype and send me your username so we can talk realtime. What country and timezone are you in?

You keep telling me the how (how you are attempting this) and not the why. I need to know all about the why so I can properly tell you the how.

What we have going on here is what is known as a XY Problem. Read about it here.

Dont worry, we will get this handled.

@C. Ahh, threads from the command line is interesting. I don't have an Apple here so you may have to do some testing but...

On Linux you just type in the command and then add the ampersand to detach and let it run.
Example: https://www.tecmint.com/run-linux-command-process-in-background-detach-process/

On Windows it's nutty. You use the START command then your command.

On Apple? I think they are BSDish so try the same ampersand. Read https://www.cyberciti.biz/tips/freebsd-job-management.html
Another at http://commandlinemac.blogspot.com/2008/12/bash-job-control-fg-bg-jobs-and-ctrl-z.html but not as succinct.

@benanamen I appreciated the link on XY problem! I admit that I am the noob stumbling through the problem by just parroting how I am doing things....I thank both you and
@rproffitt for continuing to help me.

I am going to pull my thoughts together and explain the broader picture of why I am doing things by editing the original post. In the mean time I will download skype messenger and send you my username.

commented: Thanks. Sharing alternative ideas. You'll get there IMO because your mind is open. +12

@rproffitt Cheers for these pointers, I'll work on adding them and let you know what happened.

EDIT to original post to answer the question of why:

The broad picture of what my goal is ---

I have 17 million serial numbers that were given to me in one flat text file (going forward we will refer to this file as 'BIGFILE').

I do not have access to the database that they came from and the 17 million serial numbers, never change.

I have to extract 2 million random serial numbers from the BIGFILE, between 2 and 25 times depending on what I am looking for at the moment. Sometimes I only need 5 files each with 2 million random numbers extracted from the BIGFILE and sometimes I need 25 files each with 2 million random numbers extracted from the BIGFILE. None of the newly extracted files are exactly the same.

After the smaller files are created (Let's use 10 in this example) I need to know amongst the 10 files, which numbers appear amongst the 10 files a certain number of times, for example, I could want to know how many appear 3 or 4 times within the 10 files and print those numbers to be used later in some other program that is beyond this scope.
Frequently used conditions are <15, >=3, 5 or 6, ==12, >= 3 && <=10

I must do these calls thousands of times , sometimes as much as 50,000 times in one sitting and I am looking for the quickest way possible to get the info.

If you have your skype setup I am available right now although your last post made it not really necessary now but it would still be helpful.

From the info you just provided I see this is actually a very simple task to do as well as automate.

There are many free large file transfer sites available. Upload a zip of your "Big File" and get me the link. You could also use Dropbox if you have the space.

I will make you something really, well depending on when you grew up, neato, keen, groovy, smokn, rad, dandy, bitchen, dope, farout, happening, super, most excellant, etc, etc..

FYI: This last post is more like what should have been the first post. It's all good though.

I have 17 million serial numbers that were given to me in one flat text file (going forward we will refer to this file as 'BIGFILE'). I do not have access to the database that they came from and the 17 million serial numbers, never change.

All of which sounds very dodgy indeed. Care to share the nature of these 'serial numbers' to assure the community that it is not helping you in some illicit activity here?

@rproffitt Thank you so much for the freebsd job management links!!!! I sent worker threads to their own command lines and my code sped up from one hour total time to 8 minutes total execution time* Very impressive speed hike. I thank you for that.

I've also started going over the mysql tutorials and I have a book so that I can setup a quick speed test against my php coding to see if basic mysql can do any better than 8 min. Very curious indeed to see how well mysql can power through on basic run of the mill indexes.

@happygeek
They are not serial numbers in the sense of business use, they are numbers that I generated for my own personal project. I call all numbers that I personally generate serial numbers. I can generate "serial numbers" all day long in the good old USA and use them in my personal projects. If they were REAL serial numbers to something, I certainly would have never shared them with the ENTIRE daniweb community. I am not brain dead.

I do regret calling them serial numbers because scammers on these boards might actually think someone would be stupid enough to have real serial numbers and ask for help so that they can see the numbers and put spyware on their computers to get the numbers by sending them downloads.

This board has been most helpful to me! I learned that the term "serial numbers", although not a big deal to me, can be taken very seriously by scammers and I need to simply change serial numbers to first and last names when asking for help, to weed out any potential scammers asking why, why, why, why and hiding behind XY problem. Of course they are going to ask why, they want to know if I am stupid enough to give out "real serial numbers".

If any one is dodgy here it is people asking repeatedly to access my computer and give them personal papers with names and phone contacts on them, when I have said no many times because that is unsecure. Why don't you talk about that @happygeek

But it has a happy ending because regardless of a few hasslers on this forum, I got my school work sped up quite nicely.

@Charles.

Just a few weeks ago I had to update a factory app I wrote years ago that leverages CPUs and threads (I have to write both since some machines have so many cores and some cores have more than one thread available.)

It was nearly impossible to scale the app as a single app but when I used a system to launch copies that dealt with each update, then it started to, in non computer terms, "hum."

For each OS, the method you use to "spread the load" can vary but once you see the power (of the dark side?) you can't ignore it and will look like a powerful dark lord to some. (all puns and refs to today's lore intended.)

This board has been most helpful to me! I learned that the term "serial numbers", although not a big deal to me, can be taken very seriously by scammers and I need to simply change serial numbers to first and last names when asking for help, to weed out any potential scammers asking why, why, why, why and hiding behind XY problem. Of course they are going to ask why, they want to know if I am stupid enough to give out "real serial numbers".

uh :o are you sure it was not just too young and eager to help? A part Skype & co. I had few questions too about your task, because, for example, for me it was not clear enough how input was structured (basically type and size: alphanumeric, integer, with spaces, splitted in multiple files and to be considered as a whole or repetitions and uniqueness for each file...) and how to output. For example, depending on method awk, from command line could work fine to identify repetitions. Along with rproffitt's suggest about uniq.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.