Hi there,

I have a file with 500.000 columns and 300.000 lines. The format is like that:

ColXX ColWW ColQQ ColTT ... ColEE
H1 G1 H1 K1 ... L1
G1 H1 K1 L1 ... O1
.
.
.

Based on the first line information, (ColXX to ColEE), what is the best method to take specific columns from a file and print on another file?

I tried to invert and take lines. But to go back to original file I need to invert again. Can anyone help me please?

Thanks a lot!

Based on the first line information, (ColXX to ColEE), what is the best method to take specific columns from a file and print on another file?

I'm not sure what information this first line should give?

Hi Nick Evan,

The first line is the column's ID. I will read a file containing column's ID and based on that, I need to select just that specific columns.

As example: ColQQ and ColEE -> I will get just 2 columns

The first row is the column header.

Have you looked at the split() function (depending on which version of C++ you're using)?

Also look at strtok().

If you're using dot net, just use the Split() method on the string giving it your list of separators which in this case will be just a space.

Edited 5 Years Ago by thines01: clarity

/****************************************************************
	Even if you are using a different flavor of C++,
	you should be able to read and understand what this does.
	It takes a space-delimited text file with five columns
	and converts it to a tab-delimited text file with two
	columns (taking the second and fourth columns form the input)
*****************************************************************/
// DW_377578.cpp : main project file.
#include "stdafx.h"

using namespace System;
using namespace System::IO;

int main(array<System::String ^> ^args)
{
	 try
	 {
		 StreamReader^ fileIn = gcnew StreamReader("c:/science/test.txt");
		 StreamWriter^ fileOut = gcnew StreamWriter("c:/science/test_out.txt");

		 String^ strInData = "";
		 String^ strOutData = "";
		 //
		 array<String^>^ arr_strInData = gcnew array<String^>(1);

		 if(!fileIn->EndOfStream)
		 {
			strInData = fileIn->ReadLine(); // burn the header
		 }

		 fileOut->WriteLine("ColWW\tColTT");//new header

		 while(!fileIn->EndOfStream)
		 {
			 strInData = fileIn->ReadLine();
			 arr_strInData = strInData->Split(' ');//separate colums by space

			 if(arr_strInData->Length >4) // must be the right number of columns
			 {
				 strOutData = String::Format("{0}\t{1}",
					 arr_strInData[1], // take column 2
					 arr_strInData[3]  // take column 4
				 );

				 Console::WriteLine(strOutData);	// Show it on the console
				 fileOut->WriteLine(strOutData);	// Write to output file
			 }
		 }

		 fileOut->Close();
		 fileIn->Close();
	 }
	 catch(Exception^ exc)
	 {
		 Console::WriteLine(exc->Message);
	 }

    return 0;
}

Hi thines01, thanks a lot for your attention.

I'm planning to fill a vector from a file containing the columns'ID and replace in:

fileOut->WriteLine("ColWW\tColTT");//new header

What do you think about? I have many columns (500.000). Thanks for your opinion!

Is that five-hundred or five-hundred-thousand?
COLUMNS?! WOW!

Well, it would just take a little management to go in and grab 255 (or so) columns at a time and handle those.

thines01,

five-hundred-thousand ... :)

Do you think that a vector containing new headers will work well?

Thanks!

Yes, but you will need to do just parts at a time.
I'm not sure of the max size of a vector, but logic dictates this wille be done in chunks.

Are all the columns the same width? What is the width of each column (number of characters)? If each column were 2 characters wide, then 500,000 columns would occupy about 1 meg memory. But if you know before reading the file you only need two or three of those columns then those are only the ones you need to keep in memory. Reading that huge file into memory will be ungodly sloooooow, actual time will depend on the operating system and disk access time of the hardware. *nix is faster at disk access then MS-Windows from what I've seen my tests about 5 years ago. So if speed is any concern to you at all then you might want to do this on a *nix machine instead of MS-Windows.

If all the columns are exactly the same width, with the same number of spaces or tabs between columns, then you might be able to treat the file as a binary file and use random access techniques on it. That way you could directly access any given row and column without having to read the whole file serially. This will not work though if some of the columns have varying widths.

Edited 5 Years Ago by Ancient Dragon: n/a

Hi friends,

The column has 2 characters wide and equally spaced. I can break the file and work with multiple files to reduce dimension. What do think about? Do you have a simple example to take specific columns in a binary file? Do I need to use strok() or split()? As I understood, Those will just break the row. How can I associate the variable with the column?

Thanks a lot! The comments were amazing!

The first row in the file is 500,000*4 or 2,000,000 bytes long. So the first column of the first data row starts at position 2,000,002 (CR/LF line terminator). The beginning of the second column first data row would be 2,000,002 + 3 (2 characters + 1 space). The beginning of the first colum second row would be 2,000,0002 + (3 * 500,000) + 2 = 3,500,004.

The beginning of the last column on the last data row would be 2,000,002 + (299,999 * 3 * 500,000) + (299,999 * 2). That's a pretty large number so I think you will need to use a 64-bit integer.

All other calculations are similar to those I described above. If I were you I would write a very small program to code and test the algorithm for many different column numbers. Once you get the algorithm and formula correct you can add them to your larger program.

This article has been dead for over six months. Start a new discussion instead.