LinuxQuestions.org - [SOLVED] Splitting text file into multiple files

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Splitting text file into multiple files (https://www.linuxquestions.org/questions/programming-9/splitting-text-file-into-multiple-files-849482/)

Splitting text file into multiple files

I have a text file that is filled with references to duplicate files.

I'm trying to create a text file for each duplicate file found that contains the paths to the duplicates. I would also like the text file names to be based on the size and file name.

Some thing like:
231.5 KB - P&S.doc.txt
138.5 KB - LIMITED#C71.doc.txt

If someone could point me in the right direction I would greatly appreciate it.

Code:

Name        Path        Size        Last Change        Last Access        File Type        Owner        Attributes

P&S.doc        (3 Files)                                                

  P&S.doc        Z:\Leg\_Pri_Leg\Pur\P&S\BUY\Barry V\        231.5 KB        11/2/2001 4:07 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        Lou_A        C

  P&S.doc        Z:\Leg\_Pri_Leg\P&S\BUY\Barry V\        231.5 KB        11/2/2001 4:07 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C

  P&S.doc        Z:\Leg\_Pri_Leg\Props\Pur\P&S\BUY\Barry V\        231.5 KB        11/2/2001 4:07 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C

LIMITED#C71.doc        (2 Files)                                                

  LIMITED#C71.doc        Z:\Leg\_Pri_Leg\Pur\CV\        138.5 KB        12/15/2003 1:04 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        Lou_A        C

  LIMITED#C71.doc        Z:\Leg\_Pri_Leg\Props\Pur\CV\        138.5 KB        12/15/2003 1:04 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C

ps revised.8.30.05.clean.doc        (3 Files)                                                

  ps revised.8.30.05.clean.doc        Z:\Leg\_Pri_Leg\Props\Pur\P&S\Sell\VP\Summit\        54.5 KB        8/31/2005 11:46 AM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C

  ps revised.8.30.05.clean.doc        Z:\Leg\_Pri_Leg\P&S\Sell\VP\Summit\        54.5 KB        8/31/2005 11:46 AM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C

  ps revised.8.30.05.clean.doc        Z:\Leg\_Pri_Leg\Pur\P&S\Sell\VP\Summit\        54.5 KB        8/31/2005 11:46 AM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        Lou_A        C

Copy of 08 Lee All July Billing.xls        (2 Files)                                                

  Copy of 08 Lee All July Billing.xls        Z:\IS\_Sh_IS\Dev\Doc\Docl 26 upgrade\AS6 backup code\APImport\        131.5 KB        7/30/2010 12:11 PM        11/22/2010 2:38 AM        .xls (Microsoft Office Excel 97-2003 Worksheet)        Administrators        C

  Copy of 08 Lee All July Billing.xls        Z:\AP\Kellie\        131.5 KB        7/30/2010 10:03 AM        11/22/2010 2:38 AM        .xls (Microsoft Office Excel 97-2003 Worksheet)        Kellie        C

Do you have any previous programming experience? The problem you asked could be rather easily solved in any number of languages. I would go for python (or perl, if you prefer. Or even Bash, if you're strangely masochistic)

And I assume that that was an extract you posted, rather than the whole file, or it would be far far quicker to just do it by hand ;)

I do have some programming experience. I'm familiar with C,C++,VB 6, PHP, ASP, COBOL...but I haven't coded in a while. I have taken a look at phython in the past.

And you are correct, that is only a partial sample of the file in question. The files is a 50 MB text file.

A good job for awk. Example:

Code:

BEGIN { 

  FS = "\t"

  getline

}



!/^ / {

  dupname = $1

  ndup = gensub(/\(| Files\)/,"","g",$2)

  for ( i = 1; i <= ndup; i++ ) {

    getline

    file = ( $3 " - " dupname ".txt" )

    print $2 dupname >> file

  } 

}

Colucix, thanks for the link.

The text you provided, is it a filter for awk?

I think I see how it is used:

Code:

awk -f (your file) input_file

Quote:

Originally Posted by Mithrilhall (Post 4187107)

The text you provided, is it a filter for awk?

I don't know what do you mean for filter. It is simply an awk piece of code.

Quote:

Originally Posted by Mithrilhall (Post 4187107)

I think I see how it is used:

Code:

awk -f (your file) input_file

Exactly! :)

awk is a very powerful tool to parse and extract information from text files. However if you want to learn or refresh a new language, python is more complete since - as you already know - it offers a huge collection of libraries for a large variety of tasks. Anyway, normally you don't need to develop complicate awk programs but you can consider it as a handy command line utility, so that you can limit your learning process to the basics.