LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Splitting text file into multiple files (https://www.linuxquestions.org/questions/programming-9/splitting-text-file-into-multiple-files-849482/)

Mithrilhall 12-10-2010 08:31 AM

Splitting text file into multiple files
 
I have a text file that is filled with references to duplicate files.

I'm trying to create a text file for each duplicate file found that contains the paths to the duplicates. I would also like the text file names to be based on the size and file name.

Some thing like:
231.5 KB - P&S.doc.txt
138.5 KB - LIMITED#C71.doc.txt


If someone could point me in the right direction I would greatly appreciate it.

Code:

Name        Path        Size        Last Change        Last Access        File Type        Owner        Attributes
P&S.doc        (3 Files)                                               
  P&S.doc        Z:\Leg\_Pri_Leg\Pur\P&S\BUY\Barry V\        231.5 KB        11/2/2001 4:07 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        Lou_A        C
  P&S.doc        Z:\Leg\_Pri_Leg\P&S\BUY\Barry V\        231.5 KB        11/2/2001 4:07 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C
  P&S.doc        Z:\Leg\_Pri_Leg\Props\Pur\P&S\BUY\Barry V\        231.5 KB        11/2/2001 4:07 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C
LIMITED#C71.doc        (2 Files)                                               
  LIMITED#C71.doc        Z:\Leg\_Pri_Leg\Pur\CV\        138.5 KB        12/15/2003 1:04 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        Lou_A        C
  LIMITED#C71.doc        Z:\Leg\_Pri_Leg\Props\Pur\CV\        138.5 KB        12/15/2003 1:04 PM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C
ps revised.8.30.05.clean.doc        (3 Files)                                               
  ps revised.8.30.05.clean.doc        Z:\Leg\_Pri_Leg\Props\Pur\P&S\Sell\VP\Summit\        54.5 KB        8/31/2005 11:46 AM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C
  ps revised.8.30.05.clean.doc        Z:\Leg\_Pri_Leg\P&S\Sell\VP\Summit\        54.5 KB        8/31/2005 11:46 AM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        DMs        C
  ps revised.8.30.05.clean.doc        Z:\Leg\_Pri_Leg\Pur\P&S\Sell\VP\Summit\        54.5 KB        8/31/2005 11:46 AM        11/22/2010 2:38 AM        .doc (Microsoft Office Word 97 - 2003 Document)        Lou_A        C
Copy of 08 Lee All July Billing.xls        (2 Files)                                               
  Copy of 08 Lee All July Billing.xls        Z:\IS\_Sh_IS\Dev\Doc\Docl 26 upgrade\AS6 backup code\APImport\        131.5 KB        7/30/2010 12:11 PM        11/22/2010 2:38 AM        .xls (Microsoft Office Excel 97-2003 Worksheet)        Administrators        C
  Copy of 08 Lee All July Billing.xls        Z:\AP\Kellie\        131.5 KB        7/30/2010 10:03 AM        11/22/2010 2:38 AM        .xls (Microsoft Office Excel 97-2003 Worksheet)        Kellie        C


Snark1994 12-10-2010 08:54 AM

Do you have any previous programming experience? The problem you asked could be rather easily solved in any number of languages. I would go for python (or perl, if you prefer. Or even Bash, if you're strangely masochistic)

And I assume that that was an extract you posted, rather than the whole file, or it would be far far quicker to just do it by hand ;)

Mithrilhall 12-10-2010 08:58 AM

I do have some programming experience. I'm familiar with C,C++,VB 6, PHP, ASP, COBOL...but I haven't coded in a while. I have taken a look at phython in the past.

And you are correct, that is only a partial sample of the file in question. The files is a 50 MB text file.

colucix 12-10-2010 09:24 AM

A good job for awk. Example:
Code:

BEGIN {
  FS = "\t"
  getline
}

!/^ / {
  dupname = $1
  ndup = gensub(/\(| Files\)/,"","g",$2)
  for ( i = 1; i <= ndup; i++ ) {
    getline
    file = ( $3 " - " dupname ".txt" )
    print $2 dupname >> file
  }
}


Mithrilhall 12-10-2010 09:51 AM

Colucix, thanks for the link.

The text you provided, is it a filter for awk?



I think I see how it is used:

Code:

awk -f (your file) input_file

colucix 12-10-2010 10:32 AM

Quote:

Originally Posted by Mithrilhall (Post 4187107)
The text you provided, is it a filter for awk?

I don't know what do you mean for filter. It is simply an awk piece of code.
Quote:

Originally Posted by Mithrilhall (Post 4187107)
I think I see how it is used:

Code:

awk -f (your file) input_file

Exactly! :)

awk is a very powerful tool to parse and extract information from text files. However if you want to learn or refresh a new language, python is more complete since - as you already know - it offers a huge collection of libraries for a large variety of tasks. Anyway, normally you don't need to develop complicate awk programs but you can consider it as a handy command line utility, so that you can limit your learning process to the basics.


All times are GMT -5. The time now is 12:57 AM.