LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-10-2012, 12:29 AM   #1
mrpurple
Member
 
Registered: May 2010
Posts: 50

Rep: Reputation: 1
using awk to get blocks of data from a text file


I want to extract multiline blocks of data from a text file into a multiline string array from a multiline test file. The data file looks like
Code:
[DATATYPE1]
multiple unknown number of lines
of data that I dont want
[END-type1]

[DATATYPE2]
multiple unknown number of lines of
data that I do want
[END]

[DATATYPE3]
multiple unknown number of lines
of data that I dont want
[END]

[DATATYPE2]
another set of multiple unknown number of lines of
data that I do want placed into the next
index of my array variable
[END]
I want the multiple lines from each block I need placed into a single place each in string array. So, in the example above the result would be something like
Code:
$dataarray[1]="multiple unknown number of lines of
data that I do want"

$dataarray[2]="another set of multiple unknown number of lines of
data that I do want placed into the next
index of my array variable"
I made a regex which can find the instances of my desired data blocks in a multiline string and so I started a bash script to process my text file but I cant pull out the data blocks or put them into an array. Here's what I have:
Code:
# /bin/sh
echo "Reading file $1"
blockregex="\[DATATYPE2].+?\[END]"
$dataarray=$(awk 'BEGIN {FS="\n" RS=""} /$blockregex/ { print $0 }' $1)
Probably its really simple but these multiline requirements mean I cant simply cut and paste a simple solution from elsewhere Nor can I figure out whats wrong with my code.
 
Old 09-10-2012, 01:43 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948
I'm confused. Are you trying to extract the lines with awk for a shell script array, or trying to set the lines in an awk array, or what? Could you explain the context for your request a bit more?

In awk, you'd probably have to set the RS to a string that matches each block, then further process it to exclude the lines you don't want, perhaps with sub/gsub. After that, it would depend on what you want to do with it.

Code:
$dataarray[1]="text"
This doesn't match either awk or shell syntax. There is generally no $ at the front when setting a variable.
 
1 members found this post helpful.
Old 09-10-2012, 02:30 AM   #3
mrpurple
Member
 
Registered: May 2010
Posts: 50

Original Poster
Rep: Reputation: 1
First, thanks for looking. Second sorry, please disregard the description above for what I am trying to achieve as it is unclear and an incorrect syntax.

To be more clear: I do want to extract the desired lines into a shell array. I was hoping for one block of lines at each position in the array.

In the example above I would expect to be able to add the following to the end of the script
Code:
 echo ${dataarray[1]} > firstblock.txt
cat firstblock.txt
which on execution in bash would result in
Code:
multiple unknown number of lines of
data that I do want
can i have multiple lines inside a string array like this? or perhaps by adding a dimension to the array somehow?
Also I'd rather do it by adding dimensions rather than adding functions.

Again sorry for being unclear, I would have thought that pulling blocks of multi-line text from a text file against a regex would have been common as dirt but I'm having real trouble figuring it out. Lots of hits in google but they're either far too complicated for me to follow for my "simple" problem or they are too far off topic.

Last edited by mrpurple; 09-10-2012 at 02:32 AM. Reason: spelling
 
Old 09-10-2012, 05:17 AM   #4
ip_address
Member
 
Registered: Apr 2012
Distribution: RedHat
Posts: 42

Rep: Reputation: 2
may be this helps:

put the contents in a text file named "solve_problem.txt"

Code:
more solve_problem.txt 

[DATATYPE1]
multiple unknown number of lines
of data that I dont want
[END-type1]

[DATATYPE2]
multiple unknown number of lines of
data that I do want
[END]

[DATATYPE3]
multiple unknown number of lines
of data that I dont want
[END]

[DATATYPE2]
another set of multiple unknown number of lines of
data that I do want placed into the next
index of my array variable
[END]
and try using this bash script

Code:
#!/bin/bash

#text file to be processed
filename='solve_problem.txt'

#create a temporary file
temp_file='temp.txt'

#initialize index
echo "0" > "$temp_file"

#generate index numbers
sed -n '/\[DATATYPE2\]/,/\[END\]/p' "$filename" | sed '/^\[DATATYPE2\]/d' | grep -n "\[END\]" | awk -F: '{print $1}' >> "$temp_file"

#extract portions of the file
sed -n '/\[DATATYPE2\]/,/\[END\]/p' "$filename" | sed '/^\[DATATYPE2\]/d' > temp_"$filename"

#number of lines to be proceese
loop=`wc -l < "$temp_file"`

#initialize a counter
counter2=1


for ((counter=1;counter<="$loop"-1;counter++));do

counter1=$((counter+1))

#get lines from temp file
first=`sed -n "$counter"p "$temp_file"`
sec=`sed -n "$counter1"p "$temp_file"`

#adjustments to extract onl desired part
first=$((first+1))

sec=$((sec-1))

#put the contents in array
array_var[counter2]=`sed -n "$first","$sec"p temp_"$filename"`

#display the array variable
echo ""
echo "${array_var[counter2]}"
echo ""
echo "********************"
counter2=$((counter2+1))

done

rm temp*
It will create an array named "array_var" whose contents can be accessed using echo "${array_var[n]}" where n is the nth element in the array
 
1 members found this post helpful.
Old 09-10-2012, 09:34 AM   #5
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 36
Quote:
Originally Posted by mrpurple View Post
To be more clear: I do want to extract the desired lines into a shell array. I was hoping for one block of lines at each position in the array.
Perhaps something like this. First I have used awk to extract the blocks of text. Clumsy awk, but it's early morning here. Then I have used the mapfile builtin to read the text-blocks into a shell-array. And finally a declare command to show the arrays contents.
Code:
$~ IFS=$'\n' mapfile -t dataarray < <(awk '
                    /\[DATATYPE[0-9]+\]/ {
                        x="";
                        getline
                        while ($0!~/^\[END.*\]/) {
                            if (x!="") x=x" "
                            x=x$0
                            getline
                        }
                        print x }' data.dat)

$~ declare -p dataarray | sed 's/\[/\n[/g'

declare -a dataarray='(
[0]="multiple unknown number of lines of data that I dont want" 
[1]="multiple unknown number of lines of data that I do want" 
[2]="multiple unknown number of lines of data that I dont want" 
[3]="another set of multiple unknown number of lines of data that I do want placed into the next index of my array variable")'
Happy with ur solution... then tick "yes" and mark as Solved!
 
Old 09-10-2012, 03:44 PM   #6
mrpurple
Member
 
Registered: May 2010
Posts: 50

Original Poster
Rep: Reputation: 1
Quote:
dru8274
Since yours was shorter I had a look first (thanks btw). I'm afraid I needed to retain the line breaks in the resulting shell array.
Quote:
ip_address
Thanks, thats the output I was after. Shame it had to go to temporary files but its the exact solution none-the-less.
 
Old 09-10-2012, 11:15 PM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948Reputation: 1948
Thanks for the clarification. I also just noticed that you only want to extract one datatype. That makes sed easier to use here.

Here's my quick solution:
Code:
while read -r -d "@" line ; do

	array+=( "$line" )

done < <( sed -n '/^\[DATATYPE2/,/^\[END/ { /^\[DATA/d ; s/^\[END]$/@/ ; p }' infile.txt )
Similar to ip_address's sed commands, it uses two addresses to target blocks of "DATATYPE2" to "END", but then it also runs three nested commands on them. The first one removes the DATATYPE line, the second one replaces the END line with an "@" character, and the third prints. The at-mark is is then used in the bash read command to delimit the array elements instead of newlines.

Be sure to change it to a different character if at-marks could exist in the text, of course.

Running it on the example text above, here's the output of the array.
Code:
$ printf '[%s]\n\n' "${array[@]}"
[multiple unknown number of lines of
data that I do want]

[another set of multiple unknown number of lines of
data that I do want placed into the next
index of my array variable]

Last edited by David the H.; 09-10-2012 at 11:20 PM. Reason: fixet dypos
 
1 members found this post helpful.
Old 09-13-2012, 10:39 PM   #8
mrpurple
Member
 
Registered: May 2010
Posts: 50

Original Poster
Rep: Reputation: 1
Just a note for people using this. Please be aware of the difference between \n and \r which cost me a day of grief.
 
Old 09-13-2012, 11:22 PM   #9
mrpurple
Member
 
Registered: May 2010
Posts: 50

Original Poster
Rep: Reputation: 1
Here's the final version for processing the lines from each data block of my data file one by one. Note as above that I needed to do a DOS2UNIX conversion of my data file to convert the \r into \n. Also I had to ensure that my dummy escape character @ was not used in my file.

Code:
#create block array
blockarray=()
while read -r -d "@" line ; do
        blockarray+=( "$line" )
done < <( sed -n '/^\[DATATYPE2/,/^\[END/ { /^\[DATA/d ; s/^\[END]$/@/ ; p }' infile.txt )

#Loop through blocks in block array
baLen=${#blockarray[@]}
for (( i=0; i<${baLen}; i++ ));
do
   #Pull a multiline element of block array
   currentblock="${blockarray[$i]}"

   #Create line array from current block
   linearray=()
   while read line; do
      linearray+=("$line")
   done <<< "$currentblock"

   #Loop through lines of current block for processing
   laLen=${#linearray[@]}
   for (( p=0; p<${laLen}; p++ ));
   do
      currentline="${linearray[$p]}"
         #process the current line
         echo "processing line $p of block $i"
         echo "$currentline"
   done
done
 
Old 09-14-2012, 12:16 AM   #10
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
I am presuming this is merely an exercise as you could have simply read the file in a while loop
at the start and delivered the same final output?
 
Old 09-16-2012, 03:40 PM   #11
mrpurple
Member
 
Registered: May 2010
Posts: 50

Original Poster
Rep: Reputation: 1
Thanks for looking grail. No, this was a real problem successfully solved by David. I'm curious. How would the script/code of your solution differ from that posted? Are you suggesting the use of switches? are you suggesting the use of case? How would you capture the data block as described in OP differently to that described by David. Perhaps you have a link or example code where this problem was solved more efficiently by your alternate mathod? I'd definitely be happy to hear of something more efficient.
 
Old 09-17-2012, 02:37 AM   #12
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
Well I was looking at your final code and using the input provided in your first post and example of the output would be:
Code:
processing line 1 of block 1
multiple unknown number of lines
processing line 2 of block 1
of data that I dont want
processing line 1 of block 2
multiple unknown number of lines of
processing line 1 of block 2
data that I do want
...
If I have read the code correctly and this is the output, I would have thought you could do something like:
Code:
#!/bin/bash

start_regex='\[DATATYPE[0-9]+\]'
end_regex='\[END.*\]'

block_count=0
line_count=0

while read -r line
do
    if [[ $line =~ end_regex || -z $line ]]
    then
        line_count=0
        continue
    fi

    if [[ $line =~ $start_regex ]]
    then
        (( block_count++ ))
        continue
    fi

   (( line_count++ ))

    echo "processing line $line_count of block $block_count"
    echo "$line"
done<input_file
I haven't tested this as not near a linux box at the mo, but you get the general idea
 
Old 09-18-2012, 12:21 AM   #13
mrpurple
Member
 
Registered: May 2010
Posts: 50

Original Poster
Rep: Reputation: 1
That's nice tidy code and makes immediate sence to me thanks. I didnt know I could use regex in simple string comparisons in bash. However, now that I've seen your solution I can see that I needed to be more explicit. I needed/wanted to import an entire block before I began processing that block - hence my request to get all the whole blocks into a multiline string array.

I'm not sure how I might modify your code to capture whole blocks at a time before processing the blocks, but presumably it might be straight forward. One might write the blocks to temporary files? or capture multiline strings into a string array as above giving essentially the same result? If, on the other hand your loop could be split into two nested parts somehow capturing blocks of data in the inner loop and processing them in an outer loop that would be the most efficient and usefull method of all, especially compared to sed since different "start_regex" variables might be used to parse out different kinds of data blocks that might require different processing tasks in the outer loop.

Thanks a lot for having a look!

I dont mean to muck anyone around here. It's just that I see data files laid out like this all the time and so I think a solution to parse them in bash would be very useful to lots of people.
 
Old 09-18-2012, 09:49 AM   #14
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
You may have to explain a little further as I am not sure I see why you would get the data and then perform further tasks on the data you have already parsed? Is it not more expedient to work on the
data as you hit it? The only thing I can think of off the top of my head is if you wish to maybe perform tasks out of order??

Sorry if I missed the point here

If you are wanting the data stored in an array first then the examples presented by others would seem appropriate. You could instead of echoing the data store it in an array to be used later
so that the entire solution is bash instead of using other commands as well.
 
Old 09-18-2012, 07:25 PM   #15
mrpurple
Member
 
Registered: May 2010
Posts: 50

Original Poster
Rep: Reputation: 1
Quote:
Originally Posted by grail View Post
Is it not more expedient to work on the data as you hit it? The only thing I can think of off the top of my head is if you wish to maybe perform tasks out of order??
No to the first question and yes to the second. Often times data contained in a block is not well conditioned. The way information in line 1 is processed may depend on data in line 3. Each line is highly interdependant on data which may appear in later lines. So instead of writing complex loops preparing switches and variables in an ad hoc manner its more straight forward for this kind of interdependant data to get the lot first and then process. I guess my desired output didnt imply this well enough. I have come across this interdpendancy so much I though it was more common than not. Perhaps this interdependancy is rare in which case I may be from a small group of users. Sorry, I honestly didnt realise that my situation wasnt the norm. It can be hard to know that sometimes.

Your suggestion to write lines to a multi line string array is probably what I would do to modify your code.
Thanks again
 
  


Reply

Tags
array, awk, multiline, parse


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
parsing a text file - to awk or not to awk ? rollyah Programming 9 08-18-2011 02:20 PM
Get data from multi lined text file using awk, sed or perl - grep & cut not upto par cam34 Programming 4 07-02-2010 03:10 AM
How to remove specific text blocks in a file xfrantzis Linux - Software 1 07-01-2010 09:06 AM
No of inodes equals no of data blocks in a linux file sysem? Marty21 Linux - Kernel 4 01-10-2009 07:41 PM
Script file to replace large text blocks in files? stodge Linux - Software 0 09-27-2003 10:53 AM


All times are GMT -5. The time now is 08:58 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration