LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   using awk to get blocks of data from a text file (https://www.linuxquestions.org/questions/programming-9/using-awk-to-get-blocks-of-data-from-a-text-file-4175426447/)

mrpurple 09-10-2012 12:29 AM

using awk to get blocks of data from a text file
 
I want to extract multiline blocks of data from a text file into a multiline string array from a multiline test file. The data file looks like
Code:

[DATATYPE1]
multiple unknown number of lines
of data that I dont want
[END-type1]

[DATATYPE2]
multiple unknown number of lines of
data that I do want
[END]

[DATATYPE3]
multiple unknown number of lines
of data that I dont want
[END]

[DATATYPE2]
another set of multiple unknown number of lines of
data that I do want placed into the next
index of my array variable
[END]

I want the multiple lines from each block I need placed into a single place each in string array. So, in the example above the result would be something like
Code:

$dataarray[1]="multiple unknown number of lines of
data that I do want"

$dataarray[2]="another set of multiple unknown number of lines of
data that I do want placed into the next
index of my array variable"

I made a regex which can find the instances of my desired data blocks in a multiline string and so I started a bash script to process my text file but I cant pull out the data blocks or put them into an array. Here's what I have:
Code:

# /bin/sh
echo "Reading file $1"
blockregex="\[DATATYPE2].+?\[END]"
$dataarray=$(awk 'BEGIN {FS="\n" RS=""} /$blockregex/ { print $0 }' $1)

Probably its really simple but these multiline requirements mean I cant simply cut and paste a simple solution from elsewhere Nor can I figure out whats wrong with my code.

David the H. 09-10-2012 01:43 AM

I'm confused. Are you trying to extract the lines with awk for a shell script array, or trying to set the lines in an awk array, or what? Could you explain the context for your request a bit more?

In awk, you'd probably have to set the RS to a string that matches each block, then further process it to exclude the lines you don't want, perhaps with sub/gsub. After that, it would depend on what you want to do with it.

Code:

$dataarray[1]="text"
This doesn't match either awk or shell syntax. There is generally no $ at the front when setting a variable.

mrpurple 09-10-2012 02:30 AM

First, thanks for looking. Second sorry, please disregard the description above for what I am trying to achieve as it is unclear and an incorrect syntax.

To be more clear: I do want to extract the desired lines into a shell array. I was hoping for one block of lines at each position in the array.

In the example above I would expect to be able to add the following to the end of the script
Code:

echo ${dataarray[1]} > firstblock.txt
cat firstblock.txt

which on execution in bash would result in
Code:

multiple unknown number of lines of
data that I do want

can i have multiple lines inside a string array like this? or perhaps by adding a dimension to the array somehow?
Also I'd rather do it by adding dimensions rather than adding functions.

Again sorry for being unclear, I would have thought that pulling blocks of multi-line text from a text file against a regex would have been common as dirt but I'm having real trouble figuring it out. Lots of hits in google but they're either far too complicated for me to follow for my "simple" problem or they are too far off topic.

ip_address 09-10-2012 05:17 AM

may be this helps:

put the contents in a text file named "solve_problem.txt"

Code:

more solve_problem.txt

[DATATYPE1]
multiple unknown number of lines
of data that I dont want
[END-type1]

[DATATYPE2]
multiple unknown number of lines of
data that I do want
[END]

[DATATYPE3]
multiple unknown number of lines
of data that I dont want
[END]

[DATATYPE2]
another set of multiple unknown number of lines of
data that I do want placed into the next
index of my array variable
[END]

and try using this bash script

Code:

#!/bin/bash

#text file to be processed
filename='solve_problem.txt'

#create a temporary file
temp_file='temp.txt'

#initialize index
echo "0" > "$temp_file"

#generate index numbers
sed -n '/\[DATATYPE2\]/,/\[END\]/p' "$filename" | sed '/^\[DATATYPE2\]/d' | grep -n "\[END\]" | awk -F: '{print $1}' >> "$temp_file"

#extract portions of the file
sed -n '/\[DATATYPE2\]/,/\[END\]/p' "$filename" | sed '/^\[DATATYPE2\]/d' > temp_"$filename"

#number of lines to be proceese
loop=`wc -l < "$temp_file"`

#initialize a counter
counter2=1


for ((counter=1;counter<="$loop"-1;counter++));do

counter1=$((counter+1))

#get lines from temp file
first=`sed -n "$counter"p "$temp_file"`
sec=`sed -n "$counter1"p "$temp_file"`

#adjustments to extract onl desired part
first=$((first+1))

sec=$((sec-1))

#put the contents in array
array_var[counter2]=`sed -n "$first","$sec"p temp_"$filename"`

#display the array variable
echo ""
echo "${array_var[counter2]}"
echo ""
echo "********************"
counter2=$((counter2+1))

done

rm temp*

It will create an array named "array_var" whose contents can be accessed using echo "${array_var[n]}" where n is the nth element in the array

dru8274 09-10-2012 09:34 AM

Quote:

Originally Posted by mrpurple (Post 4776630)
To be more clear: I do want to extract the desired lines into a shell array. I was hoping for one block of lines at each position in the array.

Perhaps something like this. First I have used awk to extract the blocks of text. Clumsy awk, but it's early morning here. Then I have used the mapfile builtin to read the text-blocks into a shell-array. And finally a declare command to show the arrays contents.
Code:

$~ IFS=$'\n' mapfile -t dataarray < <(awk '
                    /\[DATATYPE[0-9]+\]/ {
                        x="";
                        getline
                        while ($0!~/^\[END.*\]/) {
                            if (x!="") x=x" "
                            x=x$0
                            getline
                        }
                        print x }' data.dat)

$~ declare -p dataarray | sed 's/\[/\n[/g'

declare -a dataarray='(
[0]="multiple unknown number of lines of data that I dont want"
[1]="multiple unknown number of lines of data that I do want"
[2]="multiple unknown number of lines of data that I dont want"
[3]="another set of multiple unknown number of lines of data that I do want placed into the next index of my array variable")'

Happy with ur solution... then tick "yes" and mark as Solved!

mrpurple 09-10-2012 03:44 PM

Quote:

dru8274
Since yours was shorter I had a look first (thanks btw). I'm afraid I needed to retain the line breaks in the resulting shell array.
Quote:

ip_address
Thanks, thats the output I was after. Shame it had to go to temporary files but its the exact solution none-the-less.

David the H. 09-10-2012 11:15 PM

Thanks for the clarification. I also just noticed that you only want to extract one datatype. That makes sed easier to use here.

Here's my quick solution:
Code:

while read -r -d "@" line ; do

        array+=( "$line" )

done < <( sed -n '/^\[DATATYPE2/,/^\[END/ { /^\[DATA/d ; s/^\[END]$/@/ ; p }' infile.txt )

Similar to ip_address's sed commands, it uses two addresses to target blocks of "DATATYPE2" to "END", but then it also runs three nested commands on them. The first one removes the DATATYPE line, the second one replaces the END line with an "@" character, and the third prints. The at-mark is is then used in the bash read command to delimit the array elements instead of newlines.

Be sure to change it to a different character if at-marks could exist in the text, of course.

Running it on the example text above, here's the output of the array.
Code:

$ printf '[%s]\n\n' "${array[@]}"
[multiple unknown number of lines of
data that I do want]

[another set of multiple unknown number of lines of
data that I do want placed into the next
index of my array variable]


mrpurple 09-13-2012 10:39 PM

Just a note for people using this. Please be aware of the difference between \n and \r which cost me a day of grief.

mrpurple 09-13-2012 11:22 PM

Here's the final version for processing the lines from each data block of my data file one by one. Note as above that I needed to do a DOS2UNIX conversion of my data file to convert the \r into \n. Also I had to ensure that my dummy escape character @ was not used in my file.

Code:

#create block array
blockarray=()
while read -r -d "@" line ; do
        blockarray+=( "$line" )
done < <( sed -n '/^\[DATATYPE2/,/^\[END/ { /^\[DATA/d ; s/^\[END]$/@/ ; p }' infile.txt )

#Loop through blocks in block array
baLen=${#blockarray[@]}
for (( i=0; i<${baLen}; i++ ));
do
  #Pull a multiline element of block array
  currentblock="${blockarray[$i]}"

  #Create line array from current block
  linearray=()
  while read line; do
      linearray+=("$line")
  done <<< "$currentblock"

  #Loop through lines of current block for processing
  laLen=${#linearray[@]}
  for (( p=0; p<${laLen}; p++ ));
  do
      currentline="${linearray[$p]}"
        #process the current line
        echo "processing line $p of block $i"
        echo "$currentline"
  done
done


grail 09-14-2012 12:16 AM

I am presuming this is merely an exercise as you could have simply read the file in a while loop
at the start and delivered the same final output?

mrpurple 09-16-2012 03:40 PM

Thanks for looking grail. No, this was a real problem successfully solved by David. I'm curious. How would the script/code of your solution differ from that posted? Are you suggesting the use of switches? are you suggesting the use of case? How would you capture the data block as described in OP differently to that described by David. Perhaps you have a link or example code where this problem was solved more efficiently by your alternate mathod? I'd definitely be happy to hear of something more efficient.

grail 09-17-2012 02:37 AM

Well I was looking at your final code and using the input provided in your first post and example of the output would be:
Code:

processing line 1 of block 1
multiple unknown number of lines
processing line 2 of block 1
of data that I dont want
processing line 1 of block 2
multiple unknown number of lines of
processing line 1 of block 2
data that I do want
...

If I have read the code correctly and this is the output, I would have thought you could do something like:
Code:

#!/bin/bash

start_regex='\[DATATYPE[0-9]+\]'
end_regex='\[END.*\]'

block_count=0
line_count=0

while read -r line
do
    if [[ $line =~ end_regex || -z $line ]]
    then
        line_count=0
        continue
    fi

    if [[ $line =~ $start_regex ]]
    then
        (( block_count++ ))
        continue
    fi

  (( line_count++ ))

    echo "processing line $line_count of block $block_count"
    echo "$line"
done<input_file

I haven't tested this as not near a linux box at the mo, but you get the general idea :)

mrpurple 09-18-2012 12:21 AM

That's nice tidy code and makes immediate sence to me thanks. I didnt know I could use regex in simple string comparisons in bash. However, now that I've seen your solution I can see that I needed to be more explicit. I needed/wanted to import an entire block before I began processing that block - hence my request to get all the whole blocks into a multiline string array.

I'm not sure how I might modify your code to capture whole blocks at a time before processing the blocks, but presumably it might be straight forward. One might write the blocks to temporary files? or capture multiline strings into a string array as above giving essentially the same result? If, on the other hand your loop could be split into two nested parts somehow capturing blocks of data in the inner loop and processing them in an outer loop that would be the most efficient and usefull method of all, especially compared to sed since different "start_regex" variables might be used to parse out different kinds of data blocks that might require different processing tasks in the outer loop.

Thanks a lot for having a look!

I dont mean to muck anyone around here. It's just that I see data files laid out like this all the time and so I think a solution to parse them in bash would be very useful to lots of people.

grail 09-18-2012 09:49 AM

You may have to explain a little further as I am not sure I see why you would get the data and then perform further tasks on the data you have already parsed? Is it not more expedient to work on the
data as you hit it? The only thing I can think of off the top of my head is if you wish to maybe perform tasks out of order??

Sorry if I missed the point here :(

If you are wanting the data stored in an array first then the examples presented by others would seem appropriate. You could instead of echoing the data store it in an array to be used later
so that the entire solution is bash instead of using other commands as well.

mrpurple 09-18-2012 07:25 PM

Quote:

Originally Posted by grail (Post 4783330)
Is it not more expedient to work on the data as you hit it? The only thing I can think of off the top of my head is if you wish to maybe perform tasks out of order??

No to the first question and yes to the second. Often times data contained in a block is not well conditioned. The way information in line 1 is processed may depend on data in line 3. Each line is highly interdependant on data which may appear in later lines. So instead of writing complex loops preparing switches and variables in an ad hoc manner its more straight forward for this kind of interdependant data to get the lot first and then process. I guess my desired output didnt imply this well enough. I have come across this interdpendancy so much I though it was more common than not. Perhaps this interdependancy is rare in which case I may be from a small group of users. Sorry, I honestly didnt realise that my situation wasnt the norm. It can be hard to know that sometimes.

Your suggestion to write lines to a multi line string array is probably what I would do to modify your code.
Thanks again


All times are GMT -5. The time now is 08:04 AM.