LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (http://www.linuxquestions.org/questions/linux-general-1/)
-   -   Convert length-indicated variable length record file to LF-terminated (http://www.linuxquestions.org/questions/linux-general-1/convert-length-indicated-variable-length-record-file-to-lf-terminated-4175438727/)

Z038 11-26-2012 01:20 AM

Convert length-indicated variable length record file to LF-terminated
 
I have files containing variable length records without linefeeds that I want to convert to a standard linefeed x'0a' terminated file format.

Each "record" begins with a two byte length field followed by two null bytes followed by the record data. There is no LF to indicate the end of each line, just another length field and record data.

Here is an example of the first three lines of a file in hex format. I've separated the four byte record length descriptor from the record data for clarity, but in the file, there are no terminating linefeeds, and all the data is a single stream. 1200 in the first two bytes of the first record indicate a length of 18 (decimal) bytes. 0900 in the second record indicates 9 bytes of data. 1700 in the third indicate 23 bytes of data.

Code:

12000000 F0F0F0F1F0F0F0F0615C40D9C5E7E7405C61
09000000 F0F0F0F2F0F0F0F040
17000000 F0F0F0F3F0F0F0F0C184849985A2A240C9E2D7C5E7C5C3

In reality, it looks like this in the file:

Code:

12000000F0F0F0F1F0F0F0F0615C40D9C5E7E7405C6109000000F0F0F0F2F0F0F0F04017000000F0F0F0F3F0F0F0F0C184849985A2A240C9E2D7C5E7C5C3
You can see that the length in the record descriptor doesn't include the length of the descriptor, just the data that follows.

So what I'd like to do is read the file, copy only the data bytes that follow each length field for the indicated number of bytes, then add a linefeed character x'0A', write that to an output file, and continue with the next record until the end of the file.

In this example, I'd expect to end up with the following (hex format):

Code:

F0F0F0F1F0F0F0F0615C40D9C5E7E7405C610A
F0F0F0F2F0F0F0F0400A
F0F0F0F3F0F0F0F0C184849985A2A240C9E2D7C5E7C5C30A

Note that the four byte record length descriptor fields are gone, and each line is terminated with a linefeed character.

Is it possible to do accomplish this in a bash script? I'm not really looking for a working script, just an idea of some commands or techniques to get me started in the right direction.

pan64 11-26-2012 07:57 AM

what language do you prefer? I suggest you to use C, but you can also try java or perl or .... It would be really hard using only bash.
You can find an interesting discussion here: http://unix.stackexchange.com/questi...y-file-content

Z038 11-26-2012 10:11 AM

pan64, thank you for the link. I am going to try using dd to strip off the length bytes so I can read the following data bytes and then write them out a line at a time. If I can't get it to work in a bash script, I'll try a programming language.

Z038 11-26-2012 01:06 PM

Based on the examples at the link you posted, pan64, my script is almost doing what I want.

Code:

#!/bin/bash
#
# Arguments:
# $1 is the input file name - The file has a four byte field at the start of each record
#    where the first byte is the length of the data that follows, not counting the length 
#    field itself.  There is no linefeed at the end of each record.
# $2 is the output file name - The file consists of the data portion of each input record 
#    (length field omitted) with a linefeed x'0A' added to the end of each record.
#
((skip=0)) # read bytes at this offset
while ( true ) ; do
  #
  # Get the length byte
  ((count=1)) # count of bytes to read
  dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
  (( $(<tmp1 wc -c) != count )) && { echo "INFO: End-Of-File" ; break ; }
  strlen=$((0x$(<tmp1 xxd -ps))) 
  #
  # Copy the data for strlen bytes
  ((count=strlen)) # count of bytes to read
  ((skip+=4))      # start reading at plus 4 offset
  dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
  tmp1ct=$(<tmp1 wc -c)
  (( tmp1ct != count )) && { echo "ERROR: Data length ($tmp1ct) does not equal ($count) at offset ($skip)" ; break ; }
  echo -e "\n" >>tmp1 # add a newline at the end of the line
  cat tmp1 >>$2
  #
  ((skip=skip+count))  # increment past end of input data already processed
done
rm tmp1

This script extracts the data portion of each input record correctly, but I'm having trouble with the inserted linefeed characters in the output file. If I omit the echo statement that adds the \n linefeed to the temporary file just before I append it to the output file on the following cat statement, I get no linefeeds at all between each record. But if I use the echo statement, I get double linefeed characters x'0A0A' at the end of each line.

How can I get just one x'0A' at the end of each record?

Z038 11-26-2012 03:15 PM

I haven't figured out why using the echo statement gives me the double linefeed, but I got around it by replacing the echo with a dd command with conv=unblock. The conv=unblock causes dd to add the x'0A' to the end of the record, apparently.

Here is the current script, which does exactly what I wanted to do.

Code:

#!/bin/bash
#
# Arguments:
# $1 is the input file name - The file has a four byte field at the start of each record
#    where the first byte is the length of the data that follows, not counting the length 
#    field itself.  There is no linefeed at the end of each record.
# $2 is the output file name - The file consists of the data portion of each input record 
#    (length field omitted) with a linefeed x'0A' added to the end of each record.
#
((skip=0)) # read bytes at this offset
while ( true ) ; do
  #
  # Get the length byte
  ((count=1)) # count of bytes to read
  dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
  (( $(<tmp1 wc -c) != count )) && { echo "INFO: End-Of-File" ; break ; }
  strlen=$((0x$(<tmp1 xxd -ps))) 
  #
  # Copy the data for strlen bytes
  ((count=strlen)) # count of bytes to read
  ((skip+=4))      # start reading at plus 4 offset
  dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
  tmp1ct=$(<tmp1 wc -c)
  (( tmp1ct != count )) && { echo "ERROR: Data length ($tmp1ct) does not equal ($count) at offset ($skip)" ; break ; }
  dd if=tmp1 bs=$count cbs=$count conv=unblock of=tmp2 2>/dev/null
  cat tmp2 >>$2
  #
  ((skip=skip+count))  # increment past end of input data already processed
done
rm tmp1 tmp2


Z038 11-26-2012 03:28 PM

I should mention that the way I've coded this (with many thanks to the writer of the example I worked from and to pan64 for pointing it out to me), it works for one-byte length codes only, i.e., up to x'ff' (255 decimal) bytes. That's OK for my current situation since none of my records exceed 255 bytes. I'll try to revise it to work with 2 byte lengths since I may need to handle up to 32767 byte records at some point.

pan64 11-27-2012 02:21 AM

I cannot complete it right now, but here is an idea:
nums=$(od -j $skip -A n -N 4 -t d1)
will return 4 decimal bytes at the given address (skip).
you can use something like this:
echo "$nums 256 * + 256 * + 256 * + p" | dc
to calculate the length (but it depends on the byte order also)

Z038 11-27-2012 08:11 PM

Even though the length descriptor field is four bytes long, the length should be a two byte value. The second two bytes of the descriptor are probably always zero.

Considering the system of origin for this data, I suspect it is a signed 16 bit binary number with a maximum value of x'7FFF' = 32767 decimal. It's possible that it could be an unsigned 16 bit binary number, x'FFFF' = 65535, but I doubt that.

The first record in my sample has a descriptor of '12000000'. The data that follows the descriptor is indeed 18 decimal bytes long. x'1200' is little-endian byte order. On the big-endian system it came from, it would be x'0016', in big-endian format. The length would include the length of the 4 byte RDW (record descriptor word) for a variable blocked dataset, hence x'0016' instead of x'0012'. But this file was processed with zip on the mainframe, so I believe the endianess was changed, along with the data lengths.

It is probably safe to treat it as either a 16 bit or 32 bit unsigned number.

When I run your example above, it hangs before the echo is executed. I need to read up on the od command.

pan64 11-28-2012 12:36 AM

Oh,yes, I have fogotten the filename:
nums=$(od -j $skip -A n -N 4 -t d1 filename)
for 2 bytes you need to use -N 2 I think and
echo "$nums 256 * + p" | dc

Z038 11-28-2012 11:49 PM

That works, both the 2 byte and the 4 byte version.

Thank you. That will allow me to handle arbitrarily long variable length records.

Z038 11-29-2012 11:59 PM

After messing around with it a lot, I never could get the od and dc commands to work right for all lengths I encounted, but I figured out how to read the little-endian 16 bit integer from the file a byte at a time with xxd, concatenate the bytes in the normal order that bash expects (which is not little-endian for a hex value 0xnnnn), and convert to a decimal value for use in the dd commands.

$1 is the input file name. The data is in EBCDIC and has the record desriptor length fields and no line terminators. The script is driven off of a find command to convert all the files in a directory, and concatenate them together.

Code:

((skip=0))
filelen=$(<$1 wc -c)
while [[ $skip -lt $filelen ]] ; do
  byte0=$(xxd -l 1 -ps -c 1 -s $skip $1)
  ((skip+=1))
  byte1=$(xxd -l 1 -ps -c 1 -s $skip $1)
  reclen=$((0x$byte1$byte0))
  ((skip+=3))      # start reading after the four byte length field
  dd if=$1 bs=1 skip=$skip count=$reclen of=tmp1 2>/dev/null
  dd if=tmp1 bs=$reclen cbs=$reclen conv=ascii,unblock of=tmp2 2>/dev/null
  cat tmp2 >>$outfile
  # Increment past the copied data to point to the next length field
  ((skip=skip+reclen)) 
done
rm tmp1 tmp2
exit

The above code converts the files to variable-length ASCII records with standard LF line terminators.

I ran a test on 145 files, and aside from being dreadfully slow, it worked like a champ. I suspect the slowness is from the dd command having to pass over the data in each file repeatedly in order to get to the next record. Since I have many directories of files needing conversion that have thousands of files in them, I suppose it's time to start working on a c program to do this. Hopefully I can make that run much more quickly. Still, it's good to know how to deal with the null bytes and little-endian integer format in bash.


All times are GMT -5. The time now is 08:10 AM.