[SOLVED] Convert length-indicated variable length record file to LF-terminated
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Convert length-indicated variable length record file to LF-terminated
I have files containing variable length records without linefeeds that I want to convert to a standard linefeed x'0a' terminated file format.
Each "record" begins with a two byte length field followed by two null bytes followed by the record data. There is no LF to indicate the end of each line, just another length field and record data.
Here is an example of the first three lines of a file in hex format. I've separated the four byte record length descriptor from the record data for clarity, but in the file, there are no terminating linefeeds, and all the data is a single stream. 1200 in the first two bytes of the first record indicate a length of 18 (decimal) bytes. 0900 in the second record indicates 9 bytes of data. 1700 in the third indicate 23 bytes of data.
You can see that the length in the record descriptor doesn't include the length of the descriptor, just the data that follows.
So what I'd like to do is read the file, copy only the data bytes that follow each length field for the indicated number of bytes, then add a linefeed character x'0A', write that to an output file, and continue with the next record until the end of the file.
In this example, I'd expect to end up with the following (hex format):
Note that the four byte record length descriptor fields are gone, and each line is terminated with a linefeed character.
Is it possible to do accomplish this in a bash script? I'm not really looking for a working script, just an idea of some commands or techniques to get me started in the right direction.
what language do you prefer? I suggest you to use C, but you can also try java or perl or .... It would be really hard using only bash.
You can find an interesting discussion here: http://unix.stackexchange.com/questi...y-file-content
pan64, thank you for the link. I am going to try using dd to strip off the length bytes so I can read the following data bytes and then write them out a line at a time. If I can't get it to work in a bash script, I'll try a programming language.
Based on the examples at the link you posted, pan64, my script is almost doing what I want.
Code:
#!/bin/bash
#
# Arguments:
# $1 is the input file name - The file has a four byte field at the start of each record
# where the first byte is the length of the data that follows, not counting the length
# field itself. There is no linefeed at the end of each record.
# $2 is the output file name - The file consists of the data portion of each input record
# (length field omitted) with a linefeed x'0A' added to the end of each record.
#
((skip=0)) # read bytes at this offset
while ( true ) ; do
#
# Get the length byte
((count=1)) # count of bytes to read
dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
(( $(<tmp1 wc -c) != count )) && { echo "INFO: End-Of-File" ; break ; }
strlen=$((0x$(<tmp1 xxd -ps)))
#
# Copy the data for strlen bytes
((count=strlen)) # count of bytes to read
((skip+=4)) # start reading at plus 4 offset
dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
tmp1ct=$(<tmp1 wc -c)
(( tmp1ct != count )) && { echo "ERROR: Data length ($tmp1ct) does not equal ($count) at offset ($skip)" ; break ; }
echo -e "\n" >>tmp1 # add a newline at the end of the line
cat tmp1 >>$2
#
((skip=skip+count)) # increment past end of input data already processed
done
rm tmp1
This script extracts the data portion of each input record correctly, but I'm having trouble with the inserted linefeed characters in the output file. If I omit the echo statement that adds the \n linefeed to the temporary file just before I append it to the output file on the following cat statement, I get no linefeeds at all between each record. But if I use the echo statement, I get double linefeed characters x'0A0A' at the end of each line.
How can I get just one x'0A' at the end of each record?
I haven't figured out why using the echo statement gives me the double linefeed, but I got around it by replacing the echo with a dd command with conv=unblock. The conv=unblock causes dd to add the x'0A' to the end of the record, apparently.
Here is the current script, which does exactly what I wanted to do.
Code:
#!/bin/bash
#
# Arguments:
# $1 is the input file name - The file has a four byte field at the start of each record
# where the first byte is the length of the data that follows, not counting the length
# field itself. There is no linefeed at the end of each record.
# $2 is the output file name - The file consists of the data portion of each input record
# (length field omitted) with a linefeed x'0A' added to the end of each record.
#
((skip=0)) # read bytes at this offset
while ( true ) ; do
#
# Get the length byte
((count=1)) # count of bytes to read
dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
(( $(<tmp1 wc -c) != count )) && { echo "INFO: End-Of-File" ; break ; }
strlen=$((0x$(<tmp1 xxd -ps)))
#
# Copy the data for strlen bytes
((count=strlen)) # count of bytes to read
((skip+=4)) # start reading at plus 4 offset
dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
tmp1ct=$(<tmp1 wc -c)
(( tmp1ct != count )) && { echo "ERROR: Data length ($tmp1ct) does not equal ($count) at offset ($skip)" ; break ; }
dd if=tmp1 bs=$count cbs=$count conv=unblock of=tmp2 2>/dev/null
cat tmp2 >>$2
#
((skip=skip+count)) # increment past end of input data already processed
done
rm tmp1 tmp2
I should mention that the way I've coded this (with many thanks to the writer of the example I worked from and to pan64 for pointing it out to me), it works for one-byte length codes only, i.e., up to x'ff' (255 decimal) bytes. That's OK for my current situation since none of my records exceed 255 bytes. I'll try to revise it to work with 2 byte lengths since I may need to handle up to 32767 byte records at some point.
I cannot complete it right now, but here is an idea:
nums=$(od -j $skip -A n -N 4 -t d1)
will return 4 decimal bytes at the given address (skip).
you can use something like this:
echo "$nums 256 * + 256 * + 256 * + p" | dc
to calculate the length (but it depends on the byte order also)
Even though the length descriptor field is four bytes long, the length should be a two byte value. The second two bytes of the descriptor are probably always zero.
Considering the system of origin for this data, I suspect it is a signed 16 bit binary number with a maximum value of x'7FFF' = 32767 decimal. It's possible that it could be an unsigned 16 bit binary number, x'FFFF' = 65535, but I doubt that.
The first record in my sample has a descriptor of '12000000'. The data that follows the descriptor is indeed 18 decimal bytes long. x'1200' is little-endian byte order. On the big-endian system it came from, it would be x'0016', in big-endian format. The length would include the length of the 4 byte RDW (record descriptor word) for a variable blocked dataset, hence x'0016' instead of x'0012'. But this file was processed with zip on the mainframe, so I believe the endianess was changed, along with the data lengths.
It is probably safe to treat it as either a 16 bit or 32 bit unsigned number.
When I run your example above, it hangs before the echo is executed. I need to read up on the od command.
Oh,yes, I have fogotten the filename:
nums=$(od -j $skip -A n -N 4 -t d1 filename)
for 2 bytes you need to use -N 2 I think and
echo "$nums 256 * + p" | dc
After messing around with it a lot, I never could get the od and dc commands to work right for all lengths I encounted, but I figured out how to read the little-endian 16 bit integer from the file a byte at a time with xxd, concatenate the bytes in the normal order that bash expects (which is not little-endian for a hex value 0xnnnn), and convert to a decimal value for use in the dd commands.
$1 is the input file name. The data is in EBCDIC and has the record desriptor length fields and no line terminators. The script is driven off of a find command to convert all the files in a directory, and concatenate them together.
Code:
((skip=0))
filelen=$(<$1 wc -c)
while [[ $skip -lt $filelen ]] ; do
byte0=$(xxd -l 1 -ps -c 1 -s $skip $1)
((skip+=1))
byte1=$(xxd -l 1 -ps -c 1 -s $skip $1)
reclen=$((0x$byte1$byte0))
((skip+=3)) # start reading after the four byte length field
dd if=$1 bs=1 skip=$skip count=$reclen of=tmp1 2>/dev/null
dd if=tmp1 bs=$reclen cbs=$reclen conv=ascii,unblock of=tmp2 2>/dev/null
cat tmp2 >>$outfile
# Increment past the copied data to point to the next length field
((skip=skip+reclen))
done
rm tmp1 tmp2
exit
The above code converts the files to variable-length ASCII records with standard LF line terminators.
I ran a test on 145 files, and aside from being dreadfully slow, it worked like a champ. I suspect the slowness is from the dd command having to pass over the data in each file repeatedly in order to get to the next record. Since I have many directories of files needing conversion that have thousands of files in them, I suppose it's time to start working on a c program to do this. Hopefully I can make that run much more quickly. Still, it's good to know how to deal with the null bytes and little-endian integer format in bash.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.