LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 11-26-2012, 01:20 AM   #1
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Rep: Reputation: 174Reputation: 174
Convert length-indicated variable length record file to LF-terminated


I have files containing variable length records without linefeeds that I want to convert to a standard linefeed x'0a' terminated file format.

Each "record" begins with a two byte length field followed by two null bytes followed by the record data. There is no LF to indicate the end of each line, just another length field and record data.

Here is an example of the first three lines of a file in hex format. I've separated the four byte record length descriptor from the record data for clarity, but in the file, there are no terminating linefeeds, and all the data is a single stream. 1200 in the first two bytes of the first record indicate a length of 18 (decimal) bytes. 0900 in the second record indicates 9 bytes of data. 1700 in the third indicate 23 bytes of data.

Code:
12000000 F0F0F0F1F0F0F0F0615C40D9C5E7E7405C61 
09000000 F0F0F0F2F0F0F0F040 
17000000 F0F0F0F3F0F0F0F0C184849985A2A240C9E2D7C5E7C5C3
In reality, it looks like this in the file:

Code:
12000000F0F0F0F1F0F0F0F0615C40D9C5E7E7405C6109000000F0F0F0F2F0F0F0F04017000000F0F0F0F3F0F0F0F0C184849985A2A240C9E2D7C5E7C5C3
You can see that the length in the record descriptor doesn't include the length of the descriptor, just the data that follows.

So what I'd like to do is read the file, copy only the data bytes that follow each length field for the indicated number of bytes, then add a linefeed character x'0A', write that to an output file, and continue with the next record until the end of the file.

In this example, I'd expect to end up with the following (hex format):

Code:
F0F0F0F1F0F0F0F0615C40D9C5E7E7405C610A
F0F0F0F2F0F0F0F0400A
F0F0F0F3F0F0F0F0C184849985A2A240C9E2D7C5E7C5C30A
Note that the four byte record length descriptor fields are gone, and each line is terminated with a linefeed character.

Is it possible to do accomplish this in a bash script? I'm not really looking for a working script, just an idea of some commands or techniques to get me started in the right direction.

Last edited by Z038; 11-26-2012 at 01:26 AM.
 
Old 11-26-2012, 07:57 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,629

Rep: Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265
what language do you prefer? I suggest you to use C, but you can also try java or perl or .... It would be really hard using only bash.
You can find an interesting discussion here: http://unix.stackexchange.com/questi...y-file-content
 
1 members found this post helpful.
Old 11-26-2012, 10:11 AM   #3
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
pan64, thank you for the link. I am going to try using dd to strip off the length bytes so I can read the following data bytes and then write them out a line at a time. If I can't get it to work in a bash script, I'll try a programming language.

Last edited by Z038; 11-26-2012 at 10:12 AM.
 
Old 11-26-2012, 01:06 PM   #4
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
Based on the examples at the link you posted, pan64, my script is almost doing what I want.

Code:
#!/bin/bash
#
# Arguments:
# $1 is the input file name - The file has a four byte field at the start of each record
#    where the first byte is the length of the data that follows, not counting the length  
#    field itself.  There is no linefeed at the end of each record.
# $2 is the output file name - The file consists of the data portion of each input record  
#    (length field omitted) with a linefeed x'0A' added to the end of each record.
#
((skip=0)) # read bytes at this offset
while ( true ) ; do
  #
  # Get the length byte
  ((count=1)) # count of bytes to read
  dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
  (( $(<tmp1 wc -c) != count )) && { echo "INFO: End-Of-File" ; break ; }
  strlen=$((0x$(<tmp1 xxd -ps)))  
  #
  # Copy the data for strlen bytes
  ((count=strlen)) # count of bytes to read
  ((skip+=4))      # start reading at plus 4 offset
  dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
  tmp1ct=$(<tmp1 wc -c)
  (( tmp1ct != count )) && { echo "ERROR: Data length ($tmp1ct) does not equal ($count) at offset ($skip)" ; break ; }
  echo -e "\n" >>tmp1 # add a newline at the end of the line
  cat tmp1 >>$2
  #
  ((skip=skip+count))  # increment past end of input data already processed
done 
rm tmp1
This script extracts the data portion of each input record correctly, but I'm having trouble with the inserted linefeed characters in the output file. If I omit the echo statement that adds the \n linefeed to the temporary file just before I append it to the output file on the following cat statement, I get no linefeeds at all between each record. But if I use the echo statement, I get double linefeed characters x'0A0A' at the end of each line.

How can I get just one x'0A' at the end of each record?

Last edited by Z038; 11-26-2012 at 01:07 PM.
 
Old 11-26-2012, 03:15 PM   #5
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
I haven't figured out why using the echo statement gives me the double linefeed, but I got around it by replacing the echo with a dd command with conv=unblock. The conv=unblock causes dd to add the x'0A' to the end of the record, apparently.

Here is the current script, which does exactly what I wanted to do.

Code:
#!/bin/bash
#
# Arguments:
# $1 is the input file name - The file has a four byte field at the start of each record
#    where the first byte is the length of the data that follows, not counting the length  
#    field itself.  There is no linefeed at the end of each record.
# $2 is the output file name - The file consists of the data portion of each input record  
#    (length field omitted) with a linefeed x'0A' added to the end of each record.
#
((skip=0)) # read bytes at this offset
while ( true ) ; do
  #
  # Get the length byte
  ((count=1)) # count of bytes to read
  dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
  (( $(<tmp1 wc -c) != count )) && { echo "INFO: End-Of-File" ; break ; }
  strlen=$((0x$(<tmp1 xxd -ps)))  
  #
  # Copy the data for strlen bytes
  ((count=strlen)) # count of bytes to read
  ((skip+=4))      # start reading at plus 4 offset
  dd if=$1 bs=1 skip=$skip count=$count of=tmp1 2>/dev/null
  tmp1ct=$(<tmp1 wc -c)
  (( tmp1ct != count )) && { echo "ERROR: Data length ($tmp1ct) does not equal ($count) at offset ($skip)" ; break ; }
  dd if=tmp1 bs=$count cbs=$count conv=unblock of=tmp2 2>/dev/null
  cat tmp2 >>$2
  #
  ((skip=skip+count))  # increment past end of input data already processed
done 
rm tmp1 tmp2
 
Old 11-26-2012, 03:28 PM   #6
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
I should mention that the way I've coded this (with many thanks to the writer of the example I worked from and to pan64 for pointing it out to me), it works for one-byte length codes only, i.e., up to x'ff' (255 decimal) bytes. That's OK for my current situation since none of my records exceed 255 bytes. I'll try to revise it to work with 2 byte lengths since I may need to handle up to 32767 byte records at some point.

Last edited by Z038; 11-26-2012 at 03:29 PM.
 
Old 11-27-2012, 02:21 AM   #7
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,629

Rep: Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265
I cannot complete it right now, but here is an idea:
nums=$(od -j $skip -A n -N 4 -t d1)
will return 4 decimal bytes at the given address (skip).
you can use something like this:
echo "$nums 256 * + 256 * + 256 * + p" | dc
to calculate the length (but it depends on the byte order also)
 
Old 11-27-2012, 08:11 PM   #8
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
Even though the length descriptor field is four bytes long, the length should be a two byte value. The second two bytes of the descriptor are probably always zero.

Considering the system of origin for this data, I suspect it is a signed 16 bit binary number with a maximum value of x'7FFF' = 32767 decimal. It's possible that it could be an unsigned 16 bit binary number, x'FFFF' = 65535, but I doubt that.

The first record in my sample has a descriptor of '12000000'. The data that follows the descriptor is indeed 18 decimal bytes long. x'1200' is little-endian byte order. On the big-endian system it came from, it would be x'0016', in big-endian format. The length would include the length of the 4 byte RDW (record descriptor word) for a variable blocked dataset, hence x'0016' instead of x'0012'. But this file was processed with zip on the mainframe, so I believe the endianess was changed, along with the data lengths.

It is probably safe to treat it as either a 16 bit or 32 bit unsigned number.

When I run your example above, it hangs before the echo is executed. I need to read up on the od command.

Last edited by Z038; 11-27-2012 at 08:15 PM.
 
Old 11-28-2012, 12:36 AM   #9
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,629

Rep: Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265Reputation: 7265
Oh,yes, I have fogotten the filename:
nums=$(od -j $skip -A n -N 4 -t d1 filename)
for 2 bytes you need to use -N 2 I think and
echo "$nums 256 * + p" | dc
 
1 members found this post helpful.
Old 11-28-2012, 11:49 PM   #10
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
That works, both the 2 byte and the 4 byte version.

Thank you. That will allow me to handle arbitrarily long variable length records.
 
Old 11-29-2012, 11:59 PM   #11
Z038
Member
 
Registered: Jan 2006
Location: Dallas
Distribution: Slackware
Posts: 910

Original Poster
Rep: Reputation: 174Reputation: 174
After messing around with it a lot, I never could get the od and dc commands to work right for all lengths I encounted, but I figured out how to read the little-endian 16 bit integer from the file a byte at a time with xxd, concatenate the bytes in the normal order that bash expects (which is not little-endian for a hex value 0xnnnn), and convert to a decimal value for use in the dd commands.

$1 is the input file name. The data is in EBCDIC and has the record desriptor length fields and no line terminators. The script is driven off of a find command to convert all the files in a directory, and concatenate them together.

Code:
((skip=0)) 
filelen=$(<$1 wc -c)
while [[ $skip -lt $filelen ]] ; do
  byte0=$(xxd -l 1 -ps -c 1 -s $skip $1)
  ((skip+=1))
  byte1=$(xxd -l 1 -ps -c 1 -s $skip $1)
  reclen=$((0x$byte1$byte0))
  ((skip+=3))      # start reading after the four byte length field
  dd if=$1 bs=1 skip=$skip count=$reclen of=tmp1 2>/dev/null
  dd if=tmp1 bs=$reclen cbs=$reclen conv=ascii,unblock of=tmp2 2>/dev/null
  cat tmp2 >>$outfile
  # Increment past the copied data to point to the next length field
  ((skip=skip+reclen))  
done 
rm tmp1 tmp2
exit
The above code converts the files to variable-length ASCII records with standard LF line terminators.

I ran a test on 145 files, and aside from being dreadfully slow, it worked like a champ. I suspect the slowness is from the dd command having to pass over the data in each file repeatedly in order to get to the next record. Since I have many directories of files needing conversion that have thousands of files in them, I suppose it's time to start working on a c program to do this. Hopefully I can make that run much more quickly. Still, it's good to know how to deal with the null bytes and little-endian integer format in bash.

Last edited by Z038; 11-30-2012 at 12:03 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Why directory record length is variable in ext2 filesystem? password636 Programming 7 01-11-2011 07:24 AM
[SOLVED] How to parse files with variable record length btacuso Programming 4 08-11-2010 10:49 AM
problems reading in fixed-length record file naijaguy Programming 1 08-24-2004 02:34 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 11:35 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration