Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Are they all in a few (or one) groups, or randomly scattered through the file?
How much is space and how much is data to be kept?
HI.
The data size is use. i cannot use awk since awk need to redirect the output to a different file.
The size of the disk is 30 gb and size of my data file is around 22gb.
i need to delete all the new line whose length is zero.
HI.
The data size is use. i cannot use awk since awk need to redirect the output to a different file.
The size of the disk is 30 gb and size of my data file is around 22gb.
i need to delete all the new line whose length is zero.
####SAMPLE INPUT FILE######
abc
xyz
abc
fgh
bng
bjh
Ok, what have you done/tried so far? And 30GB of disk for a system that generates 22GB files seems very slim, especially in this day, when 1TB disks are less than $75.
I suggest you look at sed, and try looking up some examples. This may work:
Code:
sed -i -e '/^$/d' filename.txt
...that removes blank lines from a file, in place.
Is an empty line defined by having nothing on it, as the example from TB0ne has shown, or are we to also expect lines that have no visible characters as well?
Also, the following statement is in error:
Quote:
i cannot use awk since awk need to redirect the output to a different file.
As you could do the following:
Code:
awk '/./{print > FILENAME}' yourfile
It may be interesting to note that sed actually creates a temp file so this may cause an issue also??
I don't think sed will work. The -i option also creates a temporary file behind the scenes.
If space weren't an issue it would be trivial, with a whole host of options you could choose from. But as it is, I personally can't think of any solution that doesn't require either a temporary file or loading a copy into RAM (which is how most text editors work, I believe). With so little free space you'd need something that could edit the contents directly. You may need to custom-create something in a serious programming language for that.
My best suggestion, if possible, would be to get some kind of external storage to work with. An extra hard disk or a USB drive large enough to handle temporary copies that large.
I don't think sed will work. The -i option also creates a temporary file behind the scenes.
If space weren't an issue it would be trivial, with a whole host of options you could choose from. But as it is, I personally can't think of any solution that doesn't require either a temporary file or loading a copy into RAM (which is how most text editors work, I believe). With so little free space you'd need something that could edit the contents directly. You may need to custom-create something in a serious programming language for that.
My best suggestion, if possible, would be to get some kind of external storage to work with. An extra hard disk or a USB drive large enough to handle temporary copies that large.
Good catch, David. Not sure if it will work, but surely ANY sort of file processing will create some means of working with the file, wouldn't it? Not even sure that awk would work, but if it creates a temp file, perhaps it would be in the /tmp directory, which may be on a different partition.
Regardless, the OP is in a bad spot...22GB file will have to be processed somehow. OP, you could also try using vi, and (in escape mode)
Code:
:g/^$/ d
But, since vi creates a swap file by default, you'd have to start it with the "-n" switch, to disable that.
Not sure if it will work, but surely ANY sort of file processing will create some means of working with the file, wouldn't it?
This was worried me when I saw the qn; there's a lot of hidden background space needed for just about any tool.
As above, you could write a program in eg Perl or C that used a combination of tell + seek to move lines back up through the file, then truncate it in-situ.
Here's a bit of a hack solution I scripted that may fit the requirements.
It uses tail to grab a fixed chunk off the end of the file, and then truncate to shorten the file by that same amount. Then it processes each chunk separately and reassembles the final file.
I tested it on files up to a few MiB in size, and it appears to work ok. But I can't promise it's completely safe, since it has to operate destructively on the original, without backup. It's also not fast; it seems to be taking about 1 second per MiB on my system.
Code:
#!/bin/bash
infile="infile.txt"
outfile="outfile.txt"
blocksize=1M
count=0
# process a "blocksize" block of text from the end of the file during each
# iteration of the loop, remove all the blank lines from it, and store in a temp
# file. Remove an equivalent amount from the original file at the same time.
while [[ -s $infile ]]; do
# read in a block of text, and truncate the original file by the same amount.
# pads each end with an "x" also, which will be removed later.
# this ensures that any newlines at the ends of the block are preserved.
block=$( printf 'x' ; tail -c "$blocksize" "$infile" ; printf 'x' )
truncate --size=-"$blocksize" "$infile"
# pre-increment the counting variable for each block.
(( count++ ))
# "squeeze" all newlines in the block, and print it to a tempfile.
# then remove the characters at the ends and print the result to
# a temporary file, with $count in its name.
# unwanted newlines from creeping back in.
block=$( tr -s '\n' <<<"$block" )
block=${block#x}
block=${block%x}
printf '%s' "$block" > "$outfile.$count"
done
# now cat all the tempfiles back together.
# since we started from the bottom we work
# from highest number to lowest.
while (( count )); do
cat "$outfile.$count" >> "$outfile"
rm "$outfile.$count"
(( count-- ))
done
# Finish by adding a final newline back to the file (comment out if not desired).
echo >> "$outfile"
exit 0
I still say the best solution would be to simply get more disk space of some kind. It looks like you can get cheap 32MB thumbdrives for under US$20 now, for example.
Very nice, DavidTheH....and I agree about the additional space, too. A system that generates 22GB files certainly needs more than 30 GB of drive space.
dd can edit a file in-place, but I don't recommend using it on valuable data; if it were interrupted it would leave a corrupted file.
The test file:
1095073792 bytes with 66722 blank lines.
This overwrites the old file with the smaller new file, however, at first the file keeps its old size. To cut the file down to its new smaller size we need to know the precise number of bytes copied by dd.
Code:
# Some dd versions tell you the number of bytes
tr -s '\n' < file | dd of=file bs=1M conv=notrunc
0+140231 records in
0+140231 records out
1095007070 bytes .... # Number of bytes through the pipe (from dd)
# Otherwise, in Bash try this
tr -s '\n' < file | tee >(wc -c >&2) | dd of=file bs=1M conv=notrunc
1095007070 # Number of bytes through the pipe (from wc)
0+133329 records in
0+133329 records out
# Truncating the file to the new size:
dd if=/dev/null seek=1095007070 of=file bs=1
0+0 records in
0+0 records out
When I tried grail's awk command on a 1GB test file the result was a 4KB file instead of the approximately 1GB file I was expecting. Also, the command terminated so quickly it didn't have time to go through the full 1GB of data. A possible explanation is that awk reads in the first filesystem IO Block of 4KB from the file, modifies it, writes it back to the file, and then truncates the file at that point. So when awk tries to read in the second IO Block it thinks it has reached the end of the file and quits.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.