LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   c++ - parsing windows textfile: how to strip extra characters? (https://www.linuxquestions.org/questions/programming-9/c-parsing-windows-textfile-how-to-strip-extra-characters-643321/)

babag 05-19-2008 03:26 PM

c++ - parsing windows textfile: how to strip extra characters?
 
i've run into a problem reading a windows-generated textfile
onto my linux (mandriva 2007.1) system using c++. it took me
a long time and lots of help from the good folks here, but
i've finally figured out that the issue i've run into seems
to be one of extra, hidden characters in the original text file.

i started out by processing one variable read from the textfile
and had a lot of problems. i finally got around them by using
substr to parse only the first three characters of the line
read into my variable. that made things work.

my thinking is, however, that the likelihood is that every line
in the text file probably has this same issue. that would argue
in favor of addressing the issue, not at the individual variable
level, but at the file level. in other words, when the text file
is first parsed into my script. either that or by somehow
processing the textfile before it is read.

so, there i have two ideas to pursue: preprocessing the text
file, or processing it as it is read.

i'm using vector to read the text file. how would i strip extra
characters at that stage?

alternately, how would i strip the extra characters before the
text file comes into the script?

the program is below.

thanks,
BabaG
Code:

#include <fstream>
#include <iostream>
#include <iomanip>
#include <string>
#include <vector>
#include <assert.h>

using namespace std;

int main()
{
  int count = 0

  ifstream infile("file_to_be_parsed.txt");

  if (!infile)
  {
      cerr << "Could not open file." << endl;

      return 1;
  }

  vector<string> ScriptVariables;
  string line;

  while (getline(infile, line))
  {
      ScriptVariables.push_back(line);
  }

  infile.close();

// lots of variables assigned from text file
// this is the one that's been a problem in another thread

  string capformat = ScriptVariables[8];

// perform operations

  int cr2W = 4368;
  int cr2H = 2912;

  int nefW = 3872;
  int nefH = 2592;

  double CtrX = 0;
  double CtrY = 0;

  string capformatTrimmed = capformat.substr(0,3);

  if (capformatTrimmed == "cr2")
      {
      double CtrX = cr2W/2.0;
      double CtrY = cr2H/2.0;
      }
  else if (capformatTrimmed == "nef")
      {
      double CtrX = nefW/2.0;
      double CtrY = nefH/2.0;
      }
  else
      {
        cout << "something is wrong with cr2/nef line." << endl;
      }

  cout << CtrX << endl;
  cout << CtrY << endl;

  return 0;
}


ntubski 05-19-2008 05:40 PM

You're file probably has dos line endings. dos2unix <filename> should fix it. If that's not installed sed -i 's/\r\n/\n/' <filename> should work too.

From inside your program, you can remove '\r' characters from the string, using standard C++ string functions:
Code:

while (getline(infile, line))
  {
      string::size cr_idx = line.find('\r', 0);
      if (cr_idx != string::npos) {
        ScriptVariables.push_back(line.substring(0, cr_idx));
      } else {
        ScriptVariables.push_back(line);
      }
  }


babag 05-19-2008 06:17 PM

great! thanks, man. will try as soon as i get back in front of the box
that has this stuff on it.

this program is for processing a bunch of files which have been moved
over from a windows box to a linux box. in that move i'll be also moving
the ScriptVariables.txt file. should be simple enough to run dos2unix
as a part of the bash script that moves all the files.

thanks again,
BabaG

daniel.santos 05-19-2008 06:21 PM

CRLF shouldn't be the problem
 
since you are creating your ifstream object without "ifstream::binary", it should open the file in text mode, which will automatically translate CRLF sequences into the native format, which on Linux, would be CR, although you could certainly test this theory with a bit of debug output, trying printing the value of each character and comparing?

osor 05-19-2008 07:57 PM

Quote:

Originally Posted by daniel.santos (Post 3158479)
since you are creating your ifstream object without "ifstream::binary", it should open the file in text mode, which will automatically translate CRLF sequences into the native format, which on Linux, would be CR, although you could certainly test this theory with a bit of debug output, trying printing the value of each character and comparing?

Close, but no cigar. Since the file is opened in text mode, ‘\n’ is automatically translated to and from the line terminating format native to the machine running the code. On linux, this is just LF, whereas on windows it is CRLF. So in linux, a getline (which reads until a ‘\n’ is encountered) will only match the LF (so if the file happens to have a CR it will just be second-to-last character on the line). In windows, on the other hand, a getline will match CRLF as ‘\n’ and the expected behavior will occur.


All times are GMT -5. The time now is 04:11 AM.