LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Gap spaces with URL's in incoming batch files (https://www.linuxquestions.org/questions/linux-newbie-8/gap-spaces-with-urls-in-incoming-batch-files-4175650057/)

soph75 03-12-2019 05:29 PM

Gap spaces with URL's in incoming batch files
 
Hello, I'm running into an issue with incoming data using oracle SQL Loader utility in Linux server where some URL's are breaking or putting space then putting in the next line(See samples below). I need some input with a script or a command to identify all the gaps within the offended URL's that was tied to in the previous line then reattach them. I appreciate all the help.



Data Samples
==========
2019-03-08 10:05:51~,~104799~,~104799_650_27893416~,~3937565~,~MB~,~1225801.4393d~,~2019-03-05 18:42:36~,~{AC1903BE-01694EBA53C4-1C1E-0E4BB4C2}~,~51.148.67.167~,~Chrome 72.0~,~https://www.intervalworld.com/web/cs...~Cancún,
Mexico*~,~~,~R~,~Windows 7~,~Personal computer


2019-03-08 14:03:56~,~104799~,~104799_672_27893416~,~8192687~,~MB~,~1225801.a5290~,~2019-03-05 17:15:19~,~{AC1903BE-01694EBA53C4-792D-670F55C8}~,~178.197.233.123~,~Mobile Safari 12.0~,~https://www.intervalworld.com/web/cs...~,~Countryside and Safari, South
Africa~,~~,~R~,~iOS~,~Smartphon



Thanks,
Soph

berndbausch 03-12-2019 07:12 PM

From this output, it is impossible to know where a URL ends. For example, it looks like “and Safari” is part of the URL, but what about the comma and South Africa? Or does the tilde after “South Africa” belong to the URL?

To approach this question, one would need to know how is this output generated. Do the URLs come from a database field? If so, access the database and convert the string in that field to an HTML-safe string.

soph75 03-13-2019 01:53 PM

This is not an issue with URL structure... the tilde comma are field separators in control file to load an oracle table. The data must be fixed in the source prior appending the load but it's not coming in one line format.

URL 1 ==> https://www.intervalworld.com/web/cs...cute;n,Mexico*
URL 2 ==> ~https://www.intervalworld.com/web/cs...~,~Countryside and Safari, South Africa


<code>
2019-03-11 23:15:09~,~104799~,~104799_706_27893416~,~2575383~,~MB~,~1225801.319dd~,~2019-03-05 20:33:33~,~~,~174.214.33.47~,~Mobile Safari~,~~,~~,~~,~V~,~iOS~,~Smartphone
2019-03-12 06:13:33~,~104799~,~104799_276_27893416~,~3551959~,~MB~,~1225801.31a5b~,~2019-03-05 20:07:12~,~~,~97.84.173.49~,~unknown~,~~,~~,~~,~O~,~unknown~,~unknown
2019-03-12 13:02:53~,~104799~,~104799_619_27893416~,~2511880~,~MB~,~1225801.31c68~,~2019-03-05 18:23:09~,~~,~45.20.108.102~,~Apple Mail~,~~,~~,~~,~V~,~OS X~,~Personal computer
2019-03-12 20:10:53~,~104799~,~104799_654_27893416~,~8724550~,~MB~,~1225801.31d49~,~2019-03-05 18:48:51~,~~,~58.95.164.188~,~IE 10.0~,~~,~~,~~,~V~,~Windows~,~Personal computer
2019-03-12 10:22:26~,~104799~,~104799_706_27893416~,~2248496~,~MB~,~1225801.31dd5~,~2019-03-05 18:55:19~,~~,~174.226.15.147~,~unknown~,~~,~~,~~,~V~,~unknown~,~unknown
2019-03-12 18:41:47~,~104799~,~104799_617_27893416~,~5982090~,~MB~,~1225801.32060~,~2019-03-05 18:05:59~,~~,~66.249.88.167~,~Gmail image proxy~,~~,~~,~~,~O~,~Linux~,~Other
2019-03-12 18:41:57~,~104799~,~104799_617_27893416~,~5982090~,~MB~,~1225801.32060~,~2019-03-05 18:05:59~,~~,~66.249.88.161~,~Gmail image proxy~,~~,~~,~~,~V~,~Linux~,~Other
2019-03-12 11:33:14~,~104799~,~104799_705_27893416~,~5782600~,~MB~,~1225801.32074~,~2019-03-05 18:06:02~,~~,~73.27.90.108~,~Chrome 58.0~,~~,~~,~~,~V~,~Windows~,~Personal computer
2019-03-12 10:37:26~,~104799~,~104799_706_27893416~,~8908667~,~MB~,~1225801.328d1~,~2019-03-05 19:52:34~,~~,~4.79.163.189~,~Chrome Mobile 72.0~,~~,~~,~~,~V~,~Android~,~Smartphone
2019-03-12 10:37:58~,~104799~,~104799_706_27893416~,~8908667~,~MB~,~1225801.328d1~,~2019-03-05 19:52:34~,~~,~4.79.163.189~,~Chrome Mobile 72.0~,~~,~~,~~,~V~,~Android~,~Smartphone
2019-03-12 15:00:35~,~104799~,~104799_706_27893416~,~8908667~,~MB~,~1225801.328d1~,~2019-03-05 19:52:34~,~{AC1903BE-01694EBA5390-5F2B-0C91962F}~,~96.2.247.112~,~Chrome Mobile 63.0~,~https://www.intervalworld.com/web/cs...ntent~,~Spring to the Bahamas With Savings and Free Airport
Transfers!~,~~,~R~,~Android~,~Smartphone
</code>

scasey 03-13-2019 03:20 PM

OP If you would put your output in [code] tags, it would preserve the formatting so we could see exactly what the data looks like...please. You can edit your previous posts to do that.

How are you looking at these data?
What is the difference between the two files? ...that is what happened to make them different?
How are they generated?
Do you have control of their generation?

You're asking for help in fixing the problem, but you're not sharing anything about what you think is causing the problem. :)

soph75 03-13-2019 04:12 PM

The data is coming from a feedback loop mechanism where a campaign was originally deployed from "E-Mail Marketing system" - Then an ETL process is kicked off to capture the feedback via a flat file using Oracle SQL Loader utility to load a target table. Not sure if the sample was properly attached in the previous but that's a good sample.

How are you looking at these data?
vi command to open the flat file.

What is the difference between the two files? ...that is what happened to make them different?
There's only one file.

How are they generated?
SHLIB_PATH=;sqlldr userid=$$USER[Output]/$$PASSWORD[Output]@$$SERVER[Output] control=$$INSTALLROOTDIR/EM/metadata/trans.ldr log=trans.log bad=trans.bad data=$$INSTALLROOTDIR/EM/logs/trans.log

Do you have control of their generation?
No control of the file generation.

soph75 03-13-2019 04:28 PM

I need a script or command to fix the trans.log file - In sample bad line as you can see it's inputting space before "Spain~,~~,~R~,~iOS~,~Smartphone".



Sample Bad line
============
https://www.intervalworld.com/web/cs...te=~,~Balearic Islands,
Spain~,~~,~R~,~iOS~,~Smartphone


Desired output
===========
https://www.intervalworld.com/web/cs...te=~,~Balearic Islands, Spain~,~~,~R~,~iOS~,~Smartphone

scasey 03-13-2019 04:39 PM

Quote:

Originally Posted by soph75 (Post 5973543)
The data is coming from a feedback loop mechanism where a campaign was originally deployed from "E-Mail Marketing system" - Then an ETL process is kicked off to capture the feedback via a flat file using Oracle SQL Loader utility to load a target table. Not sure if the sample was properly attached in the previous but that's a good sample.

How are you looking at these data?
vi command to open the flat file.

What is the difference between the two files? ...that is what happened to make them different?
There's only one file.

How are they generated?
SHLIB_PATH=;sqlldr userid=$$USER[Output]/$$PASSWORD[Output]@$$SERVER[Output] control=$$INSTALLROOTDIR/EM/metadata/trans.ldr log=trans.log bad=trans.bad data=$$INSTALLROOTDIR/EM/logs/trans.log

Do you have control of their generation?
No control of the file generation.

Thanks for trying with the code tags, but they are square brackets, not < >, or use the # button after highlighting what you want to include.

I don't use vi much, but I'm pretty sure that long lines get wrapped. That also happens with more and less, so it's possible that what you're seeing (and pasting) is not what's actually in the file.

try opening with nano, which doesn't wrap lines by default.
Or use cat -E which will put a $ at the end of each line, so wrapped lines look like
Code:

The directory needs to be writable by the web server user.  For example, if the web server
user is <TT>httpd</TT> and your document root is <tt>/home/httpd/htdocs</tt>, you could create the
directory like this:$

(emphasis added)
Although, when I pasted that from my terminal screen, it ended up in the code tags as all one line...I had to fudge it to make it look like it did in the terminal.

My point is that I'm not sure we know that the lines are broken up as you've posted them.
Do the loads into the database fail?

soph75 03-14-2019 12:10 PM

Thanks for cat -E command - The snippet below shows that the bad file error is breaking at ( Greece*~,~~,~R~,~Windows~,~Personal computer$) which belongs to previous line block (2019-03-12 02:12:46~,~104799~,~104799_652_27893416~,~6508427~,~MB~,~1225801.289fa~,~2019-03-05 18:19:20~,~{AC1903BE-01694EBA53C4-66CE-58468261}~,~121.220.144.115~,~Chrome 72.0~,~https://www.intervalworld.com/web/cs...oDate=~,~Crete and Corfu,$)

Output:
2019-03-12 00:09:14~,~104799~,~104799_652_27893416~,~6784503~,~MB~,~1225801.289b6~,~2019-03-05 18:19:08~,~~,~138.217.176.176~,~Thunderbird 60.5~,~~,~~,~~,~V~,~Windows~,~Personal computer$
2019-03-12 02:12:46~,~104799~,~104799_652_27893416~,~6508427~,~MB~,~1225801.289fa~,~2019-03-05 18:19:20~,~{AC1903BE-01694EBA53C4-66CE-58468261}~,~121.220.144.115~,~Chrome 72.0~,~https://www.intervalworld.com/web/cs...oDate=~,~Crete and Corfu,$
Greece*~,~~,~R~,~Windows~,~Personal computer$
2019-03-12 02:10:33~,~104799~,~104799_652_27893416~,~6508427~,~MB~,~1225801.289fa~,~2019-03-05 18:19:20~,~~,~121.220.144.115~,~Outlook 2010~,~~,~~,~~,~V~,~Windows 8~,~Personal computer$
2019-03-12 02:10:56~,~104799~,~104799_652_27893416~,~6508427~,~MB~,~1225801.289fa~,~2019-03-05 18:19:20~,~{AC1903BE-01694EBA5390-7819-3BD1C3A6}~,~121.220.144.115~,~Chrome 72.0~,~https://www.intervalworld.com/web/cs...6toDate=~,~Las Vegas, Nevada~,~~,~R~,~Windows~,~Personal computer$


The ETL load is failing every time it encounters records ( Greece*~,~~,~R~,~Windows~,~Personal computer$) since the control file doesn't align with number of fields in the output.

scasey 03-14-2019 02:37 PM

Quote:

Originally Posted by soph75 (Post 5973834)
Thanks for cat -E command - The snippet below shows that the bad file error is breaking at ( Greece*~,~~,~R~,~Windows~,~Personal computer$) which belongs to previous line block (2019-03-12 02:12:46~,~104799~,~104799_652_27893416~,~6508427~,~MB~,~1225801.289fa~,~2019-03-05 18:19:20~,~{AC1903BE-01694EBA53C4-66CE-58468261}~,~121.220.144.115~,~Chrome 72.0~,~https://www.intervalworld.com/web/cs...oDate=~,~Crete and Corfu,$)

Output:
2019-03-12 00:09:14~,~104799~,~104799_652_27893416~,~6784503~,~MB~,~1225801.289b6~,~2019-03-05 18:19:08~,~~,~138.217.176.176~,~Thunderbird 60.5~,~~,~~,~~,~V~,~Windows~,~Personal computer$
2019-03-12 02:12:46~,~104799~,~104799_652_27893416~,~6508427~,~MB~,~1225801.289fa~,~2019-03-05 18:19:20~,~{AC1903BE-01694EBA53C4-66CE-58468261}~,~121.220.144.115~,~Chrome 72.0~,~https://www.intervalworld.com/web/cs...oDate=~,~Crete and Corfu,$
Greece*~,~~,~R~,~Windows~,~Personal computer$
2019-03-12 02:10:33~,~104799~,~104799_652_27893416~,~6508427~,~MB~,~1225801.289fa~,~2019-03-05 18:19:20~,~~,~121.220.144.115~,~Outlook 2010~,~~,~~,~~,~V~,~Windows 8~,~Personal computer$
2019-03-12 02:10:56~,~104799~,~104799_652_27893416~,~6508427~,~MB~,~1225801.289fa~,~2019-03-05 18:19:20~,~{AC1903BE-01694EBA5390-7819-3BD1C3A6}~,~121.220.144.115~,~Chrome 72.0~,~https://www.intervalworld.com/web/cs...6toDate=~,~Las Vegas, Nevada~,~~,~R~,~Windows~,~Personal computer$


The ETL load is failing every time it encounters records ( Greece*~,~~,~R~,~Windows~,~Personal computer$) since the control file doesn't align with number of fields in the output.

It would be helpful to both us and you if you'd use [code] tags when posting output or file contents, please. Besides showing exactly what's in the file, the URLs wouldn't be changed into links, which confuses things even more.

I'd go back to the ETL process. I'm not sure what they're trying to accomplish using ~,~ as delimiters, since the delimiter still contains a comma, so it becomes a challenge when the data also contains a comma, unless each field is quoted...which might also be an issue.

Although it is interesting the problem doesn't appear to always occur when there's a comma in the field.

In any event, if the ETL process is breaking the record across lines, then that's where the problem needs to be fixed. Something is broken in the Transform.

The last ETL shop I worked in used the pipe character (|) as a delimiter, which mostly worked as long as someone didn't enter a pipe in the data...that didn't happen often, but when it did, we'd have to open the file in a text editor and remove the bogus pipe...manually...and we'd have to get the source data changed. Also manually.

I never understood why they just didn't use tab for a delimiter. Our data was all coming from forms, and it's nearly impossible to key a tab in a form, since pressing tab just move the cursor to the next field. But that's an off-topic rant, for which I apologize.

Please. Post your data in code tags so we can see what's actually there. Thank you.


All times are GMT -5. The time now is 09:50 AM.