LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   HTML source obtained through sockets in C showing unreadable characters (https://www.linuxquestions.org/questions/programming-9/html-source-obtained-through-sockets-in-c-showing-unreadable-characters-4175531821/)

TheChronicScribbler 01-22-2015 07:16 AM

HTML source obtained through sockets in C showing unreadable characters
 
I wrote the following code to retreive HTML source into a file and the terminal for viewing. But output on the terminal and file are showing few unreadable characters.



int main(int argc, char **argv)
{
struct addrinfo hints;
struct addrinfo *results;
int ret, sockfd;
char buffer[512], resource[512];
FILE *outfile;

if (argc != 3) {
fprintf(stderr, "Not enough arguments to go forward!!\n");
return 1;
}


hints.ai_family = AF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
hints.ai_flags = AI_PASSIVE;


if ((ret = getaddrinfo(argv[1], argv[2], &hints, &results)) != 0) {
fprintf(stderr, "getaddrinfo() error\n");
return 2;
}

if ((sockfd = socket(results->ai_family, results->ai_socktype, results->ai_protocol)) < 0) {
fprintf(stderr, "Socket not made\n");
return 4;
}

if(connect(sockfd, results->ai_addr, results->ai_addrlen) < 0) {
fprintf(stderr, "Connection not established\n");
return 5;
}

printf("The News Feed URL : ");
scanf("%s", resource);
sprintf(buffer, "GET %s HTTP/1.1\nHost:%s\n\n", resource, argv[1]);

printf("%s", buffer);

if ((ret = write(sockfd, buffer, strlen(buffer))) < 0) {
fprintf(stderr, "Write Failed\n");
return 3;
}

printf("Request Sent\n");

outfile = fopen("rssout.txt", "w");

while(1) {
if((ret = read(sockfd, buffer, sizeof(buffer))) <= 0) {
printf("Read Error OR Connection Closed\n");
break;
}

fprintf(outfile, "%s", buffer);

}

fclose(outfile);
close(sockfd);
printf("\n\nALL IS WELL THAT ENDS WELL\n\n");
freeaddrinfo(results);


rssfeed();

return 0;
}


eg : </authhá{¹or>

What is the problem ? How do I correct it?

sundialsvcs 01-22-2015 07:33 AM

The characters are probably Unicode (multi-byte characters), and maybe your terminal-window settings are not set to display those characters properly.

See for example http://earthwithsun.com/questions/55...rtual-terminal ...

TheChronicScribbler 01-22-2015 07:45 AM

But even when I am directly writing into a file and opening it with gedit, it still shows these unreadable characters. Also, I tried copying the HTML source directly to the terminal (to check if these characters are supported), and they displayed fine. The problem seems to appear only when i use sockets to get the source data.

rtmistler 01-22-2015 08:11 AM

Please consider editing your first post to place your code within [code][/code] tags. There's a link in my signature which shows information on how to do that if you aren't sure. It makes the code more readable and helps people to assist you.

As far as what these characters are, one thing I'd do is to compile with debugging on, enter the debugger, set a breakpoint at an appropriate place and then examine the characters, because you can see the hex or binary data as it is saved in memory, and then you can use something like this Ascii Table Reference to determine what characters are and whether or not they're control characters, or something else.

Assuming you're using C and GCC here's a very brief example of my point:

A general sample program
Code:

#include <stdio.h>
#include <string.h>

static char data_1[8] = "01234567";
static char data_2[8] = "\r\n\r\n\r\n\r\n";

int main(int argc, char **argv)
{
    char my_array[8];

    memset(my_array, 0, sizeof(my_array));

    memcpy(my_array, data_1, sizeof(my_array));

    printf("First iteration: %s\n", my_array);

    memcpy(my_array, data_2, sizeof(my_array));

    printf("Second iteration: %s\n", my_array);

    return -1;
}

What this code will do is start with an uninitialized array, clear it completely, then copy the first static string "01234567" into the array, print that, then copy the next static string which is a few carriage returns and line feeds, into the array and print that, then exit. I compile it using the -ggdb flag and then enter gdb to debug and examine it. You can use these same commands/techniques to debug your own program.
Code:

$ gcc -ggdb -o sample sample.c
$ gdb sample
GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-linux-gnu".
For bug reporting instructions, please see:
<http://bugs.launchpad.net/gdb-linaro/>...
Reading symbols from /home/testcode/sample...done.
(gdb) b sample.c:13
Breakpoint 1 at 0x8048461: file sample.c, line 13.
(gdb) r
Starting program: /home/testcode/sample

Breakpoint 1, main (argc=1, argv=0xbffff3a4) at sample.c:13
13            memcpy(my_array, data_1, sizeof(my_array));
(gdb) p my_array
$1 = "\000\000\000\000\000\000\000"
(gdb) x/8b my_array
0xbffff2f4:        0        0        0        0        0        0        0        0
(gdb) s
15            printf("First iteration: %s\n", my_array);
(gdb) s
First iteration: 01234567
17            memcpy(my_array, data_2, sizeof(my_array));
(gdb) p my_array
$2 = "01234567"
(gdb) x/8b my_array
0xbffff2f4:        48        49        50        51        52        53        54        55
(gdb) s
19            printf("Second iteration: %s\n", my_array);
(gdb) p my_array
$3 = "\r\n\r\n\r\n\r\n"
(gdb) x/8b my_array
0xbffff2f4:        13        10        13        10        13        10        13        10
(gdb) s
Second iteration:




21            return -1;
(gdb)
22        }
(gdb) quit

So you can see that for what I call the second iteration, there are non-visible, but printing characters; however there is data when you examine memory. Similarly, when that array was memset to all zeros, the string would be NULL and therefore printing it would show nothing. Or if data started with a 0x00 but continued further, printing the string would still result in showing nothing because of a NULL terminator at the start of a string.

Another technique if you don't like the debugger or are running a system with a variety of co-dependent programs and inline debugging is difficult is to have a logger or console output and anticipate that the array of data may not all be visible/printable, so instead have a log/output utility which converts everything to HEX-ASCII. For instance "A" is 0x41, but it's also capital A, so what's the big deal, but carriage return is 0x0d and you don't "see" that, you see a newline or worse you see printing return to the start of a line and then start obliterating stuff already output. But if you just say "for zero to N-1, print each character in hex" then you'll see what the entire buffer contains and can debug what you're seeing from this file.

NevemTeve 01-22-2015 09:08 AM

Just download the same file with wget, and compare the two files.

SoftSprocket 01-22-2015 09:45 AM

This:
Code:

while(1) {
if((ret = read(sockfd, buffer, sizeof(buffer))) <= 0) {
printf("Read Error OR Connection Closed\n");
break;       
}

doesn't make any sense. You keep looping, writing to your buffer until the connection is closed. There could be anything in that buffer so fix that first.

Off the top of my head:
Code:


char buffer[512];


size_t num_read = 0;
size_t num_to_read = sizeof buffer;
char* pbuf = buffer;

while ((num_read = read (sockfd, pbuf, num_to_read)) < num_to_read) {
    if (num_read == 0) {
        printf ("EOF\n");
        break;
    }

    if (num_read < 0) {
        if (errno == EINTR) {
            continue;
        }

        perror ("read");
        // error handling here
    }
 
    num_to_read -= num_read;
    pbuf += num_read;
}



All times are GMT -5. The time now is 10:23 AM.