LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 07-15-2015, 02:51 AM   #1
grgsaliba
LQ Newbie
 
Registered: Jan 2013
Posts: 15

Rep: Reputation: Disabled
intermittent problem with ssh and apache


I have an embedded ubuntu 14.04 image and I'm accessing it over ssh and also have apache serving a site.

Sometimes, apparently randomly I get a strange issues in which any page I try to access on apache I get a Forbidden error.

Simultaneously, if I am connected on ssh, I am unable to execute any command, even if I am in 'sudu su' root mode. All commands return 'No such file or directory' even if I specify the full command path. Interestingly a few commands such as pwd and echo do work, others such as ls, reboot, cat, ping, ...all give that error.

If I am not connected to ssh, trying to connect to ssh doesn't work.

It seems as if all of a sudden there is a general loss of permissions. I'm not sure if it eventually recovers automatically as I always end up doing a power cycle.

I know this is quite generic but that is all I know. syslog doesn;t seem to indicate anything abnormal.
 
Old 07-15-2015, 03:36 AM   #2
berndbausch
LQ Addict
 
Registered: Nov 2013
Location: Tokyo
Distribution: Mostly Ubuntu and Centos
Posts: 6,316

Rep: Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002
pwd and echo are built into the shell, whereas cat, ls, ping and so on are executables residing on files. That fact could be the start of a clue. Something wrong with the root directory. Or the Loader, typically residing on /lib/ldblabla.so or perhaps /lib64, is inaccessible. Or the loader's config, ld.so.conf and ld.so.cache, corrupted.

It might help if you could provide:
  • precise commands and error messages
  • context: when it happens, does it go away by it's own? Do you need to log out? Reboot?
  • do you log on, work normally, then suddenly commands don't work?
 
Old 07-15-2015, 04:54 AM   #3
grgsaliba
LQ Newbie
 
Registered: Jan 2013
Posts: 15

Original Poster
Rep: Reputation: Disabled
Hi Bernd, thanks for your reply.

To put you in the picture this is a custom designed i.MX53 board running a 32bit Ubuntu 14.04.1 LTS image and 3.16.2-armv7 kernel. The board has LAN interface and also a pppd connection over cell modem. I am accessing it using an ssh connection over the LAN interface. The board is constantly running in a semi-remote location and I only have network (LAN interface) access to it.

So basically the context in which this happens is that I have this running 24/7 and sometimes (to me it appears randomly) I loose access to it. Seems to be happening every couple of weeks but sometimes more often. I can ping it fine and I can tell some custom processes are still running as I can see it poll an external modbus device but trying to access any page via the webserver I get a Forbidden error:

Quote:
Forbidden

You don't have permission to access / on this server.

Apache/2.4.7 (Ubuntu) Server at 192.168.1.20 Port 443

I sometime leave an ssh connection active for a few days, in that case when the issues happens - as I said previously - all commands return a 'No such file or directory' . that is the only error message I get to any command I try. Good point on pwd and echo are built into the shell. Will check the loader's cache maybe it is being corrupted as you say.

If I happen to not be connected to ssh when this happens, I simply cannot access ssh, failing with this error:

Quote:
$ ssh ubuntu@192.168.1.20
Read from socket failed: Connection reset by peer
To get the system back into normal operation I need to get someone to power cycle it. Until someone can go to power cycle it sometimes it takes a few days and it has never recovered during this time except once it seemed to have recovered by itself after 2 days (and uptime was showing it had not rebooted). I'm not sure if in that particular case there was some other factor, in general it doesn't seem to recover by itself.

Cannot really figure out a sequence of events that leads to the issue. At the moment I have it in this condition but unfortunately I have just now lost ssh connection to it (...was connected via teamviewer to a windows PC on the remote network and ssh to the board from there and stupid windows rebooted to install updates!!!)

I have a suspicion (although not a solid one) that the pppd connection has something to do with this. Meanwhile I have the 'same' image running on an identical local board and identical setup and I never managed to replicate the issue. The images are not exactly identical but both started off from the same image and have the same changes from there onwards. Also I don't have any indication that it is hardware related.

After the power cycle it works perfectly normal, and works for several days without any problem. I can login to ssh and everything is normal then it just happens all of a sudden, normally it happens while I am not using it so I leave it working fine and the next day i find ssh still connected but unusable, or I can no longer connect to ssh.

Thanks!

Last edited by grgsaliba; 07-15-2015 at 04:58 AM. Reason: added furhter answer
 
Old 07-15-2015, 06:43 AM   #4
berndbausch
LQ Addict
 
Registered: Nov 2013
Location: Tokyo
Distribution: Mostly Ubuntu and Centos
Posts: 6,316

Rep: Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002
That's an interesting troubleshooting challenge. I wonder what happens to your ssh connection. Does it close the connection because it can't spawn a shell?

A few general ideas, most related to logging.
When you can't connect, use ssh -vvv (or is it -ddd?) - that produces an ocean of debug messages and perhaps a clue what's wrong with the ssh server.
On the server side, you can run a second (or third, ...) sshd daemon in parallel, listening to a different port, with a -vvv option and collecting its messages in a log file. When you don't get connected, will all sshd's refuse to connect?
How about cranking up the Apache log level.
I suppose you have checked if there is a time correlation between all these problems. I see three problem domains - commands not working, unable to connect, web server.
 
Old 07-15-2015, 06:52 AM   #5
grgsaliba
LQ Newbie
 
Registered: Jan 2013
Posts: 15

Original Poster
Rep: Reputation: Disabled
ssh -vvv log:

Quote:
$ ssh -vvv ubuntu@192.168.1.20
OpenSSH_6.7p1, OpenSSL 1.0.1k 8 Jan 2015
debug2: ssh_connect: needpriv 0
debug1: Connecting to 192.168.1.20 [192.168.1.20] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/George/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/George/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/George/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/George/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/George/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/George/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/George/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/George/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.7
debug1: Remote protocol version 2.0, remote software version OpenSSH_6.6.1p1 Ubuntu-2ubuntu2
debug1: match: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 pat OpenSSH_6.6.1* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug3: load_hostkeys: loading entries for host "192.168.1.20" from file "/home/George/.ssh/known_hosts"
debug3: load_hostkeys: found key type ECDSA in file /home/George/.ssh/known_hosts:2
debug3: load_hostkeys: loaded 1 keys
debug3: order_hostkeyalgs: prefer hostkeyalgs: ecdsa-sha2-nistp256-cert-v01@openssh...01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521
debug1: SSH2_MSG_KEXINIT sent
Read from socket failed: Connection reset by peer
With regards to time correlation, all 3 issues happen at the same time, I am sure of that.
 
Old 07-15-2015, 07:13 AM   #6
berndbausch
LQ Addict
 
Registered: Nov 2013
Location: Tokyo
Distribution: Mostly Ubuntu and Centos
Posts: 6,316

Rep: Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002Reputation: 2002
Well at least we can see some negotiation going on between client and server. I am pretty certain the sshd gives up when it can't exec a program like the shell. Perhaps the sshd's error message will provide some info.
 
Old 07-15-2015, 07:23 AM   #7
grgsaliba
LQ Newbie
 
Registered: Jan 2013
Posts: 15

Original Poster
Rep: Reputation: Disabled
yes I think sshd is unable to exec just like I am unable to run commands when I happen to be connected to ssh. Both apahce and sshd are still running. apache even serves the ssl certificate properly and produces the Forbidden error message. the response time is also very fast so there isn't some process hogging up the resources. I don't think the issue is related to sshd or apache, these are just side-effects. It would seem to be related to some form of random corruption, maybe memory corruption, however I have no indication of memory hardware issues and the system was pretty much idle in the past couple of days so there wasn't much going on. The system has 2GB RAM and at the moment it was only using around 128MB with little variation. There are also no other issues indicating memory problems even when the system is loaded.

I'll check apache log when I get access to it and post some snippets. It should log the reason why it forbids access.
 
Old 07-15-2015, 09:16 AM   #8
grgsaliba
LQ Newbie
 
Registered: Jan 2013
Posts: 15

Original Poster
Rep: Reputation: Disabled
managed to have the board power cycled, here are the interesting points:

apache2 error.log just before the issue occurred (I think):

Quote:
ppp0: error fetching interface information: Device not found
ppp0: error fetching interface information: Device not found
ppp0: error fetching interface information: Device not found
ppp0: error fetching interface information: Device not found
[Tue Jul 14 16:00:06.282292 2015] [ssl:warn] [pid 1083] AH01909: RSA certificate configured for 127.0.1.1:443 does NOT include an ID which matches the server name
[Tue Jul 14 16:00:06.853672 2015] [ssl:warn] [pid 1091] AH01909: RSA certificate configured for 127.0.1.1:443 does NOT include an ID which matches the server name
[Tue Jul 14 16:00:06.909609 2015] [mpm_prefork:notice] [pid 1091] AH00163: Apache/2.4.7 (Ubuntu) PHP/5.5.9-1ubuntu4.4 OpenSSL/1.0.1f configured -- resuming normal operations
[Tue Jul 14 16:00:06.909838 2015] [core:notice] [pid 1091] AH00094: Command line: '/usr/sbin/apache2'
not sure why those ppp0 lines are there.

access.log has no entry for the duration of the issue.

The last activity in syslog before the issue occured is also related to pppd but everything seems normal:

Quote:
Jul 14 15:55:21 arm pppd[1278]: pppd 2.4.5 started by root, uid 0
Jul 14 15:55:22 arm chat[1289]: timeout set to 30 seconds
Jul 14 15:55:22 arm chat[1289]: abort on (\nBUSY\r)
Jul 14 15:55:22 arm chat[1289]: abort on (\nERROR\r)
Jul 14 15:55:22 arm chat[1289]: abort on (\nNO ANSWER\r)
Jul 14 15:55:22 arm chat[1289]: abort on (\nNO CARRIER\r)
Jul 14 15:55:22 arm chat[1289]: abort on (\nNO DAILTONE\r)
Jul 14 15:55:22 arm chat[1289]: abort on (\nRING\r\n\r\nRING\r)
Jul 14 15:55:22 arm chat[1289]: send (AT^M)
Jul 14 15:55:22 arm chat[1289]: expect (OK)
Jul 14 15:55:22 arm chat[1289]: ^M
Jul 14 15:55:22 arm chat[1289]: OK
Jul 14 15:55:22 arm chat[1289]: -- got it
Jul 14 15:55:22 arm chat[1289]: send (AT+CGDCONT=1,"IP","sp.telus.com"^M)
Jul 14 15:55:22 arm chat[1289]: timeout set to 30 seconds
Jul 14 15:55:22 arm chat[1289]: expect (OK)
Jul 14 15:55:22 arm chat[1289]: ^M
Jul 14 15:55:22 arm chat[1289]: ^M
Jul 14 15:55:22 arm chat[1289]: OK
Jul 14 15:55:22 arm chat[1289]: -- got it
Jul 14 15:55:22 arm chat[1289]: send (ATD*99***1#^M)
Jul 14 15:55:22 arm chat[1289]: expect (CONNECT)
Jul 14 15:55:22 arm chat[1289]: ^M
Jul 14 15:55:22 arm chat[1289]: ^M
Jul 14 15:55:22 arm chat[1289]: CONNECT
Jul 14 15:55:22 arm chat[1289]: -- got it
Jul 14 15:55:22 arm chat[1289]: send (^M)
Jul 14 15:55:22 arm pppd[1278]: Script /usr/sbin/chat -vV -f /etc/ppp/chat-HSPA910CF-nopin finished (pid 1288), status = 0x0
Jul 14 15:55:22 arm pppd[1278]: Serial connection established.
Jul 14 15:55:22 arm pppd[1278]: using channel 12
Jul 14 15:55:22 arm pppd[1278]: Using interface ppp0
Jul 14 15:55:22 arm pppd[1278]: Connect: ppp0 <--> /dev/mux0
Jul 14 15:55:23 arm pppd[1278]: rcvd [LCP ConfReq id=0x1 <asyncmap 0x0> <auth pap> <magic 0x96a24aa4> <pcomp> <accomp>]
Jul 14 15:55:23 arm pppd[1278]: sent [LCP ConfReq id=0x1 <asyncmap 0x0> <magic 0xf7d60737> <pcomp> <accomp>]
Jul 14 15:55:23 arm pppd[1278]: No auth is possible
Jul 14 15:55:23 arm pppd[1278]: sent [LCP ConfRej id=0x1 <auth pap>]
Jul 14 15:55:23 arm pppd[1278]: rcvd [LCP ConfAck id=0x1 <asyncmap 0x0> <magic 0xf7d60737> <pcomp> <accomp>]
Jul 14 15:55:23 arm pppd[1278]: rcvd [LCP ConfReq id=0x2 <asyncmap 0x0> <magic 0x96a24aa4> <pcomp> <accomp>]
Jul 14 15:55:23 arm pppd[1278]: sent [LCP ConfAck id=0x2 <asyncmap 0x0> <magic 0x96a24aa4> <pcomp> <accomp>]
Jul 14 15:55:23 arm pppd[1278]: sent [LCP EchoReq id=0x0 magic=0xf7d60737]
Jul 14 15:55:23 arm pppd[1278]: sent [CCP ConfReq id=0x1 <deflate 15> <deflate(old#) 15> <bsd v1 15>]
Jul 14 15:55:23 arm pppd[1278]: sent [IPCP ConfReq id=0x1 <addr 0.0.0.0> <ms-dns1 0.0.0.0> <ms-dns2 0.0.0.0>]
Jul 14 15:55:23 arm pppd[1278]: rcvd [LCP EchoRep id=0x0 magic=0x96a24aa4]
Jul 14 15:55:23 arm pppd[1278]: rcvd [LCP ProtRej id=0x3 80 fd 01 01 00 0f 1a 04 78 00 18 04 78 00 15]
Jul 14 15:55:23 arm pppd[1278]: Protocol-Reject for 'Compression Control Protocol' (0x80fd) received
Jul 14 15:55:23 arm pppd[1278]: rcvd [IPCP ConfNak id=0x1 <ms-dns1 10.11.12.13> <ms-dns2 10.11.12.14>]
Jul 14 15:55:23 arm pppd[1278]: sent [IPCP ConfReq id=0x2 <addr 0.0.0.0> <ms-dns1 10.11.12.13> <ms-dns2 10.11.12.14>]
Jul 14 15:55:24 arm pppd[1278]: rcvd [IPCP ConfNak id=0x2 <ms-dns1 10.11.12.13> <ms-dns2 10.11.12.14>]
Jul 14 15:55:24 arm pppd[1278]: sent [IPCP ConfReq id=0x3 <addr 0.0.0.0> <ms-dns1 10.11.12.13> <ms-dns2 10.11.12.14>]
Jul 14 15:55:25 arm pppd[1278]: rcvd [IPCP ConfNak id=0x3 <ms-dns1 10.11.12.13> <ms-dns2 10.11.12.14>]
Jul 14 15:55:25 arm pppd[1278]: sent [IPCP ConfReq id=0x4 <addr 0.0.0.0> <ms-dns1 10.11.12.13> <ms-dns2 10.11.12.14>]
Jul 14 15:55:26 arm vnstatd[1029]: Interface "ppp0" enabled.
Jul 14 15:55:26 arm pppd[1278]: rcvd [IPCP ConfNak id=0x4 <ms-dns1 10.11.12.13> <ms-dns2 10.11.12.14>]
Jul 14 15:55:26 arm pppd[1278]: sent [IPCP ConfReq id=0x5 <addr 0.0.0.0> <ms-dns1 10.11.12.13> <ms-dns2 10.11.12.14>]
Jul 14 15:55:26 arm pppd[1278]: rcvd [IPCP ConfReq id=0x1]
Jul 14 15:55:26 arm pppd[1278]: sent [IPCP ConfNak id=0x1 <addr 0.0.0.0>]
Jul 14 15:55:26 arm pppd[1278]: rcvd [IPCP ConfNak id=0x5 <addr 10.116.1.173> <ms-dns1 209.91.107.11> <ms-dns2 209.121.225.11>]
Jul 14 15:55:26 arm pppd[1278]: sent [IPCP ConfReq id=0x6 <addr 10.116.1.173> <ms-dns1 209.91.107.11> <ms-dns2 209.121.225.11>]
Jul 14 15:55:26 arm pppd[1278]: rcvd [IPCP ConfReq id=0x2 <addr 10.116.1.173>]
Jul 14 15:55:26 arm pppd[1278]: sent [IPCP ConfAck id=0x2 <addr 10.116.1.173>]
Jul 14 15:55:26 arm pppd[1278]: rcvd [IPCP ConfAck id=0x6 <addr 10.116.1.173> <ms-dns1 209.91.107.11> <ms-dns2 209.121.225.11>]
Jul 14 15:55:26 arm pppd[1278]: replacing old default route to eth0 [192.168.1.254]
Jul 14 15:55:26 arm pppd[1278]: local IP address 10.116.1.173
Jul 14 15:55:26 arm pppd[1278]: remote IP address 10.116.1.173
Jul 14 15:55:26 arm pppd[1278]: primary DNS address 209.91.107.11
Jul 14 15:55:26 arm pppd[1278]: secondary DNS address 209.121.225.11
Jul 14 15:55:26 arm pppd[1278]: Script /etc/ppp/ip-up started (pid 1377)
Jul 14 15:55:28 arm ntpd[4170]: Listen normally on 17 ppp0 10.116.1.173 UDP 123
Jul 14 15:55:28 arm ntpd[4170]: 206.108.0.131 interface 192.168.1.20 -> 10.116.1.173
Jul 14 15:55:28 arm ntpd[4170]: 91.189.89.199 interface 192.168.1.20 -> 10.116.1.173
Jul 14 15:55:28 arm ntpd[4170]: 199.182.221.110 interface 192.168.1.20 -> 10.116.1.173
Jul 14 15:55:28 arm ntpd[4170]: 173.243.192.18 interface 192.168.1.20 -> 10.116.1.173
Jul 14 15:55:28 arm ntpd[4170]: 208.73.56.29 interface 192.168.1.20 -> 10.116.1.173
Jul 14 15:55:28 arm ntpd[4170]: peers refreshed
Jul 14 15:55:28 arm ntpd[4170]: new interface(s) found: waking up resolver
Jul 14 15:55:29 arm postmulti[1447]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postmulti[1451]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postfix[1462]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postfix/master[1467]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postfix/postsuper[1468]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postfix/master[1004]: reload -- version 2.11.0, configuration /etc/postfix
Jul 14 15:55:29 arm postfix/master[1004]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postfix/postsuper[1470]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postfix/qmgr[1473]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postfix/pickup[1474]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm postfix/sendmail[1485]: warning: /etc/postfix/main.cf, line 111: overriding earlier entry: relayhost=
Jul 14 15:55:29 arm pppd[1278]: Script /etc/ppp/ip-up finished (pid 1377), status = 0x0
There is no other entry in syslog until the system is power cycled.

Also nothing interesting in auth.log
 
Old 07-21-2015, 10:00 AM   #9
grgsaliba
LQ Newbie
 
Registered: Jan 2013
Posts: 15

Original Poster
Rep: Reputation: Disabled
I am getting more convinced it is some how related to the pppd connection. The device has become more unstable locking up in this state more often. I did a test in which I disabled pppd connection and so far it hasn't produced the issue. Cannot imaging how the pppd connection can create such an issue though...
 
Old 07-22-2015, 08:41 AM   #10
grgsaliba
LQ Newbie
 
Registered: Jan 2013
Posts: 15

Original Poster
Rep: Reputation: Disabled
digging deeper into the issue, it seems that only when i set 'replacedefaultroute' option in pppd I get this strange lock up issue. So it might simply be a related to a packet routing issue. Still even in this mode LAN traffic coming from the subnet should be routed properly and it seems this issue only happens at some point during normal operation.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Intermittent SSH connection drop between Win 7 client and Red Hat 6 server thegoodoctor Linux - Server 0 10-09-2012 01:43 PM
Intermittent SSH login problem into Linux terminal debStudent Linux - Server 3 11-05-2011 08:45 PM
SSH PublicKey Authentication intermittent problems wyattjoh Linux - Networking 2 10-14-2011 02:02 AM
Apache intermittent access via ISA reverse proxy brudinie Linux - Networking 1 09-04-2007 01:44 PM
Intermittent connection problem - network/apache jakepa2001 Linux - General 4 06-05-2005 10:43 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 03:16 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration