LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 05-21-2020, 12:40 PM   #1
trackstar2000
Member
 
Registered: Apr 2013
Posts: 70

Rep: Reputation: Disabled
Troubleshooting hardware issue


We have a user that is running postgres and claimed /data mount point fails.

dmesg shows this error below. What other logs can I view? I have looked at syslog too.

How do I view actual time on the dmesg? This Ubuntu.. Thanks ahead, TT.

[10001.245337] sd 2:2:0:0: [sda] tag#0 task abort called for scmd(000000004710019e)
[10001.245341] sd 2:2:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 06 4a 0e 05 c0 00 00 02 00 00 00
[10001.245343] sd 2:2:0:0: task abort: FAILED scmd(000000004710019e)
[10001.261332] sd 2:2:0:0: target reset called for scmd(000000004710019e)
[10001.261334] sd 2:2:0:0: [sda] tag#0 megasas: target reset FAILED!!
[10001.261337] sd 2:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout
SCSI command pointer: (000000004710019e) SCSI host state: 5 SCSI
 
Old 05-21-2020, 12:49 PM   #2
michaelk
Moderator
 
Registered: Aug 2002
Posts: 19,925

Rep: Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303
The dmesg -T option converts seconds to human readable time.

Do we assume the postgresql database is on /data?
Separate drive, partition or RAID?

What type of filesystem? Filesystem errors?

Any errors specific to that partition etc?

Last edited by michaelk; 05-21-2020 at 12:55 PM.
 
3 members found this post helpful.
Old 05-21-2020, 01:17 PM   #3
trackstar2000
Member
 
Registered: Apr 2013
Posts: 70

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by michaelk View Post
The dmesg -T option converts seconds to human readable time.

Do we assume the postgresql database is on /data?

What type of filesystem? Filesystem errors?

Any errors specific to that partition etc?
From what I remember the raid 1 contains the OS. The data drive is Raid 5. I am not familiar with postgresql but will look into it.

The user reported the /data doesn't dismount but get errors when trying to read/write files. For example, running `ls /data` gives

ls: cannot access '/data/tweets_corona': Input/output error
ls: cannot access '/data/files.pushshift.io': Input/output error
ls: cannot access '/data/SUN397': Input/output error


/dev/sda1 on /data type ext4

On the fstab:

UUID=c1d5f933-9dd6-4f40-a476-4b6dd911ecee /data ext4 defaults,rw,x-gvfs-show,nofail,x-systemd.device-timeout=30s 0 2


disk allocation:


Filesystem Size Used Avail Use% Mounted on
udev 126G 0 126G 0% /dev
tmpfs 26G 3.6M 26G 1% /run
/dev/nvme0n1p2 1.8T 1.7T 1003M 100% /
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/nvme0n1p1 511M 6.1M 505M 2% /boot/efi
/dev/sda1 48T 21T 24T 47% /data
tmpfs 26G 0 26G 0% /run/user/1003
tmpfs 26G 0 26G 0% /run/user/1000

Last edited by trackstar2000; 05-21-2020 at 01:39 PM.
 
Old 05-21-2020, 03:23 PM   #4
trackstar2000
Member
 
Registered: Apr 2013
Posts: 70

Original Poster
Rep: Reputation: Disabled
I notice that directory below is 100% filled with about 1 GB left. This is where the home directory resides too.

admin1@lambda:/home$ whereis postgresql
postgresql: /usr/lib/postgresql /etc/postgresql /usr/include/postgresql /usr/share/postgresql



When the user runs this, it will crap out : $ sudo /etc/init.d/postgresql start


The user can't "ls /data" afterwards which is on a different device

ls: cannot access '/data/tweets_corona': Input/output error
ls: cannot access '/data/files.pushshift.io': Input/output error
ls: cannot access '/data/SUN397': Input/output error

Last edited by trackstar2000; 05-21-2020 at 03:24 PM.
 
Old 05-21-2020, 03:35 PM   #5
michaelk
Moderator
 
Registered: Aug 2002
Posts: 19,925

Rep: Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303Reputation: 3303
The remaining space in / is reserved which is 5% by default and not accessible by the regular user. Might be why it isn't starting. /data is another problem.
 
Old 05-21-2020, 03:52 PM   #6
trackstar2000
Member
 
Registered: Apr 2013
Posts: 70

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by michaelk View Post
The remaining space in / is reserved which is 5% by default and not accessible by the regular user. Might be why it isn't starting. /data is another problem.
He says it runs but won't finish. Words of the user below .


$ sudo /etc/init.d/postgresql start

This starts the long db recovery process, which does a lot of read/writes onto the array, and the array has been failing before it finishes.
 
Old 05-21-2020, 05:02 PM   #7
rnturn
Senior Member
 
Registered: Jan 2003
Location: Illinois (SW Chicago 'burbs)
Distribution: Currently: openSUSE, Raspbian, Slackware. Formerly: CentOS, MacOS, Red Hat. Other: Solaris, Tru64
Posts: 1,866

Rep: Reputation: 268Reputation: 268Reputation: 268
Quote:
Originally Posted by trackstar2000 View Post

disk allocation:

Code:
Filesystem      Size  Used Avail Use% Mounted on
udev            126G     0  126G   0% /dev
tmpfs            26G  3.6M   26G   1% /run
/dev/nvme0n1p2  1.8T  1.7T 1003M 100% /
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           126G     0  126G   0% /sys/fs/cgroup
/dev/nvme0n1p1  511M  6.1M  505M   2% /boot/efi
/dev/sda1        48T   21T   24T  47% /data
tmpfs            26G     0   26G   0% /run/user/1003
tmpfs            26G     0   26G   0% /run/user/1000
Sorry if this is a little long-winded. (You can tell it's late in the day and not much is going on.)

Those error messages from dmesg in the initial post look downright ominous. What events do earlier log files show up as occurring with that device? Perhaps this is a problem that has been building up in severity for a while.

The first thing I'd tackle is the root filesystem space problem. Working on the "/data" problem is gong to be difficult when the root filesystem is full.

You'll need to log in as root and clean out anything that is not absolutely necessary. Move things onto an external drive -- at least temporarily -- if you have one with free space so you have some "breathing room" that will allow you to do some cleanup.
  • Check the contents of /var/log for old, no-longer-needed log files and either delete them or compress them (i.e. "bzip2 -v9 log-file") to free up space. Be careful, though: you wouldn't want to delete any log files that might contain useful information regarding the "/data" mount point failure.
  • If old log files are getting out of hand (even when compressed), look at "logrotate(8) to keep them under control. You'll need to evaluate your needs for any log file you manage with that utility but investing the time in deciding what logs to keep and for how long is time I'd rather spend time doing that evaluation to avoid future extended downtimes cleaning up the root filesystem.
  • If you have debugging enabled for any services, consider turning that off if you don't need it; it makes log files larger than normal and you probably don't need that debugging information any more. Watch out for PostgreSQL logging. You can log the database activity in nauseating detail listing each and every SQL statement (and even more) and those logs can be huge.
  • Since the filesystem also contains /home -- not a good idea, IMHO, as a single user can bring the system to its knees with a runaway program that consumes all available disk space (you'll need to add storage to remedy this configuration situation) -- dive into the users' home directories (as root) and look for large log files, tar archives, etc, that are uncompressed. They really don't have to exist on disk in an uncompressed state; you can easily inspect and extract files from a compressed log file tar archive, for example, using command line switches and pipes or inspect them using a GUI tool like Ark.

Once you get "/" cleaned up and you've got a stable platform on which to do some debugging as a non-root user (without disk space problems getting in your way). I'd try (as root) mounting that "/data" filesystem by hand and watching the "/var/log/messages" log file for anything related to the filesystem when you mount it. If the mount succeeds, can you then access anything from that mount point? If that works, then I suspect that, when the root filesystem was at 100%, the PostgreSQL startup was unable to complete as it was trying to log to a device that has no disk space accessible to the PostgreSQL user/owner.

(Now I'm think I need to look at my accumulation of nightly psql export files. :/ )

Hope some of this is helpful and... good luck.
 
Old 05-21-2020, 08:03 PM   #8
trackstar2000
Member
 
Registered: Apr 2013
Posts: 70

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by rnturn View Post
Sorry if this is a little long-winded. (You can tell it's late in the day and not much is going on.)

Those error messages from dmesg in the initial post look downright ominous. What events do earlier log files show up as occurring with that device? Perhaps this is a problem that has been building up in severity for a while.

The first thing I'd tackle is the root filesystem space problem. Working on the "/data" problem is gong to be difficult when the root filesystem is full.

You'll need to log in as root and clean out anything that is not absolutely necessary. Move things onto an external drive -- at least temporarily -- if you have one with free space so you have some "breathing room" that will allow you to do some cleanup.
  • Check the contents of /var/log for old, no-longer-needed log files and either delete them or compress them (i.e. "bzip2 -v9 log-file") to free up space. Be careful, though: you wouldn't want to delete any log files that might contain useful information regarding the "/data" mount point failure.
  • If old log files are getting out of hand (even when compressed), look at "logrotate(8) to keep them under control. You'll need to evaluate your needs for any log file you manage with that utility but investing the time in deciding what logs to keep and for how long is time I'd rather spend time doing that evaluation to avoid future extended downtimes cleaning up the root filesystem.
  • If you have debugging enabled for any services, consider turning that off if you don't need it; it makes log files larger than normal and you probably don't need that debugging information any more. Watch out for PostgreSQL logging. You can log the database activity in nauseating detail listing each and every SQL statement (and even more) and those logs can be huge.
  • Since the filesystem also contains /home -- not a good idea, IMHO, as a single user can bring the system to its knees with a runaway program that consumes all available disk space (you'll need to add storage to remedy this configuration situation) -- dive into the users' home directories (as root) and look for large log files, tar archives, etc, that are uncompressed. They really don't have to exist on disk in an uncompressed state; you can easily inspect and extract files from a compressed log file tar archive, for example, using command line switches and pipes or inspect them using a GUI tool like Ark.

Once you get "/" cleaned up and you've got a stable platform on which to do some debugging as a non-root user (without disk space problems getting in your way). I'd try (as root) mounting that "/data" filesystem by hand and watching the "/var/log/messages" log file for anything related to the filesystem when you mount it. If the mount succeeds, can you then access anything from that mount point? If that works, then I suspect that, when the root filesystem was at 100%, the PostgreSQL startup was unable to complete as it was trying to log to a device that has no disk space accessible to the PostgreSQL user/owner.

(Now I'm think I need to look at my accumulation of nightly psql export files. :/ )

Hope some of this is helpful and... good luck.

Very much appreciated your feedback and others. In our work environment its really tough. These professors want to manage their own machines but when they have issues, they want us to help them out. So this machine basically belongs to JohnDoe .

I already asked him to clean up / directory. Now we have about 70GB free on that disk. I just ran (sudo /etc/init.d/postgresql start) and keeping my eyes on it. Will keep you guys updated.

TT

Last edited by trackstar2000; 05-21-2020 at 08:24 PM.
 
Old 05-26-2020, 05:24 PM   #9
trackstar2000
Member
 
Registered: Apr 2013
Posts: 70

Original Poster
Rep: Reputation: Disabled
Quick update:

I have been told/researched, the dude keeps most of the 21 TB of data on the /data/postgresql and then uses available space on the nvme disk (where the disk space ran out) as a cache for more frequently accessed data.

Dude is using the computer as a deep learning machine (Lambda machine). Recalling from the invoice, the /data is on a software raid. What command to verify such?


Thanks,
TT

Last edited by trackstar2000; 05-26-2020 at 05:31 PM.
 
Old 05-26-2020, 10:06 PM   #10
JJJCR
Senior Member
 
Registered: Apr 2010
Posts: 1,703

Rep: Reputation: 288Reputation: 288Reputation: 288
You can try to find for the files let's say larger than 500GB.

find /data/postgresql/ -type f -size +500G

Or you can try to get which directory has a lot of space.

du -sh /data/ | sort -hr

Last edited by JJJCR; 05-26-2020 at 10:07 PM. Reason: edit
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Network Configuration/Troubleshooting in Fedora 7 (static routing/hardware config) Linux_Questions Linux - Networking 20 10-15-2007 05:22 PM
Looking to linux for hardware troubleshooting jstars Linux - Hardware 15 07-19-2007 07:45 AM
Troubleshooting tips (and a useful hardware kludge) for CD audio problems Jane Delawney LinuxQuestions.org Member Success Stories 3 03-31-2003 06:21 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 11:21 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration