LinuxQuestions.org - Troubleshooting hardware issue

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Troubleshooting hardware issue (https://www.linuxquestions.org/questions/linux-newbie-8/troubleshooting-hardware-issue-4175675666/)

Troubleshooting hardware issue

We have a user that is running postgres and claimed /data mount point fails.

dmesg shows this error below. What other logs can I view? I have looked at syslog too.

How do I view actual time on the dmesg? This Ubuntu.. Thanks ahead, TT.

[10001.245337] sd 2:2:0:0: [sda] tag#0 task abort called for scmd(000000004710019e)
[10001.245341] sd 2:2:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 06 4a 0e 05 c0 00 00 02 00 00 00
[10001.245343] sd 2:2:0:0: task abort: FAILED scmd(000000004710019e)
[10001.261332] sd 2:2:0:0: target reset called for scmd(000000004710019e)
[10001.261334] sd 2:2:0:0: [sda] tag#0 megasas: target reset FAILED!!
[10001.261337] sd 2:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout
SCSI command pointer: (000000004710019e) SCSI host state: 5 SCSI

The dmesg -T option converts seconds to human readable time.

Do we assume the postgresql database is on /data?
Separate drive, partition or RAID?

What type of filesystem? Filesystem errors?

Any errors specific to that partition etc?

Quote:

Originally Posted by michaelk (Post 6125778)

The dmesg -T option converts seconds to human readable time.

Do we assume the postgresql database is on /data?

What type of filesystem? Filesystem errors?

Any errors specific to that partition etc?

From what I remember the raid 1 contains the OS. The data drive is Raid 5. I am not familiar with postgresql but will look into it.

The user reported the /data doesn't dismount but get errors when trying to read/write files. For example, running `ls /data` gives

ls: cannot access '/data/tweets_corona': Input/output error
ls: cannot access '/data/files.pushshift.io': Input/output error
ls: cannot access '/data/SUN397': Input/output error

/dev/sda1 on /data type ext4

On the fstab:

UUID=c1d5f933-9dd6-4f40-a476-4b6dd911ecee /data ext4 defaults,rw,x-gvfs-show,nofail,x-systemd.device-timeout=30s 0 2

disk allocation:

Filesystem Size Used Avail Use% Mounted on
udev 126G 0 126G 0% /dev
tmpfs 26G 3.6M 26G 1% /run
/dev/nvme0n1p2 1.8T 1.7T 1003M 100% /
tmpfs 126G 0 126G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 126G 0 126G 0% /sys/fs/cgroup
/dev/nvme0n1p1 511M 6.1M 505M 2% /boot/efi
/dev/sda1 48T 21T 24T 47% /data
tmpfs 26G 0 26G 0% /run/user/1003
tmpfs 26G 0 26G 0% /run/user/1000

I notice that directory below is 100% filled with about 1 GB left. This is where the home directory resides too.

admin1@lambda:/home$ whereis postgresql
postgresql: /usr/lib/postgresql /etc/postgresql /usr/include/postgresql /usr/share/postgresql

When the user runs this, it will crap out : $ sudo /etc/init.d/postgresql start

The user can't "ls /data" afterwards which is on a different device

ls: cannot access '/data/tweets_corona': Input/output error
ls: cannot access '/data/files.pushshift.io': Input/output error
ls: cannot access '/data/SUN397': Input/output error

The remaining space in / is reserved which is 5% by default and not accessible by the regular user. Might be why it isn't starting. /data is another problem.

Quote:

Originally Posted by michaelk (Post 6125825)

The remaining space in / is reserved which is 5% by default and not accessible by the regular user. Might be why it isn't starting. /data is another problem.

He says it runs but won't finish. Words of the user below :(.

$ sudo /etc/init.d/postgresql start

This starts the long db recovery process, which does a lot of read/writes onto the array, and the array has been failing before it finishes.

Quote:

Originally Posted by trackstar2000 (Post 6125782)

disk allocation:

Code:

Filesystem      Size  Used Avail Use% Mounted on

udev            126G    0  126G  0% /dev

tmpfs            26G  3.6M  26G  1% /run

/dev/nvme0n1p2  1.8T  1.7T 1003M 100% /

tmpfs          126G    0  126G  0% /dev/shm

tmpfs          5.0M    0  5.0M  0% /run/lock

tmpfs          126G    0  126G  0% /sys/fs/cgroup

/dev/nvme0n1p1  511M  6.1M  505M  2% /boot/efi

/dev/sda1        48T  21T  24T  47% /data

tmpfs            26G    0  26G  0% /run/user/1003

tmpfs            26G    0  26G  0% /run/user/1000

Sorry if this is a little long-winded. (You can tell it's late in the day and not much is going on.)

Those error messages from dmesg in the initial post look downright ominous. What events do earlier log files show up as occurring with that device? Perhaps this is a problem that has been building up in severity for a while.

The first thing I'd tackle is the root filesystem space problem. Working on the "/data" problem is gong to be difficult when the root filesystem is full.

You'll need to log in as root and clean out anything that is not absolutely necessary. Move things onto an external drive -- at least temporarily -- if you have one with free space so you have some "breathing room" that will allow you to do some cleanup.

Check the contents of /var/log for old, no-longer-needed log files and either delete them or compress them (i.e. "bzip2 -v9 log-file") to free up space. Be careful, though: you wouldn't want to delete any log files that might contain useful information regarding the "/data" mount point failure.
If old log files are getting out of hand (even when compressed), look at "logrotate(8) to keep them under control. You'll need to evaluate your needs for any log file you manage with that utility but investing the time in deciding what logs to keep and for how long is time I'd rather spend time doing that evaluation to avoid future extended downtimes cleaning up the root filesystem.
If you have debugging enabled for any services, consider turning that off if you don't need it; it makes log files larger than normal and you probably don't need that debugging information any more. Watch out for PostgreSQL logging. You can log the database activity in nauseating detail listing each and every SQL statement (and even more) and those logs can be huge.
Since the filesystem also contains /home -- not a good idea, IMHO, as a single user can bring the system to its knees with a runaway program that consumes all available disk space (you'll need to add storage to remedy this configuration situation) -- dive into the users' home directories (as root) and look for large log files, tar archives, etc, that are uncompressed. They really don't have to exist on disk in an uncompressed state; you can easily inspect and extract files from a compressed log file tar archive, for example, using command line switches and pipes or inspect them using a GUI tool like Ark.

Once you get "/" cleaned up and you've got a stable platform on which to do some debugging as a non-root user (without disk space problems getting in your way). I'd try (as root) mounting that "/data" filesystem by hand and watching the "/var/log/messages" log file for anything related to the filesystem when you mount it. If the mount succeeds, can you then access anything from that mount point? If that works, then I suspect that, when the root filesystem was at 100%, the PostgreSQL startup was unable to complete as it was trying to log to a device that has no disk space accessible to the PostgreSQL user/owner.

(Now I'm think I need to look at my accumulation of nightly psql export files. :/ )

Hope some of this is helpful and... good luck.

Quote:

Originally Posted by rnturn (Post 6125858)

Check the contents of /var/log for old, no-longer-needed log files and either delete them or compress them (i.e. "bzip2 -v9 log-file") to free up space. Be careful, though: you wouldn't want to delete any log files that might contain useful information regarding the "/data" mount point failure.
If old log files are getting out of hand (even when compressed), look at "logrotate(8) to keep them under control. You'll need to evaluate your needs for any log file you manage with that utility but investing the time in deciding what logs to keep and for how long is time I'd rather spend time doing that evaluation to avoid future extended downtimes cleaning up the root filesystem.
If you have debugging enabled for any services, consider turning that off if you don't need it; it makes log files larger than normal and you probably don't need that debugging information any more. Watch out for PostgreSQL logging. You can log the database activity in nauseating detail listing each and every SQL statement (and even more) and those logs can be huge.
Since the filesystem also contains /home -- not a good idea, IMHO, as a single user can bring the system to its knees with a runaway program that consumes all available disk space (you'll need to add storage to remedy this configuration situation) -- dive into the users' home directories (as root) and look for large log files, tar archives, etc, that are uncompressed. They really don't have to exist on disk in an uncompressed state; you can easily inspect and extract files from a compressed log file tar archive, for example, using command line switches and pipes or inspect them using a GUI tool like Ark.

Very much appreciated your feedback and others. In our work environment its really tough. These professors want to manage their own machines but when they have issues, they want us to help them out. So this machine basically belongs to JohnDoe :(.

I already asked him to clean up / directory. Now we have about 70GB free on that disk. I just ran (sudo /etc/init.d/postgresql start) and keeping my eyes on it. Will keep you guys updated.

TT

Quick update:

I have been told/researched, the dude keeps most of the 21 TB of data on the /data/postgresql and then uses available space on the nvme disk (where the disk space ran out) as a cache for more frequently accessed data.

Dude is using the computer as a deep learning machine (Lambda machine). Recalling from the invoice, the /data is on a software raid. What command to verify such?

Thanks,
TT

You can try to find for the files let's say larger than 500GB.

find /data/postgresql/ -type f -size +500G

Or you can try to get which directory has a lot of space.

du -sh /data/ | sort -hr