Troubleshooting hardware issue
We have a user that is running postgres and claimed /data mount point fails.
dmesg shows this error below. What other logs can I view? I have looked at syslog too. How do I view actual time on the dmesg? This Ubuntu.. Thanks ahead, TT. [10001.245337] sd 2:2:0:0: [sda] tag#0 task abort called for scmd(000000004710019e) [10001.245341] sd 2:2:0:0: [sda] tag#0 CDB: Write(16) 8a 00 00 00 00 06 4a 0e 05 c0 00 00 02 00 00 00 [10001.245343] sd 2:2:0:0: task abort: FAILED scmd(000000004710019e) [10001.261332] sd 2:2:0:0: target reset called for scmd(000000004710019e) [10001.261334] sd 2:2:0:0: [sda] tag#0 megasas: target reset FAILED!! [10001.261337] sd 2:2:0:0: [sda] tag#0 Controller reset is requested due to IO timeout SCSI command pointer: (000000004710019e) SCSI host state: 5 SCSI |
The dmesg -T option converts seconds to human readable time.
Do we assume the postgresql database is on /data? Separate drive, partition or RAID? What type of filesystem? Filesystem errors? Any errors specific to that partition etc? |
Quote:
The user reported the /data doesn't dismount but get errors when trying to read/write files. For example, running `ls /data` gives ls: cannot access '/data/tweets_corona': Input/output error ls: cannot access '/data/files.pushshift.io': Input/output error ls: cannot access '/data/SUN397': Input/output error /dev/sda1 on /data type ext4 On the fstab: UUID=c1d5f933-9dd6-4f40-a476-4b6dd911ecee /data ext4 defaults,rw,x-gvfs-show,nofail,x-systemd.device-timeout=30s 0 2 disk allocation: Filesystem Size Used Avail Use% Mounted on udev 126G 0 126G 0% /dev tmpfs 26G 3.6M 26G 1% /run /dev/nvme0n1p2 1.8T 1.7T 1003M 100% / tmpfs 126G 0 126G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/nvme0n1p1 511M 6.1M 505M 2% /boot/efi /dev/sda1 48T 21T 24T 47% /data tmpfs 26G 0 26G 0% /run/user/1003 tmpfs 26G 0 26G 0% /run/user/1000 |
I notice that directory below is 100% filled with about 1 GB left. This is where the home directory resides too.
admin1@lambda:/home$ whereis postgresql postgresql: /usr/lib/postgresql /etc/postgresql /usr/include/postgresql /usr/share/postgresql When the user runs this, it will crap out : $ sudo /etc/init.d/postgresql start The user can't "ls /data" afterwards which is on a different device ls: cannot access '/data/tweets_corona': Input/output error ls: cannot access '/data/files.pushshift.io': Input/output error ls: cannot access '/data/SUN397': Input/output error |
The remaining space in / is reserved which is 5% by default and not accessible by the regular user. Might be why it isn't starting. /data is another problem.
|
Quote:
$ sudo /etc/init.d/postgresql start This starts the long db recovery process, which does a lot of read/writes onto the array, and the array has been failing before it finishes. |
Quote:
Those error messages from dmesg in the initial post look downright ominous. What events do earlier log files show up as occurring with that device? Perhaps this is a problem that has been building up in severity for a while. The first thing I'd tackle is the root filesystem space problem. Working on the "/data" problem is gong to be difficult when the root filesystem is full. You'll need to log in as root and clean out anything that is not absolutely necessary. Move things onto an external drive -- at least temporarily -- if you have one with free space so you have some "breathing room" that will allow you to do some cleanup.
Once you get "/" cleaned up and you've got a stable platform on which to do some debugging as a non-root user (without disk space problems getting in your way). I'd try (as root) mounting that "/data" filesystem by hand and watching the "/var/log/messages" log file for anything related to the filesystem when you mount it. If the mount succeeds, can you then access anything from that mount point? If that works, then I suspect that, when the root filesystem was at 100%, the PostgreSQL startup was unable to complete as it was trying to log to a device that has no disk space accessible to the PostgreSQL user/owner. (Now I'm think I need to look at my accumulation of nightly psql export files. :/ ) Hope some of this is helpful and... good luck. |
Quote:
Very much appreciated your feedback and others. In our work environment its really tough. These professors want to manage their own machines but when they have issues, they want us to help them out. So this machine basically belongs to JohnDoe :(. I already asked him to clean up / directory. Now we have about 70GB free on that disk. I just ran (sudo /etc/init.d/postgresql start) and keeping my eyes on it. Will keep you guys updated. TT |
Quick update:
I have been told/researched, the dude keeps most of the 21 TB of data on the /data/postgresql and then uses available space on the nvme disk (where the disk space ran out) as a cache for more frequently accessed data. Dude is using the computer as a deep learning machine (Lambda machine). Recalling from the invoice, the /data is on a software raid. What command to verify such? Thanks, TT |
You can try to find for the files let's say larger than 500GB.
find /data/postgresql/ -type f -size +500G Or you can try to get which directory has a lot of space. du -sh /data/ | sort -hr |
All times are GMT -5. The time now is 08:05 AM. |