SATA link down

MealTicket · 04-02-2014, 03:06 PM

Hi,

Looking for help from some guru's.

We have a system based on the Freescale iMX53 processor with one SATA port. We have 2 boards out of 70 that have a SATA issue. I've gone through u-boot source and now the ata driver source.

To debug, I have sprinkled printk statements all over libata-code.c and ahci_platform.c and still can't figure out why the sata link is down. I am by no means a C++ expert, which explains why the source appears complex to me. I may be looking in the wrong spots anyway.

I have diffed a good board and bad board dmesg log.

I'm hoping someone can look at the below diff snippet and supply a clue.

Here is one line early on in dmesg that I think will lead to the answer...

good board: ata1: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 28
bad board_: ata1: SATA max UDMA/133 mmio [mem 0x10000000-0x10000fff] port 0x100 irq 28

Later on in dmesg this is the result...

good board: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
bad board_: ata1: SATA link down (SStatus 0 SControl 300)

Any help would be greatly appreciated.
Thanks

brs332 · 04-09-2014, 07:01 AM

Funny, I've been working on this exact problem over the past few days.

My work around is to check if the device node is present (/dev/sda in my case) and rescan the bus if it's not (/sys/class/scsi_host/host0/scan) You might need something more robust ...

I've not seen a failure after the bus re-scan but I literally just tested this over night.

More details on why this is occurring can be found by compiling in verbose logging. The irq_status message overwrites the mmio status message when the connection status handled, which is why they're different. Something about the handling of the connect causes ATA error handling code to fiddle with the device which in turn either causes enough time to pass or the device to enter another state. So I figured maybe I could rescan and catch the device; hence my work around.

I don't care to get much deeper in to it as I don't think there's anyone out there who really cares for a fix (old kernel and Freescale has moved on.) But if there is maybe this information will help them.

Brian.

MealTicket · 04-09-2014, 08:46 AM

Thanks for the post. Much appreciated.

After I initially posted I delved further into the ata driver (libata-core.c and ahci.c) and narrowed it down an irq not being handled or something along those lines. Haven't had time to go any further. I did whip out the BDI3000 GDB jtag debugger, but it took me a while to setup so I went on to other things.

Couple questions for you...

In my case /sys/class/scsi_host/host0/scan is just a writable file...

--w------- 1 root root 4096 Jan 16 06:45 scan

Do you add code to it and make it executable? If so can you PLEASE post up what you did?

Also, do you mean building the kernel with verbose logging? How do you do that?

Thanks for your time.

brs332 · 04-09-2014, 09:05 AM

Yeah, it looks like a funny interaction with the iMX53. I also suspect the other 68 devices you have may eventually exhibit the problem. On one of mine it takes hours of rebooting to get in this state. I wish I had a device that reproduced it reliably as then I might actually *fix* it. But I don't so I'm not <grin>.

/sys/class/scsi_host/host0/scan is only writable. By writing to it you can ask the scsi subsystem to re-scan channels or targets or LUNs. Using "-" is a wild-card, so

echo "- - -" > /sys/class/scsi_host/host0/scan

will re-scan the entirety. This may or may not be what you want to happen. I put a check for /dev/sda in an init.d script and if it's not there I echo to /sys/.../scan. Depending on your application you should think about putting this before the mount(s) occur.

So far it's always worked. But I'm testing testing testing as we type.

I added verbose logging by editing include/linux/libata.h (notice the undef's, specifically define ATA_DEBUG and ATA_VERBOSE_DEBUG) Then rebuild your kernel. There may be "better" ways to do this but I just needed information and that was my quickest path to get it.

Once again, YMMV. This works for my needs - it's not as elegant as one might like ...

Brian.

MealTicket · 04-09-2014, 12:02 PM

Thanks a bunch for taking time to post. Much appreciated.

Unfortunately running echo../sys/../scan, after bootup, doesn't enable the device. I have not tried it during boot in init.d.

Here's the diff when I run echo "- - -" > /sys/.../scan in command line after the system has booted...

Code:

BAD BOARD                                             GOOD BOARD
test# echo "- - -" > /sys/class/scsi_host/host0/scan  test# echo "- - -" > /sys/class/scsi_host/host0/scan
ata_port_schedule_eh: port EH scheduled               ata_port_schedule_eh: port EH scheduled
ata_scsi_error: ENTER                                 ata_scsi_error: ENTER
ata_sff_flush_pio_task: ENTER                         ata_sff_flush_pio_task: ENTER
ata1: ata_sff_flush_pio_task: EXIT                    ata1: ata_sff_flush_pio_task: EXIT
ata_eh_link_autopsy: ENTER                            ata_eh_link_autopsy: ENTER
ata_eh_link_autopsy: EXIT                             ata_eh_link_autopsy: EXIT
ata_eh_recover: ENTER                                 ata_eh_recover: ENTER
__ata_port_freeze: ata1 port frozen                   __ata_port_freeze: ata1 port frozen
ata1: hard resetting link                             ata1: hard resetting link
ahci_hardreset: ENTER                                 ahci_hardreset: ENTER
sata_link_hardreset: ENTER                            sata_link_hardreset: ENTER
sata_link_hardreset: EXIT, rc=0                       sata_link_hardreset: EXIT, rc=0
ahci_hardreset: EXIT, rc=0, class=0                   ata_dev_classify: found ATA device by sig
ata_eh_thaw_port: ata1 port thawed                    ahci_hardreset: EXIT, rc=0, class=1
ata_std_postreset: ENTER                              ata_eh_thaw_port: ata1 port thawed
ata1: SATA link down (SStatus 0 SControl 300)         ata_std_postreset: ENTER
                                                      ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)

BTW extra thanks for pointing me to enable debugging in include/linux/libata.h. Sooooo much easier and more thorough than the 100 printk's I had.

This is what I get at boot with all the debugging on....

Code:

GOOD BOARD                                                                    BAD BOARD
SCSI Media Changer driver v0.25                                               SCSI Media Changer driver v0.25
ata_host_alloc: ENTER                                                         ata_host_alloc: ENTER
ata_port_alloc: ENTER                                                         ata_port_alloc: ENTER
ahci: SSS flag set, parallel bus scan disabled                                ahci: SSS flag set, parallel bus scan disabled
ahci_port_init: PORT_SCR_ERR 0x0                                              ahci_port_init: PORT_SCR_ERR 0x0
ahci_port_init: PORT_IRQ_STAT 0x0                                             ahci_port_init: PORT_IRQ_STAT 0x0
ahci_init_controller: HOST_CTL 0x80000000                                     ahci_init_controller: HOST_CTL 0x80000000
ahci_init_controller: HOST_CTL 0x80000002                                     ahci_init_controller: HOST_CTL 0x80000002
ahci ahci.0: AHCI 0001.0100 32 slots 1 ports 3 Gbps 0x1 impl platform mode    ahci ahci.0: AHCI 0001.0100 32 slots 1 ports 3 Gbps 0x1 impl platform mode
ahci ahci.0: flags: ncq sntf stag pm led clo only pmp pio slum part ccc       ahci ahci.0: flags: ncq sntf stag pm led clo only pmp pio slum part ccc
__ata_port_freeze: ata4294967295 port frozen                                  __ata_port_freeze: ata4294967295 port frozen
ahci_interrupt: ENTER                                                         ata1: SATA max UDMA/133 mmio [mem 0x10000000-0x10000fff] port 0x100 irq 28
__ata_port_freeze: ata4294967295 port frozen
ahci_interrupt: port 0
ahci_interrupt: EXIT
scsi0 : ahci
ata1: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 28

So it never enters ahci_interrupt.

This is good progress. I will update if I ever find a solution. Actually, if you can post up the script for init.d so I can test this at bootup that would be awesome.

I've made more progress in a couple hours, thanks to you, than I made in 3 days.

Then the real question will be why this is happening on just 2 of 70 boards?

brs332 · 04-09-2014, 02:02 PM

Try putting one or more of the other devices in a reboot loop and see what happens. We see this sporadically in the lab, some devices seem more susceptible than others.

Below is what I put in /etc/init.d/S96rebooter.sh. This is a reboot loop, and I use a uboot environment variable to stop it. /sbin/ubootenv comes with my OS (DigiEL) I'm not sure what Freescale provides. However, if you don't do something like this you'll end up needing to edit/remove /etc/init.d/S96rebooter.sh another way which might be hard depending on your setup.

So, when /dev/sda doesn't exist (you could grep for the log messages too - whatever) I echo to scan and then drop in to a shell. Thus far when I find the device sitting at a prompt I see the results of the re-scan and *so far* it's been successful.

It's interesting that your device isn't reporting "online" and so ata_dev_classify isn't being called. From what I can tell from your log, sata_link_hardreset is kicking out early. However, the code from there isn't instrumented well so it's hard to tell if the iMX is reporting a bad state or if something in the processing went off the rails. That would be the place I'd start adding some printk's.

Next step is the 5K page datasheet <grin> ...

Brian.

--------------------

#!/bin/bash

# Reboot the device if /dev/sda *exists* (we want to recreate and catch failures)

reboot=`/sbin/ubootenv --print reboot`

if [ -e /dev/sda ]
then
if [ "X$reboot" = "X" ]
then
/sbin/reboot
fi
else
/bin/echo 0 0 0 > /sys/class/scsi_host/host0/scan
fi

MealTicket · 04-09-2014, 05:03 PM

Brian, thanks again for your response. You've helped me quite a bit.

Thanks to your script I was able to go into boot loop when sda is found (on a good board) and issue a write to /sys/../scan when it wasn't found (on a bad board). Unfortunately since writing to /sys/../scan does not work from command line it did not work from init.d either, since init.d is late in the bootup game.

My bug appears more deviant.

Once again I'm going through the driver source, but I think I'll have to consult a knowledge expert on the driver or get the BDI3000 GDB debug tool working.

I tried to play with the timing in sata_link_hardreset to no avail. I'll have to try some more things.

I'm pretty sure this line is significant in ahci_platform.c..

Code:

rc = ata_host_activate(host, irq, ahci_interrupt, IRQF_SHARED,&ahci_sht);

ahci_interrupt is a function defined in libahci.c, and in the above line it's passed in as a parameter. For some reason ahci_interrupt is not being called.

On a side note I am able to verify the sata drive is recognized in U-Boot...

Code:

Hit any key to stop autoboot:  0
testConsole>  sata info

SATA device 0: Model: SATADOM D150QH Firm: 120925 Ser#: 20131112AABB00000004
            Type: Hard Disk
            Capacity: 3825.7 MB = 3.7 GB (7835184 x 512)

There's a blue light on the sata drive. It stays blue all the time on a good board, but on a bad board it turns off when the kernel starts loading. Sounds like a driver issue.

brs332 · 04-10-2014, 07:09 AM

Yeah, sorry - looks like the root of our problems is different. I'll post anything new I learn here. Drop me a PM if you want to kick some ideas around.

Brian.

_solid_ · 09-10-2014, 10:44 AM

MealTicket, Did you ever find a solution to this? I have a bunch of boards (~5% of a batch) that exhibit the exact same problem and I've been trying to solve this one for a while now. There seems to be some hardware variation that makes the 2.6.35 driver not behave correctly. I have tested against Uboot and another OS (QNX) and the SATA does work on these boards, but not in Linux.

My next step is to build up a newer kernel and try that. Otherwise I'll start debugging the current SATA driver.

manjunathjoshi · 10-23-2017, 07:19 AM

I recently debugged for the same issue and i believe it has something to do with the unstable internal reference clock and resume/suspend of SATA link.

Please check if the below link patch works.

https://community.nxp.com/thread/360029