Linux - Embedded & Single-board computerThis forum is for the discussion of Linux on both embedded devices and single-board computers (such as the Raspberry Pi, BeagleBoard and PandaBoard). Discussions involving Arduino, plug computers and other micro-controller like devices are also welcome.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
We have a system based on the Freescale iMX53 processor with one SATA port. We have 2 boards out of 70 that have a SATA issue. I've gone through u-boot source and now the ata driver source.
To debug, I have sprinkled printk statements all over libata-code.c and ahci_platform.c and still can't figure out why the sata link is down. I am by no means a C++ expert, which explains why the source appears complex to me. I may be looking in the wrong spots anyway.
I have diffed a good board and bad board dmesg log.
I'm hoping someone can look at the below diff snippet and supply a clue.
Here is one line early on in dmesg that I think will lead to the answer...
good board: ata1: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 28 bad board_: ata1: SATA max UDMA/133 mmio [mem 0x10000000-0x10000fff] port 0x100 irq 28
Later on in dmesg this is the result...
good board: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) bad board_: ata1: SATA link down (SStatus 0 SControl 300)
Funny, I've been working on this exact problem over the past few days.
My work around is to check if the device node is present (/dev/sda in my case) and rescan the bus if it's not (/sys/class/scsi_host/host0/scan) You might need something more robust ...
I've not seen a failure after the bus re-scan but I literally just tested this over night.
More details on why this is occurring can be found by compiling in verbose logging. The irq_status message overwrites the mmio status message when the connection status handled, which is why they're different. Something about the handling of the connect causes ATA error handling code to fiddle with the device which in turn either causes enough time to pass or the device to enter another state. So I figured maybe I could rescan and catch the device; hence my work around.
I don't care to get much deeper in to it as I don't think there's anyone out there who really cares for a fix (old kernel and Freescale has moved on.) But if there is maybe this information will help them.
After I initially posted I delved further into the ata driver (libata-core.c and ahci.c) and narrowed it down an irq not being handled or something along those lines. Haven't had time to go any further. I did whip out the BDI3000 GDB jtag debugger, but it took me a while to setup so I went on to other things.
Couple questions for you...
In my case /sys/class/scsi_host/host0/scan is just a writable file...
--w------- 1 root root 4096 Jan 16 06:45 scan
Do you add code to it and make it executable? If so can you PLEASE post up what you did?
Also, do you mean building the kernel with verbose logging? How do you do that?
Yeah, it looks like a funny interaction with the iMX53. I also suspect the other 68 devices you have may eventually exhibit the problem. On one of mine it takes hours of rebooting to get in this state. I wish I had a device that reproduced it reliably as then I might actually *fix* it. But I don't so I'm not <grin>.
/sys/class/scsi_host/host0/scan is only writable. By writing to it you can ask the scsi subsystem to re-scan channels or targets or LUNs. Using "-" is a wild-card, so
echo "- - -" > /sys/class/scsi_host/host0/scan
will re-scan the entirety. This may or may not be what you want to happen. I put a check for /dev/sda in an init.d script and if it's not there I echo to /sys/.../scan. Depending on your application you should think about putting this before the mount(s) occur.
So far it's always worked. But I'm testing testing testing as we type.
I added verbose logging by editing include/linux/libata.h (notice the undef's, specifically define ATA_DEBUG and ATA_VERBOSE_DEBUG) Then rebuild your kernel. There may be "better" ways to do this but I just needed information and that was my quickest path to get it.
Once again, YMMV. This works for my needs - it's not as elegant as one might like ...
Thanks a bunch for taking time to post. Much appreciated.
Unfortunately running echo../sys/../scan, after bootup, doesn't enable the device. I have not tried it during boot in init.d.
Here's the diff when I run echo "- - -" > /sys/.../scan in command line after the system has booted...
Code:
BAD BOARD GOOD BOARD
test# echo "- - -" > /sys/class/scsi_host/host0/scan test# echo "- - -" > /sys/class/scsi_host/host0/scan
ata_port_schedule_eh: port EH scheduled ata_port_schedule_eh: port EH scheduled
ata_scsi_error: ENTER ata_scsi_error: ENTER
ata_sff_flush_pio_task: ENTER ata_sff_flush_pio_task: ENTER
ata1: ata_sff_flush_pio_task: EXIT ata1: ata_sff_flush_pio_task: EXIT
ata_eh_link_autopsy: ENTER ata_eh_link_autopsy: ENTER
ata_eh_link_autopsy: EXIT ata_eh_link_autopsy: EXIT
ata_eh_recover: ENTER ata_eh_recover: ENTER
__ata_port_freeze: ata1 port frozen __ata_port_freeze: ata1 port frozen
ata1: hard resetting link ata1: hard resetting link
ahci_hardreset: ENTER ahci_hardreset: ENTER
sata_link_hardreset: ENTER sata_link_hardreset: ENTER
sata_link_hardreset: EXIT, rc=0 sata_link_hardreset: EXIT, rc=0
ahci_hardreset: EXIT, rc=0, class=0 ata_dev_classify: found ATA device by sig
ata_eh_thaw_port: ata1 port thawed ahci_hardreset: EXIT, rc=0, class=1
ata_std_postreset: ENTER ata_eh_thaw_port: ata1 port thawed
ata1: SATA link down (SStatus 0 SControl 300) ata_std_postreset: ENTER
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
BTW extra thanks for pointing me to enable debugging in include/linux/libata.h. Sooooo much easier and more thorough than the 100 printk's I had.
This is what I get at boot with all the debugging on....
Code:
GOOD BOARD BAD BOARD
SCSI Media Changer driver v0.25 SCSI Media Changer driver v0.25
ata_host_alloc: ENTER ata_host_alloc: ENTER
ata_port_alloc: ENTER ata_port_alloc: ENTER
ahci: SSS flag set, parallel bus scan disabled ahci: SSS flag set, parallel bus scan disabled
ahci_port_init: PORT_SCR_ERR 0x0 ahci_port_init: PORT_SCR_ERR 0x0
ahci_port_init: PORT_IRQ_STAT 0x0 ahci_port_init: PORT_IRQ_STAT 0x0
ahci_init_controller: HOST_CTL 0x80000000 ahci_init_controller: HOST_CTL 0x80000000
ahci_init_controller: HOST_CTL 0x80000002 ahci_init_controller: HOST_CTL 0x80000002
ahci ahci.0: AHCI 0001.0100 32 slots 1 ports 3 Gbps 0x1 impl platform mode ahci ahci.0: AHCI 0001.0100 32 slots 1 ports 3 Gbps 0x1 impl platform mode
ahci ahci.0: flags: ncq sntf stag pm led clo only pmp pio slum part ccc ahci ahci.0: flags: ncq sntf stag pm led clo only pmp pio slum part ccc
__ata_port_freeze: ata4294967295 port frozen __ata_port_freeze: ata4294967295 port frozen
ahci_interrupt: ENTER ata1: SATA max UDMA/133 mmio [mem 0x10000000-0x10000fff] port 0x100 irq 28
__ata_port_freeze: ata4294967295 port frozen
ahci_interrupt: port 0
ahci_interrupt: EXIT
scsi0 : ahci
ata1: SATA max UDMA/133 irq_stat 0x00400040, connection status changed irq 28
So it never enters ahci_interrupt.
This is good progress. I will update if I ever find a solution. Actually, if you can post up the script for init.d so I can test this at bootup that would be awesome.
I've made more progress in a couple hours, thanks to you, than I made in 3 days.
Then the real question will be why this is happening on just 2 of 70 boards?
Last edited by MealTicket; 04-09-2014 at 12:07 PM.
Try putting one or more of the other devices in a reboot loop and see what happens. We see this sporadically in the lab, some devices seem more susceptible than others.
Below is what I put in /etc/init.d/S96rebooter.sh. This is a reboot loop, and I use a uboot environment variable to stop it. /sbin/ubootenv comes with my OS (DigiEL) I'm not sure what Freescale provides. However, if you don't do something like this you'll end up needing to edit/remove /etc/init.d/S96rebooter.sh another way which might be hard depending on your setup.
So, when /dev/sda doesn't exist (you could grep for the log messages too - whatever) I echo to scan and then drop in to a shell. Thus far when I find the device sitting at a prompt I see the results of the re-scan and *so far* it's been successful.
It's interesting that your device isn't reporting "online" and so ata_dev_classify isn't being called. From what I can tell from your log, sata_link_hardreset is kicking out early. However, the code from there isn't instrumented well so it's hard to tell if the iMX is reporting a bad state or if something in the processing went off the rails. That would be the place I'd start adding some printk's.
Next step is the 5K page datasheet <grin> ...
Brian.
--------------------
#!/bin/bash
# Reboot the device if /dev/sda *exists* (we want to recreate and catch failures)
reboot=`/sbin/ubootenv --print reboot`
if [ -e /dev/sda ]
then
if [ "X$reboot" = "X" ]
then
/sbin/reboot
fi
else
/bin/echo 0 0 0 > /sys/class/scsi_host/host0/scan
fi
Last edited by brs332; 04-09-2014 at 02:06 PM.
Reason: Remove 2>&1 which isn't particularly good when echo'ing to /sys <sheepish grin>
Brian, thanks again for your response. You've helped me quite a bit.
Thanks to your script I was able to go into boot loop when sda is found (on a good board) and issue a write to /sys/../scan when it wasn't found (on a bad board). Unfortunately since writing to /sys/../scan does not work from command line it did not work from init.d either, since init.d is late in the bootup game.
My bug appears more deviant.
Once again I'm going through the driver source, but I think I'll have to consult a knowledge expert on the driver or get the BDI3000 GDB debug tool working.
I tried to play with the timing in sata_link_hardreset to no avail. I'll have to try some more things.
I'm pretty sure this line is significant in ahci_platform.c..
ahci_interrupt is a function defined in libahci.c, and in the above line it's passed in as a parameter. For some reason ahci_interrupt is not being called.
On a side note I am able to verify the sata drive is recognized in U-Boot...
Code:
Hit any key to stop autoboot: 0
testConsole> sata info
SATA device 0: Model: SATADOM D150QH Firm: 120925 Ser#: 20131112AABB00000004
Type: Hard Disk
Capacity: 3825.7 MB = 3.7 GB (7835184 x 512)
There's a blue light on the sata drive. It stays blue all the time on a good board, but on a bad board it turns off when the kernel starts loading. Sounds like a driver issue.
Yeah, sorry - looks like the root of our problems is different. I'll post anything new I learn here. Drop me a PM if you want to kick some ideas around.
MealTicket, Did you ever find a solution to this? I have a bunch of boards (~5% of a batch) that exhibit the exact same problem and I've been trying to solve this one for a while now. There seems to be some hardware variation that makes the 2.6.35 driver not behave correctly. I have tested against Uboot and another OS (QNX) and the SATA does work on these boards, but not in Linux.
My next step is to build up a newer kernel and try that. Otherwise I'll start debugging the current SATA driver.
I recently debugged for the same issue and i believe it has something to do with the unstable internal reference clock and resume/suspend of SATA link.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.