I am using beegle board for execute my code block.
As you all know beegle use the TI OMAP 35XX architecture !
My issue is i am writing the code in 2 different way
One is using C and another is using Neon Intrinsics and Parallise the code
and I am using the following tool chain and switches for compilation
CORTEX=-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions
Really the issue is both code working fine. But the time to execute the code almost same.
But i use the same source in Other profilers there is some significan improvement in the cycle count (ie :: Time Difference)
The beegle board configuration is Advanced SIMD NEON enabled and use the angstrom linux as the OS
Where i failed to get a better performence .?
Please answer !
The NEON intrinsics are getting at the following link ::