Jump to content


i.MX353 arm11 based flight computer interfacing with CC


  • Please log in to reply
38 replies to this topic

#21 dankers

dankers

    Janitor

  • Administrators
  • 5124 posts
  • Country: flag of Australia Australia


Posted 24 November 2011 - 02:13 PM

View PostKenn Sebesta, on 24 November 2011 - 12:47 PM, said:

From a perspective of raw performance, that's probably a fair claim. However, from the perspective of actually getting things done, quickly, right, the first time, without bugs, the FPU is an incredible boon.

With the Pro we needed 2 x CPUs, one to run the EKF and the other for everything else, the old INS ran the filter at 150Hz and 80% CPU. The F4 runs the same EKF at 3.3Khz using the FPU, seems to be a great deal more raw performance to me.

#22 Kenn Sebesta

Kenn Sebesta

    Controls Master!

  • Members
  • PipPipPip
  • 896 posts
  • Country: flag of Luxembourg Luxembourg


Posted 24 November 2011 - 02:36 PM

View Postdankers, on 24 November 2011 - 02:13 PM, said:

With the Pro we needed 2 x CPUs, one to run the EKF and the other for everything else, the old INS ran the filter at 150Hz and 80% CPU. The F4 runs the same EKF at 3.3Khz using the FPU, seems to be a great deal more raw performance to me.

Peabody124 can correct me if I'm wrong, but I believe that this is because the EKF was not written in fixed-point and so thus was never optimized to its fullest potential on the original STM. Quite to the contrary, in fact. Floating-point operations on a non-FPU equipped processor have to be done in software, which takes orders of magnitude longer. The worst is the square-root operation, which is not deterministic and can take far in excess of 100 instructions. So what robert b was saying about not needing the FPU was factually correct, since with proper analysis we could write a fixed-point version of the EKF that would have been much, much faster.

Although personally by the same argument we could claim that we don't need compilers either, we can do everything in assembly. In other words, you can pry my FPU and compiler from my cold, dead hands. I'm not giving either up now that I have them!

P.S. Of course, the speed bump from 72MHz to 168MHz couldn't hurt either!

#23 robert b

robert b

    Advanced Member

  • Members
  • PipPipPip
  • 100 posts
  • Locationschneisingen
  • Country: flag of Switzerland Switzerland

Posted 24 November 2011 - 07:36 PM

i know a person who has written an xml parser using assembler :)
lately i have seen an xml parser as part of a chip.
thus why not creating a peripheral dealing with sensor fusion stuff.
then there is no need to do anything else than preparing the dma to move the sensor data to the peripheral.
in the end you read either quaternions or euler angles from the peripheral.
so save current you cut the power domain for the fpu...

#24 robert b

robert b

    Advanced Member

  • Members
  • PipPipPip
  • 100 posts
  • Locationschneisingen
  • Country: flag of Switzerland Switzerland

Posted 25 November 2011 - 11:34 AM

i started working on a lpc1769 as targetplatform.
of interest is the raw sensorfusion performance using canned data (i have no i2c driver at the moment ...).
i ended up with a frequency above 10khz for the madgwick implementation.

i tend to use a proportional gain for the error correction.
doing so i will run the sensor fusion algorythm at a higher frequency as the motor pid loop to reduce the noise from the proprtional gain.
will be interesting to see the behaviour in the air.

lots of power to play with. for shure an fpu wil rise the raw sensor fusion frequency to another level.

#25 Corvus Corax

Corvus Corax

    Master of Fixed Wing Flight Control

  • Administrators
  • 527 posts
  • LocationStuttgart, Germany
  • Country: flag of Germany Germany


Posted 30 November 2011 - 03:36 PM

The only disadvantage I see with a "single board solution" is that now the restrictions that apply to the INS

(place aligned with airframe - vibration free - away from high current lines because of mags - close to centerof gravity )

get combined with the requirements of the IO controller

(has all the servo cables routed to / through it - must be near where the other electronics is - must be near main power supply - might have to be crammed in at weird angles due to cable/connector induced space limitations)

as one can easily see those placement requirements are in conflict

then on the pro side of placement - there is only one board instead of two (or rather two instead of 3 as theres still GPS)


the other effect is that we have to make absolutely sure the EKF thread runs at full realtime priority and doesnt ever skip a beat even if "high level functions" (Guidance, flight planning, mapping, Telemetry) take up CPU resources and or hardware IO take IO (memory, DMA) resources.

with a single CPU you don't have simultaneous processing of EKF and stabilization anymore, they must run interleaved.

If you run EKF at 3 KHz do you also want to preempt threads at 3 KHz ? because otherwise the EKF will run 10 cycles at 5 KHz, then skip 8 cycles, then run another 20 cycles, then skip 5, ..., ...

have these scheduling issues been taken into account? does FreeRTOS offer true real time scheduling (run thread A each 0.5 millisecond until it yields) ?

#26 D-Lite

D-Lite

    Core Team

  • Members
  • PipPipPip
  • 968 posts
  • Country: flag of Germany Germany


Posted 02 December 2011 - 10:55 AM

View PostCorvus Corax, on 30 November 2011 - 03:36 PM, said:

  If you run EKF at 3 KHz do you also want to preempt threads at 3 KHz ?  

There is probably no point in running it that fast, apart for performance comparsion with the old board. We can just run it at 500Hz and have a lot of performance left for other tasks.

View PostCorvus Corax, on 30 November 2011 - 03:36 PM, said:

does FreeRTOS offer true real time scheduling (run thread A each 0.5 millisecond until it yields) ?

I hope so - otherwise the 'R' in RTOS would be fake ;-)

#27 D-Lite

D-Lite

    Core Team

  • Members
  • PipPipPip
  • 968 posts
  • Country: flag of Germany Germany


Posted 02 December 2011 - 11:11 AM

View PostCorvus Corax, on 30 November 2011 - 03:36 PM, said:

have these scheduling issues been taken into account? does FreeRTOS offer true real time scheduling (run thread A each 0.5 millisecond until it yields) ?

Looks like the right way to do this in FreeRTOS is to run the task at high priority and call vTaskDelayUntil at the end of each loop.

#28 peabody124

peabody124

    Crash Dummy

  • Administrators
  • 4110 posts
  • LocationHouston, TX
  • Country: flag of United States United States


Posted 03 December 2011 - 12:21 AM

yeah the time granularity of FreeRTOS the way we run it is 1 ms.  Multiple tasks might run within that 1 ms if they are good about getting out of the way instead of waiting to be preempted though.

#29 robert b

robert b

    Advanced Member

  • Members
  • PipPipPip
  • 100 posts
  • Locationschneisingen
  • Country: flag of Switzerland Switzerland

Posted 11 December 2011 - 09:05 PM

task switchng on a system with a floating point unit is more complicated.
the registers of the floating point unit have to be saved as well.
there is support for context switches in the arm cortex design.
but i have no experience with an m4 as well as tweaking the rtos to be aware of
the floating point unit.

you can save time when not doing so ;-)
means when a floating point operation occurs no context switch can done.
are the requirements so well known that such a thing could be realized?
forcing a context switch with vtaskdelay...
robert

#30 peabody124

peabody124

    Crash Dummy

  • Administrators
  • 4110 posts
  • LocationHouston, TX
  • Country: flag of United States United States


Posted 11 December 2011 - 10:25 PM

Zippe has already taken care of that in FreeRTOS.

#31 D-Lite

D-Lite

    Core Team

  • Members
  • PipPipPip
  • 968 posts
  • Country: flag of Germany Germany


Posted 11 December 2011 - 10:25 PM

View Postrobert b, on 11 December 2011 - 09:05 PM, said:

task switchng on a system with a floating point unit is more complicated.
the registers of the floating point unit have to be saved as well.

AFAIK, this is handled by the hardware. The OS doesn't even have to care if there's a FPU or not. Only the crypto hardware requires additional actions during context switching and interrupt servicing.

View Postrobert b, on 11 December 2011 - 09:05 PM, said:

you can save time when not doing so ;-)
means when a floating point operation occurs no context switch can done.

No big problem, because most FPU instructions are completed within 1-3 CPU cycles. I really don't see much reason to avoid using the FPU.

#32 robert b

robert b

    Advanced Member

  • Members
  • PipPipPip
  • 100 posts
  • Locationschneisingen
  • Country: flag of Switzerland Switzerland

Posted 12 December 2011 - 12:31 PM

werner,
  • The Cortex-M4 with Hardware Floating Point, needs additional 136 bytes on stack for storing VFP registers. This means the total size of the full context store for Cortex-M4 with FP is 200 bytes.
  • The Cortex-M4 tasks, where the Floating Point arithmetics is not used, do not store the additional VFP registers on context save. This means, they do not need additional 136 bytes on the stack. The full context store for tasks with no Floating Point calculations is still 64 bytes.
this is not done automatically.
the design decision is having context switches with complete fp operations yes or no. the cost is described above.
it comes down to the latency of the system. maybe is is possible to define fpu tasks within freertos where the fpu registers are saved.
eagerly waiting for the nxp m4's which are dualcores - they come in january 2012.

#33 robert b

robert b

    Advanced Member

  • Members
  • PipPipPip
  • 100 posts
  • Locationschneisingen
  • Country: flag of Switzerland Switzerland

Posted 12 December 2011 - 08:44 PM

i did some research ... there is no official freertos gcc port for the m4.
but https://github.com/t...eeRTOS_ARM_CM4F
there is a port.
the readme is interesting.
the additional time needed for the store/restore is to big - my point of view.
i would opt for completing the fpu instructions before a context switch.

#34 D-Lite

D-Lite

    Core Team

  • Members
  • PipPipPip
  • 968 posts
  • Country: flag of Germany Germany


Posted 12 December 2011 - 09:21 PM

View Postrobert b, on 12 December 2011 - 08:44 PM, said:

the additional time needed for the store/restore is to big - my point of view.

I haven't verified that numbers but AFAIK RTOS is configured to do task switching every 1ms - so what is wrong with adding 400ns to every task switch? This is less than 0.05% overhead. Btw, all registers are automatically pushed to the stack on every interrupt, execpt for the integer registers R4-R11. These are handled by RTOS. But the rest, R0-R3, R11 and all the FPU registers are pushed by the MCU. This doesn't mean of course that it takes no time - but doesn't hurt too bad. A way to avoid this is activating a feature called "lazy-save". It only reserves stack space for the FPU registers during interrupts, but doesn't push them immediatly. This is done later, if one of the registers is actually used.
Well, I still cannot see the potentional problem. We are running at 168MHz, not 8 or 16 so some cycles more or less are really no need to worry about or to not use such a nice feature as the FPU is IMHO.

#35 peabody124

peabody124

    Crash Dummy

  • Administrators
  • 4110 posts
  • LocationHouston, TX
  • Country: flag of United States United States


Posted 12 December 2011 - 09:47 PM

Just to be clear, it's preemptive switching at 1 ms.  However, normally we complete more tasks than that within a time slice as they cooperatively GTFO (vTaskDelay, wait on queue, etc).  And I don't think robert b was suggesting not to use the FPU, but merely that you put any floating point operations within a critical section so that no task switching occurs then.

However, I'm not sure I like that option as the EKF will occupy the FPU for a decent amount of time.  I wouldn't want to constrain ourselves to _not_ preempting that unless we find the switching cost is really hitting our performance.

#36 D-Lite

D-Lite

    Core Team

  • Members
  • PipPipPip
  • 968 posts
  • Country: flag of Germany Germany


Posted 12 December 2011 - 09:59 PM

View Postpeabody124, on 12 December 2011 - 09:47 PM, said:

And I don't think robert b was suggesting not to use the FPU, but merely that you put any floating point operations within a critical section so that no task switching occurs then.

Hmm - but why? The push of the FPU registers will happen anyway and that's most likely the biggest overhead (400ns if that article is right). Waiting for a FPU instruction to complete - well .. is there REALLY any FPU instruction that takes that long to make this necessary?

#37 robert b

robert b

    Advanced Member

  • Members
  • PipPipPip
  • 100 posts
  • Locationschneisingen
  • Country: flag of Switzerland Switzerland

Posted 12 December 2011 - 10:13 PM

the latency is not just the delay caused by the additional registers to be saved - it's prop a bit more.
i am not proposing to shutdown the fpu - just making sure the fpu job can be done within the designated time slice.
in this case the cheap register save and restore can be used.
i also would not propose the use of critical sections to prevent a task switch. just forcing a switch before the end of the time slice.

in the end of every project we run out of cpu cycles.
everyone is stealing them - because there are enough to do so.
our project leader is already dreaming having a multicode mcu's.

#38 naiiawah

naiiawah

    Core Developer

  • Members
  • PipPipPip
  • 309 posts
  • LocationNorthwest USA
  • Country: flag of United States United States


Posted 13 December 2011 - 04:11 AM

For dealing with the FP overhead on a context thread, how about this as possibility for a solution?  We have two types of threads:  Ones that are allowed to do FP ops, ones that are not.  Add a flag to the thread create call to allow FP, and store that flag in the Thread Control Block (TCB).  When an ISR or Thread context switch happens, the handler just has to look at the current TCB to see if FP is used on that thread and only store FPU regs off if the flag is set.  Most threads that don't use the FPU, would not incur the overhead.

Comments?  Criticisms?

#39 D-Lite

D-Lite

    Core Team

  • Members
  • PipPipPip
  • 968 posts
  • Country: flag of Germany Germany


Posted 13 December 2011 - 09:46 AM

View Postrobert b, on 12 December 2011 - 10:13 PM, said:

in the end of every project we run out of cpu cycles.
everyone is stealing them - because there are enough to do so.

That is true for sure. Having a fast MCU shouldn't mean that this power should be wasted blindly, I really agree with that.

View Postnaiiawah, on 13 December 2011 - 04:11 AM, said:

When an ISR or Thread context switch happens, the handler just has to look at the current TCB to see if FP is used on that thread and only store FPU regs off if the flag is set.

It's really much easier - just set the "lazy-save" flag and the MCU will act (somewhat) like that. I still claim (until someone proves we wrong ;-) that the scheduler has nothing to do with storing the FPU registers. From the Cortex M4 Users Guide :


2.3.7. Exception entry and return

....
....

When the processor takes an exception, unless the exception is a tail-chained or a late-arriving
exception, the processor pushes information onto the current stack. This operation is referred to
as[i]stacking[/i] and the structure of eight data words is referred as the [i]stack frame[/i].
When using floating-point routines the Cortex-M4 processor automatically stacks the architected
floating-point state on exception entry.

I see it this way: as long as we cannot quantify the impact and clearly see that we have a problem here, there's no need looking for a solution. High frequency ISR's like USART handlers could for example be an area where it's worth looking at, much more than context switches. But again, using the "lazy-save" feature and not using any FPU instructions in the handler should be enough to do the trick.