LUMIERA.clone/doc/devel/rfc/OfficialAssemblyLanguage.txt

Design Process : Official Assembly Language
===========================================

[options="autowidth"]
|====================================
|*State*       | _Dropped_
|*Date*        | _2008-08-01_
|*Proposed by* | PercivalTiglao
|====================================


Official Assembly Language
--------------------------

I describe here an optimization that might have to be be taken into account at
the design level. At very least, we should design our code with
auto-vectorization in mind. At the most, we can choose to manually write parts
of our code in assembly language and manually vectorize it using x86 SSE
Instructions or PowerPC AltiVec instructions. By keeping these instructions
in mind, we can easily achieve a large increase in speed.


Description
~~~~~~~~~~~

While the C / C++ core should be designed efficiently and as portable as
possible, nominating an official assembly language or an official platform can
create new routes for optimization. For example, the x86 SSE instruction set
can add / subtract 16 bytes in parallel (interpreted as 8-bit, 16-bit, 32-bit,
or 64-bit integers, or 32-bit/64-bit floats), with some instructions supporting
masks, blending, dot products, and other various instructions specifically
designed for media processing. While the specific assembly level optimizations
should be ignored for now, structuring our code in such a way to encourage a
style of programming suitable for SSE Optimization would make Lumiera
significantly faster in the long run. At very least, we should structure our
innermost loop in such a way that it is suitable for gcc's auto-vectorization.

The problem is that we will be splitting up our code. Bugs may appear on some
platforms where assembly-specific commands are, or perhaps the C/C++ code would
have bugs that the assembly code does not. We will be maintaining one more
codebase for the same set of code. Remember though, we don't have to do
assembly language now, we just leave enough room in the design to add
assembly-level libraries somewhere in our code.


Tasks
^^^^^
* Choose an ``Official'' assembly language / platform.
* Review the SIMD instructions avaliable for that assembly language.
  ** For example, the Pentium 2 supports MMX instructions.
  ** Pentium 3 supports MMX and SSE Instructions.
  ** Early Pentium4s support MMX, SSE, and SSE2 instructions.
  ** Core Duo supports upto SSE4 instructions.
  ** AMD announced SSE5 instructions to come in 2009.
* Consider SIMD instructions while designing the Render Nodes and Effects
  architecture.
* Write the whole application in C/C++ / Lua while leaving sections to optimize
  in assembly later. (Probably simple tasks or a library written in C)
* Rewrite these sections in Assembly using only instructions we agreed upon.


Pros
^^^^
Assuming we go all the way with an official assembly language / platform...

* Significantly faster render and previews. (Even when using a high-level
  library like http://www.pixelglow.com/macstl/valarray/[macstl valarray], we
  can get 3.6x -- 16.2x the speed in our inner loop. We can probably expect
  greater if we hand-optimize the assembly)


Cons
^^^^
* Earlier architectures of that family will be significantly slower or unsupported
* Other architectures will rely on C / C++ port instead of optimized assembly
* Redundant Code


Alternatives
^^^^^^^^^^^^

* We only consider auto-vectorization -- GCC is attempting to convert trivial
  loops into common SSE patterns. Newer or Higher level instructions may not be
  supported by GCC. This is turned on
  https://gcc.gnu.org/projects/tree-ssa/vectorization.html[in GCC4.3 with
  specific compiler flags]
* We can consider assembly but we don't officially support it -- We leave the
  holes there for people to patch up later. Unofficial ports may come up, and
  maybe a few years down the line we can reconsider assembly and start to
  reimplement it down the road.
* Find a SIMD library for C/C++ -- Intel's ICC and
  https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Vector-Extensions.html[GCC] both
  have non-standard extensions to C that roughly translate to these
  instructions. There is also the
  http://www.pixelglow.com/macstl/valarray/[macstl valarray library] mentioned
  earlier. Depending on the library, the extensions can be platform specific.
* Write in a language suitable for auto-vectorization -- Maybe there exists
  some vector-based languages? Fortran might be one, but I don't really know.


Rationale
~~~~~~~~~

I think this is one of those few cases where the design can evolve in a way
that makes this kind of optimization impossible. As long as we try to keep this
optimization avaliable in the future, then we should be good.


Comments
--------

I have to admit that I don't know too much about SSE instructions aside from
the fact that they can operate on 128-bits at once in parallel and there are
some cache tricks involved when using them. (you can move data in from memory
without bringing in the whole cache line). Nonetheless, keeping these
assembly level instructions in mind will ease optimization of this Video
Editor. Some of the instructions are high-level enough that they may effect
design decisions. Considering them now while we are still in early stages of
development might prove to be advantageous. Optimize early? Definitely not.
However, if we don't consider this means of optimization, we may design
ourselves into a situation where this kind of optimization becomes
impossible.

I don't think we should change any major design decisions to allow for
vectorization. At most, we design a utility library that can be easily
optimized using SIMD instructions. Render Nodes and Effects can use this
library. When this library is optimized, then all Render Nodes and Effects
can be optimized as well.

PercivalTiglao:: '2008-08-01T16:12:11Z'

The Lumiera core (backend, proc, gui) doesn't do any number crunching.
Any actual media processing will be delegated to plugins (lib-Gavl, effects, encoders).
I think we don't need any highly assembler/vector optimized code in the core (well, lets see).
This plugins and libraries are somewhat out of our scope and that's good so, the people working
on it know better than we how to optimize this stuff.

It might be even worthwhile to try if when we leave all vectorization out,
if then the plugins can use the vector registers better and we gain overall
performance!

ct:: '2008-08-03T02:27:14Z'

Another idea about a probably worthwhile optimization: GCC can instrument
code for profiling and then do arc profiling and build it a second time
with feedback what it learned from the profile runs, this mostly affects
branch prediction and can give a reasonable performance boost. If someone
likes challenges, prepare the build system to do this: +
--
. build it with `-fprofile-arcs`
. profile it by running _carefully selected_ benchmarks and tests.
. rebuild it again this time with `-fbranch-probabilities`
. PROFIT
--

ct:: '2008-08-03T02:27:14Z'

I've discussed general ideas around, and I agree now that ``core Lumiera''
is not the place to think of these kinds of optimizations.
So I'll just move this RfC to dropped.

PercivalTiglao:: '2008-08-04T18:33:58Z'

''''
Back to link:/x/DesignProcess.html[Lumiera Design Process overview]