Some sections of the Lumiera website document meeting minutes, discussion protocols and design proposals from the early days of the project; these pages were initially authored in the »Moin Moin Wiki« operated by Cehteh on pipapo.org at that time; this wiki backed the first publications of the »Cinelerra-3« initiative, which turned into the Lumiera project eventually. Some years later, those pages were transliterated into Asciidoc semi-automatically, resulting in a lot of broken markup and links. This is a long standing maintenance problem problem plaguing the Lumiera website, since those breakages cause a lot of warnings and flood the logs of any linkchecker run.
165 lines
7.1 KiB
Text
165 lines
7.1 KiB
Text
Design Process : Official Assembly Language
|
|
===========================================
|
|
|
|
[options="autowidth"]
|
|
|====================================
|
|
|*State* | _Dropped_
|
|
|*Date* | _2008-08-01_
|
|
|*Proposed by* | PercivalTiglao
|
|
|====================================
|
|
|
|
|
|
Official Assembly Language
|
|
--------------------------
|
|
|
|
I describe here an optimization that might have to be be taken into account at
|
|
the design level. At very least, we should design our code with
|
|
auto-vectorization in mind. At the most, we can choose to manually write parts
|
|
of our code in assembly language and manually vectorize it using x86 SSE
|
|
Instructions or PowerPC AltiVec instructions. By keeping these instructions
|
|
in mind, we can easily achieve a large increase in speed.
|
|
|
|
|
|
Description
|
|
~~~~~~~~~~~
|
|
|
|
While the C / C++ core should be designed efficiently and as portable as
|
|
possible, nominating an official assembly language or an official platform can
|
|
create new routes for optimization. For example, the x86 SSE instruction set
|
|
can add / subtract 16 bytes in parallel (interpreted as 8-bit, 16-bit, 32-bit,
|
|
or 64-bit integers, or 32-bit/64-bit floats), with some instructions supporting
|
|
masks, blending, dot products, and other various instructions specifically
|
|
designed for media processing. While the specific assembly level optimizations
|
|
should be ignored for now, structuring our code in such a way to encourage a
|
|
style of programming suitable for SSE Optimization would make Lumiera
|
|
significantly faster in the long run. At very least, we should structure our
|
|
innermost loop in such a way that it is suitable for gcc's auto-vectorization.
|
|
|
|
The problem is that we will be splitting up our code. Bugs may appear on some
|
|
platforms where assembly-specific commands are, or perhaps the C/C++ code would
|
|
have bugs that the assembly code does not. We will be maintaining one more
|
|
codebase for the same set of code. Remember though, we don't have to do
|
|
assembly language now, we just leave enough room in the design to add
|
|
assembly-level libraries somewhere in our code.
|
|
|
|
|
|
Tasks
|
|
^^^^^
|
|
* Choose an ``Official'' assembly language / platform.
|
|
* Review the SIMD instructions avaliable for that assembly language.
|
|
** For example, the Pentium 2 supports MMX instructions.
|
|
** Pentium 3 supports MMX and SSE Instructions.
|
|
** Early Pentium4s support MMX, SSE, and SSE2 instructions.
|
|
** Core Duo supports upto SSE4 instructions.
|
|
** AMD announced SSE5 instructions to come in 2009.
|
|
* Consider SIMD instructions while designing the Render Nodes and Effects
|
|
architecture.
|
|
* Write the whole application in C/C++ / Lua while leaving sections to optimize
|
|
in assembly later. (Probably simple tasks or a library written in C)
|
|
* Rewrite these sections in Assembly using only instructions we agreed upon.
|
|
|
|
|
|
Pros
|
|
^^^^
|
|
Assuming we go all the way with an official assembly language / platform...
|
|
|
|
* Significantly faster render and previews. (Even when using a high-level
|
|
library like http://www.pixelglow.com/macstl/valarray/[macstl valarray], we
|
|
can get 3.6x -- 16.2x the speed in our inner loop. We can probably expect
|
|
greater if we hand-optimize the assembly)
|
|
|
|
|
|
Cons
|
|
^^^^
|
|
* Earlier architectures of that family will be significantly slower or unsupported
|
|
* Other architectures will rely on C / C++ port instead of optimized assembly
|
|
* Redundant Code
|
|
|
|
|
|
Alternatives
|
|
^^^^^^^^^^^^
|
|
|
|
* We only consider auto-vectorization -- GCC is attempting to convert trivial
|
|
loops into common SSE patterns. Newer or Higher level instructions may not be
|
|
supported by GCC. This is turned on
|
|
https://gcc.gnu.org/projects/tree-ssa/vectorization.html[in GCC4.3 with
|
|
specific compiler flags]
|
|
* We can consider assembly but we don't officially support it -- We leave the
|
|
holes there for people to patch up later. Unofficial ports may come up, and
|
|
maybe a few years down the line we can reconsider assembly and start to
|
|
reimplement it down the road.
|
|
* Find a SIMD library for C/C++ -- Intel's ICC and
|
|
https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Vector-Extensions.html[GCC] both
|
|
have non-standard extensions to C that roughly translate to these
|
|
instructions. There is also the
|
|
http://www.pixelglow.com/macstl/valarray/[macstl valarray library] mentioned
|
|
earlier. Depending on the library, the extensions can be platform specific.
|
|
* Write in a language suitable for auto-vectorization -- Maybe there exists
|
|
some vector-based languages? Fortran might be one, but I don't really know.
|
|
|
|
|
|
Rationale
|
|
~~~~~~~~~
|
|
|
|
I think this is one of those few cases where the design can evolve in a way
|
|
that makes this kind of optimization impossible. As long as we try to keep this
|
|
optimization avaliable in the future, then we should be good.
|
|
|
|
|
|
Comments
|
|
--------
|
|
|
|
I have to admit that I don't know too much about SSE instructions aside from
|
|
the fact that they can operate on 128-bits at once in parallel and there are
|
|
some cache tricks involved when using them. (you can move data in from memory
|
|
without bringing in the whole cache line). Nonetheless, keeping these
|
|
assembly level instructions in mind will ease optimization of this Video
|
|
Editor. Some of the instructions are high-level enough that they may effect
|
|
design decisions. Considering them now while we are still in early stages of
|
|
development might prove to be advantageous. Optimize early? Definitely not.
|
|
However, if we don't consider this means of optimization, we may design
|
|
ourselves into a situation where this kind of optimization becomes
|
|
impossible.
|
|
|
|
I don't think we should change any major design decisions to allow for
|
|
vectorization. At most, we design a utility library that can be easily
|
|
optimized using SIMD instructions. Render Nodes and Effects can use this
|
|
library. When this library is optimized, then all Render Nodes and Effects
|
|
can be optimized as well.
|
|
|
|
PercivalTiglao:: '2008-08-01T16:12:11Z'
|
|
|
|
The Lumiera core (backend, proc, gui) doesn't do any number crunching.
|
|
Any actual media processing will be delegated to plugins (lib-Gavl, effects, encoders).
|
|
I think we don't need any highly assembler/vector optimized code in the core (well, lets see).
|
|
This plugins and libraries are somewhat out of our scope and that's good so, the people working
|
|
on it know better than we how to optimize this stuff.
|
|
|
|
It might be even worthwhile to try if when we leave all vectorization out,
|
|
if then the plugins can use the vector registers better and we gain overall
|
|
performance!
|
|
|
|
ct:: '2008-08-03T02:27:14Z'
|
|
|
|
Another idea about a probably worthwhile optimization: GCC can instrument
|
|
code for profiling and then do arc profiling and build it a second time
|
|
with feedback what it learned from the profile runs, this mostly affects
|
|
branch prediction and can give a reasonable performance boost. If someone
|
|
likes challenges, prepare the build system to do this: +
|
|
--
|
|
. build it with `-fprofile-arcs`
|
|
. profile it by running _carefully selected_ benchmarks and tests.
|
|
. rebuild it again this time with `-fbranch-probabilities`
|
|
. PROFIT
|
|
--
|
|
|
|
ct:: '2008-08-03T02:27:14Z'
|
|
|
|
I've discussed general ideas around, and I agree now that ``core Lumiera''
|
|
is not the place to think of these kinds of optimizations.
|
|
So I'll just move this RfC to dropped.
|
|
|
|
PercivalTiglao:: '2008-08-04T18:33:58Z'
|
|
|
|
''''
|
|
Back to link:/x/DesignProcess.html[Lumiera Design Process overview]
|