- use HTTPS - avoid redirects - supply Archive.org snapshots for old resources
161 lines
7.2 KiB
Text
161 lines
7.2 KiB
Text
Design Process : Official Assembly Language
|
|
===========================================
|
|
|
|
[grid="all"]
|
|
`------------`-----------------------
|
|
*State* _Dropped_
|
|
*Date* _2008-08-01_
|
|
*Proposed by* link:PercivalTiglao[]
|
|
-------------------------------------
|
|
|
|
|
|
Official Assembly Language
|
|
--------------------------
|
|
|
|
I describe here an optimization that might have to be be taken into account at
|
|
the design level. At very least, we should design our code with
|
|
auto-vectorization in mind. At the most, we can choose to manually write parts
|
|
of our code in assembly language and manually vectorize it using x86 SSE
|
|
Instructions or !PowerPC !AltiVec instructions. By keeping these instructions
|
|
in mind, we can easily achieve a large increase in speed.
|
|
|
|
|
|
Description
|
|
~~~~~~~~~~~
|
|
|
|
While the C / C++ core should be designed efficiently and as portable as
|
|
possible, nominating an official assembly language or an official platform can
|
|
create new routes for optimization. For example, the x86 SSE instruction set
|
|
can add / subtract 16 bytes in parallel (interpreted as 8-bit, 16-bit, 32-bit,
|
|
or 64-bit integers, or 32-bit/64-bit floats), with some instructions supporting
|
|
masks, blending, dot products, and other various instructions specifically
|
|
designed for media processing. While the specific assembly level optimizations
|
|
should be ignored for now, structuring our code in such a way to encourage a
|
|
style of programming suitable for SSE Optimization would make Lumiera
|
|
significantly faster in the long run. At very least, we should structure our
|
|
innermost loop in such a way that it is suitable for gcc's auto-vectorization.
|
|
|
|
The problem is that we will be splitting up our code. Bugs may appear on some
|
|
platforms where assembly-specific commands are, or perhaps the C/C++ code would
|
|
have bugs that the assembly code does not. We will be maintaining one more
|
|
codebase for the same set of code. Remember though, we don't have to do
|
|
assembly language now, we just leave enough room in the design to add
|
|
assembly-level libraries somewhere in our code.
|
|
|
|
|
|
Tasks
|
|
~~~~~
|
|
|
|
* Choose an "Official" assembly language / platform.
|
|
* Review the SIMD instructions avaliable for that assembly language.
|
|
* For example, the Pentium 2 supports MMX instructions. Pentium 3 supports MMX
|
|
and SSE Instructions. Early Pentium4s support MMX, SSE, and SSE2
|
|
instructions. Core Duo supports upto SSE4 instructions. AMD announced SSE5
|
|
instructions to come in 2009.
|
|
* Consider SIMD instructions while designing the Render Nodes and Effects
|
|
architecture.
|
|
* Write the whole application in C/C++ / Lua while leaving sections to optimize
|
|
in assembly later. (Probably simple tasks or a library written in C)
|
|
* Rewrite these sections in Assembly using only instructions we agreed upon.
|
|
|
|
|
|
Pros
|
|
~~~~
|
|
|
|
Assuming we go all the way with an official assembly language / platform...
|
|
|
|
* Significantly faster render and previews. (Even when using a high-level
|
|
library like http://www.pixelglow.com/macstl/valarray/[macstl valarray], we
|
|
can get 3.6x -- 16.2x the speed in our inner loop. We can probably expect
|
|
greater if we hand-optimize the assembly)
|
|
|
|
|
|
Cons
|
|
~~~~
|
|
|
|
* Earlier architectures of that family will be significantly slower or
|
|
unsupported
|
|
* Other architectures will rely on C / C++ port instead of optimized assembly
|
|
* Redundant Code
|
|
|
|
|
|
Alternatives
|
|
^^^^^^^^^^^^
|
|
|
|
* We only consider auto-vectorization -- GCC is attempting to convert trivial
|
|
loops into common SSE patterns. Newer or Higher level instructions may not be
|
|
supported by GCC. This is turned on
|
|
https://gcc.gnu.org/projects/tree-ssa/vectorization.html[in GCC4.3 with
|
|
specific compiler flags]
|
|
* We can consider assembly but we don't officially support it -- We leave the
|
|
holes there for people to patch up later. Unofficial ports may come up, and
|
|
maybe a few years down the line we can reconsider assembly and start to
|
|
reimplement it down the road.
|
|
* Find a SIMD library for C/C++ -- Intel's ICC and
|
|
https://gcc.gnu.org/onlinedocs/gcc-3.4.6/gcc/Vector-Extensions.html[GCC] both
|
|
have non-standard extensions to C that roughly translate to these
|
|
instructions. There is also the
|
|
http://www.pixelglow.com/macstl/valarray/[macstl valarray library] mentioned
|
|
earlier. Depending on the library, the extensions can be platform specific.
|
|
* Write in a language suitable for auto-vectorization -- Maybe there exists
|
|
some vector-based languages? Fortran might be one, but I don't really know.
|
|
|
|
|
|
Rationale
|
|
~~~~~~~~~
|
|
|
|
I think this is one of those few cases where the design can evolve in a way
|
|
that makes this kind of optimization impossible. As long as we try to keep this
|
|
optimization avaliable in the future, then we should be good.
|
|
|
|
|
|
Comments
|
|
--------
|
|
|
|
* I have to admit that I don't know too much about SSE instructions aside from
|
|
the fact that they can operate on 128-bits at once in parallel and there are
|
|
some cache tricks involved when using them. (you can move data in from memory
|
|
without bringing in the whole cache line). Nonetheless, keeping these
|
|
assembly level instructions in mind will ease optimization of this Video
|
|
Editor. Some of the instructions are high-level enough that they may effect
|
|
design decisions. Considering them now while we are still in early stages of
|
|
development might prove to be advantagous. Optimize early? Definitely not.
|
|
However, if we don't consider this means of optimization, we may design
|
|
ourselves into a situation where this kind of optimization becomes
|
|
impossible.
|
|
|
|
* I don't think we should change any major design decisions to allow for
|
|
vectorization. At most, we design a utility library that can be easily
|
|
optimized using SIMD instructions. Render Nodes and Effects can use this
|
|
library. When this library is optimized, then all Render Nodes and Effects
|
|
can be optimized as well. -- link:PercivalTiglao[]
|
|
[[DateTime(2008-08-01T16:12:11Z)]]
|
|
|
|
* Uhm, the Lumiera core (backend, proc, gui) doesn't do any numbercrunching.
|
|
This is all delegated to plugins (libgavl, effects, encoders). I think we
|
|
don't need any highly assembler/vector optimized code in the core (well, lets
|
|
see). This plugins and libraries are somewhat out of our scope and thats good
|
|
so, the people working on it know better than we how to optimize this stuff.
|
|
It might be even worthwile to try if when we leave all vectorization out, if
|
|
then the plugins can use the vector registers better and we gain overall
|
|
performance!
|
|
-- link:ct[] [[DateTime(2008-08-03T02:27:14Z)]]
|
|
|
|
* Another idea about a probably worthwhile optimization: gcc can instumentate
|
|
code for profileing and then do arc profileing and build it a second time
|
|
with feedback what it learnd from the profile runs, this mostly affects
|
|
branch prediction and can give a reasonable performance boost. If somone
|
|
likes challenges, prepare the build system to do this:
|
|
. build it with -fprofile-arcs
|
|
. profile it by running ''carefully'' selected benchmarks and tests.
|
|
. rebuild it again this time with -fbranch-probabilities
|
|
. PROFIT
|
|
-- link:ct[] [[DateTime(2008-08-03T02:27:14Z)]]
|
|
|
|
* I've discussed general ideas around, and I agree now that "core Lumiera" is
|
|
not the place to think of these kinds of optimizations. So I'll just move
|
|
this over to dropped. -- link:PercivalTiglao[]
|
|
[[DateTime(2008-08-04T18:33:58Z)]]
|
|
|
|
''''
|
|
Back to link:/documentation/devel/rfc.html[Lumiera Design Process overview]
|