Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with x264.git #1

Open
wants to merge 1,730 commits into
base: master
Choose a base branch
from
Open

Merge with x264.git #1

wants to merge 1,730 commits into from

Conversation

funman
Copy link
Contributor

@funman funman commented Mar 6, 2017

This merges to the new x264 repo

Changes can be checked with git diff HEAD^

Fiona Glaser and others added 30 commits February 4, 2012 07:18
TBM and BMI1 are supported by Trinity/Piledriver.
The others (and BMI1) will probably appear in Intel's upcoming Haswell.
Also update x86inc with AVX2 stuff.
Broke register preservation in x264_cpu_cpuid and x264_cpu_xgetbv.
Did not cause any problems.
Broke if the first macroblock in the slice exceeded the set slice-max-size.
BGR/BGRA input was correct.
x264_cavlc_init needs to be stack-aligned now.
Some x264 asm assumed that the high 32 bits of registers containing "int" values would be zero.
This is almost always the case, and it seems to work with gcc, but it is *not* guaranteed by the ABI.
As a result, it breaks with some other compilers, like Clang, that take advantage of this in optimizations.
Accordingly, fix all x86 code by using intptr_t instead of int or using movsxd where neccessary.
Also add checkasm hack to detect when assembly functions incorrectly assumes that 32-bit integers are zero-extended to 64-bit.
Not necessary for x264, as -m amd64 already does the right thing, but used by external users of x86inc.
Not necessary with the CAVLC lookup table for zero run codes.
Helps avoid VBV predictors going nuts with very low-cost MBs.
One particular case this fixes is zero-cost MBs: adaptive quantization decreases the QP a lot, but (before this patch), no cost penalty gets factored in for this, because anything times zero is zero.
Required for row re-encoding.
Extremely accurate, possibly 100% so (I can't get it to fail even with difficult VBVs).
Does not yet support rows split on slice boundaries (occurs often with slice-max-size/mbs).
Still inaccurate with sliced threads, but better than before.
Intel was nice enough to make tzcnt equal to "rep bsf", which is backwards-compatible.
This means we don't actually have to add new functions to make it work.
Recent AMD CPUs' instruction decoders choke horribly on extremely long nops (i.e. with 4 prefixes).
Won't affect much, since we don't use ALIGN much.
Fully reconstruct frames even without dump-yuv.
Lowers encoding latency around 14% in sliced threads mode with preset superfast.
Additionally, even if there is no waiting time between frames, this improves parallelism, because hpel+deblock are done during the (singlethreaded) lookahead.
For ease of debugging, dump-yuv forces all of the threads to wait and finish instead of setting b_full_recon.
Regression in r2183.
Bizarrely seemed to work on many platforms, but crashed on win64 and may have been slower.
Only affected sliced threads during encoding, but could cause crashes on x264 encoder close even without sliced threads.
Was using qp instead of qscale; could cause NANs (not to mention less accurate results).
The code does, in fact, handle CAVLC+8x8dct correctly already.
MSVS requires exported variables to be declared with the DATA keyword, and requires that imported variables be declared with dllimport.
This does not fix x264 cli being unable to use a shared library built by ICL however.
MasterNobody and others added 29 commits January 8, 2014 11:15
It's an old stand-alone application that isn't relevant to x264.
Also update AUTHORS file and my e-mail address in the headers of various files.
We don't need to wastefully allocate quant tables above QP_MAX_SPEC; they're
never used.
Assembly based on code by Henrik Gramner and Loren Merritt.
Work around yasm's inefficiency with handling large numbers of variables
in the global scope.
Android NDK does not expose sched_getaffinity.
Actually allocate less (instead of just initialize less) and fix comments.
Fixes possible corruption with MBAFF+sliced threads.
Fixes an issue with too many forced non-skips in mbaff+cavlc, as well as
non-deterministic output with mbaff+cavlc+sliced-threads.
The full details of the return values of encoder_encode and encoder_headers
were mistakenly removed a while ago; re-add them.
The H.264 spec says it shouldn't be set in these cases.
For when --frame-packing is set.
Makes it easier to detect typos.
Emulation requires a temporary register if arguments 1 and 4 are the same; this
doesn't obey the semantics of the original instruction, so we can't emulate
that in x86inc.

ffmpeg has an x86util emulation for that case; I'll add it if x264's asm ever
needs it.

Also add pmacsdql emulation.
If the stack is known to be at least 32-byte aligned we can safely store ymm
registers on the stack without doing manual alignment.

Change ALLOC_STACK to always align the stack before allocating stack space for
consistency. Previously alignment would occur either before or after allocating
stack space depending on whether manual alignment was required or not.
Reduce the number of registers used from 7 to 6.
Reduce the number of vector registers used by the AVX2 implementation from 8 to 7.
Multiply fps_factor by 1/256 once per frame instead of once per macroblock row.
Use mova instead of movu for dst since it's guaranteed to be aligned.
Some cosmetics.
About 5.6x faster than C on Haswell.
checkasm --bench on a cortex-a9:
var_8x16_c: 4306
var_8x16_neon: 791
checkasm --bench on a cortex-a9:
var2_8x16_c: 5677
var2_8x16_neon: 1421
4% faster on main/medium, 15% faster on baseline/superfast on a cortex-a9.
Move the second core part of macroblock tree into an assembly function;
SIMD-optimize roughly half of it (for x86). Roughly ~25-65% faster mbtree,
depending on content.

Slightly change how mbtree handles the tradeoff between range and precision
for propagation.

Overall a slight (but mostly negligible) effect on SSIM and ~2% faster.
@kierank
Copy link
Owner

kierank commented Feb 16, 2019

I did this manually to staging branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.