Merge with x264.git #1

funman · 2017-03-06T11:05:18Z

This merges to the new x264 repo

Changes can be checked with git diff HEAD^

Also remove unused AVX cruft.

TBM and BMI1 are supported by Trinity/Piledriver. The others (and BMI1) will probably appear in Intel's upcoming Haswell. Also update x86inc with AVX2 stuff.

Broke register preservation in x264_cpu_cpuid and x264_cpu_xgetbv. Did not cause any problems.

Broke if the first macroblock in the slice exceeded the set slice-max-size.

BGR/BGRA input was correct.

x264_cavlc_init needs to be stack-aligned now.

Some x264 asm assumed that the high 32 bits of registers containing "int" values would be zero. This is almost always the case, and it seems to work with gcc, but it is *not* guaranteed by the ABI. As a result, it breaks with some other compilers, like Clang, that take advantage of this in optimizations. Accordingly, fix all x86 code by using intptr_t instead of int or using movsxd where neccessary. Also add checkasm hack to detect when assembly functions incorrectly assumes that 32-bit integers are zero-extended to 64-bit.

Not necessary for x264, as -m amd64 already does the right thing, but used by external users of x86inc.

Not necessary with the CAVLC lookup table for zero run codes.

Helps avoid VBV predictors going nuts with very low-cost MBs. One particular case this fixes is zero-cost MBs: adaptive quantization decreases the QP a lot, but (before this patch), no cost penalty gets factored in for this, because anything times zero is zero.

Required for row re-encoding.

Extremely accurate, possibly 100% so (I can't get it to fail even with difficult VBVs). Does not yet support rows split on slice boundaries (occurs often with slice-max-size/mbs). Still inaccurate with sliced threads, but better than before.

Intel was nice enough to make tzcnt equal to "rep bsf", which is backwards-compatible. This means we don't actually have to add new functions to make it work.

Recent AMD CPUs' instruction decoders choke horribly on extremely long nops (i.e. with 4 prefixes). Won't affect much, since we don't use ALIGN much.

Fully reconstruct frames even without dump-yuv.

Lowers encoding latency around 14% in sliced threads mode with preset superfast. Additionally, even if there is no waiting time between frames, this improves parallelism, because hpel+deblock are done during the (singlethreaded) lookahead. For ease of debugging, dump-yuv forces all of the threads to wait and finish instead of setting b_full_recon.

Regression in r2183. Bizarrely seemed to work on many platforms, but crashed on win64 and may have been slower. Only affected sliced threads during encoding, but could cause crashes on x264 encoder close even without sliced threads.

Was using qp instead of qscale; could cause NANs (not to mention less accurate results).

The code does, in fact, handle CAVLC+8x8dct correctly already.

MSVS requires exported variables to be declared with the DATA keyword, and requires that imported variables be declared with dllimport. This does not fix x264 cli being unable to use a shared library built by ICL however.

It's an old stand-alone application that isn't relevant to x264.

Also update AUTHORS file and my e-mail address in the headers of various files.

We don't need to wastefully allocate quant tables above QP_MAX_SPEC; they're never used.

Assembly based on code by Henrik Gramner and Loren Merritt.

Work around yasm's inefficiency with handling large numbers of variables in the global scope.

Android NDK does not expose sched_getaffinity.

Actually allocate less (instead of just initialize less) and fix comments.

Probably a regression in r2178.

Fixes possible corruption with MBAFF+sliced threads.

Fixes an issue with too many forced non-skips in mbaff+cavlc, as well as non-deterministic output with mbaff+cavlc+sliced-threads.

The full details of the return values of encoder_encode and encoder_headers were mistakenly removed a while ago; re-add them.

The H.264 spec says it shouldn't be set in these cases.

For when --frame-packing is set.

Makes it easier to detect typos.

Emulation requires a temporary register if arguments 1 and 4 are the same; this doesn't obey the semantics of the original instruction, so we can't emulate that in x86inc. ffmpeg has an x86util emulation for that case; I'll add it if x264's asm ever needs it. Also add pmacsdql emulation.

If the stack is known to be at least 32-byte aligned we can safely store ymm registers on the stack without doing manual alignment. Change ALLOC_STACK to always align the stack before allocating stack space for consistency. Previously alignment would occur either before or after allocating stack space depending on whether manual alignment was required or not.

Reduce the number of registers used from 7 to 6. Reduce the number of vector registers used by the AVX2 implementation from 8 to 7. Multiply fps_factor by 1/256 once per frame instead of once per macroblock row. Use mova instead of movu for dst since it's guaranteed to be aligned. Some cosmetics.

About 5.6x faster than C on Haswell.

checkasm --bench on a cortex-a9: var_8x16_c: 4306 var_8x16_neon: 791

checkasm --bench on a cortex-a9: var2_8x16_c: 5677 var2_8x16_neon: 1421

4% faster on main/medium, 15% faster on baseline/superfast on a cortex-a9.

Move the second core part of macroblock tree into an assembly function; SIMD-optimize roughly half of it (for x86). Roughly ~25-65% faster mbtree, depending on content. Slightly change how mbtree handles the tradeoff between range and precision for propagation. Overall a slight (but mostly negligible) effect on SSIM and ~2% faster.

kierank · 2019-02-16T11:10:40Z

I did this manually to staging branch

Fiona Glaser and others added 30 commits February 4, 2012 07:18

Clean up and optimize weightp, plus enable SSSE3 weight on SB/BDZ

6d7c5ef

Also remove unused AVX cruft.

Minor asm optimizations/cleanup

04c3819

x86inc: add TAIL_CALL macro to abstract a common asm idiom

e0581e0

TBM, AVX2, FMA3, BMI1, and BMI2 CPU detection support

ae289e6

TBM and BMI1 are supported by Trinity/Piledriver. The others (and BMI1) will probably appear in Intel's upcoming Haswell. Also update x86inc with AVX2 stuff.

Fix regression in r2141

a37a424

Broke register preservation in x264_cpu_cpuid and x264_cpu_xgetbv. Did not cause any problems.

Fix interlaced + extremal slice-max-size

282c3cf

Broke if the first macroblock in the slice exceeded the set slice-max-size.

Fix RGB colorspace input

0fc5acc

BGR/BGRA input was correct.

Add error handling for out-of-tree build

10e1ba5

ICL: fix out of tree building and resource file usage on Windows

38a26cd

Fix rare overflow in 10-bit intra_satd_x3_16x16 asm

0a36950

Fix possible alignment crash when linking from MSVC

d52d0b1

x264_cavlc_init needs to be stack-aligned now.

x86inc: support yasm -f win64

3a5f2fe

Not necessary for x264, as -m amd64 already does the right thing, but used by external users of x86inc.

Export PSNR/SSIM in x264 API

9da19fb

Remove explicit run calculation from coeff_level_run

1b31a10

Not necessary with the CAVLC lookup table for zero run codes.

Abstract bitstream backup/restore functions

bc473dd

Required for row re-encoding.

Minor asm changes

92b0bd9

BMI1 decimate functions

42db5e6

Intel was nice enough to make tzcnt equal to "rep bsf", which is backwards-compatible. This means we don't actually have to add new functions to make it work.

x86inc: switch to amdnops

5b2c62a

Recent AMD CPUs' instruction decoders choke horribly on extremely long nops (i.e. with 4 prefixes). Won't affect much, since we don't use ALIGN much.

Add full-recon API option

90408ec

Fully reconstruct frames even without dump-yuv.

Fix clobbering of mutex/cvs

e046ba7

Regression in r2183. Bizarrely seemed to work on many platforms, but crashed on win64 and may have been slower. Only affected sliced threads during encoding, but could cause crashes on x264 encoder close even without sliced threads.

Fix sliced-threads ratecontrol bug

bca4127

Was using qp instead of qscale; could cause NANs (not to mention less accurate results).

Fix comment in deblock.c

065fec2

The code does, in fact, handle CAVLC+8x8dct correctly already.

Fix frame input colorspace check

fff12b1

Fix intra-refresh + hrd

52f7a14

ICL/MSVS: Fix shared library generation and usage

70877e3

MSVS requires exported variables to be declared with the DATA keyword, and requires that imported variables be declared with dllimport. This does not fix x264 cli being unable to use a shared library built by ICL however.

configure: correct use of RC variable and add --extra-rcflags

62d7007

MasterNobody and others added 29 commits January 8, 2014 11:15

Use 8x16c wrappers with x86 asm functions for 4:2:2 with high bit depth

7664014

Remove tools/xyuv.c

02697d5

It's an old stand-alone application that isn't relevant to x264.

Bump dates to 2014

807aeaa

Also update AUTHORS file and my e-mail address in the headers of various files.

Avoid some unneccesary memory loads in macroblock_encode

8be6600

Fix quantization factor allocation

e2a9662

We don't need to wastefully allocate quant tables above QP_MAX_SPEC; they're never used.

v210 input support

41227fa

Assembly based on code by Henrik Gramner and Loren Merritt.

Add support for AVC-Intra Class 200

dd6a303

x86inc: speed up compilation with yasm

42d2519

Work around yasm's inefficiency with handling large numbers of variables in the global scope.

Fix build with Android NDK

0d668be

Android NDK does not expose sched_getaffinity.

Really fix quantization factor allocation

ee8d5e4

Actually allocate less (instead of just initialize less) and fix comments.

Fix checkasm --bench output when nop_cycles is too large

48dbfa2

Fix corruption with CAVLC overflow handling in MBAFF+main profile

19dddbc

Probably a regression in r2178.

Fix memory overwrite in x264_deblock_h_chroma_mbaff_sse2

850c8c5

Fixes possible corruption with MBAFF+sliced threads.

mbaff: fix mb_field_decoding_flag tracking and simplify allow skip check

8b821ec

Fixes an issue with too many forced non-skips in mbaff+cavlc, as well as non-deterministic output with mbaff+cavlc+sliced-threads.

Fix pointer cast warning for 64-bit builds

de01d88

x264.h: fix documentation

b7a50c1

The full details of the return values of encoder_encode and encoder_headers were mistakenly removed a while ago; re-add them.

Don't set chroma_loc_info_present_flag for non-4:2:0

f35e3fc

The H.264 spec says it shouldn't be set in these cases.

Write 3D metadata when outputting Matroska

0bb3b2e

For when --frame-packing is set.

x86: Pass -Worphan-labels to yasm

8596dd3

Makes it easier to detect typos.

x86inc: free up variable name "n" in global namespace

974f2e7

x86: SSE2 and SSSE3 plane_copy_deinterleave_rgb

a90ea34

About 5.6x faster than C on Haswell.

arm: implement x264_pixel_var_8x16_neon

6683612

checkasm --bench on a cortex-a9: var_8x16_c: 4306 var_8x16_neon: 791

arm: implement x264_pixel_var2_8x16_neon

ac8f2e8

checkasm --bench on a cortex-a9: var2_8x16_c: 5677 var2_8x16_neon: 1421

arm: use available neon functions for intra_sa8d/sad/satd_x3

00a00cc

4% faster on main/medium, 15% faster on baseline/superfast on a cortex-a9.

Merge commit 'b3fb718404d6cce9c82987ea2909cda5072d040c'

01f973d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge with x264.git #1

Merge with x264.git #1

funman commented Mar 6, 2017

kierank commented Feb 16, 2019

Merge with x264.git #1

Are you sure you want to change the base?

Merge with x264.git #1

Conversation

funman commented Mar 6, 2017

kierank commented Feb 16, 2019