Intrinsic functions, SSE, AVX…

Normally, “intrinsics” refers to functions that are built-in — i.e. most standard library functions that the compiler can/will generate inline instead of calling an actual function in the library. For example, a call like: memset(array1, 10, 0) could be compiled for an x86 as something like:

 mov ecx, 10
 xor eax, eax
 mov edi, offset FLAT:array1
 rep stosb

Intrinsics like this are purely an optimization. “Needing” intrinsics would most likely be a situation where the compiler supports intrinsics that let you generate code that the compiler can’t (or usually won’t) generate directly. For an obvious example, quite a few compilers for x86 have “MMX Intrinsics” that let you use “functions” that are really just direct representations of MMX instructions.

Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMDinstruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD’s) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

Intel’s first IA-32 SIMD effort was the MMX instruction set. MMX had two main problems: it re-used existing x87 floating point registers making the CPUs unable to work on both floating point and SIMD data at the same time, and it only worked on integers. SSE floating point instructions operate on a new independent register set, the XMM registers, and adds a few integer instructions that work on MMX registers.

SSE was subsequently expanded by Intel to SSE2SSE3SSSE3, and SSE4. Because it supports floating point math, it had wider applications than MMX and became more popular. The addition of integer support in SSE2 made MMX a largely redundant code, though further performance increases can be attained in some situations[when?] by using MMX in parallel with SSE operations.

SSE originally…

SSE originally added eight new 128-bit registers known as XMM0 through XMM7.

SSE used only a single data type for XMM registers:

SSE2 would later expand the usage of the XMM registers to include:

  • two 64-bit double-precision floating point numbers or
  • two 64-bit integers or
  • four 32-bit integers or
  • eight 16-bit short integers or
  • sixteen 8-bit bytes or characters.

Accelerating Compute-Intensive Workloads with Intel® AVX-256/AVX-512 / …

Intel AVX is a set of instruction set extensions to the x86 instruction set architecture, which enables higher computing performance with more efficient operations and data types. Intel AVX-256 and AVX-512 are part of the Intel Advanced Vector Extensions family, providing wider vector length for more efficient operations and greater parallelism. AVX-256 and AVX-512 allow for faster execution of compute-intensive workloads such as video encoding, image enhancement, 3D modeling, and scientific calculations.

here are some examples of how Intel® AVX-256 and AVX-512 can be used to improve performance in compute-intensive workloads:

  1. Image Manipulation:

AVX-256 can be used to accelerate the processing of images and other media files, such as scaling and rotating images. Code example:

// AVX-256 code for image scaling
__m256i vscale = _mm256_set_epi16(zoom_factor, zoom_factor, …);
__m256i vin = _mm256_load_si256(&input_data);
__m256i vresult = _mm256_mullo_epi16(vin, vscale);
_mm256_store_si256(&output_data, vresult);
  1. Video Encoding/Decoding:

AVX-256 and AVX-512 both offer improved performance for encoding and decoding videos, allowing for faster video streaming. Code example:

// AVX-512 code for video encode/decode
__m512i vect;
__m512i vdata= _mm512_load_si512(&video_stream);
__m512i vzero= _mm512_set_epi32(0, 0, 0, 0, 0, 0, …);
vect = _mm512_cmpgt_epi8(vdata, vzero);
_mm512_store_si512(&encoded_video_stream, vect);
  1. Cryptography:

AVX-256 and AVX-512 enable faster asymmetric key cryptography methods, making it easier to securely transmit data over the internet. Code example:

// AVX-512 code for symmetric key cryptography
__m512i vkey = _mm512_load_ps(&key_data);
__m512i vdata = _mm512_load_ps(&input_data);
__m512i vresult = _mm512_xor_ps(vkey, vdata);
_mm512_store_ps(&encrypted_data, vresult);
  1. Scientific Calculations:

AVX-256 and AVX-512 allow developers to efficiently work with large data sets, speeding up computationally intensive operations such as linear algebra and machine learning algorithms. Code example:

// AVX-512 code for fast linear algebra calculations
__m512d vdata1 = _mm512_load_pd(&matrix_a);
__m512d vdata2 = _mm512_load_pd(&matrix_b);
__m512d vresult = _mm512_fmadd_pd(vdata1, vdata2, vzero);
_mm512_store_pd(&matrix_result, vresult);

My work at Intopix, the JPG-XS API.

Intopix JPG-XS is an API that allows developers to access the latest technology of Intopix’s JPEG XS codecs. The API provides low latency image and video compression, supporting high quality and efficient encoding of 4K UHD, HDR, and 360 degree videos. It also offers a wide range of features such as error resilience, scalable coding, and clean decoding. JPG-XS also supports accelerated decoding on multiple hardware platforms, including Intel AVX-256 and AVX-512 instruction sets.

This project is an exploration into how the Wavelet transform algorithm can be optimized using Intel AVX-256 and AVX-512 & AVX-XXX instruction sets. Wavelet transforms are used in many applications such as image and audio processing, and our goal is to find ways to improve their performance by taking advantage of the new 256-bit and 512-bit vector lengths offered by Intel AVX technology. We have done extensive research on the availability of performance optimizations generated with Intel AVX-256 and AVX-512 instructions, and the results that can be achieved with this technology. We have also developed various Wavelet transform algorithms optimized for AVX instructions and have implemented them in a demonstration program to show their increased performance. We hope that our findings will be useful for anyone interested in optimizing their Wavelet transform applications using Intel AVX instruction sets.

Final Word

This mission was short but fun, and we were lucky to have an incredible team of highly motivated and talented people working under the Intopix banner.