{"id":785,"date":"2022-12-19T12:56:00","date_gmt":"2022-12-19T12:56:00","guid":{"rendered":"http:\/\/imalogic.com\/blog\/?p=785"},"modified":"2022-12-19T19:55:45","modified_gmt":"2022-12-19T19:55:45","slug":"intrinsic-functions-sse-avx","status":"publish","type":"post","link":"https:\/\/imalogic.com\/blog\/2022\/12\/19\/intrinsic-functions-sse-avx\/","title":{"rendered":"Intrinsic functions, SSE, AVX&#8230;"},"content":{"rendered":"<body>\n<p>Normally, \u201cintrinsics\u201d refers to functions that are built-in \u2014 i.e. most standard library functions that the compiler can\/will generate inline instead of calling an actual function in the library. For example, a call like:\u00a0<code>memset(array1, 10, 0)<\/code>\u00a0could be compiled for an x86 as something like:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> mov ecx, 10\n xor eax, eax\n mov edi, offset FLAT:array1\n rep stosb<\/code><\/pre>\n\n\n\n<p>Intrinsics like this are purely an optimization. \u201cNeeding\u201d intrinsics would most likely be a situation where the compiler supports intrinsics that let you generate code that the compiler can\u2019t (or usually won\u2019t) generate directly. For an obvious example, quite a few compilers for x86 have \u201cMMX Intrinsics\u201d that let you use \u201cfunctions\u201d that are really just direct representations of MMX instructions.<\/p>\n\n\n\n<p><strong>Streaming SIMD Extensions<\/strong>\u00a0(<strong>SSE<\/strong>) is a single instruction, multiple data (<a href=\"https:\/\/en.wikipedia.org\/wiki\/SIMD\">SIMD<\/a>)\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Instruction_set\">instruction set<\/a>\u00a0extension to the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/X86\">x86<\/a>\u00a0architecture, designed by\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Intel\">Intel<\/a>\u00a0and introduced in 1999 in their\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Pentium_III\">Pentium III<\/a>\u00a0series of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Central_processing_unit\">Central processing units<\/a>\u00a0(CPUs) shortly after the appearance of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Advanced_Micro_Devices\">Advanced Micro Devices<\/a>\u00a0(AMD\u2019s)\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/3DNow!\">3DNow!<\/a>. SSE contains 70 new instructions, most of which work on\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Single_precision\">single precision<\/a>\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Floating_point\">floating point<\/a>\u00a0data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Digital_signal_processing\">digital signal processing<\/a>\u00a0and\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Graphics_processing\">graphics processing<\/a>.<\/p>\n\n\n\n<p>Intel\u2019s first\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/IA-32\">IA-32<\/a>\u00a0SIMD effort was the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/MMX_(instruction_set)\">MMX<\/a>\u00a0instruction set. MMX had two main problems: it re-used existing\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/X87\">x87<\/a>\u00a0floating point registers making the CPUs unable to work on both floating point and SIMD data at the same time, and it only worked on\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Integers\">integers<\/a>. SSE floating point instructions operate on a new independent register set, the XMM registers, and adds a few integer instructions that work on MMX registers.<\/p>\n\n\n\n<p>SSE was subsequently expanded by Intel to\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/SSE2\">SSE2<\/a>,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/SSE3\">SSE3<\/a>,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/SSSE3\">SSSE3<\/a>, and\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/SSE4\">SSE4<\/a>. Because it supports floating point math, it had wider applications than MMX and became more popular. The addition of integer support in SSE2 made MMX a largely\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Redundant_code\">redundant code<\/a>, though further performance increases can be attained in some situations<sup>[<em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Wikipedia:Manual_of_Style\/Dates_and_numbers#Chronological_items\">when?<\/a><\/em>]<\/sup>\u00a0by using MMX in parallel with SSE operations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">SSE originally\u2026<\/h2>\n\n\n\n<p>SSE originally added eight new 128-bit registers known as\u00a0<code>XMM0<\/code>\u00a0through\u00a0<code>XMM7<\/code>. <\/p>\n\n\n\n<p>SSE used only a single data type for XMM registers:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>four 32-bit\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Single-precision\">single-precision<\/a>\u00a0floating point numbers<\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/SSE2\">SSE2<\/a>\u00a0would later expand the usage of the XMM registers to include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>two 64-bit\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Double-precision\">double-precision<\/a>\u00a0floating point numbers or<\/li>\n\n\n\n<li>two 64-bit integers or<\/li>\n\n\n\n<li>four 32-bit integers or<\/li>\n\n\n\n<li>eight 16-bit short integers or<\/li>\n\n\n\n<li>sixteen 8-bit bytes or characters.<\/li>\n<\/ul>\n\n\n\n<h1 class=\"wp-block-heading\">Accelerating Compute-Intensive Workloads with Intel\u00ae AVX-256\/AVX-512 \/ \u2026<\/h1>\n\n\n\n<p>Intel AVX is a set of instruction set extensions to the x86 instruction set architecture, which enables higher computing performance with more efficient operations and data types. Intel AVX-256 and AVX-512 are part of the Intel Advanced Vector Extensions family, providing wider vector length for more efficient operations and greater parallelism. AVX-256 and AVX-512 allow for faster execution of compute-intensive workloads such as video encoding, image enhancement, 3D modeling, and scientific calculations.<\/p>\n\n\n\n<p>here are some examples of how Intel\u00ae AVX-256 and AVX-512 can be used to improve performance in compute-intensive workloads:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Image Manipulation:<\/strong><\/li>\n<\/ol>\n\n\n\n<p>AVX-256 can be used to accelerate the processing of images and other media files, such as scaling and rotating images. Code example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ AVX-256 code for image scaling<br>__m256i vscale = _mm256_set_epi16(zoom_factor, zoom_factor, \u2026);<br>__m256i vin = _mm256_load_si256(&amp;input_data);<br>__m256i vresult = _mm256_mullo_epi16(vin, vscale);<br>_mm256_store_si256(&amp;output_data, vresult);<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li><strong>Video Encoding\/Decoding:<\/strong><\/li>\n<\/ol>\n\n\n\n<p>AVX-256 and AVX-512 both offer improved performance for encoding and decoding videos, allowing for faster video streaming. Code example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ AVX-512 code for video encode\/decode<br>__m512i vect;<br>__m512i vdata= _mm512_load_si512(&amp;video_stream);<br>__m512i vzero= _mm512_set_epi32(0, 0, 0, 0, 0, 0, \u2026);<br>vect = _mm512_cmpgt_epi8(vdata, vzero);<br>_mm512_store_si512(&amp;encoded_video_stream, vect);<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li><strong>Cryptography:<\/strong><\/li>\n<\/ol>\n\n\n\n<p>AVX-256 and AVX-512 enable faster asymmetric key cryptography methods, making it easier to securely transmit data over the internet. Code example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ AVX-512 code for symmetric key cryptography<br>__m512i vkey = _mm512_load_ps(&amp;key_data);<br>__m512i vdata = _mm512_load_ps(&amp;input_data);<br>__m512i vresult = _mm512_xor_ps(vkey, vdata);<br>_mm512_store_ps(&amp;encrypted_data, vresult);<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"4\">\n<li><strong>Scientific Calculations:<\/strong><\/li>\n<\/ol>\n\n\n\n<p>AVX-256 and AVX-512 allow developers to efficiently work with large data sets, speeding up computationally intensive operations such as linear algebra and machine learning algorithms. Code example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\/\/ AVX-512 code for fast linear algebra calculations<br>__m512d vdata1 = _mm512_load_pd(&amp;matrix_a);<br>__m512d vdata2 = _mm512_load_pd(&amp;matrix_b);<br>__m512d vresult = _mm512_fmadd_pd(vdata1, vdata2, vzero);<br>_mm512_store_pd(&amp;matrix_result, vresult);<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">My work at Intopix, the JPG-XS API.<\/h2>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignleft size-full\"><a href=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2018\/09\/intoPIX_logo_198.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"283\" height=\"198\" data-attachment-id=\"651\" data-permalink=\"https:\/\/imalogic.com\/blog\/references\/intopix_logo_198\/\" data-orig-file=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2018\/09\/intoPIX_logo_198.png?fit=283%2C198&amp;ssl=1\" data-orig-size=\"283,198\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"intoPIX_logo_198\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2018\/09\/intoPIX_logo_198.png?fit=283%2C198&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2018\/09\/intoPIX_logo_198.png?resize=283%2C198&#038;ssl=1\" alt=\"\" class=\"wp-image-651\" loading=\"lazy\"><\/a><\/figure>\n<\/div>\n\n\n<p>Intopix JPG-XS is an API that allows developers to access the latest technology of Intopix\u2019s JPEG XS codecs. The API provides low latency image and video compression, supporting high quality and efficient encoding of 4K UHD, HDR, and 360 degree videos. It also offers a wide range of features such as error resilience, scalable coding, and clean decoding. JPG-XS also supports accelerated decoding on multiple hardware platforms, including Intel AVX-256 and AVX-512 instruction sets.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright size-large is-resized\"><a href=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" data-attachment-id=\"1053\" data-permalink=\"https:\/\/imalogic.com\/blog\/2022\/12\/19\/intrinsic-functions-sse-avx\/fasttico-xs-sdk-v2-0-0-product-release-sept-2020-1\/\" data-orig-file=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?fit=1600%2C1037&amp;ssl=1\" data-orig-size=\"1600,1037\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?fit=810%2C525&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?resize=256%2C165&#038;ssl=1\" alt=\"\" class=\"wp-image-1053\" width=\"256\" height=\"165\" loading=\"lazy\" srcset=\"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?resize=1024%2C664&amp;ssl=1 1024w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?resize=300%2C194&amp;ssl=1 300w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?resize=768%2C498&amp;ssl=1 768w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?resize=1536%2C996&amp;ssl=1 1536w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?resize=360%2C230&amp;ssl=1 360w, https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2022\/12\/FastTICO-XS-SDK-v2.0.0-Product-release-Sept-2020-1.jpg?w=1600&amp;ssl=1 1600w\" sizes=\"auto, (max-width: 256px) 100vw, 256px\" \/><\/a><\/figure>\n<\/div>\n\n\n<p>This project is an exploration into how the Wavelet transform algorithm can be optimized using Intel AVX-256 and AVX-512  &amp; AVX-XXX instruction sets. Wavelet transforms are used in many applications such as image and audio processing, and our goal is to find ways to improve their performance by taking advantage of the new 256-bit and 512-bit vector lengths offered by Intel AVX technology. We have done extensive research on the availability of performance optimizations generated with Intel AVX-256 and AVX-512 instructions, and the results that can be achieved with this technology. We have also developed various Wavelet transform algorithms optimized for AVX instructions and have implemented them in a demonstration program to show their increased performance. We hope that our findings will be useful for anyone interested in optimizing their Wavelet transform applications using Intel AVX instruction sets.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Final Word<\/h2>\n\n\n\n<p><em>This mission was short but fun, and we were lucky to have an incredible team of highly motivated and talented people working under the Intopix banner.<\/em><\/p>\n<\/body>","protected":false},"excerpt":{"rendered":"<p>Normally, \u201cintrinsics\u201d refers to functions that are built-in \u2014 i.e. most standard library functions that the compiler can\/will generate inline<\/p>\n","protected":false},"author":1,"featured_media":786,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[7,66,8,6],"tags":[91,24,90,88],"class_list":["post-785","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-coding","category-computer-graphics","category-embedded","category-signal-processing","tag-avx","tag-c","tag-intrinsic","tag-optimisation"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/imalogic.com\/blog\/wp-content\/uploads\/2019\/10\/images.jpg?fit=275%2C183&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8J21V-cF","jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/posts\/785","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/comments?post=785"}],"version-history":[{"count":1,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/posts\/785\/revisions"}],"predecessor-version":[{"id":1062,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/posts\/785\/revisions\/1062"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/media\/786"}],"wp:attachment":[{"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/media?parent=785"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/categories?post=785"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/imalogic.com\/blog\/wp-json\/wp\/v2\/tags?post=785"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}