Release Notes: Rolling

This document outlines changes introduced to the Intel® software for general-purpose GPU capabilities in rolling releases. As the software includes several different projects, the changes for each release are grouped by project.

Support for each rolling release continues only until the next rolling release becomes available, with no updates provided for previous rolling releases. Therefore, we recommend upgrading to the latest rolling release as soon as it becomes available. To install packages for the latest rolling release, refer to the installation guide for your distribution. For a list of packages published on repositories.intel.com/gpu for each release and operating system, see Provided Packages.


2025-04-29

Image for: 2025-04-29

The 2507.17 release supports the following operating systems:

  • Red Hat Enterprise Linux (RHEL): 8.8, 8.10, 9.2, 9.4, and 9.5

  • Ubuntu 22.04 and 24.04

  • SUSE Linux Enterprise (SLES): 15 SP4, 15 SP5, and 15 SP6

Improvements

Intel® Graphics Driver Backports for Linux* OS (i915)

  • Updated the Graphics Micro Controller (GuC) to version 70.44.1.

  • Resolved a hang detection issue on Intel Data Center GPU Max Series by re-enabling GPU hang checks. Hang detection now only logs a warning message without terminating the application.

Intel GPU Firmware

Updated the Graphics Micro Controller (GuC) to version 70.44.1.


2025-03-18

Image for: 2025-03-18

The 2507.12 release supports the following operating systems:

  • Red Hat Enterprise Linux (RHEL): 8.8, 8.10, 9.2, 9.4, and 9.5

  • Ubuntu 22.04 and 24.04

  • SUSE Linux Enterprise (SLES): 15 SP4, 15 SP5, and 15 SP6

Features

General

Incorporated the latest security updates to address recent vulnerabilities, enhance protection, and ensure greater system reliability.

Intel® Graphics Driver Backports for Linux* OS (i915)

Added support for the HBM_REPLACE bit to signal High Bandwidth Memory (HBM) health status and its transition to the REPLACE state. This enhancement enables the driver to detect the bit and prevent loading when the state changes to REPLACE, while also reporting the issue and prompting HBM replacement.

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Started handling page fault events in the Xe debugger.

  • Added support for the cl_khr_expect_assume OpenCL extension that introduces mechanisms to supply the compiler with information that can enhance the performance of certain kernels.

  • Implemented the Level Zero zeKernelGetBinaryExp API that allows retrieving kernel binary program data.

  • Added support for shared system Unified Shared Memory (USM) allocation in appendLaunchKernel.

  • Implemented enhancements to the Unified Shared Memory (USM) reuse mechanism, including the introduction of a USM reuse cleaner that efficiently manages system and local memory across different reuse strategies, as well as an extension of the USM reuse limit infrastructure.

  • Improved cache management by supporting whitelisted includes.

  • Added support for handling new Reliability, Availability, and Serviceability (RAS) errors in Sysman.

  • Implemented alignment of host Unified Shared Memory (USM) to 2MB on discrete devices when the allocated size exceeds 2MB.

Intel® Graphics Compiler

  • Modified the pass threshold to optimize the i64 multiplication performance.

  • Introduced Panther Lake support.

  • Improved vectorizer to support vector emission of ftrunc instructions.

  • Enabled the IndVarSimplification pass to improve performance.

  • Enabled access to the Workload Management and Thread Programming (WMTP) SIP kernel for the Xe3 core and introduced a default WMTP SIP configuration per Shared Local Memory (SLM) for Xe2.

  • Added more aggressive late rescheduling phase to the CodeLoopSinking pass and an option to disable the maximum sinking heuristic in the presence of 2D block reads.

  • Improved the InlineHelper LLVM utility.

  • Implemented the MergeAllocas pass and enabled allocation merging prior to the split asynchronous pass.

  • Enabled the emission of vectorized floating-point addition (FADD) instructions, allowing the VISA emitter to process them efficiently.

  • Implemented nested 3D resource loop unrolling.

Intel® oneAPI Level Zero

Upgraded specification to version 1.12.15.

Intel® Video Processing Library

  • Introduced support for Intel® VPL API 2.14, introducing new quality and speed settings for AI-powered video frame interpolation. This update also expands algorithm and mode selection options for AI-based super resolution and adds support for High Efficiency Video Coding (HEVC) level 8.5 decoding.

  • Improved compatibility with Python 3.12 development environments.

Intel® Video Processing Library Tools

  • Integrated screen content coding tools for AV1 into sample_encode.

  • Added a new GTK renderer option to sample_decode and sample_multi_transcode.

  • Introduced a new -fullscreen option for GTK in sample_decode and sample_multi_transcode. Users can now toggle full screen using Ctrl+f and exit with Esc.

  • Enhanced support for Python 3.12 development environments.

Changes

General

Updated the signing key for KMD prebuilds to enhance security and ensure continued reliability. The new key, valid for one year, will be used to sign all new releases. To ensure compatibility with these updates while maintaining the secure boot functionality, you need to download and install a new Distinguished Encoding Rules (DER) certificate.

Intel® Graphics Compiler

  • Lowered bfloat ceil and floor intrinsics.

  • Refactored parameters in vc-lits in lit-config for LLVM 16 to not link the initializeGenX function.

  • Increased the early recompilation threshold for default General Register File (GRF) to 500.

  • Enabled the EnableWaveShuffleIndexSinking registry key by default.

  • Enabled the WaveAllJointReduction pass by default.

  • Added an extra assertion check to the SIMDInfo offset.

Improvements

Intel® Graphics Driver Backports for Linux* OS (i915)

  • Introduced page fault handling improvements.

  • Fixed an issue causing the CSC hardware errors.

  • Removed unnecessary lockdep debugging checks from Global Graphics Translation Table (GGTT) updates.

  • Fixed timeout issues by preserving Translation Lookaside Buffer (TLB) seqno when splitting clear pages.

  • Fixed issues causing compilation errors on kernel 6.6 and later.

  • Fixed an issue where prefetch was attempted on empty objects.

  • Fixed an issue where pid_task() could fail if the target process had already exited.

  • Implemented a workaround for Address Translation Services for Memory (ATS-M) and introduced support for G8 power state to reduce idle power consumption.

  • Modified the logic to avoid calling pm_qos_request a second time on an existing request during breadcrumb reset.

  • Disabled C-states for breadcrumb interrupts to reduce Direct Memory Access (DMA) latency.

  • Cleaned up incomplete shmemfs obj->base.filp on failed swapout.

  • Hardcoded memory health status in sysfs to prevent breakage.

  • Implemented flushing of freed objects before reporting available memory to stabilize the reported memory levels.

  • Modified implementation to retry eviction only when it is blocked by active or locked objects, aiming to reduce response time.

  • Optimized Virtual Memory Area (VMA) prefetch by short-circuiting redundant operations.

  • Corrected Compressed Color Surface (CCS) copies for Single Root I/O Virtualization (SR-IOV) save and restore.

  • Restricted shmem flags to a valid set for swapin to resolve a page fault issue.

  • Modified the implementation to repeat the Translation Lookaside Buffer (TLB) flush invalidation request, resolving the issue with the failing Hardware Performance Library (HPL).

  • Removed early unlocked unbind from object free to avoid race conditions between lockless unbinding and eviction of non-persistent VMAs.

  • Introduced changes to protect i915_drm_client_fini from early shutdown.

  • Started supporting compilation with CONFIG_PAGE_TABLE_ISOLATION to fix a compilation issue on RHEL.

  • Optimized the unbind step in the GT IFR flow by skipping context runtime updates when the device is quiesced. This change reduces the execution time.

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Disabled implicit callback conversion for wait events to resolve the wait operation hang issues.

  • Added the missing callback event cache flush to fix an issue with the zeEventHostSynchronize hangs.

  • Fixed an issue with reporting EU counts for multi-slice platforms.

  • Fixed an issue where ZE_AFFINITY_MASK was not working when ZE_FLAT_DEVICE_HIERARCHY was set to COMBINED in OpenCL.

  • Implemented shared allocations to preserve reference timestamps and introduced a flag in the Inter-Process Communication (IPC) pool data to verify if the mapped timestamp flag is set.

  • Fixed an issue where event_profiling::command_start returned an incorrect result.

  • Set stateless addressing mode for buffers that are neither bindful nor bindless.

  • Started retrieving the minimal offset size for region barrier.

  • Fixed the scope of the result variable in initDriver to resolve an issue where it was defined in a narrower scope, causing the initialization result to be improperly discarded.

  • Started returning rawDataSize as zero when the readIoStream call fails.

  • Resolved issues with parsing and setting the Level Zero debugger bitmask.

  • Fixed performance issues on Battlemage GPUs.

  • Ensured memory residency by setting the vmbind user fence when making memory resident.

  • Prevented crashes due to over-allocation by introducing a defer backing flag to Graphics Execution Manager (GEM) create input/output control, ensuring memory is resident before locking.

Intel® Graphics Compiler

  • Fixed regression chart dump for General Register File (GRF) configurations with more than 128 registers.

  • Resolved issues in the vectorizer, improving its stability and performance.

  • Disabled the TrivialLocalMemoryOpsElimination pass from the pipeline.

  • Resized G4_Declare’s row size for atomic operations to prevent out-of-bounds (OOB) issues.

  • Fixed issues in VISA parser for Load Store Cache (LSC) 2D block operations to allow mixed register and immediate AddrX and AddrY operands for 2D block load and store instructions.

  • Fixed value tracker handling of Global Element Pointers (GEP) with zero indices by treating them as bitcasts to prevent confusion in kernel usage.

  • Implemented dynamic optimization threshold adjustment for the depressurizer based on the number and size of General Register File (GRF) registers.

  • Fixed incorrect instructions placement in the rollback functionality of the CodeLoopSinking pass.

  • Stopped using the %sp and %fp predefined variables.

  • Implemented dedicated logic for handling discards in DynamicRayManagementPass to prevent crashes.

  • Fixed the direct address destination restriction on SIMD32.

  • Improved the alignment calculation in constant coalescing and started supporting additional load and store intrinsics in SynchronizationObjectCoalescing.

  • Fixed incorrect condition check in isRegionInvariant of WIAnalysis.

  • Stopped removing Built-in Function (BiF) module prebuilt stamp files to avoid redundant recompilations when CMake files are updated.

  • Fixed an issue with constant folding prevention inside loops.

  • Added intrinsic cache to KernelDebugInfo and prevented indirect access to Software Scoreboard (SWSB).

  • Reduced the number of atomics hitting the same cache line by performing atomic predication.

  • Enabled the Execution Out-of-Order Thread (EOT) to participate in the Software Scoreboard (SWSB) token assignment.

  • Fixed a channel mask issue in the src0 length for RenderTargetDataPayload, where the Alpha channel was incorrectly controlled.

Intel® oneAPI Level Zero

  • Addressed and fixed potential memory leaks.

  • Fixed issues in the generation of pkg-config files.

  • Corrected code generation for libddi table queries.

  • Corrected validation layer’s parameter checker for extensions.

Intel® Video Processing Library

Fixed the bootstrap process to support Debian distributions that do not define the ID_LIKE property.

Intel® Video Processing Library Tools

Fixed the bootstrap process to support Debian distributions that do not define the ID_LIKE property.


2025-02-07

Image for: 2025-02-07

The 2506.18 release supports the following operating systems:

  • Red Hat Enterprise Linux (RHEL): 8.8, 8.10, 9.2, 9.4, and 9.5

  • Ubuntu 22.04 and 24.04

  • SUSE Linux Enterprise (SLES): 15 SP4, 15 SP5, and 15 SP6

Features

General

  • Introduced support for RHEL 9.5.

  • Incorporated the latest security updates to address recent vulnerabilities, enhance protection, and ensure greater system reliability.

Intel® Graphics Compiler

  • Developed the SPV_INTEL_subgroup_matrix_multiply_accumulate SPIR-V extension to enable support for DPAS operations at a lower abstraction level compared to the joint matrix.

  • Introduced support for the SPV_INTEL_maximum_registers SPIR-V extension that adds literal-based and ID-based execution modes for specifying the maximum number of registers for an entry point.

  • Added support for a 3-channel image format to SYCL bindless images to enable access to the bindless texture hardware.

  • Modified the vectorizer to support vector emission of fmul instructions.

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Implemented VF engine utilization API.

  • Added the input and output control helper functions to mmap and unmap operations, acquire and release the GPU range, allocate user pointer, and synchronize the userptr allocation.

  • Exposed new counter-based events and added the default mode for zexCounterBasedEventCreate2.

  • Introduced support for physical host memory.

  • Updated Level Zero metrics to align with v1.11 headers.

  • Started specifying the cache level when reserving a region.

  • Added GPU and memory power domain support for getEnergyCounter.

  • Added support for three channels in Level Zero.

  • Introduced support for zeInitDrivers that combines driver initialization and retrieval functionality. Updated the GTPIN initialization logic to execute only when pCount is greater than 0 and the driver handle is non-null. Additionally, removed the unused ze_init_flags_t flag from all driverInit functions.

  • Enabled counter-based allocation peer sharing to support scenarios involving in-order command lists with multi-GPU event scenarios.

  • Added support for two Xe-eudebug interfaces within a single binary. The new EuDebugInterface class encapsulates eudebug functionality, with CMake flags to control Xe-eudebug and prelim uAPI support.

  • Added a root device flag check for multi-device scenarios, so that APIs using root device handles can now validate this flag and handle failures gracefully.

  • Added a new uAPI macro in the engine module to fetch the configuration of the total ticks.

  • Added Process Maturity Table (PMT) counter offset values for Battlemage.

Intel® Graphics Memory Management Library

  • Enabled IsCpuCacheable on Linux to improve performance.

  • Enabled the R10G10B10_XR_BIAS_A2_UNORM format for display to support 10-bit color and HDR rendering with improved visual quality.

  • Added the Media Video Processing (VP) performance tags that can help with optimization and debugging.

Intel® Graphics System Controller Firmware Update Library

Enabled the logging of data traffic in the trace mode.

Intel® ME TEE Library

  • Added a 32-bit release preset in CMake.

  • Added getters for maxMsgLen and protocolVer.

Intel® Media Driver for VAAPI

  • Introduced upstream encoding support for Battlemage.

  • Added support for AV1 encoding with ARGB input.

Intel® Metrics Discovery Application Programming Interface

  • Added support for half-full Observability Architecture (OA) buffer interrupt in i915.

  • Added CoreFrequencyMHz details and MaxCount global symbols for Xe2 and Xe3 platforms.

  • Added support for GpuCoreClocks symbols in read equations.

  • Introduced a global symbol that indicates GPU frequency override state.

  • Added return code handling for the following functions: AddInformationSet, SetSnapshotReportReadEquation, SetSnapshotReportReadEquation, SetOverflowFunction, AddDefaultMetrics, AddStartRegisterSet, CreateMetricsFromPrototypes, and RefreshConfigRegisters

Intel® oneAPI Level Zero

  • Added event deadlock detection within the validation layer.

  • Started logging the full path of loaded libraries in traces for better debugging.

  • Added result passing to validation checkers at the epilogue stage.

Changes

Intel® Graphics Compiler

  • Replaced deprecated i64 calls to llvm.smax, llvm.smin, llvm.umax, and llvm.umin with icmp and select.

  • Migrated PrivateMemoryUsageAnalysis to opaque pointers. The original pass used getNonOpaquePtrEltTy to get the element type of pointer arguments. The new approach examines the uses of each pointer argument to see if they interact with structure types.

  • Replaced deprecated std::is_pod with updated type traits.

  • Restored the fptrunc functionality to the vectorizer.

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Adjusted the depth configuration for all Xe2+ platforms and moved the depth limitation from the release helper to image_hw.

  • Started handling DRM_XE_TOPO_SIMD16_EU_PER_DSS in Xe non-preliminary path.

  • Simplified CacheRegion reservation tracking by replacing the dynamic unordered_map with a static array, leveraging the small, known maximum number of reservations and unique CacheRegion values. Additionally, added helper code for array-indexing using the CacheRegion enum and started using a level-specific name for CacheInfo instances.

  • Enabled programmable metrics by default.

  • Started calling flushMonitorFence on Blitter Command Streamer (BCS) Control and Status Register (CSR) and ensured the global fence is always resident. Additionally, enabled Ultra Low Latency Scheduling (ULLS) on the copy engine for Battlemage.

  • Separated power handle creation from power limit support.

  • Started reusing staging buffers from other Control and Status Registers (CSR).

  • Standardized naming and structure alignment across multiple components, such as DESTINATION_SURFACE_TYPE, CFE_STATE, STATE_BASE_ADDRESS, thread group batch size, DISPATCH_WALKER, and RENDER_SURFACE_STATE, to conform to the latest specifications, with optimizations such as extracted PostSyncType for simplified integration.

Intel® Graphics Driver Backports for Linux* OS (i915)

  • Started using sg upper bound for incrementing partial maps.

  • Serialized Page Table Entry (PTE) updates for BLT offload and enhanced error handling for BLT clear submissions, along with the serialization of hardware fences.

Intel® Metrics Discovery Application Programming Interface

  • Removed all refresh metric sets for Alchemist G10 and G11.

  • Discontinued media support for Alchemist.

  • Removed unsupported media API masks.

  • Moved platform versions to the platform index map.

  • Updated Direct Rendering Manager (DRM) headers to version prelim v2.0-rc27.

Metrics Library for Metrics Discovery API

Updated Direct Rendering Manager (DRM) headers to version prelim v2.0-rc27.

Improvements

Intel® Graphics Compiler

  • Introduced the generate_local_id flag for some User Mode Driver (UMD) use cases to resolve performance issues.

  • Introduced ray tracing and registry flag improvements to enable Branch Target Determination (BTD) for synchronized dispatch rays.

  • Fixed memory leak in BiFManager.

  • Fixed an issue causing incorrect GPU results after passing a null pointer to the GPU before usage. The fix introduces the BufferBoundsChecking and MinimumValidAddress flags to the release mode and supports handling generic address space in the MinimumValidAddressChecking pass.

  • Added missing intrinsic functions to the TypesLegalizationPass.

  • Fixed an issue where fast math flags were unavailable in some scenarios in the ScalarizeFunction pass.

  • Fixed an issue that caused metadata to be unavailable in certain scenarios in the ScalarizeFunction pass.

  • Fixed incorrect Intermediate Representation (IR) after rollback application in the vector shuffle rescheduling CodeLoopSinking pass. Additionally, weaken the conditions for creating a candidate to reenable vector shuffle scheduling.

  • Replaced functions that used joint matrix arguments with updated functions to ensure proper cleanup.

  • Added scalarization for the fshl operation to fix an issue where the process of swapping the higher and lower 32-bit sections of 64-bit data in vectors was not working correctly.

  • Fixed an issue with incorrect Virtual Instrument Software Architecture (VISA) preemption option.

  • Added ray tracing intrinsic payload accessor for updating payload data to optimize ray tracing stack footprint.

  • Optimized WaveShuffleIndex sinking.

  • Fixed an issue where CustomLoopOpt did not ensure the floating point type, which caused type mismatch.

  • Fixed indirect access detection for divergent execution paths.

  • Fixed a crash that occurred during shader compilation by ensuring the coalescing engine correctly marks the start of the payload’s lifetime.

  • Extended unroll optimization to new core to improve performance.

  • Started emitting DW_OP_stack_value only at the end of SIMD32 expressions to resolve a nullptr dereference issue. The implementation adapts debugging information to handle SIMD32-to-SIMD16 splits, marking the merge point in IGC::DbgVariable.

  • Fixed an issue with dst and src overlapping in HWConformity.

  • Merged multiple consecutive WaveAll operations into a joint reduction tree to optimize joint reduction.

  • Created a new optimized pattern for WaveShuffle.

  • Improved getGenISARange in DebugInfoPass to significantly reduce execution time by streamlining instruction iteration and minimizing map lookups.

  • Extended the capability of wavePrefix to improve performance.

  • Enabled the EnableGEPLSRMulExpr flag by default to fix performance issues.

  • Fixed accumulator registers save and restore syntax in Xe2 SIP.

  • Modified Scalar Evolution (SCEV) analysis responsible for cashing ZExt expressions to improve the compilation time and performance.

  • Introduced a new metadata field for disabling the memory fence optimization that treats Unified Global Memory (UGM) fences as synchronizing Shared Local Memory (SLM). This change fixes performance issues.

  • Increased the available per-thread scratch size by removing the unused maxPerThreadScratchSpace method from DriverInfo.

  • Fixed an issue where the debug registry keys were incorrectly marked.

  • Added the missing registry key check for the EnableGVN key.

  • Resolved issues in shader debugging code.

  • Fixed accumulator registers save and restore syntax in Xe2 SIP.

  • Fixed the kernel argument alignment issues.

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Improved user space notification for page fault by enabling event synchronization GPU status checks by default and adding option to force GPU status checks via a dedicated key.

  • Fixed an issue with duplicating calculations when printing timestamps was enabled. The fix adds the PrintCalculatedTimestamps debug flags for printing timestamps in level zero paths. PrintTimestampPacketContents adds logging for Level Zero paths and ForceUseOnlyGlobalTimestamps forces the use of a global timestamp.

  • Added a debug flag to override region count.

  • Started clearing standalone timestamps prior to submission.

  • Enabled the Tile64 optimization flag to fix functional issues.

  • Enabled zesInit for new platforms when a legacy path is initialized.

  • Started calculating stack count for synchronization render target to fix functional issues that blocked SIMD32 compilation of syncRT shaders. Additionally, removed the number of render target stacks from the capability table.

  • Ensured the appropriate Control and Status Register (CSR) is selected for submission when copy offload is not allowed.

  • Resolved an issue where an incorrect timestamp was returned. Now, when EnableGlobalTimestampViaSubmission is set, zeDeviceGetGlobalTimestamp uses the immediate command submission method to get the correct GPU time.

  • Added a debug breakpoint to handle eviction failures, preventing devices from entering an error state. Additionally, removed destroyed allocations from eviction lists.

  • Aligned thread group count to DSS on all platforms.

  • Stopped including performance counters in timestamp profiling to improve performance.

  • Fixed an issue where OpenCL did not expose tiles as devices with combined hierarchy.

  • Added the missing AUB polls on synchronization points.

  • Started checking standalone Command Buffer (CB) event completion for profiling to fix an issue with performance checks.

  • Fixed reporting the number of Xe cores per cluster.

  • Appended the recorded command list into immediate to resolve relaxed ordering and stalling command checks. Additionally, added an immediate command list append API to the reported extensions list.

  • Eliminated an overhead caused by using the submission method for zeDeviceGetGlobalTimestamps.

  • Enabled immediate binding for make resident to fix memory reporting issues.

  • Fixed 64-bit row and slice pitch for built-ins in the Level Zero heapless mode.

  • Fixed the discrepancy of implicit arguments buffer allocation and programming in the OpenCL path.

  • Fixed an issue with mirroring the module debug area write across tiles on Xe.

  • Adjusted limiting device Unified Shared Memory (USM) reuse and stopped reserving vector for allocation information when reuse is disabled.

  • Fixed an issue with PCI and memory timestamp units in the Process Maturity Table (PMT) telemetry.

  • Started tracking Unified Shared Memory (USM) reuse usage when multiple cl contexts are used to fix memory leak.

  • Started assigning Control and Status Register (CSR) once for the staging image write, instead of assigning it separately for each sub-copy.

  • Started using a release helper to get the correct ftrXe2Compression value and disabled the Xe2 compression through the release helper.

  • Added a micro second resolution for timeout to fix a camera driver compatibility issue.

  • Started checking for nullptr before dereferencing.

  • Optimized bind information in the input/output control helper Xe to store only the userptr values and their corresponding GPU virtual addresses.

  • Fixed an issue where the returned number of VF engine statistics was incorrect. Additionally, implemented a check for local memory removal during the VF handle creation.

  • Started gracefully handling cases where the SIP version header is greater than 3.

  • Improved the behavior of notifyNReports.

  • Started allocating resources by KMD on Battlemage to improve performance.

  • Introduced staging reads and enabled image writes through staging chunks for improved performance.

  • Improved Xe2 allocation with KMD, enhanced inOrder counter signaling via pipe control during data cache flush for immediate command lists, and added debug flags for event signal visibility, including AbortHostSyncOnNonHostVisibleEvent and ForceHostSignalScope for host event synchronization management.

  • Enabled direct submission on Battlemage.

  • Enabled copy through staging buffers on Xe2 and timestamp reuse.

Intel® Graphics Driver Backports for Linux* OS (i915)

  • Resolved the thundering herd problem in ct_receive by waking only the specific receiving process through ct_request. This prevents waking unrelated processes and avoids inefficient iteration, especially during concurrent page faults.

  • Resolved issues related to map_pages() and iotlb_sync_map() functions.

  • Implemented changes to ensure that all blocking send operations are awakened and canceled if completion tracking (CT) fences are disabled during an ongoing send operation.

  • Fixed an issue causing node hangs when applications were profiled using VTune. The issue was addressed by initializing chunk->policy for shmem allocations.

  • Changed the intel_fbdev_restore_mode return type from void to int to meet the fbdev client registration API requirement introduced in kernel 6.12.

  • Fixed a node reboot issue that occurred due to a general protection fault. The issue was addressed by protecting the acquisition of ce->timeline in signal_irq_work.

  • Deferred ct_receive from the ct_send_nb path to prevent deadlock caused by calling handlers under spinlocks. The patch removes ct_receive from the non-blocking send path to reduce latency, allowing the caller to handle scheduling of ct_receive for backlog clearing.

  • Enabled backport support for 6.12 kernel.

Intel® Graphics Memory Management Library

  • Added a new media compression mode to resolve assertion issues.

  • Adjusted the BaseWidth to improve support of the RGB24 format.

  • Updated the reserved Page Attribute Table (PAT) index of the cache element to optimize cache behavior.

Intel® ME TEE Library

Added missing driver logic for fixed clients to the Unified Extensible Firmware Interface (UEFI).

Intel® Media Driver for VAAPI

  • Fixed AV1 decoding corruption caused by invalid reference frames.

  • Fixed an AVC decoding hang issue that occurred when the output surface buffer was insufficient.

  • Fixed an AV1 Bitrate Control (BRC) encoding mismatch issue by ensuring the correct frame type.

  • Fixed an AV1 multi-tile group BRC encoding issue.

  • Fixed a page fault issue in AV1 encoding related to the macroblock coded buffer.

  • Fixed corruption in UYVY to RGB32 color space conversion (CSC) output.

  • Corrected an incorrect RGB mask order in video processing.

  • Fixed an R8G8 resource allocation failure.

Intel® Metrics Discovery Application Programming Interface

  • Corrected performance capabilities in alignment with i915 performance revision.

  • Resolved a copy engine count issue.

  • Fixed truncation of symbol names in equations by enabling dynamic allocation instead of the previous fixed size of 32 characters.

  • Optimized the size of global symbol byte arrays.

Intel® oneAPI Level Zero

  • Started removing the static result in InitDrivers when the first initialization fails.

  • Switched to using relative paths for event deadlock detection in third-party headers.

  • Disconnected zeInitDrivers and zeDriverGet.

  • Addressed issues with backward compatibility regarding Get*ProcAddrTable usage.

  • Added the missing zeKernelGetExp API and header updates.

  • Fixed zeInit compatibility when zeInitDrivers is undefined.

Intel® Video Processing Library GPU Runtime

  • Fixed an AV1 decoding issue that caused frame synchronization errors and corruption.

  • Fixed a VP9 encoding issue that led to corruption on consecutive key frames.

  • Improved video processing for improved composition output quality.


2024-12-23

Image for: 2024-12-23

The 2448.13 release supports the following operating systems:

  • Red Hat Enterprise Linux (RHEL): 8.8, 8.10, 9.2, and 9.4

  • Ubuntu 22.04 and 24.04

  • SUSE Linux Enterprise (SLES): 15 SP4, 15 SP5, and 15 SP6

Features

Intel CM Compiler

  • Introduced support for arbitrary SIMD in sampler intrinsics.

  • Added the cm_rsqrt implementation, which maps directly to the SPIR-V OpenCL RSqrt intrinsic.

  • Started supporting saturation in the rsqrt built-in.

  • Introduced support for full-width r0 access in the cm_get_r0 intrinsic.

  • Introduced a compiler option to enable cost analysis information.

  • Introduced the -vc-use-bindless-buffers and -vc-use-bindless-images options to enable bindless accesses. These options allow VC to generate bindless buffers and images, replacing stateful BTI-based ones.

Intel® Graphics Compiler

  • Improved IGCVectorizer by adding a new vectorization pattern with the FPTrunc instruction and the ability to merge incompletely scalarized vector paths across multiple basic blocks.

  • Implemented SPV_INTEL_2d_block_io.

  • Added descriptive error messages for recoverable errors in the Vector Compiler (VC).

  • Extended the cl_intel_subgroup_2d_block_io extension with more variants by adding support for the following built-ins: c intel_sub_group_2d_block_read_8b_16r16x4c intel_sub_group_2d_block_read_8b_32r16x4c  intel_sub_group_2d_block_prefetch_8b_16r16x4c

  • Improved Virtual Instrument Software Architecture (VISA) APIs to support general nbarrier.

  • Added support for specifying the last caller-saved General Register File (GRF) using the VISA -lastCallerSavedGRF option.

  • Optimized performance handling by introducing a third retry stage to drop to SIMD16/SIMD8 when PTSS is exhausted, improving resource efficiency.

  • Allowed building BiFs without the OPAQUE_ARG argument.

  • Implemented debug information to support subroutines.

  • Optimized non-uniform indexed Resource Loops to improve performance.

  • Removed unnecessary TGM fences between stores in compute shader to improve performance.

  • Allowed using OpenCL Clang with an older Low-Level Virtual Machine (LLVM) version.

  • Enhanced Low-Level Virtual Machine (LLVM) interim mode by replacing the environment variable control with a new IGC_OPTION__LLVM_INTERIM option for better integration within CMake files.

  • Added the ability to set the IGC_LLVM_INTERIM mode, which allows creating the IGC_LLVM_TRUNK_REVISION definition based on the -D option or ENV variable.

  • Introduced support for opaque pointers in newer Low-Level Virtual Machine (LLVM) versions.

  • Introduced kernel performance metrics.

  • Started emitting predefined runtime symbol in the ZEBinary’s symbol table.

  • Added support for bindless memory access.

  • Introduced an intrinsic for optimization fence in the Vector Compiler (VC).

  • Introduced the DisablePHIScalarization option that allows skipping PHI nodes scalarization in the ScalarizeFunction pass.

  • Added the ability to disable the ScalarizeFunction pass in the OpenCL pipeline.

  • Improved compilation time for WIAnalysis.

  • Implemented a helper SeparateSpillAndScratch function to improve performance.

  • Implemented bundle conflict reduction for two source instructions for OpenCL to improve performance.

  • Enabled default support for illegal integer types in GetElementPtr Loop Strength Reduction (LSR), improving handling of SCEV expressions with arbitrary integer widths.

  • Added the capability to enable the use of automatic immediate offset for 2D block intrinsics.

  • Introduced a registry key to allow disabling emulation for floating-point 64-to-16-bit conversions.

  • Added BCR support for 5-source Dot Product Accumulate Systolic (DPAS). Previously, BCR only supported 3-source instructions.

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Added a debug key for setting MaxSubSlicesSupported.

  • Implemented a debug flag to manage the direct submission semaphore mode.

  • Implemented a debug flag that allows changing the ULLS Blitter Command Streamer (BCS) timeout.

  • Added new functions to the dispatch table to support mutation of kernel Instruction Set Architecture (ISA).

  • Added a class for multiple device metric to calculate the report format.

  • Enabled global bindless allocator on Battlemage.

  • Added a EnableCompatibilityMode flag to support binary compatibility across multiple hardware targets.

  • Added support for 2D block load and store extension queries.

  • Introduced a new forward-compatibility model for zeinfo to emit an error whenever an unknown attribute is encountered.

  • Added a getter method for accessing device node information from the Direct Rendering Manager (DRM).

  • Added the missing event scope flags for Command Buffer (CB) handling.

  • Added support for system memory in virtual memory functions.

  • Added new parameters to the GuC System Information (SysInfo) Blob for enhanced functionality.

  • Extended zeDeviceGetProperties with additional device properties, such as module_id and server_type tokens.

  • Enhanced log messages to support setting message severity using environment variables.

  • Enabled Write-Through Memory Type Protocol (WMTP) on Lunar Lake.

  • Added heapless built-ins with images compilation.

  • Started using heapless built-ins for images.

  • Added support for custom compiler backends to support loading different versions of the backend compiler based on underlying device.

  • Started using the sysInfo helper for detecting the memory type.

  • Started checking the peak power support using the escape call.

  • Added support for 3-channel configuration in the image format descriptor.

  • Updated the General Register File (GRF) register implementation.

  • Implemented error handling to trigger when the OA buffer overflows.

  • Introduced support for programmable metric group for Lunar Lake and Battlemage.

  • Enhanced kernel parameter configuration by adding support for passing additional enqueue and zebin parameters. This enables features such as quantum dispatch and quantum size specification directly within zebin for better encapsulation and OpenCL compatibility.

  • Added support for the custom allocator in work partition allocation.

  • Added idle Control and Status Register (CSR) detection and improved timeout handling in the ULLS controller.

Intel® Graphics Driver Backports for Linux* OS (i915)

Enabled backport support for 6.11 kernel.

Intel® Graphics System Controller Firmware Update Library

  • Added timestamps to logs.

  • Implemented read firmware status register library API and read firmware status register in CLI.

  • Introduced a new error message to notify users about device iterator failures.

Improvements

Intel CM Compiler

Resolved synchronization issues between IGC and LLVM16.

Intel® Graphics Compiler

  • Fixed an issue with broken DPAS instructions.

  • Resolved assertions with Opaque Pointers in Joint Matrix resolution pass.

  • Fixed an issue with the DW_AT_enum_class support in DWARF.

  • Resolved an issue with missing sign-extension when accessing BVH data in device memory.

  • Fixed an issue with returning incorrect result when using stack call.

  • Started generating block loads and stores automatically for contiguous memory blocks.

  • Fixed an issue causing SYCL code to fail on Intel Data Center GPU Max Series.

  • Resolved an issue with the get_coordinate function in Vector Neural Network Instructions (VNNI) matrix B to ensure it returns coordinates from the original matrix instead of the VNNIed matrix when the matrix B is packed.

  • Fixed an issue with indirect addressing for insertelement in SIMD32.

  • Updated the joint matrix to resolve an issue where runtime matrix dimensions caused a failure for the 32x64x16 combination.

  • Resolved an issue with handling negative indices in getelementptr instructions to ensure correct results for kernels when optimizations are enabled.

  • Fixed the bfloat constant handling for the fcmp instruction.

  • Removed SPIR-V support from float AtomicCompareExchange.

  • Fixed instruction cast in CustomLoopOpt.

  • Copied all data from the shadow memory to the global memory.

  • Corrected the list of platforms supporting large constant buffers.

  • Corrected the use of disallowed conditional modifiers in logic instructions.

  • Preserved -0.0 in floor and ceil emulations to maintain correct behavior for the extended instruction implementations of floor and ceil.

  • Added checks to dp4WithIdentityMatrix in CustomSafeOptPass to ensure it matches the correct patterns.

  • Fixed an out-of-bounds crash in InlineHelper.

  • Set vISA_autoLoadLocalID during overriding visaasm files to ensure the expected value, enabling overrides without unexpected perturbations.

  • Introduced per-pixel offset support to the sampler.

  • Marked the TimeStatsCounter pass as not modifying Intermediate Representation (IR) to prevent unnecessary analysis restarts and avoid potential code generation differences.

  • Extended the CustomSafeOptPass to support i64 datatype in certain operations.

  • Fixed an issue where SIP surface could not be accessed in bindless and bindful mode.

  • Enhanced instruction scheduling to reduce register spills on the default path.

  • Enabled the IndVarSimplification pass.

  • Enabled ShrinkArrayAllocaPass in fast-compilation modes.

  • Resolved an issue causing super-linear runtime scaling on Battlemage during the upsampling and convolution kernel process.

  • Optimized compilation time of the HandleSpirvDecorationMetadata pass.

  • Enabled the -dynamicAddrForExDescInLscSend option by default.

  • Stopped emitting mov instructions from undefined PHI incoming edges.

  • Fixed an issue where GroupNonUniformBroadcast from the ballot_group SYCL non-uniform group caused failures when used with sycl::half and short types.

  • Extended nbarrier support to include producer-only and consumer-only modes, in addition to the existing producer-consumer mode.

  • Fixed an issue where the Bus Interface field was incorrectly displayed as N/A.

  • Reduced usage of pointer element types by replacing calls to getNonOpaquePtrEltTy with element type information obtained through other methods.

  • Updated LITs to use the --opaque-pointers option for automatic conversion between typed and opaque pointers.

  • Replaced selectWalkOrder with selectWalkOrderInPass.

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Added missing in-order signalling in appendCopyImageBlit.

  • Corrected page alignment calculation in getPageAlignedSizeRequired.

  • Updated the logic for calculating the available thread count to eliminate the use of magic numbers, making it dependent on the General Register File (GRF) size.

  • Stopped using the event fence wait in the TBX mode.

  • Improved CPU caching handling for Unified Shared Memory (USM) allocations to prevent data corruption during memory migration.

  • Fixed a platform teardown issue to ensure release calls return success.

  • Fixed an issue with blender crashes when using the Sysman API.

  • Corrected stateless kernel compilation for Intel Data Center GPU Max Series GPUs.

  • Enabled immediate binding for the make resident functionality.

  • Optimized thread group and count alignment with DSS size for kernels utilizing SLM.

  • Fixed an issue with using incorrect names for heapless built-ins.

  • Optimized semaphore handling to avoid redundant paging fence programming when the semaphore is already set with the same value.

  • Removed defaultProfilingTimerResolution from RuntimeCapabilityTable.

  • Started using releaseHelper to determine MTP and MSAA enablement.

  • Corrected the total SLM size.

  • Fixed a ZesInit switch crash to ensure proper error handling for invalid core device handles.

  • Removed the INLINE_DATA structure.

  • Corrected read and write memory bandwidth counters.

  • Fixed stackSizePerRay to ensure proper stackID allocation.

  • Fixed an issue where internal options were not correctly appended during recompilation from SPIR-V in the createFromNativeBinary path.

  • Added the stateless suffix to names of stateless built-in functions.

  • Fixed a race condition in the debugger where VMBind events could occasionally go unacknowledged.

  • Fixed as issue with freeing allocations in the AUB mode.

  • Fixed the timerResolution value in the zesDeviceGetProperties function.

  • Added infrastructure to limit the maximum memory usage for device Unified Shared Memory (USM) reuse.

  • Added polling for Software First In, First Out (FIFO) queue.

  • Disabled copy offload for Direct-to-Direct (D2D) operations.

  • Added configurable timeout debug variables to the debugger.

  • Introduced in-order relaxed ordering handling for signalEvent.

  • Changed an error message for GPU page fault to match the message from the GBD output.

  • Set a maximum limit of 256MB for the allocation reuse mechanism.

  • Added a tail index to handle the First In, First Out (FIFO) wrap-around case.

  • Renamed getBindlessDebugSipKernel to getDebugSipKernel and added a functionality to allow selection between bindless and heapless SIP modes based on the configuration.

  • Removed csStallOnly to ensure proper setting of the RTCF flag.

  • Modified page fault event handling to make it generic.

  • Removed engineInstanced and Engineinstancedtype.

  • Moved all Sysman constants into the Sysman namespace.

  • Improved GPU virtual address management by prioritizing new address assignment over reuse.

  • Fixed an issue with incorrect Procyon PhotoEdit score.

Intel® Graphics Driver Backports for Linux* OS (i915)

  • Implemented GPU error capture during the splitting of cleared backing stores.

  • Initiated earlier shrinking of all system memory objects to mark pages as dirty and preserve their contents across hibernation.

  • Resolved an issue where the Guided Matrix Multiplication Race Condition Sample hang instead of crashing due to driver errors.

  • Prevented eviction of overlapping VM_BINDs to comply with ppGTT rules.

  • Fixed HBM diagnostics logging on Intel Data Center GPU Max Series.

  • Started tracking the duration of user stalls during TLB invalidation.

  • Optimized the userptr task placement for improved task-local performance in LAMMPS benchmarks.

  • Adjusted the starting point of put_page_range to align with the compound_head, ensuring correct iteration across adjacent compound pages.

  • Fixed a performance issue in addr_range and id sysfs calls by optimizing kobject attribute handling on RHEL8.x systems.

  • Improved handling of engine reset failures to prevent deadlocks during GT resets and G2H notifications.

  • Started storing the current UID for core dumps and display active clients in sysrq-G.

  • Started including the page fault address for CAT errors.

  • Restored RPM ownership to core to fix device runtime-pm drop in PSB builds.

  • Fixed issues causing performance drops.

  • Improved error resilience for iommu mappings.

  • Ensured ENABLE_PG is cleared on unbind/unload to prevent reset failures in power-saving mode.

  • Fixed an issue involving the uninitialized use of the domain variable.

Intel® Graphics Memory Management Library

Fixed an One Definition Rule (ODR) issue with Link-Time Optimization (LTO) compilation.

Intel® Graphics System Controller Firmware Update Library

  • Stopped suppressing ME TEE error messages in the debug mode.

  • Fixed incorrect device enumeration.

Intel® oneAPI Level Zero

  • Corrected and regenerated loader files to address incorrect reordering.

  • Ensured backward compatibility in DDI structures for experimental features.


2024-11-12

Image for: 2024-11-12

The 2441.21 release supports the following operating systems:

  • Red Hat Enterprise Linux (RHEL): 8.8, 8.10, 9.2, and 9.4

  • Ubuntu 22.04 and 24.04

  • SUSE Linux Enterprise (SLES): 15 SP4, 15 SP5, and 15 SP6

Improvements

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

Fixed a page fault issue that occurred on Intel Data Center GPU Max Series.

Intel® Metrics Discovery Application Programming Interface

  • Resolved an issue where the IoMeasurementInformationCount parameter was incorrectly set to 0 until ReadIoStream was called.

  • Disabled Wno-extern-c-compat for all compilers except clang to prevent compilation errors with newer compilers. The FORTIFY_SOURCE level is now set to 2 for all compilers when optimizations are enabled, and to 3 for modern compilers.

Metrics Library for Metrics Discovery API

  • Withdrawn the MI_REPORT_PERF_COUNT command for OAG queries.

  • Removed the unnecessary timestamp storage for OAG queries on XeHP+.

  • Started clearing the command streamer ID before MMIO_TRG starts.

  • Stopped writing driver or user markers when the value is 0.

  • Fixed the slice and unslice frequency for OAR and OAC queries.


2024-10-31

Image for: 2024-10-31

The 2441.19 release supports the following operating systems:

  • Red Hat Enterprise Linux (RHEL): 8.8, 8.10, 9.2, and 9.4

  • Ubuntu 22.04 and 24.04

  • SUSE Linux Enterprise (SLES): 15 SP4, 15 SP5, and 15 SP6

Features

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Upgraded the Core API to version 1.9, enabling the immediate execution of existing command lists by converting them into immediate command lists.

  • Added support for querying the number of L3 cache banks to the preliminary version of the Xe architecture. This allows software to retrieve information about how many banks of L3 cache are available on the GPU.

  • Enabled bindless mode that allows shaders to directly access resources like textures, buffers, and other memory objects without the need for explicit binding to specific slots in the traditional way. This provides more flexibility and efficiency, particularly for workloads with large numbers of resources.

  • Started using the global bindless allocator in the in Xe² HPG architecture to manage the assignment of global pointers to resources in the bindless mode.

  • Added an API that retrieves kernel binaries from kernel.

  • Updated kernel residency management to save the position of the kernel internal container when allocation can change.

  • Added an input/output control helper function to get fence address and set external context.

  • Added implementation of standalone CB events without pool allocation supporting all features for regular events.

  • Added support for running SIMD16 operations on Execution Units (EUs) contained within each Dual-SubSlice (DSS) on the preliminary version of the Xe architecture.

  • Implemented the kernel trace functionality to support the metric group type.

  • Added additional parameters to the Graphics Microcontroller (GuC) to support the region allocation logic.

  • Implemented setErrorDescription in os_interface/linux for drm_buffer_object, drm_memory_manager, and ioctl_helper_prelim.

  • Added a debug flag to disable walker splitting for copy operations.

  • Updated the Level Zero Core version to 1.6.

Intel® Graphics Compiler

  • Improved Conditional Coverage (CCOV) for the RemoveLoopDependency pass.

  • Added the Inverse Multiply (INVM) and Reciprocal Square Root Multiply (RSQTM) intrinsics.

  • Added the ScalarAliasBBSizeThreshold flag to control the maximum size of Basic Block (BB) for which scalar to vector aliasing applies.

  • Added the CEncoder::CopyWithImplicitConversion function to CEncoder to allow generating MOV instructions for src and dst combinations with different type sizes.

  • Introduced an additional step in the compiler build process to split the BiF module into smaller sections, reducing the time required for materializing and inlining the module.

  • Improved parsing igc_opts to avoid a misleading error message.

  • Enabled the removeUnusedSLM option.

  • Extended the cl_intel_subgroup_2d_block_io extension to support the following built-ins: c intel_sub_group_2d_block_read_8b_8r16x4c, intel_sub_group_2d_block_prefetch_8b_8r16x4c, intel_sub_group_2d_block_read_transpose_32b_32r8x1c.

  • Implemented AtomicPullSWWalkWrapperLoopImplementationPass.

  • Added a registry key to disable coalescing memory fences and extend control over the SynchronizationObjectCoalescing pass.

  • Added support for DIStringType in DWARF.

  • Enabled GenISAIntrinsics on LLVM16.

  • Enabled denormal number support for systolic operations in the Vector Compiler (VC).

  • Added Xe2 Battlemage WMTP SIP support.

  • Introduced new GenISA intrinsic WaveClusteredInterleave that combines two wave reductions: WaveClustered and WaveInterleave.

  • Introduced support for copy sign intrinsic in the Vector Compiler (VC).

  • Added fast exits in the pattern match.

  • Added TGM fence workaround for Xe2.

  • Enabled loads rescheduling in CodeLoopSinking.

  • Enabled loop sinking of 2D block loads and shuffle patterns.

  • Added joint matrix support for the 32x32x16 combination for DG2.

  • Added early constant loading for Dot Product Accumulate Systolic (DPAS) in Vector Compiler (VC). DPAS operations do not support immediate operands, so the compiler should move the constants into registers.

  • Implemented an inverse square root built-in function.

  • Added a new version of the GTSystemInfo interface with a new SLMSizeInKb field.

  • Added a global_barrier implementation using atomic instructions.

  • Added a GPUVA for ubertilesmap in RayDispatchGlobals.

  • Introduced Loop Cost Expression (LCE) support.

  • Introduced support for LIT opaque pointers.

  • Introduced a flag to disable dynamic RQ management.

  • Introduced invm and rsqtm math functions to support double-precision inverse square root calculations.

  • Enhanced the cl_intel_subgroup_2d_block_io extension by adding the following built-in functions: intel_sub_group_2d_block_read_8b_16r16x4c, intel_sub_group_2d_block_read_8b_32r16x4c, and intel_sub_group_2d_block_prefetch_8b_16r16x4c.

  • Introduced the DisablePHIScalarization option in the ScalarizeFunction pass to enable skipping of PHI node scalarization.

  • Added an option to disable the ScalarizeFunction pass in the OpenCL pipeline.

Intel® Media Driver for VAAPI

  • Added full support for the Lunar Lake platform in the upstream.

  • Introduced initial support for the Battlemage platform in the upstream.

Intel® Video Processing Library GPU Runtime

  • Enabled full support for the Lunar Lake platform.

  • Added initial support for the Battlemage platform.

  • Introduced support for the Y210 format in media copy operations.

  • Added a check for AV1 decoding bitdepth changes when parsing SPS syntax to prevent decoding issues.

  • Aligned the default decode frame rate to 30fps.

  • Added support for MFX_EXTBUFF_VIDEO_SIGNAL_INFO in AV1 decoding to retrieve video signal information.

  • Enabled dynamic decode frame rate by parsing frame rate data from the AV1 bitstream.

  • Improved reference frame patterns in pyramid cases.

  • Enabled block size selection for VP9 encoding segmentation.

Intel® XPU Manager and XPU System Management Interface

  • Added the ability to display the date in the dump command when using the --date parameter.

  • Added security consolidation.

  • Upgraded the vGPU parameters.

  • Improved GPU diagnostics when Single Root I/O Virtualization (SR-IOV) is enabled.

  • Improved the GPU diagnostics configuration file.

  • Introduced support for Ubuntu 24.04 and SLES 15 SP6.

  • Added the ability to display the version of the intel-i915-dkms package.

  • Improved GPU memory throughput reporting.

  • Improved Peripheral Component Interconnect Express (PCIe) downgrading checking.

  • Introduced support for the Xe link all-to-all throughput test in a single GPU diagnostics test.

Improvements

Intel CM Compiler

  • Implemented translation of cm_ieee_frem to the SPIR-V FR instruction.

  • Added support for arbitrary SIMD in sampler intrinsics.

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver

  • Fixed an issue that caused compilation and kernel creation to fail when buffer stateful access was detected.

  • Fixed Base Control Stream (BCS) Memory Mapped Input/Output (MMIO) remapping.

  • Limited the number of multiple queue sets for Compute Command Streamer (CCS) modes to prevent resource replication overhead and maintain flexibility for future optimizations in queue-level operations.

  • Added the missing core properties to the zesInit path.

  • Updated the debug control register to support per-context debugging and unlock threads before resuming them.

  • Modified a kernel helper method to pass kernel data directly. Additionally, added a command list engine group type and local dispatch getters.

  • Removed excessive input and output control calls.

  • Fixed alignment checks to use the checks only for debug builds.

  • Corrected program header generation for shared Instruction Set Architecture (ISA) allocation to fix an issue where running a benchmark resulted in returning an exception.

  • Modified Ultra Low Latency Scheduling (ULLS) to avoid programming the NP state when heapless state initialization is enabled.

  • Set the control block in the buffer object handle wrapper to null after deletion.

  • Created a new allocation type for syncBuffer and started using gpuAddress when patching synchronization buffer.

  • Added support for programming the partition offset for the secondary Control and Status Register (CSR).

  • Started skipping uninitialized Control and Status Registers (CSRs) during downloads for the Toolbox Interface (TBX).

  • Unified the Class of Service (CLOS) support on Intel Data Center GPU Max Series and added a helper for matching all platforms except the selected ones.

  • Enabled efficient 64-bit addressing by default.

  • Disabled per-context VMs when e64 is enabled to support debugging in the heapless mode without per-context VMs.

  • Fixed an issue with returning incorrect maxMemAllocSize.

  • Reserved the HP context on one of the BCS engines, preventing the allocation of any other context on it.

  • Fixed an issue with the MaxMemoryBandwidth unit conversion in the Process Maturity Table (PMT).

  • Fixed Process Maturity Table (PMT) temperature offset.

  • Started initializing debugger before creating engines.

  • Disabled Unified Shared Memory (USM) host recycle to avoid a page fault issue.

  • Improved reducing WMTP support on Xe2 platforms to not disable WMTP based on the ftr flag, and use compiler product helper instead.

  • Fixed an issue with null pointer dereference.

  • Fixed an issue where debugger initialization failed to disable debugging when the kernel lacked debugging support.

  • Started setting the XE_EXEC_QUEUE_SET_PROPERTY_EUDEBUG debug extension property when debugging is enabled.

  • Fixed an issue with copy offload Memory Mapped Input/Output (MMIO) programming.

  • Removed checks for Xe VM bind flags support as the bind immediate and readonly flags are already available in upstream.

  • Fixed issues where threads were not resuming in the debugger.

  • Started using stateless kernel when appending memory fill on Xe HPC and later.

  • Redesigned Unified Shared Memory (USM) allocation reuse mechanism by adding dedicated pools for different allocations size ranges. Additional reused allocations now create their own pools. Additionally, started trimming device Unified Shared Memory (USM) pools when allocation fails.

  • Started using preliminary prefix for the Xe debug sysfs file name and added a new debug SIP kernel type for the heapless mode.

  • Stopped creating preemption allocation for secondary Control and Status Register (CSR). Instead, started inheriting work partition allocation from primary root Control and Status Register (CSR).

  • Enabled the immediate flag when the resident boolean is true for vm_bind.

  • Added a workaround for an issue with reporting incorrect DRM system information.

  • Enabled indirect detection helpers.

  • Unified the IoctlHelper setup across i915 and Xe.

  • Improved handling for in-order relaxed ordering of the signalEvent to address timeout issues.

Intel® Graphics Compiler

  • Added platforms to avoid dst and src overlap causing dead lock due to the SBID dependencies.

  • Started using the BSS mode to access the memory from SIP.

  • Initialized address register to avoid unaligned cross General Register File (GRF) access.

  • Modified vector decomposer to operate it on 32-byte chunks in Vector Compiler (VC), making the vector decomposer more efficient.

  • Reduced compilation time.

  • Moved the strlit buffer from stack to heap when parsing Virtual Instrument Software Architecture (VISA) text.

  • Disabled EnableSeparateScratchWA to remove the workaround for setting the size to the maximum of slot0 and slot1 when spill/fill and private memory were separated.

  • Updated the GTSystemInfo CIF interface version to 4, which allows passing SLMSizeInKb required to get the proper SIP kernel.

  • Updated the artifact publish method.

  • Removed obsolete functions from the ComputeShaderBase files and changed the implementation to use selectWalkOrderInPass instead.

  • Fixed marking built-in functions.

  • Converted Intel® Graphics Compiler LITs for opaque pointer support.

  • Fixed meta data for linearized arguments in the Vector Compiler (VC).

  • Added support for higher CS SIMD32 spill threshold on small shaders.

  • Emulated immediate X/Y offset for untyped 2D block messages.

  • Updated the indirect detection version to 5.

  • Resolved an issue with the promotion of built-in boolean values.

  • Disabled EnableSeparateScratchWA as it is no longer required.

  • Optimized the ldrawvector_indexed intrinsic to ldraw_indexed.

  • Corrected the destination type for intel_sub_group_2d_block_read_8b_8r16x4c to uchar.

  • Fixed an issue with the intel_get_eu_id built-in.

  • Improved building on Ubuntu 24.04.

  • Enabled selective scalarization.

  • Modified indirect stateless detection mechanism to include access from pointer loaded from GenISA load intrinsic, including bindless access.

  • Fixed assertion in the Vector Compiler (VC) constant argument handling.

  • Expanded numCallerSaveEntries debug information to uint32_t to support larger kernels.

  • Fixed synchronization offset variable to use the third integer of syncBuffer as the offset variable.

  • Set the default value of JointMatrixLoadStore to 2 for systems without 2D block load/stores support.

  • Reduced usage of pointer element types.

  • Enhanced loop sinking for inttoptr and similar instructions, ensuring they are consistently sunk into loops to enable further optimizations, even when immediate pressure reduction is not apparent.

  • Refactored kernel cost model to ensure that all child loops are kept in program order.

  • Corrected Architected Register File (ARF) restrictions to prevent assertion failures prior to the HWConformity pass.

  • Resolved an issue with handling the General Register File (GRF) per input for HS and DS in SIMD32 by modifying GetMaxInputPushed instead of PushAnalysis.

  • Set the correct DST alignment for memory copy from global constants.

  • Fixed payload coalescing for SIMD32 by calculating subVar using the full relative offset through multiplication, rather than relying solely on the index.

  • Replaced the selectWalkOrder utilities with new functions to use a new SComputeShaderWalkOrder structure.

  • Updated the default ray tracing tile size to a getRayTracingTileYDim2D value that depends on a platform.

  • Updated debug registry keys related to sample_d emulation.

  • Fixed the 2D loads rescheduling rollback functionality.

  • Fixed race condition in global_barrier based on atomics.

  • Corrected the WRITE_ALWAYS_INTERFERE instructions to consider the dep type during calculateDependence so that its dependency to other instructions can be properly handled.

  • Optimized 1D storage for large joint matrices, so that when the number of columns is a multiple of the subgroup size, it can store columns and subgroup size elements in a single operation.

  • Updated the syntax for load and store commands while reading from and writing to Shared Local Memory (SLM) in the XEHPG architecture.

  • Implemented an error report for cases when intel_reqd_sub_group_size requirements cannot be met.

  • Set the default value of SIP index to 0.

  • Removed calls to getBasicBlockList for end, begin, splice, and insert cases.

  • Removed Dot Product Accumulate Systolic (DPAS) from Intel Data Center GPU Max Series.

  • Added a fix for int16 to ensure that appropriate number of GRFs is allocated.

  • Fixed a signed-zero issue for IGCVectorizer.

  • Implemented an early memcpy optimization pass to replace memcpy with memset when the source points to zero-initialized memory. This ensures improved alignment analysis and optimization consistency with the legacy Standard Portable Intermediate Representation of Vulkan (SPIR-V) translator.

  • Fixed an issue with the GenISA_WaveShuffleIndex intrinsic that occurred when src and dst were the same variables.

  • Started using vector copy instead of the oword block load on a specific device.

  • Started handling more than one General Register File (GRF) for one input.

  • Fixed kernel metadata for the legacy binary format in the Vector Compiler (VC).

  • Started porting Intel® Graphics Compiler code to LLVM16.

  • Simplified debugging information unwinding DWARF operations.

  • Removed an unnecessary mov instruction in DPAS8x8 splitting for src2 alignment in cases when src0 and dst are the same.

  • Converted Intel® Graphics Compiler LITs for opaque pointer support.

  • Fixed issues with handling GenISA_WaveBroadcast.

  • Fixed the code loop sinking rollback functionality.

  • Optimized continuous load and store instructions.

  • Legalized the vector fneg instructions.

  • Implemented a method for retrieving a resource’s dim type, ensuring that the type is created if it does not already exist.

  • Improved AA result queries for RT address space analysis.

  • Fixed an issue with flag register footprint tracking.

  • Resolved an issue with incorrect result for the accumulator’s 32x64 checked_store.

  • Refactored the CS SelectWalkOrder pass.

  • Fixed an issue where Single Instruction Multiple Data (SIMD) size changed with StackOverflowDetection.

  • Fixed a barrier call in global_barrier.

  • Implemented vectorization for Dot Product Accumulate Systolic (DPAS).

  • Stopped using the L1 cache option to tune latency for scheduling.

  • Refactored cache control.

  • Fixed uniformity analysis for the GenISA_WaveBroadcast sub_group_broadcast function.

  • Added missing legalization for some patterns.

  • Updated accumulator registers save and restore syntax in Xe2 SIP.

  • Introduced support for the i64 type in the SubGroupReductionPattern pass.

  • Removed getNonOpaquePtrEltTy from CMPacketize.

  • Started using llvm:cast instead of llvm:dyn_cast in Decompose2DBlockFuncs.cpp.

  • Resolved an issue with type mismatch when calling ushort OpenCL built-ins for the bfloat16 types.

  • Extend the functionality of the CodeLoopSinking pass to support multi-instruction candidates, vector shuffle pattern, and 2D block loads sinking.

  • Fixed metadata for stateful images and samplers in the Vector Compiler (VC).

  • Fixed incorrect fence scope generated for a group barrier.

  • Fixed issues in Decompose2DBlockFuncs hoisting functions for 2D block decomposition.

  • Cleaned up virtual register alignment checks in the Vector Compiler (VC).

  • Fixed the Single Instruction Multiple Data (SIMD) size for R0 copy to account for varying General Register File (GRF) sizes across platforms, ensuring the correct SIMD size for UD mov.

  • Optimized the 64-bit GenISA sequence for WaveInterleave by moving data as 32-bit type.

  • Enabled merging of incompletely scalarized vector paths across multiple basic blocks.

Intel® Graphics Driver Backports for Linux* OS (i915)

  • Disabled per-CPU page table allocations, enforcing allocation on each operation as a baseline.

  • Fixed an issue where HPL failed on Intel® Data Center GPU Max Series.

  • Added additional rcu_barrier on cache release to ensure the objects are freed before completing the module unload.

  • Fixed a Multi Die Fabric Interconnect (MDFI) training issue that occurred during reboot.

  • Improved error reporting on Intel® Data Center GPU Max Series.

  • Fixed an off-by-one error that left the last entry uninitialized after a get_user_pages failure.

  • Improved error handling for clearing shared memory pages.

  • Introduced support for SR-IOV (Single Root I/O Virtualization) save and restore virtual functions.

  • Enabled render power gate when RC6 is disabled on Intel® Data Center GPU Max Series.

Intel® Graphics Memory Management Library

  • Added a check to verify if BaseWidth and BaseHeight are set to zero, enhancing error handling and ensuring proper configuration.

  • Implemented caching demotion for Xe2.

Intel® ME TEE Library

  • Fixed an issue that caused multiple connections to be opened for different clients.

  • Updated libmei to version 1.6.3 to resolve an issue with traffic dumping in release mode and building on 32-bit Linux.

Intel® Media Driver for VAAPI

  • Resolved a random crash issue that occurred during AVC decoding when compression was enabled.

  • Fixed incorrect maximum resolution in MPEG2 decoding on the Tiger Lake platform.

  • Corrected the watchdog timer on Meteor Lake to prevent media resets during high-bitrate bitstream decoding.

  • Fixed a driver crash issue that occurred when AV1 encoding used more than 10 tile groups.

  • Added block size selection and capability reporting for VP9 encoding segmentation.

  • Improved the de-interlace video processing filter to correctly report capabilities.

  • Enhanced report key handling to prevent multi-process read/write conflicts.