Faster AI, Real-Time Graphics, & Smarter HPC: Intel Developer Tools 2025.1!
March 27, 2025 | Intel® Software Development Tools | AI Frameworks and Tools
Intel's latest generation AI Tools and Frameworks, as well as the Intel® oneAPI Toolkits 2025.1, have been released. Check out how we help developers create AI and HPC faster and with confidence.
Turbo-Charge Performance, Productivity, and Code Quality Across Software Projects
- From the latest optimizations for Intel® Xeon® 6 processors, AI inference and training acceleration using Intel® oneAPI Deep Neural Network Library (oneDNN), deep learning inference performance profiling on Intel® Core™ Ultra processors to the latest contributions and FlexAttention support with PyTorch* on Intel processors, the latest release has you covered.
- Use enhanced SYCL* interoperability with Vulkan* and Microsoft DirectX* 12 for real-time graphics, visual AI, and rendering performance.
- Ensure code quality for compute kernels running on CPUs and GPUs with sanitizers and expanded code coverage.Take advantage of the latest C/C++, Fortran, SYCL*, and OpenMP* support with our LLVM-based compiler technology.
- Improve the efficiency of GPU-initiated communications in your multi-node distributed configuration for massive dataset computations with the new Intel® SHMEM library.
- Meet your programming and workload needs by combining these cross-disciplinary tools, libraries, and popular frameworks based on and powered by oneAPI.
Find out More, Download and Explore
Solutions for AI:
AI Frameworks and Tools AI Tools Selector
Solutions for General Compute:
oneAPI Overview Explore Developer Toolkits
Optimize AI Performance from Data Center to PC:
Achieve significant performance improvements with PyTorch* 2.6 and other leading deep learning frameworks on the latest Intel CPU and GPU platforms:
- Developers and researchers seeking to fine-tune, perform inference, and develop with PyTorch models on Intel® Core™ Ultra AI PCs and Intel® Arc™ discrete graphics are now able to install PyTorch directly with binary releases for Windows, Linux*, and Windows Subsystem for Linux 2.
-
FlexAttention support for X86 CPUs was added through the TorchInductor CPP backend. This extends current CPP template abilities to support broad attention variants (e.g., PageAttention, which is critical for LLMs inference) based on the existing FlexAttention API optimizing performance on CPUs.
-
Float16 support on x86 CPU was first introduced in PyTorch 2.5 as a prototype feature. Now it has been further improved for both eager mode and Torch.compile and Inductor mode and released in beta for broader adoption.
Leverage the latest Intel® processors with oneDNN optimizations for deep learning inference:
- Experience enhanced matrix multiplication and convolution performance with oneDNN, leveraging the power of Intel® Xeon® processor architectures equipped with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set.
- Unlock improved performance for AI applications on Intel® Arc™ Graphics, maximizing capabilities of Intel Core Ultra processors (Series 2) and Intel Arc B-series discrete graphics.
- Increase both speed and efficiency for Gated Multi Level Perceptron (Gated MLP) and Scaled Dot-Product Attention (SDPA) with implicit casual mask to Optimize their AI models further with support for int8 or int4 compressed key and value through the oneDNN Graph API.
Remove AI PC performance bottlenecks and streamline distributed deep learning workloads with Intel® VTune™ Profiler:
- Identify performance bottlenecks of AI workloads that are calling DirectML or WinML APIs
- Improve analysis of distributed deep learning and Python workloads and pinpoint the most time-consuming code sections and critical code paths for Python 3.12.
Deliver real-time visual AI experiences for gaming, graphics, digital content using C++ with SYCL:
- Intel® oneAPI DPC++/C++ Compiler and Intel® DPC++ Compatibility Tool deliver enhanced SYCL interoperability with Vulkan and DirectX12.
- This enables the sharing of image texture map data directly from the GPU, eliminating extra image copying between CPU and GPU, ensuring seamless performance in image processing and advanced rendering applications, and boosting content creation productivity.
Improve performance, productivity, and code quality for HPC and accelerated compute:
More code stability, security, and productivity with Intel® Compilers:
- The Intel® oneAPI DPC++/C++ Compiler and Intel® Fortran Compiler extend CPU MemorySanitizer support to the device side, including GPUs. This enables developers to easily detect and troubleshoot issues in both CPU and device code, ensuring more reliable and robust applications.
-
The Intel oneAPI DPC++/C++ Compiler now supports cache* to significantly speed up build times. By caching previous compilations and reusing them, you benefit from faster iterations and more efficient workflows. You can focus on writing high-quality code rather than waiting for builds.
-
The Intel oneAPI DPC++/C++ Compiler's code coverage tool now includes GPU support and enhanced CPU coverage for applications using C/C++, SYCL, and OpenMP*.
-
The Intel® Fortran Compiler expands its OpenMP 6.0 standard support by introducing the WORKDISTRIBUTE construct to efficiently distribute work across threads and the INTERCHANGE construct to reorder loops in a loop nest, boosting parallel performance and code optimization.
- It moves forward with even more Fortran 23 support, and standard conformance increases code flexibility, ensuring consistent kind types for integer arguments in the SYSTEM_CLOCK intrinsic and allowing PUBLIC NAMELIST groups to include PRIVATE variables.
More efficient accelerated parallel application issue resolution using the Intel® Distribution for GDB*:
- More precise debug-thread control and enhanced debugging experience with default settings for scheduler-locking IDE in VSCode* when debugging on a Linux* machine.
- Efficient debugging of applications targeting AI PCs and compute kernel GPU offload, with support for the latest Intel® Core™ Ultra processors (Series 2) and Intel® Arc™ B-Series Graphics on both Linux* and Windows*.
→ Try the new software development tools versions now for free in a hosted Jupyter notebook on Intel® Tiber™ AI Cloud and start exploring the possibilities.
More efficiency of GPU-initiated communications with MPI and the new Intel® SHMEM Library product release:
- Target multi-node accelerator device and host with OpenSHMEM 1.5 compliant features including point-to-point Remote Memory Access (RMA) and OpenSHMEM 1.6 strided RMA operations, Atomic Memory Operations (AMO), Signaling, Memory Ordering, Teams, Collectives, and Synchronization.
- Simplify distributed multi-node SYCL* device access with Intel SHMEM SYCL queue ordered RMA, SYCL host USM access using a symmetric heap API.
- Intel® MPI Library now also supports device-initiated MPI-RMA functions on supported GPUs in advance of the emerging MPI standard.
AI Frameworks and Tools Explore Developer Toolkits
Get the 2025.0 AI Tools and Optimized Frameworks
December 19, 2024 | AI Frameworks and Tools
Take GenAI Productivity, Acceleration, and Scaling to the Next Level
Today, Intel released its 2025.0 AI Tools and optimized frameworks, driving AI efficiency and scaling forward through a strong focus on GenAI quantization and inference. The tools support open standards, with numerous contributions to the PyTorch* and TensorFlow* ecosystems, as well as optimizations for the latest Intel® Xeon® 6 Processor with P-Cores, Intel® Arc™ B-Series Graphics, Intel GPUs and built-in accelerators.
By embracing open source AI frameworks and deep learning models, our latest generation AI tools make it easy for the developer to take their existing codebase and scale it towards supporting the latest Intel technology without losing backward compatibility.
A rich set of libraries and utilities powered by oneAPI forms a reliable, flexible, expandable, and highly optimized foundation for quickly creating or migrating highly performant AI workloads.
Embrace Open AI Frameworks
- PyTorch* Optimizations from Intel are upstreamed to PyTorch 2.5 with native out-of-the-box support for Intel® Xeon® 6 and Intel® Core™ Ultra processors, Intel® Data Center GPU Max Series, and client GPUs. The latest in CPU performance for PyTorch inference is now available in PyTorch 2.5 with support for torch.compile with optimizations for the current processor generation and the ability to compile TorchInductors with the Intel® oneAPI DPC++/C++ Compiler for increased performance. Intel® Extension for PyTorch adds customized kernel support for Large Language Model (LLM) optimization, and integrates Intel® Extension for DeepSpeed*.
- TensorFlow* Optimizations from Intel extend the official TensorFlow capabilities, using the Intel® oneAPI Deep Neural Network Library (oneDNN) to boost performance and scalability. TensorFlow workloads can run with performance optimizations and efficiency on the latest Intel CPUs and GPUs.
- The JAX* Python* library has been integrated into the AI Tools, enabling efficient and flexible numerical computation with automatic differentiation. This inclusion enhances the product's capabilities for machine learning, scientific computing, and optimization tasks, leveraging JAX's high-performance, just-in-time compilation. Seamlessly run JAX models on Intel® Data Center GPUs with Intel® Extension for OpenXLA*, an Intel-optimized PyPI package based on the PJRT plugin mechanism.
Optimize Model Size and Performance
- Intel® Neural Compressor performs model optimization to reduce the model size and increase the speed of deep learning inference for deployment on CPUs or GPUs.
- New features include a Transformer-like quantization API for weight-only quantization on LLM, which offers a one-stop experience for quantization and inference on Intel hardware, and INT4 quantization of visual language models (VLM), like LLaVA, Phi-3-vision, and Qwen-VL, with the AutoRound algorithm.
- Additional improvements for Intel Neural Compressor include support for AWQ format INT4 model loading and converting for PyTorch inference in a Transformer-like API, enabling AutoRound format export for the INT4 model, and supporting per-channel INT8 post-training quantization for PyTorch 2 Export (PT2E).
Boost GenAI Developer Productivity
- AI Tools Containers have been redefined based on the main Python framework or library they support. This reduces container size and simplifies usage, ensuring a more streamlined and efficient environment for developers and users alike.
- Intel® oneAPI Deep Neural Network Library (oneDNN) maximizes efficiency and performance with tailored optimizations for the latest Intel platforms—spanning server, desktop, and mobile—including significantly faster performance for large language models (LLMs) and Scaled Dot-Product Attention subgraphs.
- Intel® oneAPI Data Analytics Library (oneDAL) enables calculation of SHAP (SHapley Additive exPlanations) values for binary classification models, which are required for explainability of random forest (RF) algorithms.
- Intel® oneAPI Collective Communications Library (oneCCL) enables workloads to scale and perform even better. Important enhancements have been made to key collectives such as AllGather, AllReduce, and Reduce-Scatter. Enhancements to oneCCL's Key-Value store improve communication among ranks, allowing workloads to scale up to an even larger number of nodes.
- Intel® Extension for Scikit-learn* adds new interfaces for covariance and statistics algorithms and introduces sparse data support for the LogisticRegression algorithm.
The 2025.0 Intel® Software Development Tools Are Here, Marking the 5th Anniversary of oneAPI
November 17, 2024 | Intel® Software Development Tools
The highly productive development stack for AI, HPC, and open accelerated computing
Today, Intel released its 2025.0 developer tools—all powered by oneAPI—marking the 5th anniversary of the oneAPI programming model with expanded performance optimizations and open-standards coverage to support the latest innovations in multiarchitecture, hardware-agnostic software development and deployment, edge to cloud.
3 Key Benefits
- More Performance on Intel Platforms – Achieve up to 3x higher GenAI performance on 6th Gen Intel® Xeon® processors (P-cores) with oneDNN, Intel-optimized AI frameworks, and Intel® AMX1; achieve up to 2.5x better HPCG performance with MRDIMM2 and oneMKL; develop high-performance AI on the PC—including LLM development—with optimized tools to unlock the power of Intel® Core™ Ultra processors (Series 2); and improve security and encryption with Intel® Cryptography Primitives Library.
- More Access to Industry-Standard Tools – Get even more from your existing development workflows using industry-leading AI frameworks and performance libraries with even more built-in Intel optimizations, including native support for PyTorch 2.5 on CPUs and GPUs; achieve optimal performance across CPU, GPU, and AI accelerators from the latest LLMs—Llama 3.2, Qwen2, Phi-3, and more—with Intel AI tools; and streamline your software setup with our toolkit selector to install full kits or right-sized sub-bundles.
- More Hardware Choices – Enjoy increased multi-vendor, multiarchitecture support, including faster CUDA*-to-SYCL* migration with the Intel® DPC++ Compatibility Tool that auto-migrates over 100 APIs used by popular AI, HPC, and rendering apps; achieve near-native performance on CPU and GPU for numeric compute with Intel® Distribution for Python; get 4x speedup of GPU kernels for algorithms with oneDPL; and gain future system flexibility and prevent lock-in through cross-hardware AI-acceleration libraries, including Triton, JAX, and OpenXLA*.
The Nuts & Bolts
Here's the collection for those interested in diving into the component-level details.
Compilers
- Intel oneAPI DPC++/C++ Compiler adds optimizations tailored for Intel® Xeon® 6 processors and Intel® Core™ Ultra processors, enables dynamic execution and flexible programming for Intel GPUs with new SYCL Bindless Textures support, streamlines development with new LLVM sanitizers to detect and troubleshoot device code issues, and enhances OpenMP standards conformance for 5.x and 6.0 plus add a more user-friendly optimization report that includes OpenMP offloading details.
- Intel® Fortran Compiler adds several enhancements, including Fortran 2023 standard features such as the AT Edit Descriptor for cleaner output, conditional TEAMS construct execution with the new IF clause for OpenMP 6.0, and support for arrays of co-arrays and “standard-semantics” option to precisely control application standards compliance; updates Fortran Developer Guide and reference documentation with refreshed code samples and added support for Fortran 2018 and 2023 Fortran language features.
Performance Libraries
- Intel® oneAPI Math Kernel Library (oneMKL) introduces performance optimizations across multiple domains—BLAS, LAPACK, FFT, and others—for developers targeting Xeon 6 processors with P-cores. It also adds significant improvements for HPC workload execution using single-precision 3D real in-place FFT on Intel® Data Center GPU Max Series and makes available new distribution models and data types for RNG using SYCL device API.
- Intel® oneAPI Data Analytics Library (oneDAL) enables calculation of SHAP (SHapley Additive exPlanations) values for binary classification models, which are required for explainability random forest (RF) algorithms.
- Intel® oneAPI Deep Neural Network Library (oneDNN) maximizes efficiency and performance with tailored optimizations for the latest Intel® platforms—spanning server, desktop, and mobile—including significantly faster performance for large language models (LLMs) and Scaled Dot-Product Attention subgraphs.
- Intel® oneAPI Threading Building Blocks (oneTBB) improves scalability for task_group, flow_graph, and parallel_for_each so multi-threaded applications run faster; introduces try_put_and_wait experimental API for faster results using oneTBB flow graph to process overlapping messages on a shared graph.
- Intel® oneAPI Collective Communications Library (oneCCL) improves workload performance and scalability with enhancements to Key-Value store, which allows workloads to scale up to an even larger number of nodes, and performance improvements to key collectives such as Allgather, Allreduce, and Reduce-scatter.
- Intel® MPI Library offers a full MPI 4.0 implementation, including partitioned communication, improved error handling, and Fortran 2008 support; and improves scale-out/scale-up performance on both Xeon 6 processors with P-core pinning and Intel GPUs via optimizations for MPI_Allreduce.
- Intel® oneAPI DPC++ Library (oneDPL) accelerates GPU kernels up to 4x3 for algorithms including reduce, scan and many other functions. Range-based algorithms with over 20 new C++20 standard ranges and views accelerate highly parallel code execution on multiarchitecture devices.
- Intel® Integrated Performance Primitives (Intel® IPP) adds CET-enabled protection (Control-flow Enforcement Technology), cutting-edge, hardware-enforced security measures that safeguard software against attacks and exploitation risks.
- Intel® Cryptography Primitives Library (formerly Intel® IPP) enables developers to dispatch on Xeon 6 processors, turbocharging RSA encryption (2k, 3k, 4k) with multi-buffer capabilities and hashing with an enhanced SM3 algorithm.
Analyzers & Debuggers
- Intel® DPC++ Compatibility Tool saves time and effort when migrating CUDA code and CMake build script to SYCL via auto-migration of more APIs used by popular AI, HPC, and rendering applications; migrated code is easy to comprehend with SYCLcompat, easy to debug with CodePin, and runs performantly on NVIDIA GPUs.
- Intel® VTune™ Profiler adds support for Intel Xeon 6 processors with P-cores and Core Ultra processors (Series 2), plus profiling support for Python 3.11, improving productivity with the ability to focus Python profiling to areas of interest and control performance data collection with Intel® ITT APIs.
- Intel® Advisor increases developers’ ability to identify bottlenecks, optimize code, and achieve peak performance on the latest Intel platforms; introduces a more adaptable kernel-matching mechanism—flexible kernel matching and XCG integration—to identify and analyze code regions relevant to specific optimization goals.
- Intel® Distribution for GDB* rebases to GDB 15, staying current and aligned with the latest enhancements supporting effective application debug; adds support for Core Ultra processors (Series 2) on Windows*; and enhances developer experience, both on the command line and when using Microsoft* Visual Studio and Visual Studio Code*, by boosting the debugger performance and refining the user interface.
AI & ML Tools, Frameworks, and Accelerated Python
- Intel® Distribution for Python* provides drop-in, near-native performance on CPU and GPU for numeric compute; Data Parallel Extension for Python (dpnp) and Data Parallel Control (dpctl) expand compatibility, adding NumPy 2.0 support in the runtime and providing asynchronous execution of offloaded operations.
- Intel AI Tools latest release ensures current and future GenAI foundation models—Llama 3.2, Qwen2, Phi-3 family, and more—perform optimally across Intel CPUs, GPUs, and AI accelerators.
- Triton (open source GPU programming for neural networks) enables developers to achieve peak performance and kernel efficiency on Intel GPUs thanks to it being fully optimized for Intel Core Ultra and Data Center GPU Max Series processors and available upstream in stock PyTorch.
- Native Support for PyTorch 2.5 is accessible on Intel’s Data Center GPUs, Core Ultra processors, and client GPUs, where it can be used to develop on Windows with out-of-the-box support for Intel® Arc™ Graphics and Intel® Iris® XE Graphics GPUs.
- Simplify enterprise GenAI adoption and reduce the time to production of hardened, trusted solutions by adopting the open platform project, OPEA, part of LF AI & Data. Now at release 1.0, OPEA continues to gain momentum with over 40 partners, including AMD, BONC, ByteDance, MongoDB, and Rivos.
- Seamlessly run JAX models on Intel® Data Center GPU Max and Flex with Intel® Extension for OpenXLA*, an Intel-optimized PyPI package based on PJRT plugin mechanism.
Footnotes
1 See [9A2] at intel.com/processorclaims: Intel® Xeon® 6. Results may vary.
2 See [9H10] at intel.com/processorclaims: Intel® Xeon® 6. Results may vary.
3 See oneDPL product page
oneAPI Turns 5!
November 17, 2024 | What is oneAPI?, oneAPI Developer Page
Happy 5th Anniversary to the open, standards-based, multiarchitecture programming initiative for accelerator architectures
Launched at Supercomputing 2019, the oneAPI initiative not only fostered permanent change in how the global developer ecosystem approaches heterogeneous programming, it’s become the foundation for building, optimizing, and deploying high-performance software that can run on any vendor architecture.
With hundreds of contributors, over 4.3 million installations, and 6.8 million developers using it via Intel® Software and AI Tools (explore the 2025.0 release), oneAPI is arguably one of the most eminent programming standards, a point further underscored by its adoption in 2023 by the Unified Acceleration (UXL) Foundation, hosted by Linux Foundation. UXL’s mission: to deliver an open-standard accelerator programming model that simplifies development of performant, cross-platform applications. It marks yet another critical step in driving innovation, with oneAPI as a key component.
All that in just 5 years. (Imagine what the next 5 will bring.)
If you haven’t tried oneAPI, you can get the gist of it here and download the 2025.0 tools here.
Celebrating oneAPI’s 5th Anniversary – What the Ecosystem is Saying
The 5th Anniversary of oneAPI is an opportunity to recognize both the technical depth of the ecosystem, which enables applications to run on different hardware, and how it succeeds in forming a community around HPC, AI, API standards, and portable applications.
oneAPI has revolutionized the way we approach heterogeneous computing by enabling seamless development across architectures. Its open, unified programming model has accelerated innovation in fields from AI to HPC, unlocking new potential for researchers and developers alike. Happy 5th Anniversary to oneAPI!
Intel's commitment to their oneAPI software stack is a testament to their developer-focused, open-standards commitment. As oneAPI celebrates its 5th Anniversary, it provides comprehensive and performant implementations of OpenMP and SYCL for CPUs and GPUs, bolstered by an ecosystem of library and tools to make the most of Intel processors.
Celebrating 5 years of oneAPI. In ExaHyPE, oneAPI has been instrumental in implementing the numerical compute kernels for hyperbolic equation systems, making a huge different in performance with SYCL providing the ideal abstraction and agnosticism for exploring these variations. This versatility enabled our team, together with Intel engineers, to publish three distinct design paradigms for our kernels.
Happy 5th Anniversary, oneAPI! We’ve been partners since the private beta program in 2019. We are currently exploring energy-efficient solutions for simulations in material science and data analysis in bioinformatics with different accelerators. For that, the components of oneAPI, its compilers with backends for various GPUs and FPGAs, oneMKL, and the performance tools VTune Profiler and Advisor, are absolutely critical.
GROMACS was an early adopter of SYCL as a performance-portability backend, leveraging it to run on multi-vendor GPUs. Over the years, we’ve observed significant improvements in the SYCL standard and the growth of its community. This underscores the importance of open standards in computational research to drive innovation and collaboration. We look forward to continued SYCL development, which will enable enhancements in software performance and increase programmer productivity.
See other testimonials:
Announcing General Availability of Object Storage on Intel® Tiber™ AI Cloud
October 17, 2024 | Intel® Tiber™ AI Cloud
Today Intel announced the availability of a new object storage service on its AI Cloud, providing scalable, durable, and cost-effective data storage that meets the demanding requirements of modern data and AI workloads.
It’s built on the powerful and open source MinIO platform, which is compatible with the S3 API (AWS’ Simple Storage Service), ensuring easy integration with existing applications and tools.
Customer benefits include:
- Scalability & flexibility – Can handle massive data storage needs, whether gigabytes or petabytes, to ensure your storage infrastructure grows with your business.
- Performance – Optimized for fast data access and retrieval, ensuring data is always accessible and can be processed quickly, including AI/ML workloads.
- Cost-effective storage – Enables businesses of all sizes to store vast amounts of data without breaking the bank.
- Robust security – Incorporates encryption at rest and in transit and includes robust access controls.
- Easy integration – Is purpose-built to integrate seamlessly with your existing workflows and applications spanning backup and recovery, data archiving, data lake use, and more.
- Enhanced data management – Manage your data efficiently with features like versioning, lifecycle policies, and metadata management.
Inflection AI Launches Enterprise AI Running on Intel® Gaudi® 3 and Intel® Tiber™ AI Cloud
October 7, 2024 | Inflection AI-Intel collaboration, Intel® Tiber™ AI Cloud
New collaboration delivers turnkey AI-powered platform to drive high-impact results for enterprises
Today Inflection AI and Intel announced a collaboration to accelerate the adoption and impact of AI for the world’s largest enterprises. Inflection AI is launching Inflection 3.0, an industry-first, enterprise-grade AI platform, delivering empathetic, conversational and employee-friendly AI capabilities—powered by Intel® Gaudi® 3 accelerators on Intel® Tiber™ AI Cloud—that provides the control, customization, and scalability required for complex, large-scale deployments.
“Together, we’re giving enterprise customers ultimate control over their AI,” said Markus Flierl, CVP of Intel Tiber Cloud Services. “By integrating Inflection AI with Intel Tiber AI Cloud and Gaudi 3, we are providing an open ecosystem of software, price and performance, and scalability, unlocking the critical roadblocks to enterprise AI adoption and the secure, purpose-built, employee-specific, and culture-oriented AI tools customers need.”
Why it matters
Building an AI platform is complex, requiring extensive infrastructure; time to develop, train, and fine-tune models; and a multitude of engineers, data scientists, and application developers.
With Inflection 3.0, enterprise customers now have access to a complete AI platform that supercharges their employees with a virtual AI co-worker trained on their company data, policies and culture. And running it on Gaudi 3 in the Intel Tiber AI cloud offers high performance, robust software and efficiency, ultimately delivering industry-leading performance, speed and scalability in a cost-effective way for high-impact results.
Intel Launches Xeon 6 and Gaudi 3, Enabling the Next-Generation of AI Solutions
September 24, 2024 | Xeon 6 with P-Cores, Gaudi 3 AI Accelerator
Today, Intel launched Intel® Xeon® 6 processors with Performance cores (P-cores) and Intel® Gaudi® 3 AI accelerators, bolstering the company’s commitment to deliver powerful AI systems with optimal performance-per-watt and lower TCO.
Highlights of these two major updates to Intel’s AI-focused data center portfolio include:
- Intel Xeon 6 with P-cores is designed to handle compute-intensive workloads with exceptional efficiency, delivering twice the performance of its predecessor1. It features increased core count, double the memory bandwidth, and AI acceleration capabilities embedded in every core.
- Intel Gaudi 3 AI Accelerator is specifically optimized for large-scale generative AI, boasting 64 Tensor processor cores and 8 matrix multiplications engines to accelerate deep neural network computations. It includes 128 Gigabytes of HBM2e memory for training and inference and 24 200-Gigabit Ethernet ports for scalable networking, and it offers up to 20% more throughput and 2x price/performance vs NVIDIA H100 for inference of Llama 2 70B2.
Seekr Launches Self-Service AI Enterprise Platform on Intel
September 4, 2024 | SeekrFlow, Intel® Tiber Developer Cloud
Deploy trusted AI with Seekr at a superior price-performance running on Intel® Tiber™ Developer Cloud
Today Seekr announced its enterprise-ready platform, SeekrFlow, is now available in the Intel® Tiber™ Developer Cloud, running on high-performance, cost-efficient Intel® Gaudi® AI accelerators.
SeekrFlow is a complete end-to-end platform for training, validating, deploying, and scaling trusted enterprise AI applications, reducing the cost and complexity of AI adoption and lessening hallucinations.
Why it matters
In short, customer advantage.
By using Intel’s cloud for developing and deploying AI at scale while also leveraging the power of SeekrFlow to run Trusted AI—and doing this all in one place—customers gain excellent price-performance, access to Intel CPUs, GPUs and AI accelerators, and flexibility with an open AI software stack.
Deliver AI Faster on Next-Gen Intel® Core™ Ultra AI PCs
September 3, 2024 | Jumpstart AI Development, Develop for the AI PC
Today Intel introduced next-gen Intel® Core™ Ultra processors (code-named Lunar Lake), revealing breakthroughs in efficiency, compute, and AI performance in the latest AI PCs.
ISVs, developers, AI engineers, and data scientists can take advantage of the client platform’s AI horsepower for their work—AI PCs are great for developing and optimizing models, applications, and solutions.
- Simplify and accelerate AI training and inference using open source foundational models, optimized frameworks like PyTorch and TensorFlow, and Intel® OpenVINO™ toolkit.
- Tap into the AI PC’s cutting-edge capabilities such as Intel® AVX-512 and Intel® AI Boost by leveraging Intel® Software Development Tools to gain performance and development productivity.
- Port your existing CPU/GPU code using oneAPI heterogeneous programming and optimize it to run faster while drawing up to 40% less power.
Before the end of 2024, Intel Core Ultra processor-based platforms with integrated software development kits (SDKs) will also be available in Intel® Tiber Developer Cloud.
AI Everywhere: 2024.2 Intel® Software Development & AI Tools Are Here
Aug. 9 , 2024 | Intel® Software Development Tools, Intel® Tiber™ Developer Cloud
The fast path to performant, production-ready AI
The latest release of Intel’s oneAPI and oneAPI-powered AI tools are tuned to help developers more easily deliver high-performance AI applications (and HPC, too) with faster time-to-solution, increased hardware choice, and improved reliability. And for building and deploying AI in a production cloud environment, check out new hardware and services in Intel® Tiber™ Developer Cloud.
3 Key Benefits
- Faster, More Responsive AI – Achieve up to 2x higher GenAI performance on upcoming Intel® Xeon® 6 processors (P-cores) with oneDNN, Intel-optimized AI frameworks, and Intel® AMX3 and up to 1.6 better performance for workloads including analytics and media (with Xeon 6 E-Cores)4. Experience improved LLM inference throughput and scalability on AI PCs – including upcoming client processors (codenamed Lunar Lake) for unmatched future-ready AI compute, and 3.5x AI throughput over the previous generation5. The tools support 500+ models such as Llama 3.1 and Phi-3. Deploy and scale production AI on a managed, cost-efficient infrastructure with Intel Tiber Developer Cloud.
- Greater Choice & Control – Maximize performance for AI and HPC workloads on all Intel CPUs and GPUs through continued upstream optimizations to industry-standard AI frameworks. Run and deploy PyTorch 2.4 on Intel GPUs with minimal coding efforts for easier deployment on ubiquitous hardware. Increase application efficiency and control through optimizations in oneMKL, oneTBB, and oneDPL and enhanced SYCL* Graph capabilities in Intel® oneAPI DPC++/C++ Compiler. This release introduces broad tools support for Xeon 6 (E-cores and upcoming P-cores) and Lunar Lake processors for accelerating AI, technical, enterprise, and graphics compute workloads.
- Simplified Code Optimization – Speed up AI training and inference performance with Intel® VTune™ Profiler’s platform-aware optimizations, wider framework, and new hardware codename Grand Ridge processors. For easier CUDA* code porting to SYCL*, automatically migrate 100+ more CUDA APIs with the Intel® DPC++ Compatibility Tool; and pinpoint inconsistencies in CUDA-to-SYCL code migration using CodePin instrumentation.
The Nuts & Bolts
For those interested in diving into component-level details, here’s the collection. Foundational tools are bundled in the Intel® oneAPI Base Toolkit and Intel® HPC Toolkit. For AI tools get just what you need in a selector tool.
Compilers
- Intel oneAPI DPC++/C++ Compiler includes enhanced SYCL Graph capabilities featuring pause/resume support for better control and increased performance tuning; delivers more SYCL performance on Windows* with default context enabled; and introduces SPIR-V support and OpenCL™ query support with the latest release of the kernel compiler for greater compute kernel flexibility and optimization.
- Intel® Fortran Compiler adds integer overflow control options (-fstrict-overflow, Qstrict-overflow[-], and -fnostrict-overflow) to ensure correct functionality; expands conformance enhancements for the latest OpenMP standards, including 5.x and 6.0, for increased thread-usage control and more powerful loop optimizations; and adds OpenMP runtime library extensions for memory management, performance, and efficiency.
Libraries
- Intel® Distribution for Python* adds sorting and summing functions to the Data Parallel Control Library for improved productivity; and provides a new family of cumulative and improved linear algebra functions to Data Parallel Extension for NumPy* for increased performance.
- Intel® oneAPI Deep Neural Network Library (oneDNN) delivers production-quality optimizations that increase performance on Intel’s AI-enhanced client processors and server platforms, and boosts AI workload efficiency with support for int8 and int4 weight decompression in matmul, which accelerates LLMs for faster insights and results.
- Intel® oneAPI Math Kernel Library (oneMKL) introduces enhanced performance of 2D and 3D real and complex FFT targeted for Intel® Max Series GPUs.
- Intel® oneAPI Data Analytics Library (oneDAL) extends sparsity functions across its algorithms by adding DPC++ sparse gemm and gemy primitives and sparsity support for the logloss function primitive.
- Intel® oneAPI DPC++ Library (oneDPL) adds new C++ Standard Template Library inclusive_scan algorithm extension, which enables developers to write parallel programs for multiarchitecture devices and improves existing algorithms on Intel and other vendor GPUs.
- Intel® oneAPI Collective Communications Library (oneCCL) introduces multiple enhancements that improve system resources utilization such as memory and I/O for even better performance.
- Intel® oneAPI Threading Building Blocks (oneTBB) optimizes thread and multi-thread synchronization, which reduces startup latency on 5th Gen Intel Xeon processors and speeds OpenVINO™ toolkit performance up to 4x on ARM CPUs, including Apple Mac*; enhanced parallel_reduce improves data movement to avoid extra copying.
- Intel® Integrated Performance Primitives (Intel® IPP) adds optimization patch for zlip 1.3.1 to improve compression ratio and throughput in data-compression tasks, and adds accelerated image-processing capabilities on select color-conversion functions using Intel® AVX-512 VNNI on Intel GPUs.
- Intel® IPP Cryptography expands security across government agencies and the private sector, including NIST FIPS 140-3 compliance, and enhances data protection with optimized LMS post-quantum crypto algorithm for single buffer implementation. It also optimizes AES-GCM performance on Intel Xeon and Intel® Core™ Ultra processors via a simplified new code sample, and streamlines development with Clang 16.0 compiler support for Linux*.
- Intel® MPI Library increases application performance on machines with multiple Network Interface Cards by enabling developers to pin specific threads to individual NICs; and adds optimizations for GPU-aware broadcasts, RMA peer-to-peer device-initiated communications, intranode thread-splits, and Infiniband* tuning for 5th Gen Intel Xeon processors.
AI & ML Tools & Frameworks
- PyTorch* 2.4 now provides initial support for Intel® Max Series GPUs, which brings Intel GPUs and the SYCL* software stack into the official PyTorch stack to help further accelerate AI workloads.
- Intel Extension for PyTorch* provides better tuning for CPU performance for Bert_Large, Stable Diffusion using FP16 optimizations in eager mode. Popular LLM models are optimized for Intel GPUs using weight-only quantization (WOQ) to reduce the amount of memory access without losing accuracy while still improving performance.
- Intel Neural Compressor improves INT8 and INT4 LLM model performance using SmoothQuant and WOQ algorithms in more than 15+ popular LLM quantization recipes. Take advantage of in-place mode in WOQ to reduce memory footprint when running the quantization process. Improve model accuracy with AutoRound, a low-bit quantization method for LLM inference to fine-tune rounding values and minmax values of weights in fewer steps. New Wanda and DSNOT pruning algorithms for PyTorch LLM help improve performance during AI inferencing while the SNIP algorithm enables scaling models on multi-card or multi-nodes (CPU).
Analysis, Debug and Code Migration Tools
- Intel® VTune™ Profiler enables deeper insights into sub-optimal oneCCL communication, adds support for .NET8, and supports upcoming codename Grand Ridge processors. A technical preview feature allows developers to get a high-level view of potential bottlenecks in software performance analysis before exploring top-down microarchitecture metrics for deeper analysis.
- Intel® DPC++ Compatibility Tool accelerates visual AI and imaging applications on multivendor GPUs via option-enabled migration to SYCL* image API extension; auto-compares kernel run logs and reports differences for migrated SYCL code; and can migrate 126 commonly-used CUDA APIs.
- Intel® Distribution for GDB* supports Core Ultra processors on Windows*; adds Land Variable Watch Window to monitor and analyze variables and enhance application stability faster and more efficiently in VS Code*; and expands Control-flow Enforcement Technology (CET) to strengthen application security.
Get deeper details with a developer’s perspective on new features in this blog and in tools release notes.
Build & Deploy AI Solutions at Scale in Intel Tiber Developer Cloud
Develop and deploy AI models, applications, and production workloads on the latest Intel architecture using an open software stack that’s built on oneAPI and includes popular foundational models and optimized tools and frameworks.
New hardware and services—access:
- Virtual machines with Intel® Max Series GPUs
- GenAI Jupyter notebooks with Intel® Gaudi® 2 accelerators
- Intel® Kubernetes Service with container deployment via K8s APIs
- Intel Xeon 6 preproduction systems in the preview environment
Intel® Gaudi® 2 Enables a Lower Cost Alternative for AI Compute and GenAI
June 12, 2024 | Intel® Gaudi® 2 AI Accelerator, Intel® Tiber™ Developer Cloud
Today, MLCommons published results of its industry AI performance benchmark: MLPerf Training v4.0. Intel’s results illustrate the choice Intel Gaudi 2 AI accelerators offer to enterprises and customers.
Intel submitted results on a large Gaudi 2 system (1,024 Gaudi 2 accelerators) trained in the Intel Tiber Developer Cloud to demonstrate the AI accelerator’s performance and scalability—it can handily train 70B-175B parameter LLMs—as well as Tiber Developer Cloud’s capacity for efficiently training MLPerf’s GPT-3 175B1 parameter benchmark model.
Results
Gaudi 2 continues to be the only MLPerf-benchmarked alternative for AI compute to the Nvidia H100. Trained in the Tiber Developer Cloud, Intel’s GPT-3 results for time-to-train (TTT) of 66.9 minutes on an AI system of 1,024 Gaudi accelerators proves strong Gaudi 2 scaling performance on ultra-large LLMs within a developer cloud environment1.
The benchmark suite also featured a new measurement: fine-tuning the Llama 2 70B parameter model using LoRA (Low-Rank Adaptation, a fine-tuning method for large language and diffusion models). Intel’s submission achieved TTT of 78.1 minutes on eight Gaudi 2 accelerators.
How Gaudi provides AI value to customers
High costs have priced too many enterprises out of the market. Intel Gaudi is starting to change that. At Computex, Intel announced that a standard AI kit including eight Intel Gaudi 2 accelerators with a universal baseboard (UBB) offered to system providers at $65,000 is estimated to be one-third the cost of comparable competitive platforms. A kit including eight Intel Gaudi 3 accelerators with a UBB lists at $125,000, estimated to be two-thirds the cost of comparable competitive platforms2.
The value of Intel Tiber Developer Cloud
Intel’s cloud provides enterprise customers a unique, managed, and cost-efficient platform to develop and deploy AI models, applications, and solutions—from single nodes to large cluster-level compute capacity. This platform increases access to Gaudi for AI compute needs—in the Tiber Developer Cloud, Intel makes its accelerators, CPUs, GPUs, an open AI software stack, and other services are easily accessible. Learn more.
More resources
1 MLPerf's GPT-3 measurement is conducted on a 1% representative slice of the entire model as determined by the participating companies who collectively devise the MLCommons benchmark.
2 Pricing guidance for cards and systems is for modeling purposes only. Please consult your original equipment manufacturer (OEM) of choice for final pricing. Results may vary based upon volumes and lead times.
For workloads and configurations, visit MLCommons.org. Results may vary.
More than 500 AI Models Run Optimized on Intel® Core™ Ultra Processors
May 1, 2024 | Intel® Core™ Ultra Processor family
Intel builds the PC industry’s most robust AI PC toolchain
Today, Intel announced it has surpassed 500 pre-trained AI models running optimized on new Intel® Core™ Ultra processors, the industry’s premier AI PC processor available in the market.
The models span more than 20 categories of local AI inferencing: large language, diffusion, super resolution, object detection, image classification and segmentation, and computer vision, among others. They include Phi-2, Mistral, Llama, BERT, Whisper, and Stable Diffusion 1.5.
This is a landmark moment for Intel’s efforts to nurture and support the AI PC transformation—the Intel Core Ultra processor is the fastest growing AI PC processor to date; it feature new AI experiences, immersive graphics, and optimal battery life; and it’s the most robust platform for AI PC development, with more AI models, frameworks, and runtimes enabled than any other processor vendor.
All 500 models can be deployed across CPU, GPU, and NPU. They are available across popular industry sources such as OpenVINO Model Zoo, Hugging Face, ONNX Model Zoo, and PyTorch.
Additional resources
Canonical Ubuntu* 24.04 LTS Release Optimized by Intel® Technology
April 25, 2024 | Ubuntu 24.04 LTS, Intel® QAT, Intel® TDX
Today, Canonical announced the release of Ubuntu* 24.04 LTS (codenamed Noble Numbat). This 10th Long Term Supported release merges advancements in performance engineering and confidential computing, including integration of Intel® QuickAssist Technology (Intel® QAT) for workload acceleration on CPU and support for Intel® Trust Domain Extensions (Intel® TDX) to strengthen confidential computing in private data centers.
“Ubuntu is a natural fit to enable the most advanced Intel features. Canonical and Intel have a shared philosophy of enabling performance and security at scale across platforms.”
Release Highlights
- Performance-engineering tools – Includes the latest Linux* 6.8 kernel with improved syscall performance, nested KVM support on ppc64el, features to reduce kernel task scheduling delays, and frame pointers enabled by default on all 64-bit architectures for more complete CPU and off-CPU profiling.
- Intel® QAT integration – Enables accelerated encryption and compression, reduce CPU utilization, and improve networking and storage application performance on 4th Gen and new Intel® Xeon® Scalable processors.
- Intel® TDX support – The release seamlessly supports the extensions on both the host and guest sides, with no changes required to the application layer, greatly simplifying the porting and migration of existing workloads to a confidential computing environment.
- Increased developer productivity – Includes Python* 3.12, Ruby 3.2, PHP 8.3, and Go 1.22, with additional focus dedicated to the developer experience for .NET, Java, and Rust.
Learn more
Download Ubuntu 24.04 LTS
Noble Numbat Deep Dive
About Canonical
Canonical, the publisher of Ubuntu, provides open source security, support, and services. Its portfolio covers critical systems, from the smallest devices to the largest clouds, from the kernel to containers, from databases to AI.
Seekr Grows AI Business with Big Cost Savings on Intel® Tiber™ Developer Cloud
April 10, 2024 | Intel® Tiber® Developer Cloud
Trustworthy AI for content evaluation and generation at reduced costs
Named one of the most innovative companies of 2024 by Fast Company, Seekr is using the Intel® Tiber™ Developer Cloud1 to build, train, and deploy advanced LLMs on cost-effective clusters running on the latest Intel hardware and software, including Intel® Gaudi® 2 AI accelerators. This strategic collaboration to accelerate AI helps Seekr meet the enormous demand for compute capacity while reducing its cloud costs and increasing workload performance.
Solution overview at a glance
Two of Seekr’s popular products, Flow and Align, help customers leverage AI to deploy and optimize their content and advertising strategies and to train, build, and manage the entire LLM pipeline using scalable and composable workflows.
This takes immense compute capacity which, historically, would require a significant infrastructure investment and considerable cloud costs.
By moving their production workloads from on-premise to Intel Tiber Developer Cloud, Seekr is now able to employ the power and capacity of Intel hardware and software technologies—including thousands of Intel Gaudi 2 cards—to build its LLMs, and do so at a fraction of the price and with exceptionally high performance.
Read the case study (includes benchmarks)
About Seekr
Seekr builds large language models (LLMs) that identify, score, and generate reliable content at scale; the company’s goal is to make the Internet safer and more valuable to use while solving their customers’ need for brand trust. Its customers include Moderna, SimpliSafe, Babbel, Constant Contact, and Indeed.
1 Formerly “Intel® Developer Cloud”; now part of the Intel® Tiber™ portfolio of enterprise business solutions.
Intel Vision 2024 Unveils Depth & Breadth of Open, Secure, Enterprise AI
April 9, 2024
At Intel Vision 2024, Intel CEO Pat Gelsinger introduced new strategies, next-gen products and portfolios, customers, and collaborations spanning the AI continuum.
Topping the list is Intel® Tiber™, a rich portfolio of complementary business solutions to streamline deployment of enterprise software and services across AI, cloud, edge, and trust and security; and the Intel® Gaudi® 3 accelerator, bringing more performance, openness, and choice to enterprise GenAI.
More than 20 customers showcased their leading AI solutions running on Intel® architecture, with LLM/LVM platform providers Landing.ai, Roboflow, and Seekr demonstrating how they use Intel Gaudi 2 accelerators on the Intel® Tiber™ Developer Cloud to develop, fine-tune, and deploy their production-level solutions.
Specific to collaborations, Intel announced them with Google Cloud, Thales, and Cohesivity, each of whom is leveraging Intel’s confidential computing capabilities—including Intel® Trust Domain Extensions (Intel® TDX), Intel® Software Guard Extensions (Intel® SGX), and Intel® Tiber™ Trust Services1 attestation service—in their cloud instances.
A lot more was revealed, including formation of the Open Platform for Enterprise AI and Intel’s expanded AI roadmap inclusive of 6th Gen Intel® Xeon® processors with E- and P-cores and silicon for client, edge, and connectivity.
“We’re seeing incredible customer momentum and demonstrating how Intel’s open, scalable systems, powered by Intel Gaudi, Xeon, Core Ultra processors, Ethernet-enabled networking, and open software, unleash AI today and tomorrow, bringing AI everywhere for enterprises.”
Highlights
Intel Tiber portfolio of business solutions simplifies the deployment of enterprise software and services, including for AI, making it easier for customers to find complementary solutions that fit their needs, accelerate innovation, and unlock greater value without compromising on security, compliance, or performance. Full rollout is planned in the 3rd quarter of 2024. Explore Intel Tiber now.
Intel Gaudi 3 AI accelerator promises 4x more compute and 1.5x increase in memory bandwidth over Gaudi 2 and is projected to outperform NVIDIA H100 by an average of 50% on inference and 60% on power efficiency for LLaMa 7B and 70B and Falcon 180B LLMs. It will be available the 2nd quarter of 2024, including in the Intel Developer Cloud.
Intel Tiber Developer Cloud’s latest release includes new hardware and services that boost compute capacity, including bare metal as a service (BMaaS) options that host large-scale clusters of Gaudi 2 accelerators and Intel® Max Series GPUs, VMs running on Gaudi 2, storage as a service (StaaS) including file storage, and Intel® Kubernetes Service for cloud-native AI workloads.
Find out how Seekr used Intel Developer Cloud to deploy a trustworthy LLM for content generation and evaluation at scale.
Confidential computing collaborations with Thales and Cohesity increase trust and security and decrease risk for enterprise customers.
- Thales, a leading global tech and security provider, announced a data security solution comprised of its own CipherTrust Data Security Platform on Google Cloud Platform for end-to-end data protection and Intel Tiber Trust Services for confidential computing and trusted cloud-independent attestation. This will give enterprises additional controls to protect data at rest, in transit, and in use.
- Cohesity, a leader in AI-powered data security and management, announced the addition of confidential computing capabilities to Cohesity Data Cloud. The solution leverages its Fort Knox cyber vault service for data-in-use encryption, in tandem with Intel SGX and Intel Tiber Trust Services to reduce the risk posed by bad actors accessing data while it’s being processed in main memory. This is critical for regulated industries such as financial services, healthcare, and government.
Explore more
- Intel’s Enterprise Software Portfolio
- Intel Tiber Developer Cloud
- Intel® Confidential Computing Solutions
- Intel TDX
- Intel SGX
1 Formerly Intel® Trust Authority
Just Released: Intel® Software Development Tools 2024.1
March 28, 2024 | Intel® Software Development Tools
Accelerate code with confidence on the world’s first SYCL 2020-conformant toolchain
The 2024.1 Intel® Software Development Tools are now available and include a major milestone for accelerated computing: Intel® oneAPI DPC++/C++ Compiler has become the first compiler to adopt the full SYCL 2020 specification.
Why is this important?
Having a SYCL 2020-conformant compiler means developers can have confidence that their code is future-proof—it’s portable and reliably performant across the diversity of existing and future-emergent architectures and hardware targets, including GPUs.
“SYCL 2020 enables productive heterogeneous computing today, providing the necessary controls to write high-performance parallel software for the complex reality of today’s software and hardware. Intel’s commitment to supporting open standards is again showcased as they become a SYCL 2020 Khronos Adopter.”
Key Benefits
- Code with Confidence & Build Faster – Optimize parallelization for higher performance and productivity in modern C++ code via the Intel oneAPI DPC++/C++ Compiler, now with full SYCL 2020 conformance; explore new multiarchitecture features across AI, HPC, and distributed computing; and access relevant AI Tools faster and more easily with an expanded set of web-based selector options.
- Accelerate AI Workloads & Lower Compute Costs – Achieve performance improvements on new Intel CPUs and GPUs, including up to 14x with oneDNN on 5th Gen Intel® Xeon® Scalable processors1; 10x to 100x out-of-the-box acceleration of popular deep learning frameworks and libraries such as PyTorch* and TensorFlow*2; and faster gradient boosting inference across XGBoost, LightGBM, and CatBoost. Perform parallel computations at reduced cost with Intel® Extension for Scikit-learn* algorithms.
- Increase Innovation & Expand Deployment – Tune once and deploy universally with more efficient code offload using SYCL Graph, now available on multiple SYCL backends in the Intel oneAPI DPC++/C++ Compiler; ease CUDA-to-SYCL migration of more CUDA APIs in the Intel® DPC++ Compatibility Tool; and explore time savings in a CodePin Tech Preview (new SYCLomatic feature) to auto-capture test vectors and start validation immediately after migration. Codeplay adds new support and capabilities to its oneAPI plugins for NVIDIA and AMD GPUs.
The Nuts & Bolts
For those of you interested in diving into the component-level deets, here’s the collection.
Compilers
- Intel oneAPI DPC++/C++ Compiler is the first compiler to achieve SYCL 2020 conformance, giving developers confidence that their SYCL code is portable and reliably performs on the diversity of current and emergent GPUs. Enhanced SYCL Graph allows for seamless integration of multi-threaded work and thread-safe functions with applications and is now available on multiple SYCL backends, enabling tune-once-deploy-anywhere capability. Expanded conformance to OpenMP 5.0, 5.1, 5.2, and TR12 language standards enables increased performance.
- Intel® Fortran Compiler adds more Fortran 2023 language features including improved compatibility and interoperability between C and Fortran code, simplified trigonometric calculations, and predefined data types to improve code portability and ensure consistent behavior; makes OpenMP offload programming more productive; and increases compiler stability.
Performance Libraries
- Intel® oneAPI Math Kernel Library (oneMKL) introduces new optimizations and functionalities to reduce the data transfer between Intel GPUs and the host CPU, enables the ability to reproduce results of BLAS level 3 operations on Intel GPUs from run-to-run through CNR, and streamlines CUDA-to-SYCL porting via the addition of CUDA-equivalent functions.
- Intel® oneAPI Data Analytics Library (oneDAL) enables gradient boosting inference acceleration across XGBoost*, LightGBM*, and CatBoost* without sacrificing accuracy; improves clustering by adding spare K-Means support to automatically identify a subset of the features used in clustering observations.
- Intel® oneAPI Deep Neural Network Library (oneDNN) adds support for GPT-Q to improve LLM performance, fp8 data type in primitives and Graph API, fp16 and bf16 scale and shift arguments for layer normalization, and opt-in deterministic mode to guarantee results are bitwise identical between runs in a fixed environment.
- Intel® oneAPI DPC++ Library (oneDPL) adds a specialized sort algorithm to improve app performance on Intel GPUs, adds transform_if variant with mask input for stencil computation needs, and extends C++ STL style programming with histogram algorithms to accelerate AI and scientific computing.
- Intel® oneAPI Collective Communications Library (oneCCL) optimizes all key communication patterns to speed up message passing in a memory-efficient manner and improve inference performance.
- Intel® Integrated Performance Primitives expands features and support for quantum computing, cybersecurity, and data compression, including XMSS post-quantum hash-based cryptographic algorithm (tech preview), FIPS 140-3 compliance, and updated LZ4 lossless data compression algorithm for faster data transfer and reduced storage requirements in large data-intensive applications.
- Intel® MPI Library adds new features to improve application performance and programming productivity, including GPU RMA for more efficient access to remote memory and MPI 4.0 support for Persistent Collectives and Large Counts.
AI & ML Tools & Frameworks
- Intel® Distribution for Python* expands the ability to develop more future-proof code, including Data Parallel Control (dpctl) library’s 100% conformance to the Python Array API standard and support for NVIDIA devices; Data Parallel Extension for NumPy* enhancements for linear algebra, data manipulation, statistics, data types, plus extended support for keyword arguments; and Data Parallel Extension for Numba* improvements to kernel launch times.
- Intel Extension for Scikit-learn reduces the computational costs on GPUs by making computations only on changed dataset pieces with Incremental Covariance and performing parallel GPU computations using SPMD interfaces.
- Intel® Distribution of Modin* delivers significant enhancements in security and performance, including a robust security solution that ensures proactive identification and remediation of data asset vulnerabilities, and performance fixes to optimize asynchronous execution. (Note: in the 2024.2 release, developers will be able to access Modin through upstream channels.)
Analyzers & Debuggers
- Intel® VTune™ Profiler expands the ability to identify and understand the reasons of implicit USM data movements between Host and GPU causing performance inefficiencies in SYCL applications; adds support for .NET 8, Ubuntu* 23.10, and FreeBSD* 14.0.
- Intel® Distribution for GDB* rebases to GDB 14, staying current and aligned with the latest application debug enhancements; enables the ability to monitor and troubleshoot memory access issues in real time; and adds large General Purpose Register File debug mode support for more comprehensive debugging and optimization of GPU-accelerated applications.
Rendering & Ray Tracing
- Intel® Embree adds enhanced error reporting for SYCL platform and driver to smooth the transition of cross-architecture code; improves stability, security, and performance capabilities.
- Intel® Open Image Denoise fully supports multi-vendor denoising across all platforms: x86 and ARM CPUs (including ARM support on Windows*, Linux*, and macOS*) and Intel, NVIDIA, AMD, and Apple GPUs.
More Resources
- Intel Compiler First to Achieve SYCL 2020 Conformance
- A Dev's Take on the 2024.1 Release
- Download Codeplay oneAPI plugins: NVIDIA GPUs | AMD GPUs
Footnotes
1 Performance Index: 5th Gen Intel Xeon Scalable Processors
2 Software AI accelerators: AI performance boost for free
Gaudi and Xeon Advance Inference Performance for Generative AI
March 27, 2024 | Intel® Developer Cloud, MLCommons
Newest MLPerf results for Intel® Gaudi 2 accelerators and 5th Gen Intel® Xeon® processors demonstrate Intel is raising the bar for GenAI performance.
Today, MLCommons published results of the industry standard MLPerf v4.0 benchmark for inference, inclusive of Intel’s submissions for its Gaudi 2 accelerators and 5th Gen intel Xeon Scalable processors with Intel® AMX.
As the only benchmarked alternative to NVIDIA H100* for large language and multi-model models, Gaudi 2 offers compelling price/performance, important when gauging the total cost of ownership. On the CPU side, Intel remains the only server CPU vendor to submit MLPerf results (and Xeon is the host CPU for many accelerator submissions).
Get the details and results here.
Try them in the Intel® Developer Cloud
You can evaluate 5th Gen Xeon and Gaudi 2 in the Intel Developer Cloud, including running small- and large-scale training (LLM or generative AI) and inference production workloads at scale and managing AI compute resources. Explore the subscription options and sign up for an account here.
Intel Open Sources Continuous Profiler Solution, Automating Always-On CPU Performance Analysis
March 11, 2024 | Intel® Granulate™ Cloud Optimization Software
A continuous, autonomous way to find runtime efficiencies and simplify code optimization.
Today, Intel has released to open source the Continuous Profiler optimization agent, serving as another example of the company’s open ecosystem approach to catalyze innovation and boost productivity for developers.
As its name indicates, Continuous Profiler keeps perpetual oversight on CPU utilization, thereby offering developers, performance engineers, and DevOps an always-on and autonomous way to identify application and workload runtime inefficiencies.
How it works
It combines multiple sampling profilers into a single flame graph, which is a unified visualization of what the CPU is spending time on and, in particular, where high latency or errors are happening in the code.
Why you want it
Continuous Profiler comes with numerous unique features to help teams find and fix performance errors and smooth deployment, is compatible with Intel Granulate’s continuous optimization services, can be deployed cluster-wide in minutes, and supports a range of programming languages without requiring code changes.
Additionally, it’s SOC2-certified and held to Intel's high security standards, ensuring reliability and trust in its deployment, and is used by global companies including Snap Inc. (portfolio includes Snapchat and Bitmoji), ironSource (app business platform), and ShareChat (social networking platform).
Learn more
Intel® Software at KubeCon Europe 2024
February 29, 2024 | Intel® Software @ KubeCon Europe 2024
Intel’s Enterprise Software Portfolio enables K8s scalability for enterprise applications
Meet Intel enterprise software experts at KubeCon Europe 2024 (March 19-22) and discover how you can streamline and scale deployments, reduce Kubernetes costs, and achieve end-to-end security for data.
Plus, attend the session Above the Clouds with American Airlines to learn how one of the world’s top airlines achieved 23% cost reductions for their largest cloud-based workloads using Intel® Granulate™ software.
Why Intel Enterprise Software for K8s?
Because its Enterprise Software portfolio is purpose-built to accelerate cloud-native applications and solutions more efficiently, at scale, paving a faster way to AI. Meaning you can run production-level Kubernetes workloads the right way—easier to manage, secure, and efficiently scalable.
In a nutshell, you get:
- Optimized performance with reduced costs
- Better models with streamlined workflow
- Confidential computing that’s safe, secure, and compliant
Stop by Booth #J17 to have a conversation about the depth and breadth of Intel’s enterprise software solutions.
More resources
Prediction Guard Offers Customers LLM Reliability and Security via Intel® Developer Cloud
February 22, 2024 | Intel® Developer Cloud
AI startup Prediction Guard is now hosting its LLM API in the secure, private environment of Intel Developer Cloud, taking advantage of Intel’s resilient computing resources to deliver peak performance and consistency in cloud operations for its customers’ GenAI applications.
Prediction Guard’s AI platform enables enterprises to harness the full potential of large language models while mitigating security and trust issues such as hallucinations, harmful outputs, and prompt injections.
By moving to Intel Developer Cloud, the company can offer its customers significant and reliable computing power as well as the latest AI hardware acceleration, libraries, and frameworks: it’s currently leveraging Intel® Gaudi® 2 AI accelerators, the Intel/Hugging Face collaborative Optimum Habana library, and Intel extensions for PyTorch and Transformers.
“For certain models, following our move to Intel Gaudi 2, we have seen our costs decrease while throughput has increased by 2x.”
Learn more
Prediction Guard is part of the Intel® Liftoff for Startups, a free program for early-stage AI and machine learning startups that helps them innovate and scale across their entrepreneurial journey.
New Survey Unpacks the State of Cloud Optimization for 2024
February 20, 2024 | Intel® Granulate™ software
A newly released global survey conducted by the Intel® Granulate™ cloud-optimization team assessed key trends and strategies in cloud computing among DevOps, Data Engineering, and IT leaders at 413 organizations spanning multiple industries.
Among the findings, the #1 and #2 priorities for the majority of organizations (over 2/3) were cloud cost reduction and application performance improvement. And yet, 54% do not have a team dedicated to cloud-based workload optimization.
Get the report today to learn more trends, including:
- Cloud optimization priorities and objectives
- Assessment of current optimization efforts
- The most costly and difficult-to-optimize cloud-based workloads
- Optimization tools used in the tech stack
- Innovations for 2024
Download the report →
Request a demo →
American Airlines Achieves 23% Cost Reductions for Cloud Workloads using Intel® Granulate™
January 29, 2024 | Intel® Granulate™ Cloud Optimization Software
American Airlines (AA) partnered with Intel Granulate to optimize its most challenging workloads, which were stored in a Databricks data lake, and also mitigate the challenges of an untenable data-management price tag.
After deploying the Intel Granulate solution, which delivers autonomous and continuous optimization with no code changes or development efforts required, AA was able to free up engineering teams to process and analyze data at optimal pace and scale, run job clusters with 37% fewer resources, and reduce costs across all clusters by 23%.
Read the case study →
Request a demo →
Intel, the Intel logo, and Granulate are trademarks of Intel Corporation or its subsidiaries
Now Available: the First Open Source Release of Intel® SHMEM
January 10, 2024 | Intel® SHMEM [GitHub]
V1.0.0 of this open source library extends the OpenSHMEM programming model to support Intel® Data Center GPUs using the SYCL cross-platform C++ programming environment.
OpenSHMEM (SHared MEMory) is a parallel programming library interface standard that enables Single Program Multiple Data (SPMD) programming of distributed memory systems. This allows users to write a single program that executes many copies of the program across a supercomputer or cluster of computers.
Intel® SHMEM is a C++ library that enables applications to use OpenSHMEM communication APIs with device kernels implemented in SYCL. It implements a Partitioned Global Address Space (PGAS) programming model and includes a subset of host-initiated operations in the current OpenSHMEM standard and new device-initiated operations callable directly from GPU kernels.
Feature Highlights
- Supports the Intel® Data Center GPU Max Series
- Device and host API support for OpenSHMEM 1.5-compliant point-to-point RMA, Atomic Memory Operations, Signaling, Memory Ordering, and Synchronization Operations
- Device and host API support for OpenSHMEM collective operations
- Device API support for SYCL work-group and sub-group level extensions of Remote Memory Access, Signaling, Collective, Memory Ordering, and Synchronization Operations
- Support of C++ template function routines replacing the C11 Generic selection routines from the OpenSHMEM spec
- GPU RDMA support when configured with Sandia OpenSHMEM with suitable Libfabric providers for high-performance networking services
- Choice of device memory or USM for the SHMEM Symmetric Heap
Read the blog for all the details
(written by 3 Sr. Software Engineers @ Intel)
More resources
- Complete Intel SHMEM spec
- OpenSHMEM standard [PDF]