jagomart
digital resources
picture1_High Performance Python Pdf 189298 | Dace Python


 134x       Filetype PDF       File size 3.12 MB       Source: htor.inf.ethz.ch


File: High Performance Python Pdf 189298 | Dace Python
productivity portability performance data centric python alexandros nikolaos ziogas timo schneider tal ben nun alexandru calotoiu tiziano de matteis johannes de fine licht luca lavarini and torsten hoefler department of ...

icon picture PDF Filetype PDF | Posted on 03 Feb 2023 | 2 years ago
Partial capture of text on file.
                           Productivity, Portability, Performance: Data-Centric Python
                              Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru Calotoiu, Tiziano De
                                                    Matteis, Johannes de Fine Licht, Luca Lavarini, and Torsten Hoefler
                                                                             Department of Computer Science, ETH Zurich
                                                                                                       Switzerland
                    ABSTRACT
                    Python has become the de facto language for scientific computing.
                    ProgramminginPythonishighlyproductive,mainlyduetoitsrich
                    science-oriented software ecosystem built around the NumPy mod-
                    ule.Asaresult,thedemandforPythonsupportinHighPerformance
                    Computing(HPC)hasskyrocketed. However, the Python language
                    itself does not necessarily offer high performance. In this work, we
                    present a workflow that retains Python’s high productivity while
                    achieving portable performance across different architectures. The
                    workflow’s key features are HPC-oriented language extensions and
                    a set of automatic optimizations powered by a data-centric inter-
                    mediate representation. We show performance results and scaling
                    across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to
                    23,328 cores), with 2.47x and 3.75x speedups over previous-best
                    solutions, first-ever Xilinx and Intel FPGA results of annotated                                                                                                1.7x
                    Python, and up to 93.16% scaling efficiency on 512 nodes.
                                                                                                                                  Figure 1: Data-Centric Python overview.
                    CCSCONCEPTS
                    · Softwareanditsengineering→Parallelprogramminglan-                                                Asaresult, the scientific community now pushes to also make
                    guages; Distributed programming languages; Data flow lan-                                       Python the language for writing high-performance code. For most
                    guages; Source code generation.                                                                 scientific applications, NumPy arrays [36] provide the core data
                    KEYWORDS                                                                                        structures, interfaces, and BLAS and LAPACK library interoper-
                    Data-Centric, High Performance Computing, Python, NumPy                                         ability. NumPy is optimized to provide efficient data structures
                    ACMReferenceFormat:                                                                             and fast library implementations for many common operations.
                    Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru                              However, the performance benefits of NumPy are tied to optimized
                    Calotoiu, Tiziano De Matteis, Johannes de Fine Licht, Luca Lavarini, and                        methodcalls and vectorized array operations, both of which evap-
                    Torsten Hoefler. 2021. Productivity, Portability, Performance: Data-Centric                     orate in larger scientific codes that do not adhere to these con-
                    Python . In The International Conference for High Performance Computing,                        straints. Therefore, there is a significant performance gap between
                    Networking, Storage and Analysis (SC ’21), November 14ś19, 2021, St. Louis,                     the numpythonic code-style and general code written by domain
                    MO, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/                             scientists.
                    3458817.3476176                                                                                    InHPC,thethreePs(Productivity,Portability,Performance)
                    1 INTRODUCTION                                                                                  aredrivingrecentdevelopmentsininfrastructureandprogramming
                                                                                                                    model research to ensure sustainability [14, 53]. In Python, this
                    Python is the language to write scientific code [30]. The capabil-                              drive has resulted in many optimized domain-specific libraries and
                    ity to write and maintain Python code with ease, coupled with a                                 frameworks [2, 17, 46, 61, 72]. Simultaneously, the diversity of the
                    vast number of domain-specific frameworks and libraries such as                                 hardware landscape motivated the creation of interface libraries,
                    SciPy, Matplotlib, scikit-learn [62], or pandas [74], leads to high                             such as CuPy [56], which provides replacements to NumPy opera-
                    productivity. It also promotes collaboration with reproducible sci-                             tions for NVIDIA and AMD GPUs, and MPI4PY [22], which offers
                    entific workflows shared using Jupyter notebooks [42]. Therefore,                               direct MPI bindings. Lower-level interfaces, such as Cython [12],
                    numerousscientific fields, ranging from machine learning [2, 61]                                promisehighperformanceatthecostofwritingcodethatresembles
                    to climate [72] and quantum transport [75] have already adopted                                 C, and lazy evaluation of array operations [10, 43, 61] that enable
                    Python as their language of choice for new developments.                                        high-performance runtime systems. Furthermore, a variety of JIT
                                                                                                                    compilers [17, 34, 45] address the performance degradation result-
                    SC’21, November 14ś19, 2021, St. Louis, MO, USA                                                 ing from the interpreter. Last but not least, runtime systems [10],
                    ©2021Association for Computing Machinery.                                                       distributed tasking (Dask) [23], and remote procedure calls [52] fur-
                    This is the author’s version of the work. It is posted here for your personal use. Not
                    for redistribution. The definitive Version of Record was published in The International         ther scale Python to distributed systems. Despite the abundance of
                    Conference for High Performance Computing, Networking, Storage and Analysis (SC ’21),           choices, Python still struggles: while each approach works towards
                    November 14ś19, 2021, St. Louis, MO, USA, https://doi.org/10.1145/3458817.3476176.              one or more of the Ps, none of them supports all at the same time.
               SC’21, November 14ś19, 2021, St. Louis, MO, USA                                                                              Ziogas et al.
                 WeproposeawaytobridgethegapbetweenthethreePsfor                        Python is an imperative language and, therefore, not designed
               Python programming using a data-centric paradigm. In particular,      to express data movement. Its terseness makes the process of un-
               weempowerPythonuserswithanautomaticoptimizationandspe-                derstanding dataflow difficult, even when comparing to other lan-
               cialization toolbox, which spans the entire Python/HPC ecosystem      guages like C and FORTRAN, as the types of variables in Python
               (Fig. 1) Ð from the code, through communication distribution, to      code cannot be statically deduced.
               hardware mapping. At the core of the toolbox, we use the State-          Below, we define high-performance Python programs, discuss
               ful Dataflow multiGraphs (SDFG) [13] data-centric intermediate        the decorators that we must add to Python code to make the
               representation, which enables these optimizations in the form of      dataflow analyzable, and then detail how we translate them into
               multi-level data movement transformations. With a data-centric        the SDFG data-centric intermediate representation.
               representation, as opposed to library bindings, all data dependen-
               cies and potential overlap are inferred statically from the code, and 2.1    HighPerformancePython
               interpreter overhead is mitigated. Compared with implicit and lazy    Ourapproachsupports a large subset of the Python language that
               evaluation approaches, we also provide a set of extensions that       is important for HPC applications. The focus lies on NumPy ar-
               give power users complete control over parallelism and partition-     rays [36] and operations on such arrays. In addition to the low-
               ing schemes, using pythonic principles and syntax (e.g., returning    overhead data structures NumPy offers, it is central to many frame-
               łlocal viewž objects from global data, but allowing users to operate  works focused on scientific computing, e.g., SciPy, pandas, Mat-
               onthełglobal viewž as well).                                          plotlib.Asopposedtolazyevaluationapproaches,high-performance
                 Wedemonstrateawidevarietyofbenchmarksusingtheauto-                  Python must take control flow into account to auto-parallelize and
               matic toolbox over annotated Python code, both on individual          avoid interpreter overhead. This tradeoff between performance
               nodes and the Piz Daint supercomputer. For the former, we show        and productivity is necessary because Python features such as co-
               that it consistently outperforms other automatic approaches on        routines are not statically analyzable and have to be parsed as
               multicore CPUs and GPUs, and for the first time show automatic        łblack-boxesž. To combat some of these Python quirks, we propose
               Python HPCcompilation results for both Xilinx and Intel FPGAs,        to augmentthelanguagewithanalyzableconstructsusefulforHPC.
               whichvastly differ in architecture and programming language. In
               distributed-memoryenvironments,weshowhighscalingefficiency            2.2    AnnotatingPython
               andabsoluteperformancecomparedwithdistributedtasking.Thus,            TheData-Centric (DaCe) Python frontend parses Python code and
               werealize all three Ps within a single system.                        convertsittoSDFGsonaper-functionbasis.Thefrontendwillparse
                 Thepapermakesthefollowingcontributions:                             only the Python functions that have been annotated explicitly by
                   • (Productivity) Definition of high-performance Python, a         theuserwiththe@dace.programdecorator.DaCeprogramscanthen
                     methodology to translate it to a data-centric IR, and exten-    be called like any Python function and perform Just-in-Time (JIT)
                     sions to improve said conversion via explicit annotation.       compilation.
                   • (Portability)AsetofautomaticoptimizationsforCPU,GPU                Static symbolic typing. To enable Ahead-of-Time (AOT) com-
                     andFPGA,outperformingthebestpriorapproachesby2.47×              pilation, which is of key importance for FPGAs and for reusing
                     onCPUand3.75×onGPUonaverage(geometricmean[1]).                  programs across different inputs, SDFGs should be statically typed.
                   • (Performance)AutomaticimplicitMPItransformationsand             Therefore, the function argument data types are given as type an-
                     communicationoptimizations,aswellasexplicitdistribution         notations, providing the required information as shown below:
                     management, with the former scaling to 512 nodes with up
                     to 93.16% efficiency.                                           N = dace.symbol()
                                                                                     @dace.program
                                                                                     def jacobi_1d(TSTEPS: dace.int32,
               2 DATA-CENTRICPYTHON                                                                    A: dace.float64[N],
               Thecentral tenet of our approach is that understanding and opti-                        B: dace.float64[N]):
               mizing data movement is the key to portable, high-performance              for t in range(1, TSTEPS):
               code. In a data-centric programming paradigm, three governing                   B[1:-1] = 0.33333 * (A[:-2]+A[1:-1]+A[2:])
               principles guide development and execution:                                     A[1:-1] = 0.33333 * (B[:-2]+B[1:-1]+B[2:])
                  (1) Data containers must be separate from computations.            ThePythonmethodjacobi_1dhasthreearguments; TSTEPS is a 32-
                  (2) Data movement must be explicit, both from data containers      bit integer scalar, while A and B are double precision floating-point
                     to computations and to other data containers.                   vectors of length N. The symbolic size N, defined with dace.symbol,
                  (3) Control flow dependencies must be minimized, they shall        indicates that the vector sizes can be dynamic (but equal). All sub-
                     only define execution order if no implicit dataflow is given.   sets are then symbolically defined (e.g., the subset B[1:-1] becomes
                                                                                     B[1:N-1], and symbolic manipulation can then be performed in
                 In the context of SDFGs, examples of data containers are arrays     subsequent data-centric transformations.
               andscalar data, which have a NumPy-compatible data type, such            Parametric parallelism. An important feature that has no direct
               as int32 or float64.                                                  expressioninPythonisaloopthatcanruninparallel.Ourapproach
                                                                                                                                              Productivity, Portability, Performance: Data-Centric Python                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      SC’21, November 14ś19, 2021, St. Louis, MO, USA
                                                                                                                                                                 Python                                                                                                                                                                                                                                                                                            SDFGEquivalent                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         A[0: , 0: ]                                                                                                                                                                                                                                     C[0:, 0: ]
                                                                                                                                                                 Declarations and Types                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |              alpha                                                                                                                                                                                                                                                                                                           |
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ,                    ∈0.. − 1 ∧  ∈ 0. .  − 1                                                                                                                                                                                                                                                     ,                  ∈0.. − 1 ∧  ∈ 0. .  − 1
                                                                                                                                                                 Primitive data types                                                                                                                                                                                                                                                                              Scalar data container                                                                                                                                                                                                                                                                                                                                                                                                                                                                             alpha                                                                                                                               A[, ]                                                                                                                                                                                                                                              C[, ]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     out = inp1 * inp2                                                                                                                                                                                                                                                                                                                                       out = inp
                                                                                                                                                                 NumPyarray                                                                                                                                                                                                                                                                                        Array data container                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               tmp0[, ]                                                                                                                                                                                                                                                                                                             alpha (+)
                                                                                                                                                                 Assignments                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            tmp0[0:, 0: ]                                                                                                                                                                                                                                                                                                alpha (+)
                                                                                                                                                                 Assignments                                                                                                                                                                                                                                                                                       Tasklet or map scope with incoming                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   andoutgoingmemletsforread/writ-                                                                                                                                                                                                                                                                                                                                                                  (a) Element-wise array operation (b) Augmented assignment with
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ten operands                                                                                                                                                                                                                                                                                                                                                                                      tmp0 = alpha * A.                                                                                                                                                                                                                                                                                                           WCR.
                                                                                                                                                                 Array subscript                                                                                                                                                                                                                                                                                   Memlet(dataflow edge)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Figure 2: SDFG representations
                                                                                                                                                                 Statements
                                                                                                                                                                 Branching (if)                                                                                                                                                                                                                                                                                    Branch conditions on state transi-
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   tion edges                                                                                                                                                                                                                                                                                                                                                                                      Our first pass traverses the Python AST to simplify the expres-
                                                                                                                                                                 Iteration (for)                                                                                                                                                                                                                                                                                   Conditions, increments on state                                                                                                                                                                                                                                                                                                                                                                 sions into individual steps, similar to static single assignment [7],
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   transition edges                                                                                                                                                                                                                                                                                                                                                                                respecting order of operations:
                                                                                                                                                                 Control-flow                                                                                                                                                                                                                                                                                      Edgetocontrolstructureorfunction                                                                                                                                                                                                                                                                                                                                                                                                                    tmp0 = alpha * A
                                                                                                                                                                (break, continue, return)                                                                                                                                                                                                                                                                          exit state                                                                                                                                                                                                                                                                                                                                                                                                                                          tmp1 = tmp0 @ B
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       tmp2 = beta * C
                                                                                                                                                                 Functions                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             C = tmp1 + tmp2
                                                                                                                                                                 Functioncalls(withsource)and                                                                                                                                                                                                                                                                      Nested SDFG for content, memlets                                                                                                                                                                                                                                                                                                                                                                                           The first step in the above code multiplies each element of A
                                                                                                                                                                 decorator for argument types                                                                                                                                                                                                                                                                      reduce shape of inputs and outputs                                                                                                                                                                                                                                                                                                                                                              with alpha. SDFGs view data containers separately from the com-
                                                                                                                                                                 External/Library calls                                                                                                                                                                                                                                                                            Tasklet with callback or Library                                                                                                                                                                                                                                                                                                                                                                putations the data are part of, as per the first data-centric tenet.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Node                                                                                                                                                                                                                                                                                                                                                                                            These containers are represented by oval Access nodes. In the first
                                                                                                                                             Table 1: Mapping of Python syntax and constructs to SDFG.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             statement, these refer to tmp0, alpha, and A (see Fig. 2a).
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              In SDFGs,connectionstodatacontainersarecalledmemlets,and
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   they describe the data movement Ð the edge direction indicates
                                                                                                                                              supports explicit parallelism declaration through map scopes, simi-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  whetheritisreadorwritten,anditscontentsrefertothepartofthe
                                                                                                                                              larly to an N-dimensional parallel for. There are two ways to                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        data container that is accessed. Computations consume/produce
                                                                                                                                              take advantage of this feature. DaCe provides the dace.map iterator,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 memlets and can be divided into multiple types:
                                                                                                                                             which can be used in Python code as a substitute to the Python                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     (1) Stateless computations (Tasklets, shown as octagons), e.g.,
                                                                                                                                              built-in range iterator and generates a map scope when parsed:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       representing scalar assignments such as a = 1.
                                                                                                                                               for i, j in dace.map[0:M, 0:N]:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  (2) Calls to external libraries (Library Nodes, folded rectangles),
                                                                                                                                                                                                 A[i, j] = B[j, i]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 that represent calls to functions that are not in the list of
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   functions decorated with dace.program. Matrix-matrix mul-
                                                                                                                                             Alternatively, the DaCe framework provides a LoopToMap transfor-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      tiply is a common and important operation tmp1 = tmp0 @ B
                                                                                                                                              mation that detects for-loops in the IR, whose iterations can be                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     andis contained in a Library Node called MatMul.
                                                                                                                                              executed safely in parallel (using symbolic affine expression analy-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              (3) Calls to other SDFGs (Nested SDFGs, rectangles), which rep-
                                                                                                                                              sis), and converts them to map scopes automatically.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 resent calls to functions decorated with dace.program.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                (4) Maps are a particular type of Nested SDFGs matching the
                                                                                                                                              2.3                                                              FromPythontoDaCe                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    language augmentation previously discussed in Section 2.2
                                                                                                                                             WeturntopresenttheSDFGintermediaterepresentation(IR)anda                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              andexpress that the content can be processed in parallel.
                                                                                                                                              novel data-centric Python translation procedure in tandem. While                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Element-wise array operations automatically yield Map scopes.
                                                                                                                                              previous work [13] converted a restricted, low-level Python def-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Similarly, assignments to arrays yield a Map scope, containing a
                                                                                                                                              inition of the SDFG IR, here we aim to cover the majority of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Taskletwiththeelement-wiseassignment.Augmentedassignments
                                                                                                                                              Python/NumPylanguageconstructsviastatic analysis and fallback                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        such as C += 1 are a special case where the output is also an input
                                                                                                                                              for unsupported features. We summarize the equivalence between                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       whennodataracesaredetected.
                                                                                                                                              Python constructs and SDFG counterparts in Table 1, and present                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Parallel maps can be augmented to express how Write-Conflict
                                                                                                                                              the generation of an SDFG from a Python program using the gemm                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Resolution (WCR)shoulddeterminethevalueofdatawhenmultiple
                                                                                                                                              kernel as an example:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                sourceswritetoitconcurrently.Ifdataracesarefound,theoutgoing
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   edges are marked as dashed. E.g., the following program requires
                                                                                                                                                @dace.program                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     WCR(SDFGrepresentationisshowninFig.2b):
                                                                                                                                               def gemm(alpha, beta, C, A, B):
                                                                                                                                                                                                 C[:] = alpha * A @ B + beta * C                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     for i, j in dace.map[0:NI, 0:NJ]:
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       alpha += C[i, j]
                                                                                                                                                                 SC’21, November 14ś19, 2021, St. Louis, MO, USA                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Ziogas et al.
                                                                                                                                                                                               Byconnecting the inputs and outputs of computations with the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         A[0:, 0: ]
                                                                                                                                                                 containers explicitly, the data-centric representation of a program                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |             alpha                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ,                     ∈0.. − 1 ∧  ∈ 0. .  − 1                                                                                                                                                                                                                                                                                                                                                                            alpha                                                                                                                             A[0:, 0: ]
                                                                                                                                                                 can be created. To represent control dependencies, we encapsulate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     alpha                                                                                                                              A[, ]                                                                                                                                                                                                                                ,             |       ∈0.. − 1 ∧  ∈ 0. .  − 1
                                                                                                                                                                 codedrivenbypuredatadependenciesinStates(brightblueregions)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          out = inp1 * inp2                                                                                                                                                                                                                                                                                                                                                                                                    alpha                                                                                                                              A[, ]
                                                                                                                                                                 in the SDFG. The states are connected with State Transition Edges                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     tmp0[, ]                                                                                                                                       StateFusion                                                                                                                                                                                  out = inp1 * inp2
                                                                                                                                                                (in blue) that can be annotatedwithcontrolflow,representingloops,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         tmp0[0:, 0: ]                                                                                                                                                                                                                                                                                                                                                                                    tmp0[, ]
                                                                                                                                                                 branch conditions, and state machines in general. The following                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                tmp0[0:, 0: ]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                 Python program is represented by the SDFG in Fig. 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  tmp0[0:, 0: ]                                                                                                      B[0:, 0: ]                                                                                                                                                                                                         tmp0[0:, 0: ]                                                                                                                                                    B[0:, 0: ]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   MatMult
                                                                                                                                                                   for i in range(NI):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         MatMult                                                                                                                                                                                                                                                                                                                                                                                                                                                          tmp1[0:, 0: ]
                                                                                                                                                                                                                           C[i] += 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       tmp1[0:, 0: ]                                                                                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                        =0                                                                                                                                                                                      +=1                                                                                                                                                                                                                                                                                                            C[]                                                                                                                                                                                                                   Figure4:Statefusionoftmp0 = alpha * Aandtmp1 = tmp0 @ B.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      out = inp + 1
                                                                                                                                                                                                                                ≥                                                                                                                                                                                       <                                                                                                                                                                                                                                                                                                       C[]                                                                                                                                                                                                                  coarseningpasshappensautomaticallyaspartofourproposedtool-
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     box, one can also apply transformations manually and separately,
                                                                                                                                                                 Figure 3: SDFG representation of a Python for-loop. There                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          without changing the original Python source code (we color such
                                                                                                                                                                 are twostates (left: guard, right: body) connected by control                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    łperformance engineering codesž in cyan):
                                                                                                                                                                 flow(state transition) edges that define the loop range.
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       sdfg = gemm.to_sdfg()
                                                                                                                                                                TheaboveloopalsoisanexcellentusecasefortheLoopToMaptrans-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              sdfg.apply(StateFusion)
                                                                                                                                                                 formation (Section 2.2) since its iterations are independent.
                                                                                                                                                                                               In the conversion to the SDFG IR, we also replace calls to library                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    2.5                                                                       PythonRestrictions
                                                                                                                                                                 functions (e.g., np.linalg.solve) and object methods (A.view())
                                                                                                                                                                with custom subgraphs or Library Nodes, which users can extend                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Somefeatures available in Python are incompatible with our defi-
                                                                                                                                                                 for other libraries and object types via a decorated function. Fol-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 nition of high performance Python, and are discussed below. This
                                                                                                                                                                 lowing this initial conversion, the resulting SDFG contains a state                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 doesnotexcludeprogramsusingthefullfeaturesetofPythonfrom
                                                                                                                                                                 per statement and a nested SDFG per function call.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  analysis, but calls to functions containing unsupported features
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    will not benefit from our optimization. The restricted features are:
                                                                                                                                                                 2.4                                                                       DataflowOptimization                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       (1) Python containers (lists, sets, dictionaries, etc.) other than
                                                                                                                                                                Thedirect translation to SDFGs creates a control-centric version                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              NumPy arrays as arguments, as they are represented by
                                                                                                                                                                 of the code, which matches the Python semantics but does not                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 linked lists and difficult to analyze from a dataflow view.
                                                                                                                                                                 contain dataflow beyond a single statement (similarly to compilers’                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Note that this does not preclude internal containers (which
                                                                                                                                                          -O0). To mitigate this, we run a pass of IR graph transformations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   can be analyzed) and list comprehensions.
                                                                                                                                                                 that coarsens the dataflow, exposing a true data-centric view of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 (2) Dynamictyping:fixedstructsareallowedinSDFGs,sofields
                                                                                                                                                                 code (similar to -O1). The transformations include redundant copy                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            can be transformed and their class methods into functions
                                                                                                                                                                 removals, inlining Nested SDFGs, and others (14 in total), which                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             decorated with the dace.program decorator. However, dy-
                                                                                                                                                                 onlymodifyorremoveelementsinthegraph,suchthattheycannot                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      namic changes to classes or dynamic types are unsupported
                                                                                                                                                                 be applied indefinitely.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     as their structure cannot be statically derived.
                                                                                                                                                                                               Tounderstandthispass,weshowcaseonetransformationÐstate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 (3) Control-dependent variable state (i.e., no scoping rules), e.g.,
                                                                                                                                                                 fusion Ð which allows merging two states where the result does                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               the following valid Python:
                                                                                                                                                                 not produce data races. For example, the two states containing the
                                                                                                                                                                 assignments below can be merged:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              x = ...
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                if x > 5:
                                                                                                                                                                   tmp0 = alpha * A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     y = np.ndarray([5, 6], dtype=np.float32)
                                                                                                                                                                   tmp1 = tmp0 @ B                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             # use y (will raise exception if x <= 5)
                                                                                                                                                                 Internally, the transformation matches a subgraph pattern of two                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     (4) Recursion. This is a limitation of the data-centric program-
                                                                                                                                                                 states connected together and compares source and sink Access                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                mingmodel[13], as recursion is a control-centric concept
                                                                                                                                                                 nodes using symbolic set intersection. If no data dependency con-                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            that is not portable across platforms.
                                                                                                                                                                 straints are violated, Access nodes are either fused (if they point to
                                                                                                                                                                 the same memory, see Fig. 4) or set side by side, creating multiple                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               After performing the full translation and coarsening, the re-
                                                                                                                                                                 connected components that can run in parallel.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      sulting gemm kernel can be seen in Figure 5 (left-hand side). This
                                                                                                                                                                                               All transformations in the DaCe framework follow the same                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             data-centric representation can now use the SDFG IR capabilities
                                                                                                                                                                 infrastructure, either matching a subgraph pattern or allowing                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      to further optimize and map the original Python code to different
                                                                                                                                                                 the user to choose an arbitrary subgraph [13]. While the dataflow                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   architectures and distributed systems.
The words contained in this file might help you see if this file matches what you are looking for:

...Productivity portability performance data centric python alexandros nikolaos ziogas timo schneider tal ben nun alexandru calotoiu tiziano de matteis johannes fine licht luca lavarini and torsten hoefler department of computer science eth zurich switzerland abstract has become the facto language for scientific computing programminginpythonishighlyproductive mainlyduetoitsrich oriented software ecosystem built around numpy mod ule asaresult thedemandforpythonsupportinhighperformance hpc hasskyrocketed however itself does not necessarily offer high in this work we present a workflow that retains s while achieving portable across different architectures key features are extensions set automatic optimizations powered by inter mediate representation show results scaling cpu gpu fpga piz daint supercomputer up to cores with x speedups over previous best solutions first ever xilinx intel annotated efficiency on nodes figure overview ccsconcepts softwareanditsengineering parallelprogramminglan ...

no reviews yet
Please Login to review.