134x Filetype PDF File size 3.12 MB Source: htor.inf.ethz.ch
Productivity, Portability, Performance: Data-Centric Python Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru Calotoiu, Tiziano De Matteis, Johannes de Fine Licht, Luca Lavarini, and Torsten Hoefler Department of Computer Science, ETH Zurich Switzerland ABSTRACT Python has become the de facto language for scientific computing. ProgramminginPythonishighlyproductive,mainlyduetoitsrich science-oriented software ecosystem built around the NumPy mod- ule.Asaresult,thedemandforPythonsupportinHighPerformance Computing(HPC)hasskyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Python’s high productivity while achieving portable performance across different architectures. The workflow’s key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric inter- mediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated 1.7x Python, and up to 93.16% scaling efficiency on 512 nodes. Figure 1: Data-Centric Python overview. CCSCONCEPTS · Softwareanditsengineering→Parallelprogramminglan- Asaresult, the scientific community now pushes to also make guages; Distributed programming languages; Data flow lan- Python the language for writing high-performance code. For most guages; Source code generation. scientific applications, NumPy arrays [36] provide the core data KEYWORDS structures, interfaces, and BLAS and LAPACK library interoper- Data-Centric, High Performance Computing, Python, NumPy ability. NumPy is optimized to provide efficient data structures ACMReferenceFormat: and fast library implementations for many common operations. Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, Alexandru However, the performance benefits of NumPy are tied to optimized Calotoiu, Tiziano De Matteis, Johannes de Fine Licht, Luca Lavarini, and methodcalls and vectorized array operations, both of which evap- Torsten Hoefler. 2021. Productivity, Portability, Performance: Data-Centric orate in larger scientific codes that do not adhere to these con- Python . In The International Conference for High Performance Computing, straints. Therefore, there is a significant performance gap between Networking, Storage and Analysis (SC ’21), November 14ś19, 2021, St. Louis, the numpythonic code-style and general code written by domain MO, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/ scientists. 3458817.3476176 InHPC,thethreePs(Productivity,Portability,Performance) 1 INTRODUCTION aredrivingrecentdevelopmentsininfrastructureandprogramming model research to ensure sustainability [14, 53]. In Python, this Python is the language to write scientific code [30]. The capabil- drive has resulted in many optimized domain-specific libraries and ity to write and maintain Python code with ease, coupled with a frameworks [2, 17, 46, 61, 72]. Simultaneously, the diversity of the vast number of domain-specific frameworks and libraries such as hardware landscape motivated the creation of interface libraries, SciPy, Matplotlib, scikit-learn [62], or pandas [74], leads to high such as CuPy [56], which provides replacements to NumPy opera- productivity. It also promotes collaboration with reproducible sci- tions for NVIDIA and AMD GPUs, and MPI4PY [22], which offers entific workflows shared using Jupyter notebooks [42]. Therefore, direct MPI bindings. Lower-level interfaces, such as Cython [12], numerousscientific fields, ranging from machine learning [2, 61] promisehighperformanceatthecostofwritingcodethatresembles to climate [72] and quantum transport [75] have already adopted C, and lazy evaluation of array operations [10, 43, 61] that enable Python as their language of choice for new developments. high-performance runtime systems. Furthermore, a variety of JIT compilers [17, 34, 45] address the performance degradation result- SC’21, November 14ś19, 2021, St. Louis, MO, USA ing from the interpreter. Last but not least, runtime systems [10], ©2021Association for Computing Machinery. distributed tasking (Dask) [23], and remote procedure calls [52] fur- This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in The International ther scale Python to distributed systems. Despite the abundance of Conference for High Performance Computing, Networking, Storage and Analysis (SC ’21), choices, Python still struggles: while each approach works towards November 14ś19, 2021, St. Louis, MO, USA, https://doi.org/10.1145/3458817.3476176. one or more of the Ps, none of them supports all at the same time. SC’21, November 14ś19, 2021, St. Louis, MO, USA Ziogas et al. WeproposeawaytobridgethegapbetweenthethreePsfor Python is an imperative language and, therefore, not designed Python programming using a data-centric paradigm. In particular, to express data movement. Its terseness makes the process of un- weempowerPythonuserswithanautomaticoptimizationandspe- derstanding dataflow difficult, even when comparing to other lan- cialization toolbox, which spans the entire Python/HPC ecosystem guages like C and FORTRAN, as the types of variables in Python (Fig. 1) Ð from the code, through communication distribution, to code cannot be statically deduced. hardware mapping. At the core of the toolbox, we use the State- Below, we define high-performance Python programs, discuss ful Dataflow multiGraphs (SDFG) [13] data-centric intermediate the decorators that we must add to Python code to make the representation, which enables these optimizations in the form of dataflow analyzable, and then detail how we translate them into multi-level data movement transformations. With a data-centric the SDFG data-centric intermediate representation. representation, as opposed to library bindings, all data dependen- cies and potential overlap are inferred statically from the code, and 2.1 HighPerformancePython interpreter overhead is mitigated. Compared with implicit and lazy Ourapproachsupports a large subset of the Python language that evaluation approaches, we also provide a set of extensions that is important for HPC applications. The focus lies on NumPy ar- give power users complete control over parallelism and partition- rays [36] and operations on such arrays. In addition to the low- ing schemes, using pythonic principles and syntax (e.g., returning overhead data structures NumPy offers, it is central to many frame- łlocal viewž objects from global data, but allowing users to operate works focused on scientific computing, e.g., SciPy, pandas, Mat- onthełglobal viewž as well). plotlib.Asopposedtolazyevaluationapproaches,high-performance Wedemonstrateawidevarietyofbenchmarksusingtheauto- Python must take control flow into account to auto-parallelize and matic toolbox over annotated Python code, both on individual avoid interpreter overhead. This tradeoff between performance nodes and the Piz Daint supercomputer. For the former, we show and productivity is necessary because Python features such as co- that it consistently outperforms other automatic approaches on routines are not statically analyzable and have to be parsed as multicore CPUs and GPUs, and for the first time show automatic łblack-boxesž. To combat some of these Python quirks, we propose Python HPCcompilation results for both Xilinx and Intel FPGAs, to augmentthelanguagewithanalyzableconstructsusefulforHPC. whichvastly differ in architecture and programming language. In distributed-memoryenvironments,weshowhighscalingefficiency 2.2 AnnotatingPython andabsoluteperformancecomparedwithdistributedtasking.Thus, TheData-Centric (DaCe) Python frontend parses Python code and werealize all three Ps within a single system. convertsittoSDFGsonaper-functionbasis.Thefrontendwillparse Thepapermakesthefollowingcontributions: only the Python functions that have been annotated explicitly by • (Productivity) Definition of high-performance Python, a theuserwiththe@dace.programdecorator.DaCeprogramscanthen methodology to translate it to a data-centric IR, and exten- be called like any Python function and perform Just-in-Time (JIT) sions to improve said conversion via explicit annotation. compilation. • (Portability)AsetofautomaticoptimizationsforCPU,GPU Static symbolic typing. To enable Ahead-of-Time (AOT) com- andFPGA,outperformingthebestpriorapproachesby2.47× pilation, which is of key importance for FPGAs and for reusing onCPUand3.75×onGPUonaverage(geometricmean[1]). programs across different inputs, SDFGs should be statically typed. • (Performance)AutomaticimplicitMPItransformationsand Therefore, the function argument data types are given as type an- communicationoptimizations,aswellasexplicitdistribution notations, providing the required information as shown below: management, with the former scaling to 512 nodes with up to 93.16% efficiency. N = dace.symbol() @dace.program def jacobi_1d(TSTEPS: dace.int32, 2 DATA-CENTRICPYTHON A: dace.float64[N], Thecentral tenet of our approach is that understanding and opti- B: dace.float64[N]): mizing data movement is the key to portable, high-performance for t in range(1, TSTEPS): code. In a data-centric programming paradigm, three governing B[1:-1] = 0.33333 * (A[:-2]+A[1:-1]+A[2:]) principles guide development and execution: A[1:-1] = 0.33333 * (B[:-2]+B[1:-1]+B[2:]) (1) Data containers must be separate from computations. ThePythonmethodjacobi_1dhasthreearguments; TSTEPS is a 32- (2) Data movement must be explicit, both from data containers bit integer scalar, while A and B are double precision floating-point to computations and to other data containers. vectors of length N. The symbolic size N, defined with dace.symbol, (3) Control flow dependencies must be minimized, they shall indicates that the vector sizes can be dynamic (but equal). All sub- only define execution order if no implicit dataflow is given. sets are then symbolically defined (e.g., the subset B[1:-1] becomes B[1:N-1], and symbolic manipulation can then be performed in In the context of SDFGs, examples of data containers are arrays subsequent data-centric transformations. andscalar data, which have a NumPy-compatible data type, such Parametric parallelism. An important feature that has no direct as int32 or float64. expressioninPythonisaloopthatcanruninparallel.Ourapproach Productivity, Portability, Performance: Data-Centric Python SC’21, November 14ś19, 2021, St. Louis, MO, USA Python SDFGEquivalent A[0: , 0: ] C[0:, 0: ] Declarations and Types | alpha | , ∈0.. − 1 ∧ ∈ 0. . − 1 , ∈0.. − 1 ∧ ∈ 0. . − 1 Primitive data types Scalar data container alpha A[, ] C[, ] out = inp1 * inp2 out = inp NumPyarray Array data container tmp0[, ] alpha (+) Assignments tmp0[0:, 0: ] alpha (+) Assignments Tasklet or map scope with incoming andoutgoingmemletsforread/writ- (a) Element-wise array operation (b) Augmented assignment with ten operands tmp0 = alpha * A. WCR. Array subscript Memlet(dataflow edge) Figure 2: SDFG representations Statements Branching (if) Branch conditions on state transi- tion edges Our first pass traverses the Python AST to simplify the expres- Iteration (for) Conditions, increments on state sions into individual steps, similar to static single assignment [7], transition edges respecting order of operations: Control-flow Edgetocontrolstructureorfunction tmp0 = alpha * A (break, continue, return) exit state tmp1 = tmp0 @ B tmp2 = beta * C Functions C = tmp1 + tmp2 Functioncalls(withsource)and Nested SDFG for content, memlets The first step in the above code multiplies each element of A decorator for argument types reduce shape of inputs and outputs with alpha. SDFGs view data containers separately from the com- External/Library calls Tasklet with callback or Library putations the data are part of, as per the first data-centric tenet. Node These containers are represented by oval Access nodes. In the first Table 1: Mapping of Python syntax and constructs to SDFG. statement, these refer to tmp0, alpha, and A (see Fig. 2a). In SDFGs,connectionstodatacontainersarecalledmemlets,and they describe the data movement Ð the edge direction indicates supports explicit parallelism declaration through map scopes, simi- whetheritisreadorwritten,anditscontentsrefertothepartofthe larly to an N-dimensional parallel for. There are two ways to data container that is accessed. Computations consume/produce take advantage of this feature. DaCe provides the dace.map iterator, memlets and can be divided into multiple types: which can be used in Python code as a substitute to the Python (1) Stateless computations (Tasklets, shown as octagons), e.g., built-in range iterator and generates a map scope when parsed: representing scalar assignments such as a = 1. for i, j in dace.map[0:M, 0:N]: (2) Calls to external libraries (Library Nodes, folded rectangles), A[i, j] = B[j, i] that represent calls to functions that are not in the list of functions decorated with dace.program. Matrix-matrix mul- Alternatively, the DaCe framework provides a LoopToMap transfor- tiply is a common and important operation tmp1 = tmp0 @ B mation that detects for-loops in the IR, whose iterations can be andis contained in a Library Node called MatMul. executed safely in parallel (using symbolic affine expression analy- (3) Calls to other SDFGs (Nested SDFGs, rectangles), which rep- sis), and converts them to map scopes automatically. resent calls to functions decorated with dace.program. (4) Maps are a particular type of Nested SDFGs matching the 2.3 FromPythontoDaCe language augmentation previously discussed in Section 2.2 WeturntopresenttheSDFGintermediaterepresentation(IR)anda andexpress that the content can be processed in parallel. novel data-centric Python translation procedure in tandem. While Element-wise array operations automatically yield Map scopes. previous work [13] converted a restricted, low-level Python def- Similarly, assignments to arrays yield a Map scope, containing a inition of the SDFG IR, here we aim to cover the majority of the Taskletwiththeelement-wiseassignment.Augmentedassignments Python/NumPylanguageconstructsviastatic analysis and fallback such as C += 1 are a special case where the output is also an input for unsupported features. We summarize the equivalence between whennodataracesaredetected. Python constructs and SDFG counterparts in Table 1, and present Parallel maps can be augmented to express how Write-Conflict the generation of an SDFG from a Python program using the gemm Resolution (WCR)shoulddeterminethevalueofdatawhenmultiple kernel as an example: sourceswritetoitconcurrently.Ifdataracesarefound,theoutgoing edges are marked as dashed. E.g., the following program requires @dace.program WCR(SDFGrepresentationisshowninFig.2b): def gemm(alpha, beta, C, A, B): C[:] = alpha * A @ B + beta * C for i, j in dace.map[0:NI, 0:NJ]: alpha += C[i, j] SC’21, November 14ś19, 2021, St. Louis, MO, USA Ziogas et al. Byconnecting the inputs and outputs of computations with the A[0:, 0: ] containers explicitly, the data-centric representation of a program | alpha , ∈0.. − 1 ∧ ∈ 0. . − 1 alpha A[0:, 0: ] can be created. To represent control dependencies, we encapsulate alpha A[, ] , | ∈0.. − 1 ∧ ∈ 0. . − 1 codedrivenbypuredatadependenciesinStates(brightblueregions) out = inp1 * inp2 alpha A[, ] in the SDFG. The states are connected with State Transition Edges tmp0[, ] StateFusion out = inp1 * inp2 (in blue) that can be annotatedwithcontrolflow,representingloops, tmp0[0:, 0: ] tmp0[, ] branch conditions, and state machines in general. The following tmp0[0:, 0: ] Python program is represented by the SDFG in Fig. 3: tmp0[0:, 0: ] B[0:, 0: ] tmp0[0:, 0: ] B[0:, 0: ] MatMult for i in range(NI): MatMult tmp1[0:, 0: ] C[i] += 1 tmp1[0:, 0: ] =0 +=1 C[] Figure4:Statefusionoftmp0 = alpha * Aandtmp1 = tmp0 @ B. out = inp + 1 ≥ < C[] coarseningpasshappensautomaticallyaspartofourproposedtool- box, one can also apply transformations manually and separately, Figure 3: SDFG representation of a Python for-loop. There without changing the original Python source code (we color such are twostates (left: guard, right: body) connected by control łperformance engineering codesž in cyan): flow(state transition) edges that define the loop range. sdfg = gemm.to_sdfg() TheaboveloopalsoisanexcellentusecasefortheLoopToMaptrans- sdfg.apply(StateFusion) formation (Section 2.2) since its iterations are independent. In the conversion to the SDFG IR, we also replace calls to library 2.5 PythonRestrictions functions (e.g., np.linalg.solve) and object methods (A.view()) with custom subgraphs or Library Nodes, which users can extend Somefeatures available in Python are incompatible with our defi- for other libraries and object types via a decorated function. Fol- nition of high performance Python, and are discussed below. This lowing this initial conversion, the resulting SDFG contains a state doesnotexcludeprogramsusingthefullfeaturesetofPythonfrom per statement and a nested SDFG per function call. analysis, but calls to functions containing unsupported features will not benefit from our optimization. The restricted features are: 2.4 DataflowOptimization (1) Python containers (lists, sets, dictionaries, etc.) other than Thedirect translation to SDFGs creates a control-centric version NumPy arrays as arguments, as they are represented by of the code, which matches the Python semantics but does not linked lists and difficult to analyze from a dataflow view. contain dataflow beyond a single statement (similarly to compilers’ Note that this does not preclude internal containers (which -O0). To mitigate this, we run a pass of IR graph transformations can be analyzed) and list comprehensions. that coarsens the dataflow, exposing a true data-centric view of the (2) Dynamictyping:fixedstructsareallowedinSDFGs,sofields code (similar to -O1). The transformations include redundant copy can be transformed and their class methods into functions removals, inlining Nested SDFGs, and others (14 in total), which decorated with the dace.program decorator. However, dy- onlymodifyorremoveelementsinthegraph,suchthattheycannot namic changes to classes or dynamic types are unsupported be applied indefinitely. as their structure cannot be statically derived. Tounderstandthispass,weshowcaseonetransformationÐstate (3) Control-dependent variable state (i.e., no scoping rules), e.g., fusion Ð which allows merging two states where the result does the following valid Python: not produce data races. For example, the two states containing the assignments below can be merged: x = ... if x > 5: tmp0 = alpha * A y = np.ndarray([5, 6], dtype=np.float32) tmp1 = tmp0 @ B # use y (will raise exception if x <= 5) Internally, the transformation matches a subgraph pattern of two (4) Recursion. This is a limitation of the data-centric program- states connected together and compares source and sink Access mingmodel[13], as recursion is a control-centric concept nodes using symbolic set intersection. If no data dependency con- that is not portable across platforms. straints are violated, Access nodes are either fused (if they point to the same memory, see Fig. 4) or set side by side, creating multiple After performing the full translation and coarsening, the re- connected components that can run in parallel. sulting gemm kernel can be seen in Figure 5 (left-hand side). This All transformations in the DaCe framework follow the same data-centric representation can now use the SDFG IR capabilities infrastructure, either matching a subgraph pattern or allowing to further optimize and map the original Python code to different the user to choose an arbitrary subgraph [13]. While the dataflow architectures and distributed systems.
no reviews yet
Please Login to review.