A Comprehensive Guide to Snowpark – Part 2
Background:
We’re glad to have you back for the second part of our extensive Snowpark guide. In our previous blog, we delved into the different execution modes of Snowpark code and obtained a more profound comprehension of the inner workings of Snowpark execution. In this article, our attention will shift towards exploring the constraints of language-specific Stored Procedures (SPs) and immersing ourselves in the remarkable solution offered by User-Defined (Table) Functions (UD(T)Fs) to unlock Snowpark’s potential.
Limitations of SPs
Snowpark SPs have certain limitations that should be considered:
- Firstly, SPs are primarily designed to function on a single node. During the execution of an SP, the Snowpark DataFrames and SQL statements are transmitted to the Snowflake SQL engine, facilitating performance improvements and scale-out advantages. However, the execution of language-specific constructs within the SP is limited to the single node in the Snowflake virtual warehouse. Consequently, if the entire code runs on a Python Virtual Machine (PVM) of a single node, it becomes necessary to transfer all the data to that node for processing. This can result in notable performance difficulties, especially when handling extensive datasets.
- Moreover, SPs lack support for concurrent queries and process creation. As a result, each statement or query within an SP executes sequentially, regardless of their interdependence.
- Another constraint pertains to memory usage. SPs have memory limitations, and if the code requires more memory than what has been allocated, the execution may fail. Thus, carefully managing the code’s memory requirements is crucial for successful execution.
So, now you might be wondering if there is a way to run your language-specific code in Snowflake, taking advantage of all the nodes of your virtual warehouse. UD(T)Fs are there to help you out.
How are UD(T)Fs solving the SPs problem?
Fig 11: UD(T)F Execution Flow
Figure 11 illustrates the execution flow of UD(T)Fs:
- Python employs a Global Interpreter Lock (GIL) to ensure the sequential execution of threads, preventing the simultaneous utilization of multiple processor cores in a multi-threaded environment.
- To overcome the limitations of the GIL, Snowflake employs multiple interpreter processes for executing Python functions within a query. Each process operates an independent Python interpreter instance, enabling parallel execution of threads across different processes. For instance, when dealing with a million-row table like “my_table,” Snowflake may generate multiple interpreter processes to parallelize computations. Each process would concurrently execute the UDF function on a subset of rows, with the results merged to produce the outcome.
- The platform initializes the Python interpreter when executing a Python function in Snowflake. This initialization process entails configuring the Python environment, importing modules, and other necessary steps for executing Python code.
- After the Python interpreter initialization, Snowflake forks additional processes to execute the Python function. This approach reduces the time required to initialize the interpreter for each function’s execution.
- Upon completion of the query, packages and imports remain stored in the local cache. However, the sandbox and interpreters are cleaned up. Subsequent executions of the same query or another query employing the same packages and imports will benefit from faster execution as the initialization steps are avoided.
- In summary, Snowflake strives to leverage the full processing capability of the warehouse when running queries with Python or Java UDFs. This involves redistributing rows among nodes in the warehouse to parallelize resource-intensive computations.
- Furthermore, Snowflake collects historical statistics and assesses the performance of UDFs (whether fast or slow) to optimize workload distribution among workers, aiming to achieve the fastest possible query execution.
UD(T)Fs address the SPs problem by allowing distributed execution on multiple nodes. When a UDTF is invoked in a query, Snowflake automatically partitions the input data and distributes it among the nodes in the warehouse. Each node then independently executes the UDTF on its assigned data subset. This parallel execution allows multiple nodes to process different input portions simultaneously, significantly improving performance.
Once the UDTF execution is complete on each node, the intermediate results are combined or merged to produce the final output. Snowflake handles this merging process transparently, providing a unified result set to the query. By leveraging the distributed computing capabilities of the Snowflake data warehouse, UD(T)Fs provide a scalable and performant solution to the SPs problem. They allow for the efficient utilization of multiple nodes and enable parallel processing, thereby overcoming the limitations of executing code on a single node.
Summary of Snowpark objects
Snowpark best practices
- Utilizing Snowpark DataFrame instead of Pandas DataFrame offers significant performance boost. It is considered a best practice to maximize the use of Snowpark DataFrames whenever possible.
- Vectorized UDFs offer a performance boost for numerical computations, making them ideal for enhancing performance in such scenarios. However, for non-numeric computations, using vectorized UDFs may lead to performance degradation, so it’s best to stick with non-vectorized UDFs for optimal performance.
- It is recommended to minimize data retrieval directly into language-specific stored procedures. SPs execute on a single node within the virtual warehouse, limiting computational resources. SPs are most effective for controlling flow, acting as the main script, initializing objects, and maintaining state. Data-intensive computations within SPs should use the DataFrame API or separate UD(T)Fs for improved performance.
- Caching Snowpark DataFrames can optimize performance by avoiding repetitive transformations. However, excessive use should be avoided, as it may lead to performance issues. Selectively applying cache operations and considering query frequency helps prevent disk spillage and unnecessary processing time.
- While the Snowpark client API bears similarities to the Pyspark API, it’s crucial to acknowledge that Snowpark generates an underlying SQL query. Therefore, when utilizing Snowpark, it is advisable to consider and adhere to SQL best practices.
CONCLUSION
In the second part of our Snowpark guide, we examined the limitations of Snowpark Stored Procedures (SPs) and the solution offered by User-Defined (Table) Functions (UD(T)Fs) to overcome these limitations. We looked into the execution flow of UD(T)Fs, which involves Snowflake using multiple interpreter processes to execute Python functions concurrently. Furthermore, we shared best practices for Snowpark usage. By adhering to these practices and harnessing the distributed computing capabilities of Snowpark, users can fully unleash Snowpark’s potential for scalable, high-performance, and efficient data processing and analytics.
References:
- Snowflake. (2023). Snowpark – Snowflake Documentation. Retrieved from https://docs.snowflake.com/en/developer-guide/snowpark/index
- Snowflake. (2022, January 10). Operationalizing Snowpark Python – Part One. Medium. Retrieved from https://medium.com/snowflake/operationalizing-snowpark-python-part-one-892fcb3abba1
- Snowflake Developers, Jan5, 2023, Snowpark Java & Python UDFs DEMO | Under The Hood, YouTube. Retrieved from https://www.youtube.com/watch?v=tT0jCX_Bjok
More from Poorna Chand Addala
Background Snowpark is a new developer framework of Snowflake composed of a client-side library…
Latest Blogs
Introduction Artificial Intelligence (AI) is transforming industries and redefining possibilities…
Introduction The evolution of artificial intelligence (AI) has been a remarkable journey,…