From SAS to PySpark: Scintilla’s AI-Driven Transformation
Enterprises worldwide are emphasizing flexibility, scalability, and cost-effectiveness to stay resilient and relevant. Disruptive technological advancements in cloud computing and artificial intelligence are driving organizations to embrace change while maintaining customer satisfaction as a top priority. However, organizations are experiencing a significant gap between customer expectations and their operational landscapes, products, or services. They are re-evaluating their ecosystems, including technology, business processes, services, and products. They are also making stringent strategic decisions, such as decommissioning long-standing solutions in favor of more futuristic and trending offerings. These changes aim to improve customer reach, strengthen data transformation journeys, and empower future business use cases at a fraction of the cost.
SAS, renowned for its statistical, analytical, and domain-specific solutions, has been widely used across various industries. However, limitations around proprietary licensing, interoperability, integration, and constantly improving cloud and AI offerings are leading organizations to explore technology stacks beyond SAS. The modern requirement across industries is to transform SAS processes to non-SAS platforms, optimizing cost and technological efficiency. This transition brings the challenge of effectively converting SAS code to suit other platforms like PySpark, Snowflake, Databricks, and others.
SAS modernization gaining ground
We have completed several prototypes and successful real-time SAS modernization projects facing inevitable challenges. Our specialization includes SAS modernization to Databricks and PySpark-based platforms. In our endeavor to master the process and overcome the challenges of the SAS code conversion process, we designed a solution. It includes a migration approach for transforming the existing SAS ecosystem to a completely different stack, be it PySpark, Databricks, or Snowflake. We have also created accelerators for automated and accurate code conversion.
Going through the modernization journeys of several organizations has also given us insights into how things work and what works best. A Prudent Markets study indicates a 33.9% CAGR in the adoption of Apache Spark. The verdict is clear. PySpark has emerged as a widely accepted alternative to SAS owing to its capabilities and open-source architecture. Our proprietary accelerator, Scintilla, is a great companion en route to the SAS modernization process, streamlining processes, speeding up code conversion, and ensuring efficiency and quality.
Accelerating SAS modernization with Scintilla
Scintilla is our flagship accelerator that we designed to streamline and speed up code conversion during the SAS modernization process. It is a pattern-driven converter that learns from past conversion exercises and advances the progress of upcoming iterations. Our accelerator also unlocks the benefits of modern big data processing and analysis.
A smart analyzer included in Scintilla summarizes SAS coding standards and simplifies the complexities of moving workloads. Powered by generative AI (Gen AI), the accelerator offers a versatile approach that is speedy and accurate. Its capabilities include:
- SAS Code Analysis and Lineage Assessment Reports
- SAS Code Transpilation to PySpark
- PySpark Code Optimization
- PySpark Code Analysis and Documentation
- Synthetic Data Maker
- Test case Generation
Integrating Gen AI and LLM with Scintilla
Experts and enthusiasts worldwide are inclined to explore, adapt, and create applications based on Gen AI for better and faster outcomes. Well, so are we. We infused Scintilla with the capabilities of Gen AI and LLMs for SAS to PySpark code transpilation. During our research, we found many proprietary LLMs and open-source foundation models that we evaluated understood logic and pseudo codes. However, they could not handle code conversion tasks accurately, especially complex SAS codes. Given the current limitations of LLMs, they could not be used as-is for complex SAS to PySpark migration. Therefore, it was crucial to focus on tuning the models by selecting an appropriate LLM as the baseline. We identified tools and methods that would help us optimize and speed up the code conversion process. They were as follows:
- Efficient Model for SAS Code Conversion and Documentation: Google’s Gemini Pro or Gemini Flash
- Tuning Methods suitable: Prompt Engineering, Few-Shot Tuning.
- Training Methods evaluated: RAG
- Training tools evaluated: Vertex AI, Google Colab
Based on these results, we finalized and progressed with tuning the identified LLM, creating a distilled or child model for internal evaluation.
Scintilla now delivers enhanced conversion and documentation results with this newly fine-tuned model and its underlying artifacts. The integration of Scintilla and the LLM was seamless, fitting perfectly into Scintilla’s code conversion and analysis processes.
Advantages of enhancing Scintilla with Gen AI and LLM
Besides strengthening the tool and accelerating the code conversion process, integrating with Gen AI and LLM:
- Reduces and optimizes the overall cost incurred while communicating and connecting with the LLM
- Enhances Scintilla’s outcome and improves accuracy, specifically for complex SAS code snippets
- Improves Scintilla’s documentation and analytical capabilities and generates accurate documentation with less effort
- Reduces efforts and SAS/PySpark skill requirements for manual remediations requiring human intervention
Key components of Scintilla
The following modules simplify the SAS code conversion process and make modernization faster.
UI module
This vibrant user interface is the gateway for all activities and process flows related to SAS to PySpark code conversion, documentation, and Gen AI LLM tasks. It is where the magic begins and ends.
Assessment module
SAS codes are meticulously parsed and analyzed at the block level in this module to generate comprehensive assessment reports.
Transpiler module
The heart of conversion, this utility transforms SAS code into clean, integrated, and syntactically accurate PySpark code. It combines native components with cutting-edge LLM integrations to ensure seamless transitions.
Core repository
This essential module houses a specialized code dictionary developed by our experts. It acts as an LLM model and facilitates the conversion of SAS to PySpark code, continuously evolving to enhance accuracy and capability.
AI Repository powered by Google Gemini Pro LLM
This module fine-tunes LLM to handle complex SAS logic that may be difficult for the core repository and ensures precise PySpark code generation. For this purpose, it leverages a sophisticated code dictionary we have developed in house.
Code Optimizer powered by Google Gemini Pro LLM
This module uses the code dictionary to fine-tune LLM, optimizing the PySpark code generated by the Transpiler module for peak performance.
TechWriter powered by Google Gemini Pro LLM
This module analyzes both SAS and PySpark code to produce detailed technical documentation. Enhanced by LLM, it offers superior code analysis and reduces the need for manual intervention.
Test case generation using Gen AI
This will generate test cases using LLM model for the generated PySpark code from the optimized code.
Synthetic data generation
This generates synthetic data using dbtldatagen library. It is possible to generate data in two ways—using sample data, using schema alone. It can also be customized based on the minimum and maximum values and other parameters.
Conclusion
Scintilla’s innovative integration of generative AI algorithms and models has transformed the code conversion process, making it more efficient and accurate. Each module, from the vibrant UI to the meticulous TechWriter, plays a crucial role in this transformation. With the seamless integration of cutting-edge LLM technology, Scintilla simplifies SAS to PySpark migration and ensures high-quality documentation and optimized code performance. This comprehensive approach positions Scintilla as a powerful tool for modern data engineering needs.
More from Ramesh Vanteru
Modern-day organizations that generate a huge amount of data look forward to leveraging the…
In the ever-evolving data processing landscape, transitioning from traditional systems to modern…
Latest Blogs
The business world is moving quickly and the only way to make informed decisions is to leverage…
As businesses turn to cloud services to meet their growing technology needs, the promise of…
Clinical trials are at the heart of drug development, producing vast, complex datasets that…
The rise of machine customers introduces essential questions that stretch our technological…