In the space of data analysis and database management, generating accurate SQL queries is a critical skill. However, even experienced data professionals can sometimes struggle with complex queries, especially when dealing with nested aggregations or intricate table relationships. This challenge has led to the development of AI-powered SQL generators, which promise to simplify the query-writing process.
Our journey begins with a seemingly simple question: "What is the percentage of orders that have more than one product?" While this query might appear straightforward at first glance, it requires careful consideration of data structure and SQL syntax to execute correctly.
In this blog post, we'll explore how two popular AI SQL generators - SQLCoder2 and SQLCoder-34B - attempted to tackle this problem, and why their solutions fell short. We'll then delve into the process of custom training a model to generate the correct query, highlighting the importance of tailored datasets and fine-tuning in achieving accurate results.
By sharing this experience, we aim to shed light on the current capabilities and limitations of AI in SQL generation, as well as demonstrate a practical approach to improving these tools for specific use cases. Whether you're a data analyst, a machine learning enthusiast, or simply curious about the intersection of AI and database querying, this post will provide valuable insights into the challenges and potential solutions in this field.
Join us as we unravel the complexities of SQL query generation and explore how custom AI training can lead to more reliable and efficient database interactions.
To address our question about the percentage of orders with more than one product, we turned to two popular AI-powered SQL generators: SQLCoder2 and SQLCoder-34B. Both models are known for their ability to generate SQL queries from natural language prompts. Let's examine how they performed with our specific task.
We presented SQLCoder2 & SQLCoder-34B with the following prompt:
What is the percentage of orders that have more than one product?
This straightforward question aimed to test the model's ability to interpret the request and generate an appropriate SQL query.
SQLCoder2 responded with the following SQL query:
SELECT CAST(COUNT(DISTINCT CASE WHEN COUNT(shopify_orderlineitem.id) > 1 THEN shopify_order.id END) AS FLOAT) / NULLIF(COUNT(DISTINCT shopify_order.id), 0) AS proportion
FROM shopify_order
JOIN shopify_orderlineitem ON shopify_order.id = shopify_orderlineitem.order_id;
Upon executing this query, we encountered the following error:
SQL Error [42803]: ERROR: aggregate function calls cannot be nested
This error indicates that SQLCoder2's attempt to nest COUNT functions is not valid in standard SQL syntax.
Next, we tried SQLCoder-34B, which produced the following query:
SELECT CAST(COUNT(DISTINCT o.id) filter (
WHERE COUNT(DISTINCT oi.product_id) > 1) AS FLOAT) / NULLIF(COUNT(DISTINCT o.id), 0) AS proportion
FROM shopify_order o
JOIN shopify_orderlineitem oi ON o.id = oi.order_id
GROUP BY o.id;
Unfortunately, this query also resulted in an error:
SQL Error [42803]: ERROR: aggregate functions are not allowed in FILTER
The error suggests that SQLCoder-34B's use of an aggregate function within a FILTER clause is not supported by SQL.
The attempts by both SQLCoder2 and SQLCoder-34B reveal several limitations in their current implementations:
These limitations underscore the challenges in creating AI models that can generate correct and efficient SQL queries for a wide range of scenarios. They also highlight the potential benefits of custom training for specific use cases and database schemas.
In the next section, we'll explore how we addressed these limitations through custom training, resulting in a more accurate and efficient solution to our original question.
After encountering limitations with existing SQL generators, we decided to take a custom approach to solve our specific query challenge. This involved preparing a tailored dataset and fine-tuning a model to better understand and generate the required SQL query.
We created a custom dataset centred around our original prompt: "What is the percentage of orders that have more than one product?" To create our custom dataset, we leveraged the capabilities of GPT-3.5. This dataset included various permutations of this question, along with similar queries that required counting and percentage calculations across multiple tables. This approach allowed us to quickly produce a relevant and varied dataset tailored to our specific use case.
Our final dataset consisted of 196 entries, each containing a natural language question paired with its corresponding SQL query. We then split this dataset as follows:
This split provides a substantial amount of data for training while reserving a significant portion for testing, helping to ensure the model's ability to generalize to new, unseen examples.
The training process took a total of 43 minutes. This relatively short training time demonstrates the efficiency of fine-tuning an existing model compared to training from scratch, making it a practical approach for customizing AI models for specific tasks.
To assess the quality of our training, we plotted the training loss against the validation loss over the course of the training process. This visualization helps us understand how well the model is learning and whether it's overfitting or underfitting the data.
Our analysis of the plot revealed that the model achieved a good fit. This means:
This good fit gives us confidence that our custom-trained model has effectively learned to generate SQL queries for our specific use case without memorizing the training data or failing to capture important patterns.
By taking this custom training approach, we were able to address the limitations of existing models and create a solution tailored to our specific SQL generation needs. The next section will reveal the results of this training and the SQL query our model was able to produce.
After our custom training process, we achieved a significant improvement in the model's ability to generate the correct SQL query for our specific problem. Let's examine the successful query and analyze its structure and functionality.
The custom-trained model produced the following SQL query:
select
(COUNT(*) * 100.0 / (
select
COUNT(*)
from
shopify_order))
from
shopify_order
where
id in (
select
order_id
from
shopify_orderlineitem
group by
order_id
having
COUNT(*) > 1);
1. Overall Structure:
2. Key Features:
3. Functionality:
4. Improvements:
5. Potential Optimization:
This query successfully solves the problem, showing the model's improved SQL comprehension and ability to handle complex queries without common errors.
Our training methodology focused on fine-tuning an existing language model to specialize in SQL query generation. We employed a transfer learning approach, leveraging a pre-trained model as our starting point and then adapting it to our specific task using a custom dataset. This method allows us to benefit from the broad knowledge captured in the original model while tailoring its capabilities to our particular use case.
We utilized a robust stack of modern machine-learning tools and libraries for this project:
1. PyTorch: As our primary deep learning framework, PyTorch provided the foundation for model manipulation and training.
2. Hugging Face Transformers: This library offered easy access to state-of-the-art pre-trained models and tools for fine-tuning.
3. PEFT (Parameter-Efficient Fine-Tuning): A library from Hugging Face that implements various efficient fine-tuning methods, including LoRA.
4. Datasets: Another Hugging Face library used for efficient data handling and preprocessing.
5. Accelerate: This library was used to enable distributed training and mixed precision, optimizing our training process.
LoRA is an efficient fine-tuning technique introduced in 2021 by Microsoft researchers. It addresses the challenge of adapting large language models to specific tasks without the need to retrain all parameters, which can be computationally expensive and prone to overfitting.
Key Concepts:
1. Low-Rank Matrices: LoRA works by adding pairs of rank-decomposition matrices to the weights of the original model. These matrices are initialized to a zero matrix, ensuring that the model starts from the pre-trained weights.
2. Trainable Parameters: During fine-tuning, only these low-rank matrices are updated, while the original model weights remain frozen. This significantly reduces the number of trainable parameters.
3. Adaptability: Despite the reduced parameter count, LoRA can adapt the model to new tasks effectively, often matching or exceeding the performance of full fine-tuning.
For more information : [check this paper]
Custom datasets are crucial for training AI models on domain-specific tasks. In our case, a tailored dataset focusing on complex SQL queries with nested aggregations significantly improved the model's performance. This approach ensures that the AI learns from relevant examples, leading to more accurate and contextually appropriate outputs.
Fine-tuning pre-trained models, rather than training from scratch, offers several advantages:
1. Reduced training time (our process took only 43 minutes)
2. Lower computational resource requirements
3. Ability to leverage the general knowledge of large language models while adapting to specific tasks
4. Improved performance on niche problems that general models might struggle with
These lessons underscore the value of customization in AI development, especially for specialized applications like SQL query generation. By combining the power of existing models with targeted fine-tuning, we can create more effective and efficient AI tools for specific data analysis needs.
Suggested Reads- How To Use Open Source LLMs (Large Language Model)
As we reflect on our journey of fine-tuning an AI model for SQL query generation, it's clear that while we've made significant progress, there's still room for improvement and expansion. This section explores potential enhancements to our model and discusses how our approach could be applied to other SQL-related tasks.
1. Expanding the training dataset:
2. Fine-tuning hyperparameters:
3. Incorporating user feedback:
1. Query optimization:
2. Schema design:
3. SQL to natural language:
4. Data cleaning and preprocessing:
5. Temporal and spatial data queries:
6. Security and access control:
By pursuing these improvements and exploring new applications, we can continue to push the boundaries of AI-assisted SQL query generation.
While our custom-trained model successfully generated the correct SQL query, it's important to acknowledge the limitations of this approach:
1. Dataset Specificity: The model was trained on a custom dataset tailored to this particular problem. This means it may not generalize well to significantly different SQL query tasks without additional training.
2. Time and Resource Intensive: Creating a custom dataset and training the model required significant time and computational resources. This approach may not be feasible for every SQL query challenge, especially in time-sensitive situations.
3. Limited Scope: Our solution addresses a specific type of SQL query. It may not be equipped to handle a wide range of SQL operations or complex database schemas without further training.
4. Maintenance and Updates: As database structures or requirements change, the model may need to be retrained or fine-tuned, which requires ongoing effort and expertise.
5. Dependency on Initial Models: Our approach is built upon existing models (SQLCoder2). The quality and limitations of these base models influence the final output, even after custom training.
We started with a challenging task: generating an SQL query to find the percentage of orders with multiple products. Off-the-shelf AI models like SQLCoder2 and SQLCoder-34B couldn't solve this problem correctly. By creating a custom dataset and fine-tuning a model, we successfully generated an accurate SQL query for our specific need.
This experience shows that while AI models for SQL generation are promising, they still have limitations. Custom training can significantly improve their performance for specific tasks. As AI continues to evolve, we can expect more accurate and versatile SQL generation tools. However, human expertise remains crucial in training these models and verifying their outputs. The future of database query generation likely involves a collaboration between AI assistance and human knowledge, leading to more efficient and accurate data analysis processes.
I'm more than a back-end developer with 2.5 years of coding experience. I'm also the guy who swings a cricket bat and conquers gaming worlds in my free time.
What is Retrieval-Augmented Generation (RAG)?
Large language models or LLMs are a type of Artificial intelligence that can mimic human intelligence. It applies neural network techniques with lots of parameters to process and understand human languages. It can perform sev...
How To Build a UI for LLM with Gradio
Large Language Models (LLMs) have revolutionised the field of natural language processing, offering unprecedented capabilities in understanding and generating human-like text. These powerful AI models, trained on vast amounts...
Unlocking LLM Potential Through Function Calling
What is Function Calling? Function Calling is a powerful feature in large language models (LLMs) that enables them to interact with external tools, APIs, and databases. This capability allows LLMs to extend their problem-sol...