Text-to-SQL automation - HMS Analytical Software

Why Text-to-SQL matters for data-driven organizations

Over the last few years capabilities of large language models (LLM) increased constantly, and their application ranges steadily broadened. This versatility was achieved by finetuning large foundation models to specific tasks (e.g. instruction following) or domains (e.g. legal texts). These models exhibit excellent performance in producing structured outputs such as code, JSON or SQL. The focus of this article is a business example using Text-to-SQL.

In modern organizations vast amounts of business and proprietary data are stored in relational databases whose efficient access often requires trained staff and good understanding of data structure and data management principles. Many colleagues who need to work with this data lack sufficient SQL skills, making efficient data use more difficult. A typical business world scenario based on information from relational databases may look like this: A marketing department of company needs product A sales figures from last year for all branches in Hamburg and Berlin.

Here is the point where Text-to-SQL systems help bridge the gap between queries phrased in natural language and the extraction of information from relational databases. The quality that LLMs currently achieve makes SQL generation a reliable process. (see, for example reference [1] as a timely review). In business contexts as described above, we therefore see a high relevance for Text-to-SQL systems, which on the one hand combine a user-friendly interface with natural language capabilities and at the same time enable access to a wide range of existing databases, including data warehouses or data marts.

In this blog post, we summarize our findings during the development and testing of a standard Text-to-SQL use case employing a sales database. We will describe the building blocks that make up the complete system, dive deeper into corresponding prompt engineering and outline a typical conversational flow. It turned out that using a capable LLM in the user application for SQL-generation allowed for interaction with the database at various levels:

Interaction levels in LLM-based Text-to-SQL systems

Direct interaction with database schema and raw data

L0: The immediate level. The user prompt refers to questions concerning the database schema or raw content. Example: "Show me the current database schema" or "Return all values of the column 'brand'".

Natural language queries and SQL generation

L1: The query level: This is the conversational level which is most important for the use case where concrete information retrieval tasks based on the schema are issued. Example: "Show me all product names in the 'cosmetics' category customers from 'Hamburg' bought last month". By combining the information from the database schema with the query terms, the LLM is able to generate the corresponding SQL statement, which a orchestrator subsequently forwards to the query engine. In the resulting query string in our product A example the tables/columns 'customer', 'category', 'product.name', 'customer.address', 'sales.date' will be addressed together with required joins and quantifiers.

Context-aware and referential queries

L2: The referential level: A key advantage of conversational memory lies in its ability to let users reference previous queries easily, eliminating the need to reenter the entire query when only slight adjustments are made. Let’s get back to the example above: The same information with respect to category and time frame is needed, but for the Berlin branch: Entering “Repeat the last query for our Berlin branch” will do the job. Depending on the length of the conversation window and the complexity of the queries a chat with the database at the SQL level is possible.

L2-type communication also happens as soon as the user issues feedback to steer or to correct generated SQL. By considering various database dialects, the LLM can well adapt to the specifics of a database system but still could make mistakes in its generation process. Once a problem is detected by strict syntax and format checking a correction prompt can be provided and enriched by a linter output: "In your last query there is the following problem: <linter output>. Try to correct it". This approach worked well in the rare cases where syntax or formatting issues occurred. In a production setting, users should not be exposed to SQL correction playback. During development, however, a careful analysis of linter outputs helped to shape the system prompt, the schema linking strategy and postprocessing steps. Greater emphasis will be placed on the latter factors when it comes to more advanced database schemas or highly complex queries.

To summarize, by using capable LLMs in SQL-driven business cases, information becomes easily accessible through a text-to-SQL system and will therefore lead to efficient generation of insight, higher data autonomy and resource optimization.

Constraints and workflow of our demonstration case

Our goal is to demonstrate how a robust and extensible text-to-SQL workflow can be set up starting from a user query, SQL generation and verification until delivery of the corresponding database results. In actual customer use cases we normally have the following constraints:

The LLM is accessed via an API, neither self-hosting nor fine-tuning should be considered. Customers generally do not have the hardware to further refine models, and they want to become productive quickly.
For an arbitrary customer database we do not have any gold standards or ground truths at hand. This implies that at the beginning no general validations relying on Execution Accuracy (EXE), Exact Set Matching Accuracy (ESM) or other metrics are performed (see reference [2] for detailed analysis of these quality metrics). Validations can come in later when there is a need to generate reference SQL queries and data. We enable later testing by generating an abstract syntax tree (AST) from the SQL query and corresponding reference data from the customer database for each user prompt.

The figure below outlines the workflow.

Workflow of an enterprise Text-to-SQL system showing user interaction, LLM-based SQL generation, validation steps and database query execution.

End-to-end workflow from natural language query to validated SQL execution

The user on the left side in the figure above constantly interacts with the system through a dialogue window where prompts are entered and results are played back. The conversational memory tracks user prompts and generated SQL such that a L2-level conversation is enabled. In each query the LLM receives the system prompt together with the conversation taken place so far. Specific information such as database schema, database type and SQL dialect are inserted into the system prompt on system startup. In our implementation validation messages from the syntax checkers and linters are played back as well for control and debugging purposes. Once the SQL has passed all checks the AST and the retrieved data are stored to allow for additional verification with a customer gold standard. After longer conversations it is recommendable to clear the conversation memory to avoid any bias that may have built up in a chain of queries. Since context lengths are limited, we maintain a conversation memory window of length 20. For our business case demonstration the following components were used:

Chainlit as UI and orchestrator
GPT-4o-mini as the generating LLM
Dockerized MS SQL Server as a common SQL engine together with required database drivers
Microsoft Contoso data set as a good starting point with anonymous realistic customer data
SQLglot as a strict, SQL dialect-aware syntax checker and AST generator
SQLFluff as a powerful SQL dialect-aware linter useful for detecting formatting issues and other hidden issues even in syntactically correct SQL statements.

Prompt Engineering for Reliable SQL Generation

As mentioned above, finetuning a model is out of scope for getting into production quickly and due to the capabilities of the current LLMs one can steer the model by thoughtful prompt engineering following a few core principles relevant in the context of SQL generation by LLMs. As mentioned in [1] prompt engineering can be divided into three stages, namely preprocessing, inference and postprocessing. When it comes to the construction of our system prompt for high SQL quality we largely adhere to these principles albeit some aspects described in [1] do not apply for our production case. We briefly sketch our construction principles for the LLM system prompt.

Preprocessing: Rules, schema linking and structured outputs

Preprocessing stage: in this part of the prompt hints to the expected query structure or descriptions of its abstract representation are placed. This we do not provide, instead we list a set of rules to be obeyed by the generator in order to arrive at a good SQL derived from natural language. Furthermore we request the answer as structured output such that we gain some insight in the creation process itself. For this purpose we provide an example json file to be used by the LLM each time an answer is generated. The template json structure reads as follows, the placeholders are filled by the LLM:

Example of a structured JSON output generated by an LLM for a Text-to-SQL query.

Example of LLM-generated structured output for SQL generation

The second component of preprocessing is schema linking. On system startup we query the database schema and insert it into the prompt template after whitespace correction and some formatting. It is recommendable to explain to the LLM the individual components of the schema with respect to function (table or column) and data type. We found that further elaboration on the database schema was not necessary for getting good results but our study is not exhaustive. More advanced aspects described in the literature can be included as soon as deficiencies in the generated SQL are observed. We take the position that placing sufficient trust in the LLM’s capabilities can reduce the risk of overloading the system prompt with numerous fine-grained rules, which may otherwise lead to unwanted side effects.

Inference: Controlling SQL generation

Inference stage: Chain-of-thought (CoT) and decompositional workflows are mentioned in the literature as best practices to guide the LLM through the SQL generation step. As already mentioned, we set up a list of rules of what to do and to avoid during the construction process. A good set of rules is obtained by starting the generation process in a completely unrestricted manner and carefully analyzing the outcomes for a wide range of queries. In this way common errors and repeated pitfalls can be detected and excluded. This procedure is not strictly identical to a chain-of-thought (CoT) process but has some overlaps with it. More elaborate CoT or decomposition techniques may become relevant once queries gain complexity in combination with large databases.

Postprocessing: Error handling, consistency and cost control

Postprocessing stage: In this part, self-correction and consistency are the main building blocks. By analyzing semantic and formatting issues produced by SQLFluff we could come up with some guardrails that were included in the system prompt. In some systems self-correction loops are executed under the hood automatically until SQL correctness is established. We have not implemented this approach due to cost concerns as multiple LLM calls are invoked in the background. Instead, we propagate occurring issues directly to the user in the current implementation allowing for L2 correction prompts. By exposing issues repeatedly we were able to constantly improve the rule set hardening the system prompt (done offline). During an active conversation the LLM can memorize issues once it has received a correction prompt and can avoid errors in due course of the information retrieval chat. Another promising approach we will pursue in the future is the self-consistency method described in [3]. Here, multiple SQL strings emerging from a natural language query are generated having set the temperature T > 0 in the LLM call. The SQL string manifold is then ranked by the LLM and the one with the highest rank is forwarded to the query engine. This approach also incurs higher cost due to multiple LLM inference calls. For more complex scenarios self-consistency has proved very efficient in practice but should be accompanied by a careful cost monitoring.

Exemplifying a user/system interaction using Contoso data

L0 conversation for getting acquainted with the database:

As stated above, level 0 refers to queries that directly relate to the database structure or raw table content and are sometimes needed to reveal specific details on the database:

L0 interaction in a Text-to-SQL system showing a user request for the database schema, LLM-generated SQL query and resulting schema output.

Example of a level-0 interaction retrieving database schema information via natural language

By maintaining the conversation history, a short follow up question is possible and will automatically generate the extended query internally (the output order of the tuples is arbitrary):

Context-aware L0 interaction in a Text-to-SQL system extending a previous schema query to include column data types.

Follow-up schema request using conversational context in a Text-to-SQL workflow

L1/L2 queries in an actual use case scenario:

As an example for a real query, we want to find customers who made a purchase in a specific store and within a certain time period.

L1 interaction in a Text-to-SQL system showing a natural language query translated into a SQL statement and returning database query results.

Example of a level-1 interaction translating a business question into SQL

Now we want to issue a follow-up query by finding the customers now in Hamburg and assuming that the LLM can deduce the missing information from the conversation memory:

L2 interaction in a Text-to-SQL system where a referential follow-up query leads to a semantically incorrect SQL statement and a database error.

Context-based follow-up query resulting in a semantic SQL error

Interestingly, an error in the produced query (despite its syntactical correctness) was observed which is a result of the assumption that the ‘Store’ table has the column ‘City’ which is actually not the case. Here we have a competition of LLMs (high likelihood) internal knowledge that ‘Hamburg’ is a city which overrides the information in the schema that no ‘City’ column exists for the ‘Store’ table. From a database perspective it seems a bit inconsistent to store datapoints related to ‘Hamburg’ in the table ‘Store.State’ but strictly speaking, ‘Hamburg’ is a city-state in Germany which may justify this choice.

Having localized the problem, we can straightforwardly produce results by explicitly tell that 'Hamburg' is a store state in a L2-type conversation (observe the occurrence of German-sounding customer names in the database results, indicating the correctness of the data):

L2 interaction in a Text-to-SQL system where a referential follow-up query is corrected based on user feedback, resulting in a valid SQL statement and database results.

Correcting a context-based query using conversational feedback

In summary, the examples shown are not very complex but demonstrate that natural language interaction with a SQL database can be realized without serious impediments in short time. Internally, we also conducted successful benchmarks including more complex queries where, for example, nested SELECTs were generated from a more elaborate natural language request.

Key takeaways for production-ready Text-to-SQL systems

Prompt structure strongly affects SQL quality.
Following established rules for prompting detailed in the literature substantially influences correctness and completeness of generated SQL statements. In order to be most comprehensible to the LLM, the database schema should be provided in structured form. It is helpful albeit not mandatory to provide structured examples of queries, any other requirements can be formulated in natural language. If a structured json output is required providing an example json is necessary to define expected key/value pairs.
Syntax validation tools are indispensable.
In case of lacking precision of user queries it may happen that syntactically incorrect or inferior SQL code is generated. For these situations syntax, semantic and format checking is required. If one seeks automation through self-correction, syntactically incorrect SQL statements are reiterated by the LLM together with their error description until validation is achieved. An alternative is the self-consistency method where multiple SQL strings are ranked.
Conversational context improves both usability and correctness.
Context allows users to reference previous queries, repeat complex SQL generation with minor changes and supports error detection and correction during development and runtime.
Software engineering rigor is essential in GenAI-based data systems.
Even in LLM-driven applications, engineering discipline supports scalability, adaptability and long-term maintainability.

Outlook and Next Steps

Looking ahead, we plan to

include open source models and evaluate their text-to-SQL capabilities in combination with arbitrary customer databases for cost reduction.
leverage a framework such as vLLM (https://docs.vllm.ai/en/v0.6.2/) for self-hosting suitable models of sufficient quality on customer premises. This is getting more and more relevant with respect to data privacy and sovereignty.
extend the schema linking and postprocessing approaches in order to tackle even larger and structurally different data storage schemes, e.g. star schema where fact tables are directly connected to dimension tables, snowflake shema or data vaults with hubs, links and satellites.
support scenarios where multiple databases come into play. This will require more effort on the preprocessing side and dynamic schema generation to keep the LLM focused on the actual query structure.
adapt the service such that it can be seamlessly integrated in a MCP (model context protocol) setting.

Conclusion

Our text-to-SQL example project demonstrates that natural language querying can be made practical and reliable when combined with sound engineering principles. It bridges the gap between human intent and structured data, enabling organizations to unlock the full value of their databases through conversational interfaces.

References

[1] L. Shi et al., A Survey on Employing Large Language Models for Text-to-SQL Tasks, arXiv:2407.15186 (2025).

[2] B. Ascoli et al., ETM: Modern Insights into Perspective on Text-to-SQL Evaluation in the Age of Large Language Models, arXiv:2407.07313 (2025).

[3] X. Wang et al., Self-Consistency Improves Chain of Thought Reasoning in Language Models, arXiv:2203.11171 (2022).

Talking to your SQL database: Text-to-SQL automation in practice