using the aforementioned format

MasudIbne756 · Post by **MasudIbne756** » Wed May 28, 2025 4:57 am

prepare data primarily involves curating realistic input-output-code tuples. The training data we curated was sourced from community discussions on help.mulesoft.com and from dataweave documentation on docs.mulesoft.com.

A data-extraction pipeline was built to extract data from the community forums that contain questions and answers to those questions the community forum tends to contain input and expected output in the question and the right code for the transformation in the answer.

We use a combination of rule-based approaches and gpt 3.5 to extract the input and expected output from the question and the code from the answer. The previously- mentioned command line tool is then used to verify if the extracted code compiles correctly and whether it generates the right output given the input by matching the generated output to the expected output. Successful (input, output, code) tuples are added to the training dataset to fine-tune a large language model.

Tuples for which the generated output matches the expected output america phone number list are separated and added to a set named the matched set. The other tuples, for which some output is generated upon successful compilation, are added to the set named the unmatched set. These two sets are used during different phases of fine-tuning the models. From the results above, while in a 0-shot setting, the models are similar in performance. In one-shot and two-shot settings, the salesforce xgen 4k models outperform bloom 7b1 model, and also outperforms claude 1.3 and claude 2 from above, but not gpt 3.5 or gpt 4.

The fine-tuned models also tend to have marginally higher compilation percentages, but lower pass@k, suggesting that the fine-tuning has enabled them to understand the syntax of the language to a good extent, but struggle to generate the right code for the implicitly specified task. The effectiveness of fine-tuning is further elucidated by the fact that the models had <1% pass@1 before the fine-tuning process.