Page 1 of 1

Using models out-of-the-box

Posted: Wed May 28, 2025 4:57 am
by MasudIbne756
We run generated codes using the dataweave command line tool, and the exit code from the process running the code is used to determine whether or not the code compiled successfully.

the first approach we tried for dataweave code generation was to prompt pre-trained large language models in zero-shot, one-shot and two-shot settings.

We evaluated the following models:

gpt-3.5-turbo from openai
gpt 4 from openai
claude 1.3 from anthropic
claude 2 from anthropic
for one-shot evaluation, the example in the prompt was chosen such that the input and output formats (json, xml, csv, etc.) of the example matched the input and output formats of the test instance.

For each model and each setting, we tried different prompt-styles, including openai’s suggestions of using triple backticks, xml-like tags, and more. We’ll discuss the best results we obtained for each model below.

The results for the three models for two temperature values (t=0.2 and t=0.8) are shown in the tables below. For all these experiments, we use n=20. These results were obtained from the models in early july 2023 on america phone number list a benchmark dataset with 66 input-output examples.analyzing the results
overall, we see that gpt-4 produces the most reliable code with an average pass@1 of 0.434 in a two-shot setting. Also, the models perform better in a one-shot setting compared to a zero-shot setting, but there is no significant improvement in the two-shot setting over the one-shot setting.

We also see that gpt 3.5 and gpt 4 outperform claude 1.3 and claude 2 for this task. The difference in performance is more evident in terms of compilation percentages suggesting that gpt 3.5 and gpt 4 achieve higher performance by virtue of generating more compilable code. Gpt 3.5 and claude 2 achieve similar pass@1 for the compiled codes suggesting that they have similar abilities in terms of generating the right code for the implicitly specified task, while gpt-4 exceeds both gpt 3.5 and claude 2 in this regard.

We found an improved performance in gpt-4 in the june 2023 version over the march 2023 version, contrary to other reports, the march 2023 version of gpt-4 produced results similar to what gpt 3.5 produces. Intuitively, this shows a difference in behavior between a popular language like python and a low-resource language like dataweave.

The report states that the generations of gpt-4 become more verbose and contain more comments. A plausible explanation for this is the variety of tasks (like code-summarization, code-comment generation, etc.) and data (comments present in the python code that is seen by the model) that gpt-4 might be used and trained for in the case of python. Such variety is uncommon in a language like dataweave, possibly explaining the absence of such effects. Data preparation.