Experimenting with ChatGPT's Vulnerability Volcano and Prompt Party Tricks

Wait 5 sec.

Table of LinksAbstract and I. IntroductionII. Related WorkIII. Technical BackgroundIV. Systematic Security Vulnerability Discovery of Code Generation ModelsV. ExperimentsVI. DiscussionVII. Conclusion, Acknowledgments, and References\AppendixA. Details of Code Language ModelsB. Finding Security Vulnerabilities in GitHub CopilotC. Other Baselines Using ChatGPTD. Effect of Different Number of Few-shot ExamplesE. Effectiveness in Generating Specific Vulnerabilities for C CodesF. Security Vulnerability Results after Fuzzy Code DeduplicationG. Detailed Results of Transferability of the Generated Nonsecure PromptsH. Details of Generating non-secure prompts DatasetI. Detailed Results of Evaluating CodeLMs using Non-secure DatasetJ. Effect of Sampling TemperatureK. Effectiveness of the Model Inversion Scheme in Reconstructing the Vulnerable CodesL. Qualitative Examples Generated by CodeGen and ChatGPTM. Qualitative Examples Generated by GitHub CopilotV. EXPERIMENTSIn this section, we present the results of our experimental evaluation. First, we explain the details of the experimental setup. Then, we provide the results of finding the models’ security vulnerabilities and study the efficiency and scalability of the proposed approach. We also investigate the transferability of the generated non-secure prompts across the different models. Furthermore, utilizing our method, we provide a diverse set of 280 prompts for various types of security vulnerabilities. This dataset allows us to benchmark state-of-the-art models.\A. SetupWe start with an overview of the setup, including the details of the models, few-shot prompts, sampling strategies, and the CodeQL settings.\1) Code Generation Models: To evaluate our proposed approach, we concentrate on two models: CodeGen with 6 billion parameters [6] and the ChatGPT model [4], [52] We provide the details of each model in Appendix A. In addition to these models, we also provide the results for the GitHub Copilot AI programming assistant [9] in Appendix B.\In the generation procedure of non-secure prompt and codes for the ChatGPT model, following previous work by Ouyang et al. [3], we employ an instruction to guide the model toward generating the desired codes. For example, in the generation procedure of Python non-secure prompts given the few-shot or one-shot input, we instruct ChatGPT as follows:\‘‘You are a Python developer. Given the provided code examples, complete the following code. Define a function and write the task description as a Python comment. Provide comprehensive comments. Do not write any explanation. Only write Python codes.’’\Using the above instruction, we prime ChatGPT to generate Python prompts with comprehensive comments, and we also instruct the model to output the code for the desired language.\In the Python code generation procedure, we instruct ChatGPT using the following instruction:\‘‘You are a Python developer. Complete the following Python code:’’.\We provide our observations about other potential ChatGPT baselines (e.g., can we generate secure codes by instructing ChatGPT to generate ‘‘secure’’ codes?) in Appendix C.\We conduct the experiments for the CodeGen model using two NVIDIA 40GB Ampere A100 GPUs. To run the experiments on ChatGPT, we use the OpenAI API [52] to query the model. In the generation process, we consider generating up to 25 and 150 tokens for non-secure prompts and code, respectively. We use nucleus sampling to sample k non-secure prompts from CodeGen. Using each k sampled non-secure prompts, we sample k ′ completion of the given input nonsecure prompts. For the ChatGPT model, we also set the number of samples for generating non-secure prompts and code to k and k ′ , respectively. In total, we sample k × k ′ completed codes. For both models, we set the sampling temperature to 0.6, where the temperature describes the randomness of the model’s output and its variance. The higher the temperature, the more random the output. Note that we use the sampling temperature employed in previous code generation works [6], [5]. In Appendix J, we provide detailed results of the effect of different sampling temperatures in generating non-secure prompts.\2) Constructing Few-shot Prompts: We use the few-shot setting in FS-Code and FS-Prompt to guide the models to generate the desired output. Previous work has shown that the optimal number for the few-shot prompting is between two and ten examples [1], [53]. Due to the difficulty in accessing potential security vulnerability code examples, we set the number to four in all of our experiments for FS-Code and FS-Prompt. Note that three out of four of these examples are used as demonstration examples, and one of them is the targeted code. We analyze the effect of using different numbers of few-shot examples in Appendix D.\To construct each few-shot prompt, we use a set of four examples for each CWEs in Table I. The examples in the fewshot prompts are separated using a special tag (###). It has been shown that the order of examples affects the output [51]. To generate a diverse set of non-secure prompts, we construct five few-shot prompts with four examples by randomly shuffling the order of examples. Note that each of the examples contains at least one security vulnerability of the targeted CWE. Using the five constructed few-shot prompts, we can sample 5×k×k ′ completed codes from each model.\3) CWEs and CodeQL Settings: By default, CodeQL provides queries to discover 29 different CWEs in Python and 35 in C. In this work, we generate non-secure prompts and codes for 13 different CWEs, listed in Table I. However, we analyzed the generated code to detect all supported CWEs for Python and C code. We summarize all CWEs that are not in the list in Table I but are found during the analysis as Other\B. EvaluationIn the following, we present the evaluation results and discuss the main insights of these results.\1) Generating Codes with Security Vulnerabilities: We evaluate our different approaches for finding vulnerable codes that are generated by the CodeGen and ChatGPT models. We examine the performance of our FS-Code, FS-Prompt, and OSPrompt in terms of quality and quantity. For this evaluation, we use five different few-shot prompts by permuting the examples’ order. We provide the details of constructing these five few-shot prompts using four code examples in Section V-A. Note that in one-shot prompts for OS-Prompt, we use one example in each one-shot prompt, followed by importing relevant libraries. In total, using each few-shot prompt or one-shot prompt, we sample the top five non-secure prompts, and each sampled non-secure prompt is used as input to sample the top five code completions. Therefore, using five few-shot or one-shot prompts, we sample 5 × 5 × 5 (125) complete codes from CodeGen and ChatGPT models.\a) Effectiveness in Generating Specific Vulnerabilities: Figure 3 shows the percentage of vulnerable Python codes that are generated by CodeGen (Figure 3a, Figure 3b, and Figure 3c) and ChatGPT (Figure 3d, Figure 3e, and Figure 3f) using our three few-shot prompting approaches (We also provide the percentage of vulnerable C codes in Appendix E). We removed duplicates and codes with syntax errors. The x-axis refers to the CWEs that have been detected in the sampled codes, and the y-axis refers to the CWEs that have been used to generate non-secure prompts. These non-secure prompts are used to generate the codes. Other refers to detected CWEs that are not listed in Table I and are not considered in our evaluation. The results in Figure 3 show the percentage of the generated code samples that contain at least one security vulnerability. The high numbers on the diagonal show our approaches’ effectiveness in finding code with targeted vulnerabilities, especially for ChatGPT. For CodeGen, the diagonal is less distinct. However, we can still find a reasonably large number of vulnerabilities for all three few-shot sampling approaches. Furthermore, the results in Figure 3 show how effective the approximated inverse of the models are in finding the targeted type of security vulnerabilities. Overall, we find that our FS-Code approach (Figure 3a and Figure 3d) performs better in comparison to FSPrompt (Figure 3b and Figure 3e) and OS-Prompt (Figure 3c and Figure 3f). For example, Figure 3d shows that FS-Code finds higher percentages of CWE-020, CWE-079, and CWE-94 vulnerabilities for ChatGPT models in comparison to our other approaches (FS-Prompt and OS-Prompt).\The main goal of approximating the inversion of the model is to generate the code with the targeted vulnerability. However, our experiments show that our FS-Code approach can also partially reconstruct the targeted code in many examples. We provide the detailed results in Appendix K.\b) Quantitative Comparison of Different Prompting Techniques: Table II and Table III provide the quantitative results of our approaches. The tables show the absolute numbers of vulnerable codes found by FS-Code, FS-Prompt, and OSPrompt for both models. Additionally, we present the results obtained by using only the initial few first lines of vulnerable code examples as non-secure prompts, referring to them as CVE-prompts (We use directly the first few lines as the nonsecure prompt to complete the code). We employ the non-secure prompts from vulnerable code examples to sample the same number of code completions. Table II presents the results for the codes generated by CodeGen, and Table III for the codes generated by ChatGPT. Columns 2 to 13 provide the number of vulnerable Python codes, and columns 14 to 19 provide the number of vulnerable C codes. In Table II Other refers to the number of codes that contain other CWEs that are not considered separately in our evaluation. The Total columns provide the sum of all vulnerable codes for Python and C.\In Table II and Table III, we observe that our best performing method (FS-Code) found 124 and 501 vulnerable Python codes that are generated by CodeGen and ChatGPT, respectively. In general, the results in Table III show that our approaches found more vulnerable codes that are generated by ChatGPT in comparison to CodeGen (Table II). One reason for that could be related to the capability of the ChatGPT model to generate more complex codes compared to CodeGen [6]. Another reason might be related to the code datasets used in the model’s training procedure. Furthermore, Table II and Table III show that FS-Code performs better in finding codes with different CWEs in comparison to FS-Prompt and OSPrompt. For example, in Table III, we can observe that FS-Code find more vulnerable codes that contain CWE-020, CWE-094 for Python codes, and CWE-190 for C codes. This shows the advantage of employing vulnerable codes in our few-shot prompting approach. For the remaining experiments, we use FS-Code as our best-performing approach. Tables II and III show that CVE-prompts were unable to generate any vulnerable codes of certain specific types. For instance, in Table II, we observe that CVE-prompts could not generate any vulnerable codes with types CWE-079, CWE-117, and CWE-601. This indicates that to examine the security weaknesses that can generated by these models, we cannot solely rely on a handful of vulnerable code samples\2) Finding Security Vulnerabilities of Models on Large Scale: Next, we evaluate the scalability of our FS-Code approach in finding vulnerable codes that could be generated by the CodeGen and ChatGPT models. We investigate if our approach can find a larger number of vulnerable codes by increasing the number of sampled non-secure prompts and code completions. To evaluate this, we set k = 15 (number of sampled non-secure prompts) and k ′ = 15 (number of sampled codes given each non-secure prompts). Using five few-shot prompts, we generate 1125 (15×15×5) codes using each model and then remove all duplicate codes. Figure 4 provides the results for the number of codes with different CWEs versus the number of samples. Figure 4a and Figure 4b provide Python codes results in ten different CWEs, and Figure 4c and Figure 4d provide C codes result for four different CWEs.\Figure 4 shows that, in general, by sampling more code samples, we can find more vulnerable codes that are generated by CodeGen and ChatGPT models. For example, Figure 4a shows that with sampling more codes, CodeGen generates a significant number of vulnerable codes for CWE-022 and CWE-079. In Figure 4a and Figure 4b, we also observe that generating more codes has less effect in finding more codes with specific vulnerabilities (e.g., CWE-020 and CWE-094). Furthermore, Figure 4 shows an almost linear growth for CWE-022 (Figure 4b), CWE-079 (Figure 4b), and CWE-787 (Figure 4d). This is mainly due to the nature of these CWEs: For example, CWE-787 refers to writing out-of-bounds of a defined array or allocated memory; this is a very prevalent issue in C and can happen in many program writing scenarios. We also qualified the provided results in Figure 4 by employing fuzzy matching to drop near duplicate codes. However, we did not observe a significant change in the effect of sampling the codes on finding the number of vulnerable codes. We provide more details and results in Appendix F.\a) Qualitative Examples: Listing 4 and Listing 5 provide two examples of vulnerable code generated by CodeGen and ChatGPT, respectively. Listing 4 shows C code that contains an integer overflow vulnerability (CWE-190). Listing 5 provides Python code that contains a cross-site scripting vulnerability (CWE-079). In Listing 4, lines 1 to 12 are used as the nonsecure prompt, and the rest of the code example is the CodeGen completion for the given non-secure prompt. The code contains a multiplication in lines 27 and 34 that potentially overflows on a 32-bit platform. Since the result controls an allocation size, this vulnerability could lead to a heap buffer overflow. In Listing 5, lines 1 to 4 are the non-secure prompt, and the rest of the code is the output of ChatGPT given the non-secure prompt. The web application copies user input into page content (lines 15 and 17) without prior sanitization, which enables Cross-Site Scripting (XSS). We provide more generated vulnerable Python and C codes in Appendix L.\3) Transferability of the Generated Non-secure Prompts: In the previous experiments, we generated the non-secure prompts and completed codes using the same model. Here we investigate if the generated non-secure prompts are transferable across different models. For example, we want to answer whether the non-secure prompts generated by ChatGPT can lead the CodeGen model to generate vulnerable codes. For this experiment, we collect a set of “promising” non-secure prompts generated with the CodeGen and ChatGPT models in Section V-B2. We consider a non-secure prompt promising if it at least leads the model to generate one vulnerable code sample. After deduplication, we collected 544 of the non-secure prompts generated by the CodeGen model and 601 non-secure prompts that the ChatGPT model generated. All the prompts were generated using our FS-Code approach.\To examine the transferability of the promising non-secure prompts, we use CodeGen to complete the non-secure prompts that ChatGPT generates. Furthermore, we use ChatGPT to complete the non-secure prompts that CodeGen generates. Table IV and Table V provide results of generated Python and C codes, respectively. These vulnerable codes are generated by CodeGen and ChatGPT models using the promising nonsecure prompts that are generated by CodeGen and ChatGPT models. We sample k ′ = 5 for each of the given non-secure prompts. In Table IV and Table V, #Code refers to the number of the generated codes, and #Vul refers to the number of codes that contain at least one vulnerability. Table IV and Table V show that Python and C non-secure prompts that we sampled from CodeGen are transferable to the ChatGPT model and vice versa. Specifically, the non-secure prompts\ \ \that we sampled from one model generate a high number of vulnerable codes in the other model. For example, in Table IV, we observe that the generated Python non-secure prompts by CodeGen leads ChatGPT to generate 617 vulnerable codes. We also observe that, in most of the cases, the non-secure prompts lead to generating more vulnerable codes on the same model compared to the other model. For example, in Table IV non-secure prompts generated by ChatGPT lead ChatGPT to generate 1659 vulnerable codes, while it only generates 707 vulnerable codes on the CodeGen model. Furthermore, Table IV shows that the non-secure prompts of ChatGPT models can generate a higher fraction of vulnerabilities for CodeGen (707/2050 = 0.34) in comparison to CodeGen’s non-secure prompts (466/1545 = 0.30). In general, the results show that the sampled non-secure prompts of different programming languages are transferable across different models and can be employed to evaluate the other model in generating codes with particular security issues. We provide the detailed results of Table IV and Table V per CWEs in Appendix G.\C. CodeLM Security BenchmarkIn Section V-B3, we show that non-secure prompts are transferable across different models. Building on this finding, we leverage our FS-Code approach to generate a collection of non-secure prompts using a set of state-of-the-art models. This dataset serves as a benchmark to evaluate and compare code language models. In the following, we first provide the details of the non-secure prompt dataset. Using this dataset, we assess and compare vulnerabilities among five different state-of-the-art code language models. We provide the details of these models in Appendix A.\1) Non-secure Prompts Dataset: We generate the dataset of non-secure prompts by using our FS-Code approach and employing two state-of-the-art code models GPT-4 [54] and Code Llama-34B [12]. We generate 50 prompts for each CWE, 25 are generated by GPT-4 [54] and 25 by Code Llama34B [12]. To generate diverse prompts, we set the temperature of each model to 1.0. We provide more details in Appendix H. Given the 50 generated prompts per CWE, through a defined\ \ \ \ \ \procedure, we select 20 non-secure prompts as the instances of our dataset. This results in a total of 280 non-secure prompts, with 200 designed for Python and 80 for C. Details of the selection procedure are outlined below.\a) Non-secure Prompts Selection: We select 20 deduplicated prompts out of 50 generated prompts: A prompt generated by GPT-4 [54] is considered “promising” if it leads GPT-4 [54] to generate at least one vulnerable code. For generating the codes using the non-secure prompts, we use a setting of k ′ = 5, resulting in the generation of 250 codes per CWE (50 × 5).\2) Evaluating CodeLMS using Non-secure Prompts Dataset: We utilize our custom non-secure prompts dataset as a benchmark to assess and evaluate different code language models. Table VI presents the number of vulnerable codes generated using the non-secure prompts of our dataset. These codes were generated by different instruction-tuned and pretrained code models. Here, we present the initial results of evaluating the security weaknesses of the code language models. As a service for the community, we will launch a website at the\ \ \time of publication for ranking the security of models inspired by the “Big Code Models Leaderboard” [55], which will regularly report the security evaluations of the state-of-the-art code models. Furthermore, to avoid intentional or unintentional overfitting to the provided non-secure prompts, we can regularly update them using our FS-Code approach and the selection approach described above.\In Table VI, we provide the results of the security weaknesses that can be generated with five different code language models using our proposed dataset. Among the evaluated models, Code Llama-13B [12], WizardCoder [56], and ChatGPT are instruction-tuned, while CodeGen [6] and StarCoder [24] are the base models (only pre-trained). Table Table VI presents the total number of vulnerable Python and C codes for various CWEs. In this table, top-1 indicates the number of generated vulnerable codes among the top-ranked outputs of the model, while top-5 represents the number of generated vulnerable codes among the top 5 outputs of the models. We provide the detailed results per CWE in Appendix I. To generate the codes for each non-secure prompt, we adhere to the “Big Code Models Leaderboard” [55] with the following settings: a maximum token limit of 512, a top-p value of 0.95 (The parameter of nucleus sampling [50]), and a temperature setting of 0.2.\Table VI demonstrates that CodeGen-6B produces a lower number of vulnerable Python and C codes in comparison to other models. However, when selecting a model for a specific application, we recommend considering both the performance with respect to correctness and our security benchmark results. For example, CodeGen-6B and ChatGPT have comparable results in generating vulnerable Python codes. However, as per Liu et al. [57], CodeGen-6B achieves a performance score of only 29.3 on the HumanEval benchmark [5], while ChatGPT’s performance excels at 73.2 (Here, we report pass@1 performance of the models in HumanEval benchmark. For more details, please refer to Liu et al. [57]). Furthermore, in Table VI, we note that Code Llama-13B produces fewer vulnerable codes than StarCoder-7B, while, as per [55], Code Llama-13B has exhibited superior performance in the HumanEval benchmark compared to StarCoder-7B (Code Llama-13B scored 50.60, whereas StarCoder-7B scored only 28.37). For a comprehensive comparison of these models, it is also helpful to analyze the number of vulnerable code instances generated for each type of vulnerability. Detailed results can be found in Appendix I.\:::infoAuthors:(1) Hossein Hajipour, CISPA Helmholtz Center for Information Security (hossein.hajipour@cispa.de);(2) Keno Hassler, CISPA Helmholtz Center for Information Security (keno.hassler@cispa.de);(3) Thorsten Holz, CISPA Helmholtz Center for Information Security (holz@cispa.de);(4) Lea Schonherr, CISPA Helmholtz Center for Information Security (schoenherr@cispa.de);(5) Mario Fritz, CISPA Helmholtz Center for Information Security (fritz@cispa.de).::::::infoThis paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.:::\