├── AI_ETHICS.md
├── CODEOWNERS
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE.txt
├── README.md
├── SECURITY.md
├── assets
├── codegen_logo.png
└── two.gif
├── codegen1
├── LICENSE.txt
├── README.md
├── benchmark
│ ├── README.md
│ ├── mtpb.jsonl
│ ├── mtpb_exec.py
│ └── mtpb_sample.py
├── jaxformer
│ └── hf
│ │ ├── codegen
│ │ ├── configuration_codegen.py
│ │ └── modeling_codegen.py
│ │ ├── sample.py
│ │ └── train_deepspeed.py
└── requirements.txt
├── codegen2
├── LICENSE
├── README.md
├── requirements.txt
└── sample.py
└── codegen25
├── LICENSE
├── README.md
├── requirements.txt
└── sample.py
/AI_ETHICS.md:
--------------------------------------------------------------------------------
1 | ## Ethics disclaimer for Salesforce AI models, data, code
2 |
3 | This release is for research purposes only in support of an academic
4 | paper. Our models, datasets, and code are not specifically designed or
5 | evaluated for all downstream purposes. We strongly recommend users
6 | evaluate and address potential concerns related to accuracy, safety, and
7 | fairness before deploying this model. We encourage users to consider the
8 | common limitations of AI, comply with applicable laws, and leverage best
9 | practices when selecting use cases, particularly for high-risk scenarios
10 | where errors or misuse could significantly impact people’s lives, rights,
11 | or safety. For further guidance on use cases, refer to our standard
12 | [AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf)
13 | and [AI AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ai-acceptable-use-policy.pdf).
14 |
--------------------------------------------------------------------------------
/CODEOWNERS:
--------------------------------------------------------------------------------
1 | # Comment line immediately above ownership line is reserved for related other information. Please be careful while editing.
2 | #ECCN:Open Source
3 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Salesforce Open Source Community Code of Conduct
2 |
3 | ## About the Code of Conduct
4 |
5 | Equality is a core value at Salesforce. We believe a diverse and inclusive
6 | community fosters innovation and creativity, and are committed to building a
7 | culture where everyone feels included.
8 |
9 | Salesforce open-source projects are committed to providing a friendly, safe, and
10 | welcoming environment for all, regardless of gender identity and expression,
11 | sexual orientation, disability, physical appearance, body size, ethnicity, nationality,
12 | race, age, religion, level of experience, education, socioeconomic status, or
13 | other similar personal characteristics.
14 |
15 | The goal of this code of conduct is to specify a baseline standard of behavior so
16 | that people with different social values and communication styles can work
17 | together effectively, productively, and respectfully in our open source community.
18 | It also establishes a mechanism for reporting issues and resolving conflicts.
19 |
20 | All questions and reports of abusive, harassing, or otherwise unacceptable behavior
21 | in a Salesforce open-source project may be reported by contacting the Salesforce
22 | Open Source Conduct Committee at ossconduct@salesforce.com.
23 |
24 | ## Our Pledge
25 |
26 | In the interest of fostering an open and welcoming environment, we as
27 | contributors and maintainers pledge to making participation in our project and
28 | our community a harassment-free experience for everyone, regardless of gender
29 | identity and expression, sexual orientation, disability, physical appearance,
30 | body size, ethnicity, nationality, race, age, religion, level of experience, education,
31 | socioeconomic status, or other similar personal characteristics.
32 |
33 | ## Our Standards
34 |
35 | Examples of behavior that contributes to creating a positive environment
36 | include:
37 |
38 | * Using welcoming and inclusive language
39 | * Being respectful of differing viewpoints and experiences
40 | * Gracefully accepting constructive criticism
41 | * Focusing on what is best for the community
42 | * Showing empathy toward other community members
43 |
44 | Examples of unacceptable behavior by participants include:
45 |
46 | * The use of sexualized language or imagery and unwelcome sexual attention or
47 | advances
48 | * Personal attacks, insulting/derogatory comments, or trolling
49 | * Public or private harassment
50 | * Publishing, or threatening to publish, others' private information—such as
51 | a physical or electronic address—without explicit permission
52 | * Other conduct which could reasonably be considered inappropriate in a
53 | professional setting
54 | * Advocating for or encouraging any of the above behaviors
55 |
56 | ## Our Responsibilities
57 |
58 | Project maintainers are responsible for clarifying the standards of acceptable
59 | behavior and are expected to take appropriate and fair corrective action in
60 | response to any instances of unacceptable behavior.
61 |
62 | Project maintainers have the right and responsibility to remove, edit, or
63 | reject comments, commits, code, wiki edits, issues, and other contributions
64 | that are not aligned with this Code of Conduct, or to ban temporarily or
65 | permanently any contributor for other behaviors that they deem inappropriate,
66 | threatening, offensive, or harmful.
67 |
68 | ## Scope
69 |
70 | This Code of Conduct applies both within project spaces and in public spaces
71 | when an individual is representing the project or its community. Examples of
72 | representing a project or community include using an official project email
73 | address, posting via an official social media account, or acting as an appointed
74 | representative at an online or offline event. Representation of a project may be
75 | further defined and clarified by project maintainers.
76 |
77 | ## Enforcement
78 |
79 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
80 | reported by contacting the Salesforce Open Source Conduct Committee
81 | at ossconduct@salesforce.com. All complaints will be reviewed and investigated
82 | and will result in a response that is deemed necessary and appropriate to the
83 | circumstances. The committee is obligated to maintain confidentiality with
84 | regard to the reporter of an incident. Further details of specific enforcement
85 | policies may be posted separately.
86 |
87 | Project maintainers who do not follow or enforce the Code of Conduct in good
88 | faith may face temporary or permanent repercussions as determined by other
89 | members of the project's leadership and the Salesforce Open Source Conduct
90 | Committee.
91 |
92 | ## Attribution
93 |
94 | This Code of Conduct is adapted from the [Contributor Covenant][contributor-covenant-home],
95 | version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html.
96 | It includes adaptions and additions from [Go Community Code of Conduct][golang-coc],
97 | [CNCF Code of Conduct][cncf-coc], and [Microsoft Open Source Code of Conduct][microsoft-coc].
98 |
99 | This Code of Conduct is licensed under the [Creative Commons Attribution 3.0 License][cc-by-3-us].
100 |
101 | [contributor-covenant-home]: https://www.contributor-covenant.org (https://www.contributor-covenant.org/)
102 | [golang-coc]: https://golang.org/conduct
103 | [cncf-coc]: https://github.com/cncf/foundation/blob/master/code-of-conduct.md
104 | [microsoft-coc]: https://opensource.microsoft.com/codeofconduct/
105 | [cc-by-3-us]: https://creativecommons.org/licenses/by/3.0/us/
106 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing Guide For CodeGen
2 |
3 | This page lists the operational governance model of this project, as well as the recommendations and requirements for how to best contribute to CodeGen. We strive to obey these as best as possible. As always, thanks for contributing – we hope these guidelines make it easier and shed some light on our approach and processes.
4 |
5 | # Governance Model
6 |
7 | ## Published but not supported
8 |
9 | The intent and goal of open sourcing this project is because it may contain useful or interesting code/concepts that we wish to share with the larger open source community. Although occasional work may be done on it, we will not be looking for or soliciting contributions.
10 |
11 | # Issues, requests & ideas
12 |
13 | Use GitHub Issues page to submit issues, enhancement requests and discuss ideas.
14 |
15 | ### Bug Reports and Fixes
16 | - If you find a bug, please search for it in the [Issues](https://github.com/salesforce/CodeGen/issues), and if it isn't already tracked,
17 | [create a new issue](https://github.com/salesforce/CodeGen/issues/new). Fill out the "Bug Report" section of the issue template. Even if an Issue is closed, feel free to comment and add details, it will still
18 | be reviewed.
19 | - Issues that have already been identified as a bug (note: able to reproduce) will be labelled `bug`.
20 | - If you'd like to submit a fix for a bug, [send a Pull Request](#creating_a_pull_request) and mention the Issue number.
21 | - Include tests that isolate the bug and verifies that it was fixed.
22 |
23 | ### New Features
24 | - If you'd like to add new functionality to this project, describe the problem you want to solve in a [new Issue](https://github.com/salesforce/CodeGen/issues/new).
25 | - Issues that have been identified as a feature request will be labelled `enhancement`.
26 | - If you'd like to implement the new feature, please wait for feedback from the project
27 | maintainers before spending too much time writing the code. In some cases, `enhancement`s may
28 | not align well with the project objectives at the time.
29 |
30 | ### Tests, Documentation, Miscellaneous
31 | - If you'd like to improve the tests, you want to make the documentation clearer, you have an
32 | alternative implementation of something that may have advantages over the way its currently
33 | done, or you have any other change, we would be happy to hear about it!
34 | - If its a trivial change, go ahead and [send a Pull Request](#creating_a_pull_request) with the changes you have in mind.
35 | - If not, [open an Issue](https://github.com/salesforce/CodeGen/issues/new) to discuss the idea first.
36 |
37 | If you're new to our project and looking for some way to make your first contribution, look for
38 | Issues labelled `good first contribution`.
39 |
40 | # Contribution Checklist
41 |
42 | - [x] Clean, simple, well styled code
43 | - [x] Commits should be atomic and messages must be descriptive. Related issues should be mentioned by Issue number.
44 | - [x] Comments
45 | - Module-level & function-level comments.
46 | - Comments on complex blocks of code or algorithms (include references to sources).
47 | - [x] Dependencies
48 | - Minimize number of dependencies.
49 | - Prefer Apache 2.0 licenses.
50 | - [x] Reviews
51 | - Changes must be approved via peer code review
52 |
53 | # Creating a Pull Request
54 |
55 | 1. **Ensure the bug/feature was not already reported** by searching on GitHub under Issues. If none exists, create a new issue so that other contributors can keep track of what you are trying to add/fix and offer suggestions (or let you know if there is already an effort in progress).
56 | 3. **Clone** the forked repo to your machine.
57 | 4. **Create** a new branch to contain your work (e.g. `git br fix-issue-11`)
58 | 4. **Commit** changes to your own branch.
59 | 5. **Push** your work back up to your fork. (e.g. `git push fix-issue-11`)
60 | 6. **Submit** a Pull Request against the `main` branch and refer to the issue(s) you are fixing. Try not to pollute your pull request with unintended changes. Keep it simple and small.
61 | 7. **Sign** the Salesforce CLA (you will be prompted to do so when submitting the Pull Request)
62 |
63 | > **NOTE**: Be sure to [sync your fork](https://help.github.com/articles/syncing-a-fork/) before making a pull request.
64 |
65 | # Code of Conduct
66 | Please follow our [Code of Conduct](CODE_OF_CONDUCT.md).
67 |
68 | # License
69 | By contributing your code, you agree to license your contribution under the terms of our project [LICENSE](LICENSE.txt) and to sign the [Salesforce CLA](https://cla.salesforce.com/sign-cla)
70 |
--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | Apache License Version 2.0
2 |
3 | Copyright (c) 2023 Salesforce, Inc.
4 | All rights reserved.
5 |
6 | Apache License
7 | Version 2.0, January 2004
8 | http://www.apache.org/licenses/
9 |
10 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
11 |
12 | 1. Definitions.
13 |
14 | "License" shall mean the terms and conditions for use, reproduction,
15 | and distribution as defined by Sections 1 through 9 of this document.
16 |
17 | "Licensor" shall mean the copyright owner or entity authorized by
18 | the copyright owner that is granting the License.
19 |
20 | "Legal Entity" shall mean the union of the acting entity and all
21 | other entities that control, are controlled by, or are under common
22 | control with that entity. For the purposes of this definition,
23 | "control" means (i) the power, direct or indirect, to cause the
24 | direction or management of such entity, whether by contract or
25 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
26 | outstanding shares, or (iii) beneficial ownership of such entity.
27 |
28 | "You" (or "Your") shall mean an individual or Legal Entity
29 | exercising permissions granted by this License.
30 |
31 | "Source" form shall mean the preferred form for making modifications,
32 | including but not limited to software source code, documentation
33 | source, and configuration files.
34 |
35 | "Object" form shall mean any form resulting from mechanical
36 | transformation or translation of a Source form, including but
37 | not limited to compiled object code, generated documentation,
38 | and conversions to other media types.
39 |
40 | "Work" shall mean the work of authorship, whether in Source or
41 | Object form, made available under the License, as indicated by a
42 | copyright notice that is included in or attached to the work
43 | (an example is provided in the Appendix below).
44 |
45 | "Derivative Works" shall mean any work, whether in Source or Object
46 | form, that is based on (or derived from) the Work and for which the
47 | editorial revisions, annotations, elaborations, or other modifications
48 | represent, as a whole, an original work of authorship. For the purposes
49 | of this License, Derivative Works shall not include works that remain
50 | separable from, or merely link (or bind by name) to the interfaces of,
51 | the Work and Derivative Works thereof.
52 |
53 | "Contribution" shall mean any work of authorship, including
54 | the original version of the Work and any modifications or additions
55 | to that Work or Derivative Works thereof, that is intentionally
56 | submitted to Licensor for inclusion in the Work by the copyright owner
57 | or by an individual or Legal Entity authorized to submit on behalf of
58 | the copyright owner. For the purposes of this definition, "submitted"
59 | means any form of electronic, verbal, or written communication sent
60 | to the Licensor or its representatives, including but not limited to
61 | communication on electronic mailing lists, source code control systems,
62 | and issue tracking systems that are managed by, or on behalf of, the
63 | Licensor for the purpose of discussing and improving the Work, but
64 | excluding communication that is conspicuously marked or otherwise
65 | designated in writing by the copyright owner as "Not a Contribution."
66 |
67 | "Contributor" shall mean Licensor and any individual or Legal Entity
68 | on behalf of whom a Contribution has been received by Licensor and
69 | subsequently incorporated within the Work.
70 |
71 | 2. Grant of Copyright License. Subject to the terms and conditions of
72 | this License, each Contributor hereby grants to You a perpetual,
73 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
74 | copyright license to reproduce, prepare Derivative Works of,
75 | publicly display, publicly perform, sublicense, and distribute the
76 | Work and such Derivative Works in Source or Object form.
77 |
78 | 3. Grant of Patent License. Subject to the terms and conditions of
79 | this License, each Contributor hereby grants to You a perpetual,
80 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
81 | (except as stated in this section) patent license to make, have made,
82 | use, offer to sell, sell, import, and otherwise transfer the Work,
83 | where such license applies only to those patent claims licensable
84 | by such Contributor that are necessarily infringed by their
85 | Contribution(s) alone or by combination of their Contribution(s)
86 | with the Work to which such Contribution(s) was submitted. If You
87 | institute patent litigation against any entity (including a
88 | cross-claim or counterclaim in a lawsuit) alleging that the Work
89 | or a Contribution incorporated within the Work constitutes direct
90 | or contributory patent infringement, then any patent licenses
91 | granted to You under this License for that Work shall terminate
92 | as of the date such litigation is filed.
93 |
94 | 4. Redistribution. You may reproduce and distribute copies of the
95 | Work or Derivative Works thereof in any medium, with or without
96 | modifications, and in Source or Object form, provided that You
97 | meet the following conditions:
98 |
99 | (a) You must give any other recipients of the Work or
100 | Derivative Works a copy of this License; and
101 |
102 | (b) You must cause any modified files to carry prominent notices
103 | stating that You changed the files; and
104 |
105 | (c) You must retain, in the Source form of any Derivative Works
106 | that You distribute, all copyright, patent, trademark, and
107 | attribution notices from the Source form of the Work,
108 | excluding those notices that do not pertain to any part of
109 | the Derivative Works; and
110 |
111 | (d) If the Work includes a "NOTICE" text file as part of its
112 | distribution, then any Derivative Works that You distribute must
113 | include a readable copy of the attribution notices contained
114 | within such NOTICE file, excluding those notices that do not
115 | pertain to any part of the Derivative Works, in at least one
116 | of the following places: within a NOTICE text file distributed
117 | as part of the Derivative Works; within the Source form or
118 | documentation, if provided along with the Derivative Works; or,
119 | within a display generated by the Derivative Works, if and
120 | wherever such third-party notices normally appear. The contents
121 | of the NOTICE file are for informational purposes only and
122 | do not modify the License. You may add Your own attribution
123 | notices within Derivative Works that You distribute, alongside
124 | or as an addendum to the NOTICE text from the Work, provided
125 | that such additional attribution notices cannot be construed
126 | as modifying the License.
127 |
128 | You may add Your own copyright statement to Your modifications and
129 | may provide additional or different license terms and conditions
130 | for use, reproduction, or distribution of Your modifications, or
131 | for any such Derivative Works as a whole, provided Your use,
132 | reproduction, and distribution of the Work otherwise complies with
133 | the conditions stated in this License.
134 |
135 | 5. Submission of Contributions. Unless You explicitly state otherwise,
136 | any Contribution intentionally submitted for inclusion in the Work
137 | by You to the Licensor shall be under the terms and conditions of
138 | this License, without any additional terms or conditions.
139 | Notwithstanding the above, nothing herein shall supersede or modify
140 | the terms of any separate license agreement you may have executed
141 | with Licensor regarding such Contributions.
142 |
143 | 6. Trademarks. This License does not grant permission to use the trade
144 | names, trademarks, service marks, or product names of the Licensor,
145 | except as required for reasonable and customary use in describing the
146 | origin of the Work and reproducing the content of the NOTICE file.
147 |
148 | 7. Disclaimer of Warranty. Unless required by applicable law or
149 | agreed to in writing, Licensor provides the Work (and each
150 | Contributor provides its Contributions) on an "AS IS" BASIS,
151 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
152 | implied, including, without limitation, any warranties or conditions
153 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
154 | PARTICULAR PURPOSE. You are solely responsible for determining the
155 | appropriateness of using or redistributing the Work and assume any
156 | risks associated with Your exercise of permissions under this License.
157 |
158 | 8. Limitation of Liability. In no event and under no legal theory,
159 | whether in tort (including negligence), contract, or otherwise,
160 | unless required by applicable law (such as deliberate and grossly
161 | negligent acts) or agreed to in writing, shall any Contributor be
162 | liable to You for damages, including any direct, indirect, special,
163 | incidental, or consequential damages of any character arising as a
164 | result of this License or out of the use or inability to use the
165 | Work (including but not limited to damages for loss of goodwill,
166 | work stoppage, computer failure or malfunction, or any and all
167 | other commercial damages or losses), even if such Contributor
168 | has been advised of the possibility of such damages.
169 |
170 | 9. Accepting Warranty or Additional Liability. While redistributing
171 | the Work or Derivative Works thereof, You may choose to offer,
172 | and charge a fee for, acceptance of support, warranty, indemnity,
173 | or other liability obligations and/or rights consistent with this
174 | License. However, in accepting such obligations, You may act only
175 | on Your own behalf and on Your sole responsibility, not on behalf
176 | of any other Contributor, and only if You agree to indemnify,
177 | defend, and hold each Contributor harmless for any liability
178 | incurred by, or claims asserted against, such Contributor by reason
179 | of your accepting any such warranty or additional liability.
180 |
181 | END OF TERMS AND CONDITIONS
182 |
183 | APPENDIX: How to apply the Apache License to your work.
184 |
185 | To apply the Apache License to your work, attach the following
186 | boilerplate notice, with the fields enclosed by brackets "{}"
187 | replaced with your own identifying information. (Don't include
188 | the brackets!) The text should be enclosed in the appropriate
189 | comment syntax for the file format. We also recommend that a
190 | file or class name and description of purpose be included on the
191 | same "printed page" as the copyright notice for easier
192 | identification within third-party archives.
193 |
194 | Copyright {yyyy} {name of copyright owner}
195 |
196 | Licensed under the Apache License, Version 2.0 (the "License");
197 | you may not use this file except in compliance with the License.
198 | You may obtain a copy of the License at
199 |
200 | http://www.apache.org/licenses/LICENSE-2.0
201 |
202 | Unless required by applicable law or agreed to in writing, software
203 | distributed under the License is distributed on an "AS IS" BASIS,
204 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
205 | See the License for the specific language governing permissions and
206 | limitations under the License.
207 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | # CodeGen
6 | Official release for the **CodeGen1** and **CodeGen2** models (`350M`, `1B`, `3B`, `7B` `16B`) for **Program Synthesis** by [Salesforce AI Research](https://www.salesforceairesearch.com/).
7 |
8 |
9 |
10 |
11 |
12 | ## News
13 |
14 | **July 2023**
15 |
16 | [**CodeGen2.5**](https://github.com/salesforce/CodeGen/tree/main/codegen25) released outperforming 16B parameter models with only 7B.
17 |
18 | **May 2023**
19 |
20 | **CodeGen2.0** released with strong infill sampling capability.
21 |
22 | **March 2022**
23 |
24 | **CodeGen1.0** released on par with OpenAI Codex at the time.
25 |
26 | ## Publications
27 |
28 | [CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis](https://arxiv.org/abs/2203.13474)
29 | [Erik Nijkamp](https://enijkamp.github.io/)\*, [Bo Pang](https://scholar.google.com/citations?user=s9fNEVEAAAAJ&hl=en)\*, [Hiroaki Hayashi](https://hiroakih.me/)\*, [Lifu Tu](https://home.ttic.edu/~lifu/), [Huan Wang](https://scholar.google.com/citations?user=7NpTttkAAAAJ&hl=en), [Yingbo Zhou](https://scholar.google.com/citations?user=H_6RQ7oAAAAJ&hl=en), [Silvio Savarese](https://scholar.google.com/citations?user=ImpbxLsAAAAJ&hl=en), and [Caiming Xiong](https://scholar.google.com/citations?user=vaSdahkAAAAJ&hl=en)
30 | ICLR, 2023
31 |
32 | [CodeGen2: Lessons for Training LLMs on Programming and Natural Languages](https://arxiv.org/abs/2305.02309)
33 | [Erik Nijkamp](https://enijkamp.github.io/)\*, [Hiroaki Hayashi](https://hiroakih.me/)\*, [Caiming Xiong](https://scholar.google.com/citations?user=vaSdahkAAAAJ&hl=en), [Silvio Savarese](https://scholar.google.com/citations?user=ImpbxLsAAAAJ&hl=en), and [Yingbo Zhou](https://scholar.google.com/citations?user=H_6RQ7oAAAAJ&hl=en)
34 | ICLR, 2023
35 |
36 | ## Usage
37 |
38 | The models are available on the [Hugging Face Hub](https://huggingface.co/models?search=salesforce+codegen).
39 |
40 | **CodeGen1.0**
41 |
42 | ```python
43 | import torch
44 | from transformers import AutoTokenizer, AutoModelForCausalLM
45 |
46 | tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
47 | model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
48 | inputs = tokenizer("# this function prints hello world", return_tensors="pt")
49 | sample = model.generate(**inputs, max_length=128)
50 | print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))
51 | ```
52 |
53 | **CodeGen2.0**
54 |
55 | ```python
56 | import torch
57 | from transformers import AutoTokenizer, AutoModelForCausalLM
58 |
59 | tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen2-7B")
60 | model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen2-7B", trust_remote_code=True, revision="main")
61 | inputs = tokenizer("# this function prints hello world", return_tensors="pt")
62 | sample = model.generate(**inputs, max_length=128)
63 | print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))
64 | ```
65 |
66 | **CodeGen2.5**
67 |
68 | ```python
69 | import torch
70 | from transformers import AutoTokenizer, AutoModelForCausalLM
71 |
72 | tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-mono", trust_remote_code=True)
73 | model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-mono")
74 | inputs = tokenizer("# this function prints hello world", return_tensors="pt")
75 | sample = model.generate(**inputs, max_length=128)
76 | print(tokenizer.decode(sample[0]))
77 | ```
78 |
79 | ## Training
80 |
81 | The Jaxformer library for data pre-processing, training and fine-tuning the CodeGen models can be found here:
82 |
83 | https://github.com/salesforce/jaxformer
84 |
85 | ## Citation
86 | If you find our code or paper useful, please cite the paper:
87 | ```bibtex
88 | @article{nijkamp2022codegen,
89 | title={CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis},
90 | author={Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming},
91 | journal={ICLR},
92 | year={2023}
93 | }
94 |
95 | @article{nijkamp2023codegen2,
96 | title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages},
97 | author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo},
98 | journal={ICLR},
99 | year={2023}
100 | }
101 | ```
102 |
103 | ## Ethics disclaimer for Salesforce AI models, data, code
104 |
105 | This release is for research purposes only in support of an academic
106 | paper. Our models, datasets, and code are not specifically designed or
107 | evaluated for all downstream purposes. We strongly recommend users
108 | evaluate and address potential concerns related to accuracy, safety, and
109 | fairness before deploying this model. We encourage users to consider the
110 | common limitations of AI, comply with applicable laws, and leverage best
111 | practices when selecting use cases, particularly for high-risk scenarios
112 | where errors or misuse could significantly impact people’s lives, rights,
113 | or safety. For further guidance on use cases, refer to our standard
114 | [AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf)
115 | and [AI AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ai-acceptable-use-policy.pdf).
116 |
--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
1 | ## Security
2 |
3 | Please report any security issue to [security@salesforce.com](mailto:security@salesforce.com)
4 | as soon as it is discovered. This library limits its runtime dependencies in
5 | order to reduce the total cost of ownership as much as can be, but all consumers
6 | should remain vigilant and have their security stakeholders review all third-party
7 | products (3PP) like this one and their dependencies.
8 |
--------------------------------------------------------------------------------
/assets/codegen_logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salesforce/CodeGen/533a6cc52b4dbd46286cfb12f492390f379ed8a1/assets/codegen_logo.png
--------------------------------------------------------------------------------
/assets/two.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/salesforce/CodeGen/533a6cc52b4dbd46286cfb12f492390f379ed8a1/assets/two.gif
--------------------------------------------------------------------------------
/codegen1/LICENSE.txt:
--------------------------------------------------------------------------------
1 | BSD 3-Clause License
2 |
3 | Copyright (c) 2022, Salesforce.com, Inc.
4 | All rights reserved.
5 |
6 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
7 |
8 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
9 |
10 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
11 |
12 | 3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
13 |
14 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
15 |
--------------------------------------------------------------------------------
/codegen1/README.md:
--------------------------------------------------------------------------------
1 | # CodeGen1
2 |
3 | Official research release for the **CodeGen1** models (`2B`, `6B`, `16B`) for **Program Synthesis** as presented in ICLR 2023:
4 |
5 | *Title*: [CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis](https://arxiv.org/abs/2203.13474)
6 |
7 | *Authors*: [Erik Nijkamp](https://enijkamp.github.io/)\*, [Bo Pang](https://scholar.google.com/citations?user=s9fNEVEAAAAJ&hl=en)\*, [Hiroaki Hayashi](https://hiroakih.me/)\*, [Lifu Tu](https://home.ttic.edu/~lifu/), [Huan Wang](https://scholar.google.com/citations?user=7NpTttkAAAAJ&hl=en), [Yingbo Zhou](https://scholar.google.com/citations?user=H_6RQ7oAAAAJ&hl=en), [Silvio Savarese](https://scholar.google.com/citations?user=ImpbxLsAAAAJ&hl=en), and [Caiming Xiong](https://scholar.google.com/citations?user=vaSdahkAAAAJ&hl=en) (* indicates equal contribution)
8 |
9 | ## Hugging Face Integration
10 |
11 | The models are available on the [HuggingFace Hub](https://huggingface.co/models?search=salesforce+codegen).
12 |
13 | ## Sampling
14 |
15 | Program synthesis in the form of auto-regressive sampling can be performed as follows:
16 |
17 | ```python
18 | import torch
19 | from transformers import AutoTokenizer, AutoModelForCausalLM
20 |
21 | tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-2B-mono")
22 | model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-2B-mono")
23 | inputs = tokenizer("# this function prints hello world", return_tensors="pt")
24 | sample = model.generate(**inputs, max_length=128)
25 | print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))
26 | ```
27 |
28 | ## Citation
29 |
30 | ```bibtex
31 | @article{nijkamp2022codegen,
32 | title={CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis},
33 | author={Nijkamp, Erik and Pang, Bo and Hayashi, Hiroaki and Tu, Lifu and Wang, Huan and Zhou, Yingbo and Savarese, Silvio and Xiong, Caiming},
34 | journal={ICLR},
35 | year={2023}
36 | }
37 | ```
38 |
--------------------------------------------------------------------------------
/codegen1/benchmark/README.md:
--------------------------------------------------------------------------------
1 | # Multi-Turn Programming Benchmark
2 |
3 |
4 | ## Format
5 |
6 | Each line is a problem expressed in JSON format, consisting of the following fields:
7 |
8 | * `id`: Problem ID
9 | * `name`: Problem name (cf. Appendix D)
10 | * `description`: A short description of the problem
11 | * `category`: Manually labeled problem category
12 | * `prompts`: A list of template-enabled strings, specifying each step.
13 | * `inputs`: A list consisting of 5 test case inputs. Each test case is a key-value table mapping the variables (used in the templated prompt) to actual values.
14 | * `outputs`: A list consisting of 5 test case outputs. Each test case is an expected output value of the program.
15 | * `max_gen_length`: Maximum number of tokens we set for each turn for the problem. The value is mostly 128 because each turn doesn't require substantial lines of code, but we adjusted a higher number when long generation is expected.
16 |
--------------------------------------------------------------------------------
/codegen1/benchmark/mtpb.jsonl:
--------------------------------------------------------------------------------
1 | {"prompts": ["Assign the string \"{A}\" to a variable named \"my_string\".", "Lowercase the given string \"my_string\".", "Assign the distinct characters of the string to a variable named \"chars\".", "Sort these characters in alphabetical order.", "Print the resulting list of characters."], "inputs": [{"A": "abcde"}, {"A": "abcdecadeCADE"}, {"A": "aaaaAAAAaaaa"}, {"A": "Jerry jERRY JeRRRY"}, {"A": "ddddc"}], "outputs": [["a", "b", "c", "d", "e"], ["a", "b", "c", "d", "e"], ["a"], [" ", "e", "j", "r", "y"], ["c", "d"]], "max_gen_length": 128, "category": "string", "name": "Sandwich string", "description": "Append a string in the middle of another string.", "id": "1"}
2 | {"prompts": ["Define a list of integers named \"numbers\" with the values {numbers}.", "Calculate the sum of the elements in variable \"numbers\" and store the result to variable \"total\".", "Divide each element of the list by the total and multiply by 100, store the result to variable \"normalized\".", "Convert each element in variable \"normalized\" into a formatted string with single decimal point and store the result into \"formatted\".", "Print the variable \"formatted\"."], "inputs": [{"numbers": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {"numbers": [56, 97, 19, 57, 69]}, {"numbers": []}, {"numbers": [1]}, {"numbers": [10000, 1]}], "outputs": [["1.8", "3.6", "5.5", "7.3", "9.1", "10.9", "12.7", "14.5", "16.4", "18.2"], ["18.8", "32.6", "6.4", "19.1", "23.2"], [], ["100.0"], ["100.0", "0.0"]], "max_gen_length": 128, "category": "math", "name": "Normalize integer list", "description": "Normalize a list of positive integers and print formatted percentages.", "id": "2"}
3 | {"prompts": ["Write a function that takes an integer minutes and converts it to seconds.", "Write a function that takes an integer hours and converts it to seconds.", "Print the total seconds of {a1} hours and {a2} minutes."], "inputs": [{"a1": 2, "a2": 13}, {"a1": 1, "a2": 2}, {"a1": 32, "a2": 32}, {"a1": 0, "a2": 32}, {"a1": 1, "a2": 1}], "outputs": [7980, 3720, 117120, 1920, 3660], "max_gen_length": 128, "category": "math", "name": "Convert time", "description": "Convert units of time.", "id": "3"}
4 | {"prompts": ["Implement a function which returns the n-th Fibonacci number.", "Implement a function that computes the square of an integer argument.", "Print out the square of {a1}-th Fibonacci number."], "inputs": [{"a1": 1}, {"a1": 2}, {"a1": 3}, {"a1": 4}, {"a1": 10}], "outputs": [1, 1, 4, 9, 3025], "max_gen_length": 128, "category": "math", "name": "Squared Fibonacci", "description": "Print the squared fibonacci numbers.", "id": "4"}
5 | {"prompts": ["Assign the list of numbers \"{A}\" to a variable named \"my_numbers\".", "Count the number of negative numbers in the list as \"n_neg\".", "Count the number of positive numbers in the list as \"n_pos\".", "Print out the larger number of those two."], "inputs": [{"A": "[1,2,3,4]"}, {"A": "[-1,2,3,4]"}, {"A": "[-1,-2,-3,-4]"}, {"A": "[-1000000, 1, 2]"}, {"A": "[-1, 0.2, 0.3, 0.4]"}], "outputs": [4, 3, 4, 2, 3], "max_gen_length": 128, "category": "array", "name": "Count negative numbers", "description": "Count negative numbers in a given list.", "id": "5"}
6 | {"prompts": ["Import the pandas library.", "Create a dataframe with a column labeled \"Yes\" with values [{a1}, {a2}] and a column named \"No\" with values [{a3}, {a4}].", "Compute the mean per column and store the value in a variable named means.", "Print the variable means."], "inputs": [{"a1": "50", "a2": "21", "a3": "131", "a4": "2"}, {"a1": "-10", "a2": "10", "a3": "-20", "a4": "20"}, {"a1": "1", "a2": "2", "a3": "3", "a4": "4"}, {"a1": "-1", "a2": "-2", "a3": "-3", "a4": "-4"}, {"a1": "-10", "a2": "-20", "a3": "-30", "a4": "-40"}], "outputs": [[35.5, 66.5], [0.0, 0.0], [1.5, 3.5], [-1.5, -3.5], [-15.0, -35.0]], "max_gen_length": 128, "category": "data science", "name": "Pandas mean", "description": "Construct and compute the mean of a pandas df.", "id": "6"}
7 | {"prompts": ["Write a function that returns a number, for numbers multiple of {a1} print \"fizz\" instead of a number, for numbers multiple of {a2} print \"buzz\", for numbers which are multiples of both {a1} and {a2} \"fizzbuzz\".", "Create a list of integers ranging from {a3} to {a4}.", "Call the written function for each element in the list and store the result as \"new_list\".", "Print out the list \"new_list\"."], "inputs": [{"a1": 3, "a2": 5, "a3": 0, "a4": 4}, {"a1": 5, "a2": 3, "a3": 0, "a4": 9}, {"a1": 9, "a2": 3, "a3": 0, "a4": 2}, {"a1": 2, "a2": 4, "a3": 0, "a4": 7}, {"a1": 2, "a2": 4, "a3": 4, "a4": 7}], "outputs": [["fizzbuzz", 1, 2, "fizz", 4], ["fizzbuzz", 1, 2, "buzz", 4, "fizz", "buzz", 7, 8, "buzz"], ["fizzbuzz", 1, 2], ["fizzbuzz", 1, "fizz", 3, "fizzbuzz", 5, "fizz", 7], ["fizzbuzz", 5, "fizz", 7]], "max_gen_length": 128, "category": "algorithm", "name": "Fizz buzz", "description": "Solve the fizz buzz problem.", "id": "7"}
8 | {"prompts": ["Write a function that can take a string and return a list of word bigrams as pairs.", "Assign the string \"{a1}\" to a variable named sentence.", "Print out the bi-grams for the variable named sentence."], "inputs": [{"a1": "Have free hours and love children? Drive kids to school, soccer practice and other activities."}, {"a1": "Hello World Foo Bar"}, {"a1": "AA BB CC"}, {"a1": "abc de"}, {"a1": "AB CD EF"}], "outputs": [[["Have", "free"], ["free", "hours"], ["hours", "and"], ["and", "love"], ["love", "children?"], ["children?", "Drive"], ["Drive", "kids"], ["kids", "to"], ["to", "school,"], ["school,", "soccer"], ["soccer", "practice"], ["practice", "and"], ["and", "other"], ["other", "activities."]], [["Hello", "World"], ["World", "Foo"], ["Foo", "Bar"]], [["AA", "BB"], ["BB", "CC"]], [["abc", "de"]], [["AB", "CD"], ["CD", "EF"]]], "max_gen_length": 128, "category": "string", "name": "Bi-grams", "description": "Print the bi-grams of a sentence.", "id": "8"}
9 | {"prompts": ["Assign the names [\"Kevin\", \"John\", \"Mike\", \"Mitch\"] as keys and corresponding notes [{a1}, {a2}, {a3}, {a4}] as values to a dictionary named \"my_notes\".", "Create a function that takes a dictionary of objects like {{ \"name\": \"John\", \"notes\": [3, 5, 4] }} and returns a dictionary of objects like {{ \"name\": \"John\", \"top_note\": 5 }}.", "For each name in the dictionary get the top_note and store the pairs of names and top_notes as \"my_list\".", "Find the name with the highest top_note and assign it to \"top_name\".", "Print the variable top_name."], "inputs": [{"a1": [3, 5, 4], "a2": [3, 1, 1], "a3": [1, 2, 3], "a4": [0, 4, 4]}, {"a1": [0], "a2": [1], "a3": [2], "a4": [3]}, {"a1": [0, 7], "a2": [1, 9], "a3": [2, 7], "a4": [3, 6]}, {"a1": [-1], "a2": [-1], "a3": [1], "a4": [-1]}, {"a1": [0], "a2": [10000], "a3": [1000], "a4": [9999]}], "outputs": ["Kevin", "Mitch", "Mike", "Mike", "John"], "max_gen_length": 128, "category": "dict", "name": "Top note", "description": "Print name with top note out of a dict.", "id": "9"}
10 | {"prompts": ["Create a function that will take a HEX number and returns the binary equivalent (as a string). E.g., to_binary(0xFF) = \"11111111\".", "Create a function that will take the output of the above function and return the HEX number. E.g., to_hex(\"11111111\") = 0xFF.", "Assign the value {a1} to a variable named \"my_hex\".", "Convert the variable \"my_hex\" into the binary equivalent as string named \"my_binary\".", "Convert \"my_binary\" back to a HEX number named \"result\".", "Print the result."], "inputs": [{"a1": "0xFF"}, {"a1": "0xAA"}, {"a1": "0xAF"}, {"a1": "0x12"}, {"a1": "0xAA"}], "outputs": [255, 170, 175, 18, 170], "max_gen_length": 128, "category": "math", "name": "Hex to binary", "description": "Hex to binary and reverse.", "id": "10"}
11 | {"prompts": ["Assign the keys {a1} and values {a2} to a dictionary named \"my_dict\".", "Write a function \"invert\" that inverts the keys and values of a dictionary. E.g., invert({{ \"z\": \"q\", \"w\": \"f\" }}) = {{ \"q\": \"z\", \"f\": \"w\" }}.", "Write a function \"is_inverted\" that takes two dicts as arguments and returns a boolean which indicates if the second dict is an inversion of the first dict argument.", "Create a new variable \"my_dict2\" and initialize it with {a3} \"my_dict\".", "Print a boolean value indicating if \"my_dict2\" is the inverted dictionary of \"my_dict\"."], "inputs": [{"a1": "[\"a\", \"b\"]", "a2": "[1, 2]", "a3": ""}, {"a1": "[\"a\", \"b\"]", "a2": "[1, 2]", "a3": "inverted"}, {"a1": "[\"a\", \"b\", \"c\"]", "a2": "[1, 2, -1]", "a3": ""}, {"a1": "[\"a\", \"b\", \"c\"]", "a2": "[1, 2, -1]", "a3": "inverted"}, {"a1": "[\"1\"]", "a2": "[1]", "a3": ""}], "outputs": [false, true, false, true, false], "max_gen_length": 128, "category": "dict", "name": "Invert dict", "description": "Detect inversion of dict.", "id": "11"}
12 | {"prompts": ["Defines class named \"Player\" that takes the following four arguments for a particular football player: name, age, height, weight.", "Also, create three functions for the class that returns the following strings: (1) get_age() returns \"{{name}} is age {{age}}\", (2) get_height() returns \"{{name}} is {{height}} cm\", (3) get_weight() returns \"{{name}} weighs {{weight}} kg\".", "Create an object named \"player\" with name \"{a1}\", age {a2}, height {a3}, weight {a4}.", "Call the getter for the {a5} of the player and print the result."], "inputs": [{"a1": "David Jones", "a2": 25, "a3": 175, "a4": 75, "a5": "age"}, {"a1": "Paul Smith", "a2": 50, "a3": 160, "a4": 60, "a5": "weight"}, {"a1": "Paul Smith", "a2": 50, "a3": 160, "a4": 60, "a5": "height"}, {"a1": "Herr Schmidth Gold", "a2": 50, "a3": 210, "a4": 60, "a5": "height"}, {"a1": "Paul Smith", "a2": 5, "a3": 160, "a4": 60, "a5": "age"}], "outputs": ["David Jones is age 25", "Paul Smith weighs 60 kg", "Paul Smith is 160 cm", "Herr Schmidth Gold is 210 cm", "Paul Smith is age 5"], "max_gen_length": 128, "category": "class", "name": "Class definition", "description": "Create POJO class.", "id": "12"}
13 | {"prompts": ["Create a function \"num_len\" that takes a number num and returns its length. E.g., number_length(5000) = 4.", "Initialize a last \"my_list\" with the values {a1}", "Print the longest number in this list."], "inputs": [{"a1": "[1, 2, 3, 12]"}, {"a1": "[-123, 2, 3, 12]"}, {"a1": "[1]"}, {"a1": "[-12, 1]"}, {"a1": "[1, 22, 333, 4444, -55555]"}], "outputs": [12, -123, 1, -12, -55555], "max_gen_length": 128, "category": "math", "name": "Longest number", "description": "Print longest number.", "id": "13"}
14 | {"prompts": ["Import the class LinearRegression from sklearn.", "Import math.", "Assign integers ranging from 0 to 10 (inclusive) to \"x\".", "Define a function \"f\" that multiplies a input argument by 2.", "Create a numpy array of numbers \"y\" by applying f to each element of x.", "Initialize a linear regression model.", "Fit the model to input x and output y (reshape both arguments with reshape(-1, 1)).", "Predict a variable \"x_hat\" at x=[[{a1}]] using the fitted model.", "Apply ceil() to the predicted value and print it as an integer."], "inputs": [{"a1": "1"}, {"a1": "2"}, {"a1": "3"}, {"a1": "4"}, {"a1": "5"}], "outputs": [2, 4, 6, 8, 10], "max_gen_length": 128, "category": "data science", "name": "Linear regression", "description": "Fit linear regression model with specified function and sk-learn.", "id": "14"}
15 | {"prompts": ["Create a function encrypt that takes a string as an argument and returns a string encrypted with the alphabet being rotated. The alphabet should be rotated in a manner such that the letters shift down by two places. For example: encrypt('hi') returns 'jk', encrypt('asdfghjkl') returns 'cufhijlmn', encrypt('gf') returns 'ih'.", "Create a function decrypt that decodes the encrypted string from encrypt() back into the original text.", "Assign \"{a1}\" to a variable named \"original_text\".", "Call the function encrypt with original_text as argument and assign the result to a variable named 'encrypted_text'.", "Call the function decrypt with encrypted_text as argument and assign the result to a variable named 'restored_text'.", "Create a list named \"my_result\" containing restored_text and encrypted_text as elements.", "Print the list."], "inputs": [{"a1": "hi"}, {"a1": "asdfghjkl"}, {"a1": "gf"}, {"a1": "Hello World"}, {"a1": "This is a LONG string for our encryption algOrithm."}], "outputs": [["hi", "jk"], ["asdfghjkl", "cufhijlmn"], ["gf", "ih"], ["Hello World", "Hgnnq Wqtnf"], ["This is a LONG string for our encryption algOrithm.", "Tjku ku c LONG uvtkpi hqt qwt gpetarvkqp cniOtkvjo."]], "max_gen_length": 128, "category": "algorithm", "name": "Encrypt and decrypt", "description": "Rotate alphabet for encryption. Write a function for decryption (inverse of encrypt()). Concat should give identity function.", "id": "15"}
16 | {"prompts": ["Defines a class \"Person\" which takes name and id as constructor arguments.", "Extend the class with a function __hash__ which uses the {a1} property as hash value.", "Extend the class with a function __eq__ which returns true, if the hash value of the passed object and self are identical.", "Create a list \"persons\" with instances of Person and names \"Person A\", \"Person B\", \"Person {a3}\" and ids {a2}.", "Create a set \"unique_persons\" of this list.", "Print the number of elements in the set."], "inputs": [{"a1": "id", "a2": "1, 2, 2", "a3": "C"}, {"a1": "name", "a2": "1, 2, 2", "a3": "C"}, {"a1": "id", "a2": "2, 2, 2", "a3": "C"}, {"a1": "id", "a2": "1, 2, 3", "a3": "C"}, {"a1": "name", "a2": "1, 1, 1", "a3": "B"}], "outputs": [2, 3, 1, 3, 2], "max_gen_length": 128, "category": "class", "name": "Compare object equivalence", "description": "Implement a class with __hash__ and obtain a count unique objects.", "id": "16"}
17 | {"prompts": ["Python got drunk and the built-in functions str() and int() are acting odd: \n# str(4) = 4\n# str(\"4\") = 4\n# int(\"4\") = \"4\"\n# int(4) = \"4\".", "Create a function called int_to_str() that converts integers into strings. E.g., int_to_str(4) = \"4\".", "Create a function called str_to_int() that converts integers into strings. E.g., str_to_int(\"4\") = 4.", "Create a list named \"my_result\" with elements int_to_str({a1}) and str_to_int(\"{a1}\").", "Print the list."], "inputs": [{"a1": "29348"}, {"a1": "1"}, {"a1": "123"}, {"a1": "2344"}, {"a1": "-1"}], "outputs": [[29348, "29348"], [1, "1"], [123, "123"], [2344, "2344"], [-1, "-1"]], "max_gen_length": 128, "category": "string", "name": "Drunken python", "description": "Overload built-in functions, and write functions which correct drunken functions.", "id": "17"}
18 | {"prompts": ["Initialize dictionary of Morse codes named 'chars_to_dots' with values ['A': '.-', 'B': '-...', 'C': '-.-.', 'D': '-..', 'E': '.', 'F': '..-.','G': '--.', 'H': '....', 'I': '..', 'J': '.---', 'K': '-.-', 'L': '.-..','M': '--', 'N': '-.', 'O': '---', 'P': '.--.', 'Q': '--.-', 'R': '.-.','S': '...', 'T': '-', 'U': '..-', 'V': '...-', 'W': '.--', 'X': '-..-','Y': '-.--', 'Z': '--..', ' ': ' ', '0': '-----','1': '.----', '2': '..---', '3': '...--', '4': '....-', '5': '.....','6': '-....', '7': '--...', '8': '---..', '9': '----.','&': '.-...', \"'\": '.----.', '@': '.--.-.', ')': '-.--.-', '(': '-.--.',':': '---...', ',': '--..--', '=': '-...-', '!': '-.-.--', '.': '.-.-.-','-': '-....-', '+': '.-.-.', '\"': '.-..-.', '?': '..--..', '/': '-..-.']", "Create a function named 'encode_morse' that takes a string as an argument and returns the Morse code equivalent.", "Create a function named 'decode_morse' that takes a Morse code as an argument and returns the decodes string.", "Encode '{a1}' to morse code and assign the result to 'morse_code'.", "Decode the variable named 'morse_code' to a string named 'decoded_text'.", "Print the variable named 'decoded_text'."], "inputs": [{"a1": "Hello World"}, {"a1": "Hello Foo"}, {"a1": "Hello WORLD"}, {"a1": "foo BAR"}, {"a1": "This is a long string"}], "outputs": ["HELLO WORLD", "HELLO FOO", "HELLO WORLD", "FOO BAR", "THIS IS A LONG STRING"], "max_gen_length": 512, "category": "algorithm", "name": "Morse code", "description": "Encode a string into morse code given its conversion rule.", "id": "18"}
19 | {"prompts": ["Initialize a list of integers with {a1} and a variable named target with a value of {a2}.", "Implement a function \"two_sum\" solving two sum problem given a list of integers and a target argument.", "Run the function and print out the result."], "inputs": [{"a1": "[0,1,2,3]", "a2": "4"}, {"a1": "[1, 11, 111]", "a2": "122"}, {"a1": "[-1, 0, 2, 4]", "a2": "3"}, {"a1": "[10, 20, 30, 40]", "a2": "70"}, {"a1": "[-1, -1, 123, -123]", "a2": "0"}], "outputs": [[1, 3], [1, 2], [0, 3], [2, 3], [2, 3]], "max_gen_length": 128, "category": "algorithm", "name": "Two-sum", "description": "Implement the two-sum problem on a given input pair.", "id": "19"}
20 | {"prompts": ["Implement a function to sample n points from a bivariate normal distribution with mean (x_mean, y_mean) and standard deviation (x_std, y_std).", "Call the function to sample 100 points named points1 centered at ({a1}, {a1}) with standard deviation (1, 1).", "Call the function to sample 100 points named points2 centered at (-{a1}, -{a1}) with standard deviation (1, 1).", "Concatenate these data points.", "Implement the k-means clustering algorithm with n iterations and the centroids as return value.", "Run the algorithm on the points for 100 iterations with 2 clusters and assign the result to \"my_centroids\".", "Assign the centroid with negative coordinates to c1 and the one with positive coordinates to c2.Round the coordinates element-wise to the nearest integers and print the two centroids c1, c2 in the format of \"(x1, y1), (x2, y2)\"."], "inputs": [{"a1": 10}, {"a1": 20}, {"a1": 30}, {"a1": 40}, {"a1": 50}], "outputs": ["(-10, -10), (10, 10)", "(-20, -20), (20, 20)", "(-30, -30), (30, 30)", "(-40, -40), (40, 40)", "(-50, -50), (50, 50)"], "max_gen_length": 256, "category": "data science", "name": "k-means", "description": "Implement and run k-means on sampled points.", "id": "20"}
21 | {"prompts": ["Define a list of integers named \"elements\" with values {numbers}.", "Calculate the sum of the even numbers of the list and store the result to variable \"even\".", "Calculate the sum of the odd numbers in the same list and store the result to \"odd\".", "Create a list named \"my_result\" containing the variables even and odd.", "Print the list."], "inputs": [{"numbers": [1]}, {"numbers": [2e+100, 5e+100, -11, 10]}, {"numbers": []}, {"numbers": [-5, 1, 6, -25, -36, 6]}, {"numbers": [73, 4, 14, 95, 69, 57, 82, 4, 75, 50, 91, 4, 83, 89, 61, 67, 53, 54, 48, 10]}], "outputs": [[0, 1], [7e+100, -11], [0, 0], [-24, -29], [270, 813]], "max_gen_length": 128, "category": "math", "name": "Even odd sum", "description": "Print the sum of even and odd numbers in an array.", "id": "21"}
22 | {"prompts": ["Define a list named \"elements\" with the values {lst}.", "Count the number of zeros in variable elements and store the value into variable \"zero_count\".", "Scan through the list in order and remove all the zeros, store the result into variable \"non_zero\".", "Merge the variable non_zero and a new list containing \"zero_count\" 0s and store the result to \"result\". Print the variable \"result\"."], "inputs": [{"lst": ["a", "b", "c", "d", "e", "f", "g"]}, {"lst": ["a", 0, 0, "b", "c", "d", 0, 1, 0, 1, 0, 3, 0, 1, 9, 0, 0, 0, 0, 9]}, {"lst": [0]}, {"lst": [-1, 0, 1e-05, 0, 1e-30, 0]}, {"lst": [0, 1, null, 2, false, 1, 0]}], "outputs": [["a", "b", "c", "d", "e", "f", "g"], ["a", "b", "c", "d", 1, 1, 3, 1, 9, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0], [-1, 1e-05, 1e-30, 0, 0, 0], [1, null, 2, false, 1, 0, 0]], "max_gen_length": 128, "category": "array", "name": "Shift zeros", "description": "Move all the zeroes in a list to the right.", "id": "22"}
23 | {"prompts": ["Import numpy and initialize a numpy array named X with values {array}.", "Write a function that can take a numpy array and return an array of same size consisting of samples with replacement from the input.", "Call the function {n} times and stack the arrays into a new 2d array named \"samples\".", "Calculate the mean of each element in variable \"sample\" and store the result to \"mean\".", "Compute the 2.5 and 97.5 percentile of the variable mean and store the values into a new list named \"percentile\".", "Print the variable \"percentile\"."], "inputs": [{"array": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "n": 1000}, {"array": "consisting of 1000 randomly sampled integers ranging from 0 to 10", "n": 1000}, {"array": "consisting of 1000 randomly sampled integers ranging from 0 to 10", "n": 10000}, {"array": "consisting of 1000 uniformly sampled floats in [0, 1)", "n": 1000}, {"array": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], "n": 100}], "outputs": [[20.525, 28.575], [4.8025, 5.1975], [4.8025, 5.1975], [0.4825, 0.5175], [1, 1]], "max_gen_length": 128, "category": "data science", "name": "Bootstrap 95% CI", "description": "Define an array, sample N times, calculate means, calculate the percentile.", "id": "23"}
24 | {"prompts": ["Given two positive integers {a} and {b}, store the even single digits between a and b (inclusive) as \"my_digits\".", "Assign the sum of the even digits to the variable \"result\".", "Print the resulting number as integer."], "inputs": [{"a": 8, "b": 2}, {"a": 2, "b": 8}, {"a": 2, "b": 6}, {"a": 132, "b": 6}, {"a": 17, "b": 96}], "outputs": [20, 20, 12, 14, 0], "max_gen_length": 128, "category": "math", "name": "Sum even digits", "description": "Sum even digits between two numbers.", "id": "24"}
25 | {"prompts": ["Find the maximum element in the list {A} and assign it to variable \"my_max\".", "Find the minimum element in the same list.", "Compute the different between \"my_max\" and the minimum element.", "Print the difference"], "inputs": [{"A": [0, 4]}, {"A": [4, 0]}, {"A": [0]}, {"A": [0, 7, 6]}, {"A": [2, 4, 7, 20, 6]}], "outputs": [4, 4, 0, 7, 18], "max_gen_length": 128, "category": "array", "name": "Min-max diff", "description": "Compute the difference between maximum and minimum numbers in a list.", "id": "25"}
26 | {"prompts": ["Assign the string \"{A}\" to a variable named \"my_string\".", "Lowercase the given string \"my_string\".", "Assign the distinct characters of the string to a variable named \"chars\".", "Sort these characters in alphabetical order.", "Print the resulting list of characters."], "inputs": [{"A": "abcde"}, {"A": "abcdecadeCADE"}, {"A": "aaaaAAAAaaaa"}, {"A": "Jerry jERRY JeRRRY"}, {"A": "ddddc"}], "outputs": [["a", "b", "c", "d", "e"], ["a", "b", "c", "d", "e"], ["a"], [" ", "e", "j", "r", "y"], ["c", "d"]], "max_gen_length": 128, "category": "string", "name": "Distinct chars", "description": "Print the sorted, case-insensitive unique characters of a string.", "id": "26"}
27 | {"prompts": ["Create two variables \"a\" and \"b\" for the strings \"{A}\" and \"{B}\", respectively.", "Define a function \"len_str\" that returns the length of a string.", "Assign the length of each string to a seperate variable.", "Assign the longer string to the variable \"result\".", "Print the resulting string."], "inputs": [{"A": "abcde", "B": "ab"}, {"A": "ab", "B": "abcde"}, {"A": "a", "B": "aa"}, {"A": "aaaaaaaaaa", "B": "cdeee"}, {"A": "f", "B": "gg"}], "outputs": ["abcde", "abcde", "aa", "aaaaaaaaaa", "gg"], "max_gen_length": 128, "category": "string", "name": "Longer string", "description": "Compare and print the longer string given two strings.", "id": "27"}
28 | {"prompts": ["Assign the positive floating point number {A} to a variable \"f\".", "Compute the integer part of the number as variable \"a\".", "Assign the digits of the fractional part of the floating point number to an integer variable \"b\".", "Add them together and print the result."], "inputs": [{"A": 17.82}, {"A": 1.1}, {"A": 1000000.0000001}, {"A": 0.0101}, {"A": 100.5}], "outputs": [99, 2, 1000001, 101, 105], "max_gen_length": 128, "category": "math", "name": "Sum float digits", "description": "Sum numbers before an after the decimal point of a float.", "id": "28"}
29 | {"prompts": ["Assign the string value {s} to a variable \"my_string\".", "Lowercase the defined string.", "Count the number of vowels", "Print out the number"], "inputs": [{"s": "CelebrAtion"}, {"s": "PaLm"}, {"s": "PrEdictiOn"}, {"s": ""}, {"s": "ABC"}], "outputs": [5, 1, 4, 0, 1], "max_gen_length": 128, "category": "string", "name": "Count vowels", "description": "Count the number of vowels in a string.", "id": "29"}
30 | {"prompts": ["Assign the positive integer {n} to a variable \"f\".", "Create a list from 1 to \"f\" (inclusive).", "Create and initialize a variable named \"factorial\".", "Compute the product of all the values in the list and assign the product to \"factorial\".", "Print out the variable \"factorial\"."], "inputs": [{"n": 2}, {"n": 4}, {"n": 10}, {"n": 1}, {"n": 5}], "outputs": [2, 24, 3628800, 1, 120], "max_gen_length": 128, "category": "math", "name": "Factorial", "description": "Compute the factorial of n.", "id": "30"}
31 | {"prompts": ["Given two positive integers, {a} {b}, which are the lengths of two edges of a triangle, compute the sum of the two edges and store it in a variable \"two-edges\".", "Compute the maximum length of the third edge by substracting 1 from \"two-edges\" and store the value in a variable \"maximum-edge\".", "Compute the minimum length of the third edge and store the value in a variable \"minimum-edge\".", "Assign value of maximum-edge and minimum-edge to a tuple named \"my_tuple\".", "Print the variable \"my_tuple\"."], "inputs": [{"a": 8, "b": 9}, {"a": 5, "b": 7}, {"a": 9, "b": 2}, {"a": 1, "b": 1}, {"a": 1000, "b": 1000}], "outputs": [[17, 2], [11, 3], [10, 8], [1, 1], [1999, 1]], "max_gen_length": 128, "category": "math", "name": "Max edge triangle", "description": "Finds the maximum range of a triangle's third edge.", "id": "31"}
32 | {"prompts": ["Compute factorial", "Implement a function to compute the remainder when dividing a number by 10", "Print out the remainder when dividing the factorial of {n} by 10"], "inputs": [{"n": 2}, {"n": 4}, {"n": 10}, {"n": 1}, {"n": 5}], "outputs": [2, 4, 0, 1, 0], "max_gen_length": 128, "category": "math", "name": "Factorial and remainder", "description": "Compute the factorial and its remainder when divided.", "id": "32"}
33 | {"prompts": ["Given a positive integer {n} and create a variable named \"n\" with this value", "Compute the the total sum of internal angles in degrees of a regular-polygon with \"n\" sides", "Convert the angle from degrees to radians", "Round the angle to have two decimal digits", "Print out the angle"], "inputs": [{"n": 3}, {"n": 4}, {"n": 1000}, {"n": 10}, {"n": 100}], "outputs": [3.14, 6.28, 3135.31, 25.13, 307.88], "max_gen_length": 128, "category": "math", "name": "Sum polygon angles", "description": "Sum the angles in a polygon.", "id": "33"}
34 | {"prompts": ["Assign two strings {s1} and {s2} to the variable named s1 and the variable named s2 respectively", "Convert s1 and s2 to integers", "Compute the sum of the two integers and store it as the variable s", "Print out the variable s"], "inputs": [{"s1": "111", "s2": "222"}, {"s1": "2", "s2": "4"}, {"s1": "0", "s2": "12"}, {"s1": "50", "s2": "100"}, {"s1": "10000", "s2": "1"}], "outputs": [333, 6, 12, 150, 10001], "max_gen_length": 128, "category": "string", "name": "Sum string numbers", "description": "Add together two numbers represented in string.", "id": "34"}
35 | {"prompts": ["Initialize the variable named lst with an integer list {l}.", "Find the maximum of the variable lst and assign it to a variable named ma.", "Find the minimum of the variable lst and assign to a variable named mi.", "Create a list from mi and ma (inclusive).", "Print the sum of this list."], "inputs": [{"l": [4, 3, 8, 2]}, {"l": [17, 16, 15, 10, 11, 12]}, {"l": [1, 2]}, {"l": [10]}, {"l": [1, 100]}], "outputs": [35, 108, 3, 10, 5050], "max_gen_length": 128, "category": "array", "name": "Min-max sum", "description": "Sum the range from the minimum to the maximum of a list.", "id": "35"}
36 | {"prompts": ["Implement a function to return the characters shared between two words.", "Implement a function to find the number of vowels in a string.", "Find the shared characters of {s1} and {s2}, concatenate them into a string, and assign it to a variable named s.", "Print the number of vowels in the variable s"], "inputs": [{"s1": "meaty", "s2": "apple"}, {"s1": "fan", "s2": "forsook"}, {"s1": "spout", "s2": "shout"}, {"s1": "happiness", "s2": "fitness"}, {"s1": "code", "s2": "fork"}], "outputs": [2, 0, 2, 2, 1], "max_gen_length": 128, "category": "string", "name": "Vowel overlap", "description": "Find the number of overlaped vowels of two words.", "id": "36"}
37 | {"prompts": ["Given a list of integers {l}, assign the list to a varialbe named lst1.", "Find the negative numbers of the list and assign it to a new variable named lst2", "Compute the sum of numbers in lst2", "Print out the sum"], "inputs": [{"l": [-1, -2, 0, 1, 5]}, {"l": [5, 2, 0, 5, 10]}, {"l": [-100, -20, -3, 0, 0]}, {"l": [-23, -2, -5, 1000, 23, -10, -100, -10]}, {"l": [5, 1000, 0, 1, 0, 0, 0, 1, 1]}], "outputs": [-3, 0, -123, -150, 0], "max_gen_length": 128, "category": "math", "name": "Sum neg", "description": "Sum of negative numbers in a list.", "id": "37"}
38 | {"prompts": ["Import the pandas library.", "Read a dataframe \"df\" from the csv file located in \"./datasets/mlbootcamp5_train.csv\".", "Group by the column \"gender\" and assign the value counts for \"{a1}\" to a variable named \"my_counts\".", "Assign the attribute \"values\" of this variable and to a new variable named \"plain_list\".", "Print the maximum element of this list."], "inputs": [{"a1": "alco"}, {"a1": "age"}, {"a1": "smoke"}, {"a1": "active"}, {"a1": "weight"}], "outputs": [44369, 25, 44717, 36516, 2770], "max_gen_length": 128, "category": "data science", "name": "Load dataset", "description": "Load from a file and print statistics.", "id": "38"}
39 | {"prompts": ["Define a string named 's' with the value '{s}'.", "Import re and compile a regular expression that matches comma and period and store the result to variable 'pattern'", "Use the variable 'pattern' to substitute all the commas and periods in the string 's' and store the result to variable 's2'", "Split the string 's2' into a list of words with a space and store the result to variable 'words'", "Print a list of integers consisting of the length of each word in 'words'"], "inputs": [{"s": "Hello, World!"}, {"s": "Raising Skinny Elephants Is Utterly Boring"}, {"s": "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "}, {"s": ",."}, {"s": "Wow! Is this a real sentence?"}], "outputs": [[5, 6], [7, 6, 9, 2, 7, 6], [5, 5, 5, 3, 4, 11, 10, 4, 3, 2, 7, 6, 10, 2, 6, 2, 6, 5, 6], [], [4, 2, 4, 1, 4, 9]], "category": "string", "name": "Char length list", "description": "Return a list of non-punctuation character lengths of a list of strings.", "id": "39"}
40 | {"prompts": ["Create a variable named 's' with the value '{s}'.", "Lowercase the variable 's' and store the result to variable 's2'.", "Import re and compile a regular expression that matches a sharp symbol followed by three hexadecimal digits (0-9, a-f), store the result to variable 'pattern3'.", "Compile a regular expression that matches a sharp symbol followed by six hexadecimal digits (0-9, a-f), store the result to variable 'pattern6'.", "Print True if the variable 's2' if it matches with either of variables 'pattern3' or 'pattern6', False otherwise."], "inputs": [{"s": "#FFF"}, {"s": "#egacea"}, {"s": "#12"}, {"s": "123456"}, {"s": "#ffb600"}], "outputs": [true, false, false, false, true], "category": "math", "name": "Hex to RGB", "description": "Convert a six hexadecimal digit string into list of RGB values.", "id": "40"}
41 | {"prompts": ["Create a function called 'count_values' that takes a list of integers and returns a hash map of the number of times each integer appears in the list.", "Apply the function 'count_values' to the list '{lst}' and store the result to variable 'counts'.", "Print the integer with maximum count in the hash map 'counts', if the count is larger than half of the length of the list, otherwise print 'None'."], "inputs": [{"lst": [1, 1, 2, 2, 2, 2]}, {"lst": []}, {"lst": [100, 100, 0]}, {"lst": [0, 0, 0, 0, 0, 1, 1, 1, 1]}, {"lst": [1, 2, 3, 4, 5, 6, 6, 6, 6, 6]}], "outputs": [2, null, 100, 0, null], "category": "array", "name": "Majority vote", "description": "Check if a certain element is the majority of a given list.", "id": "41"}
42 | {"prompts": ["Import datetime and initialize a datetime object named 'today' with {month}/{day}/{year} (month/day/year).", "Add 7 days to the variable 'today' and store the result to variable 'week'.", "Print 'week' in the format '%m/%d/%Y'."], "inputs": [{"year": 1990, "month": 1, "day": 28}, {"year": 2000, "month": 2, "day": 26}, {"year": 2022, "month": 12, "day": 28}, {"year": 1274, "month": 11, "day": 5}, {"year": 1600, "month": 7, "day": 30}], "outputs": ["02/04/1990", "03/04/2000", "01/04/2023", "11/12/1274", "08/06/1600"], "category": "string", "name": "Week later", "description": "Print the formatted date of a week later given a date.", "id": "42"}
43 | {"prompts": ["Create a function named 'word_weight' that takes a string as input and returns the sum of ASCII values of each alphabet in the string.", "Given a list of strings named 'words' with the value {words}', apply the function 'word_weight' to each word and store the result to variable 'weights'.", "Print 'True' if the sorted 'weights' is the same as the original 'weights', otherwise 'False'."], "inputs": [{"words": ["apple", "banana", "carrot"]}, {"words": ["I'll", "see", "trees."]}, {"words": ["a...", "b?", "c!", "d"]}, {"words": ["", "a", "A"]}, {"words": ["ABC", "ghijklmno", "def"]}], "outputs": [true, true, true, false, false], "category": "math", "name": "Sorted word weights", "description": "Calculate the sum of ASCII values of each word and check if the list is sorted.", "id": "43"}
44 | {"prompts": ["Create a function named 'is_palindrome' that takes an integer as input and returns if the integer is a palindrome, by comparing stringified integer and its reversed string.", "Create a function named 'descent' that takes an integer as input and add each pair of adjacent digits together and return the result.", "Define an integer variable named 'base' with the value {n}.", "While the variable 'base' is not a single digit, apply the function 'is_palindrome' on 'base' and break if 'base' is palindrome. Otherwise, apply the function 'descent' to the variable 'base' and store the result to variable 'base'.", "Print 'False' if the variable 'base' is a single digit, otherwise print 'True'."], "inputs": [{"n": 123456}, {"n": 1234}, {"n": 123212}, {"n": 11211230}, {"n": 1112212124000131}], "outputs": [false, false, true, true, true], "category": "string", "name": "Create Palindrome", "description": "Sum pairs of adjacent digits until the number is palidrome.", "id": "44"}
45 | {"prompts": ["Define a string variable named 'input' with the value '{input}', as well as an empty list named 'stack'.", "Iterating over variable 'input', if the current character is '@' and 'stack' is not empty, pop the last element from 'stack', otherwise append the character to 'stack'.", "Print the joined string from 'stack'."], "inputs": [{"input": "he@@l@hel@llo"}, {"input": "@@@@"}, {"input": "si@@@t boy"}, {"input": "a@b@c@d@e@f@g@h@i@jkl"}, {"input": "hello @@world"}], "outputs": ["hello", "", "t boy", "jkl", "hello world"], "category": "string", "name": "Simulate Backspace", "description": "Apply the backspace characters in a string and print the modified .", "id": "45"}
46 | {"prompts": ["Import the pandas library.", "Import the function train_test_split from sklearn.model_selection.Read the dataframe \"df\" from the csv file './datasets/melb_data.csv'.", "Assign the attribute \"Price\" to the target variable \"y\".", "Drop the column \"Price\" from the dataframe on axis 1 and assign the result to a variable named \"melb_predictors\".", "From \"melb_predictors\" select and exclude columns of dtype \"object\" and name the result \"X\".", "Divide data into training and validation subsets x_train, x_valid, y_train, y_valid with train set size of {a1}%, test set size of {a2}%, random_state=0.", "Print the sum of the first column of x_train and the sum of y_train. Use the format \"{{:.1f}} {{:.1f}}\"."], "inputs": [{"a1": 80, "a2": 20}, {"a1": 50, "a2": 50}, {"a1": 20, "a2": 80}, {"a1": 10, "a2": 90}, {"a1": 90, "a2": 10}], "outputs": ["31956.0 14607789799.0", "20086.0 14607789799.0", "7995.0 14607789799.0", "3948.0 14607789799.0", "35891.0 14607789799.0"], "max_gen_length": 128, "category": "data science", "name": "Pandas DF manipulation", "description": "Manipulate a pandas dataframe and split into train and test set.", "id": "46"}
47 | {"prompts": ["Create a variable named lst1 with value {l}", "Find the minimum and maximum of lst1 and assign them to variables a and b respectively", "Create a list from a to b (inclusive) and assign it to variable named lst2", "Find the elements that are in lst2 but not in lst1", "Print the sum of these elements"], "inputs": [{"l": [1, 3, 5, 7, 10]}, {"l": [10, 7, 5, 3, 1]}, {"l": [10, 20, 30, 40, 50, 60]}, {"l": [-100, 100]}, {"l": [-5, -10, 0, 10]}], "outputs": [29, 29, 1575, 0, 5], "max_gen_length": 128, "category": "array", "name": "Sum non-overlap range", "description": "Sum the integers in a (min, max) range that don't appear in a list .", "id": "47"}
48 | {"prompts": ["Initialize the variable named lst1 with a list {l}.", "Create a function called num_in_str() to check whether a string contains a number.", "Call the function num_in_str() to find strings in lst1 that have numbers and assign them to a list named lst2", "Print out lst2"], "inputs": [{"l": ["1a", "a", "2b", "b"]}, {"l": ["abc", "abc10"]}, {"l": ["abc", "ab10c", "a10bc", "bcd"]}, {"l": ["this is a test", "test1"]}, {"l": ["t0t", "11", "0"]}], "outputs": [["1a", "2b"], ["abc10"], ["ab10c", "a10bc"], ["test1"], ["t0t", "11", "0"]], "max_gen_length": 256, "category": "array", "name": "Detect digits", "description": "Find if a string contains digits.", "id": "48"}
49 | {"prompts": ["Define a function \"a\" that multiplies an integer argument by {a1} and returns the result.", "Define a function \"b\" that multiplies an integer argument by {a2} and returns the result.", "Define a function \"c\" that multiplies an integer argument by {a3} and returns the result.", "Create a list named \"abc\" which contains the three functions in order of definition.", "Assign the integer {a4} to a variable \"my_init\".", "Apply the first function of the list to \"my_init\" and name the result \"my_result\".", "For each subsequent function in the list, take the result of the previous function as input argument and assign the result to \"my_result\".", "Print the variable named \"my_result\"."], "inputs": [{"a1": "2", "a2": "2", "a3": "2", "a4": "1"}, {"a1": "1", "a2": "1", "a3": "2", "a4": "1"}, {"a1": "2", "a2": "2", "a3": "2", "a4": "2"}, {"a1": "-2", "a2": "2", "a3": "2", "a4": "1"}, {"a1": "-2", "a2": "-2", "a3": "2", "a4": "1"}], "outputs": [8, 2, 16, -8, 8], "max_gen_length": 128, "category": "math", "name": "Cascading functions", "description": "Sequentially invoke function objects in a list.", "id": "49"}
50 | {"prompts": ["This function \"to_plural\" takes list of words in the singular form and returns a set of those words in the plural form adding an \"s\" to the end of the words, if they appear more than once in the list. E.g., pluralize([\"cow\", \"pig\", \"cow\", \"cow\"]) = {{\"cows\", \"pig\"}}, pluralize([\"table\", \"table\", \"table\"]) = {{\"tables\"}}.", "Create a function \"is_plural\" which returns True if the word passed as argument is in plural form.", "Assign {a1} to a variable named \"words\".", "Apply the function that returns plural forms to the variable \"words\" and name the result \"words_plural\".", "Define a boolean \"contains_plural\" and apply \"is_plural\" to each element of \"words_plural\" to detect if at least one word is in plural form.", "Print out whether or not \"words_plural\" contains a word in plural as boolean."], "inputs": [{"a1": "[\"chair\", \"pencil\", \"arm\", \"arm\"]"}, {"a1": "[\"arm\", \"arm\", \"arm\", \"arm\"]"}, {"a1": "[\"chair\", \"arm\", \"pencil\", \"arm\"]"}, {"a1": "[\"chair\", \"pencil\", \"arm\"]"}, {"a1": "[\"chair\", \"pencil\", \"table\"]"}], "outputs": [true, true, true, false, false], "max_gen_length": 128, "category": "dict", "name": "Pluralize duplicates", "description": "Pluralize duplicated words in a list.", "id": "50"}
51 | {"prompts": ["Assign the list of numbers \"{A}\" to a variable named \"my_relative_altitude\".", "Compute the all prefix sum in the list ( 0 is the first element ) and store as my_net_altitude.", "Find the larget number in the list my_net_altitude and print it out."], "inputs": [{"A": "[1,2,3,4]"}, {"A": "[-1,2,3,4]"}, {"A": "[-1,-2,-3,-4]"}, {"A": "[-1000000, 1, 2]"}, {"A": "[-5, 1, 5, 0, -10]"}], "outputs": [10, 8, 0, 0, 1], "max_gen_length": 128, "category": "array", "name": "Highest altitude", "description": "Given relative altitudes , find the highest altitude.", "id": "51"}
52 | {"prompts": ["Assign the list of words \"{A}\" to a variable named \"my_sentences\".", "Assign an integer \"{K}\" to a variable named \"k\".", "Truncate the list such that it contains k words and store as truncated_list", "Print out the variable truncated_list ."], "inputs": [{"A": ["hello", "world"], "K": 1}, {"A": ["hello", "how", "are", "you", "Jim"], "K": 4}, {"A": ["China", "is", "a", "large", "country"], "K": 1}, {"A": ["yes", "yes", "yes", "yes", "yes"], "K": 4}, {"A": ["what", "is", "your", "name"], "K": 1}], "outputs": [["hello"], ["hello", "how", "are", "you"], ["China"], ["yes", "yes", "yes", "yes"], ["what"]], "max_gen_length": 128, "category": "array", "name": "Truncate words", "description": "Truncate a sentence so that it contains k words.", "id": "52"}
53 | {"prompts": ["Assign the list of integers \"{A}\" to a variable named \"my_numbers\".", "Count the frequencies of the integers in my_numbers.", "Find the integer that the frequency is 1 and store as one_time.", "Print out the variable one_time."], "inputs": [{"A": [1, 2, 2, 2]}, {"A": [-1, 4, 4, 4, 4, 4]}, {"A": [-1, -4, 8, -4, 8]}, {"A": [-1000000, 1, 1]}, {"A": "[10000, 2, 2, 2,2,2]"}], "outputs": [1, -1, -1, -1000000, 10000], "max_gen_length": 128, "category": "array", "name": "Single element", "description": "Find the elements that appear one time in an array.", "id": "53"}
54 | {"prompts": ["Assign the list of integers \"{A}\" to a variable named \"my_numbers\".", "Assign an integer \"{Val}\" to a variable named \"val\".", "Remove all occurrences of val in my_numbers and store the removed list as remove_numbers.", "Print out the variable remove_numbers."], "inputs": [{"A": [1, 2, 2, 2], "Val": 2}, {"A": [-1, 4, 4, 4, 4, 4], "Val": 4}, {"A": [-1, -4, 8, -4, 8], "Val": -1}, {"A": [-1000000, 1, 1], "Val": 1}, {"A": "[10000, 2, 2, 2,2,2]", "Val": 2}], "outputs": [[1], [-1], [-4, 8, -4, 8], [-1000000], [10000]], "max_gen_length": 128, "category": "array", "name": "Remove elements", "description": "Remove all the occurrences of an element in an array.", "id": "54"}
55 | {"prompts": ["Assign the list of integers \"{A}\" to a variable named \"my_numbers\".", "Assign an integer \"{Val}\" to a variable named \"val\".", "Sum all the number in my_numbers and store as sum_numbers.", "Check whether the sum_numbers is equal to val. If yes, return \"True\", otherwise return \"False\"."], "inputs": [{"A": [1, 2, 2, 2], "Val": 2}, {"A": [-1, 5], "Val": 4}, {"A": [-1, -1, -1, -1, 1], "Val": -5}, {"A": [-1000000, 1, 1], "Val": 1}, {"A": "[10000, 2, 2, 2,2,2]", "Val": 2}], "outputs": ["False", "True", "True", "False", "False"], "max_gen_length": 128, "category": "array", "name": "Check array sum", "description": "Check whether the sum of an array is equal to a given value.", "id": "55"}
56 | {"prompts": ["Assign a sorted list \"{A}\" to a variable named \"my_numbers1\".", "Assign a sorted list \"{B}\" to a variable named \"my_numbers2\".", "Merge the two sorted lists in a new sorted list and store as new_list.", "Print the sorted new_list."], "inputs": [{"A": [1, 2, 2, 2], "B": [3, 4]}, {"A": [-1, 5], "B": [1, 2]}, {"A": [-1, -1, -1, -1, 1], "B": [-1, 8]}, {"A": [-1000000, 1, 1], "B": [1, 6]}, {"A": "[2, 2,2,2,2, 10000]", "B": [-2, -1]}], "outputs": [[1, 2, 2, 2, 3, 4], [-1, 1, 2, 5], [-1, -1, -1, -1, -1, 1, 8], [-1000000, 1, 1, 1, 8], [-2, -1, 2, 2, 2, 2, 2, 10000]], "max_gen_length": 128, "category": "algorithm", "name": "Merge sorted lists", "description": "Merge two sorted lists into one.", "id": "56"}
57 | {"prompts": ["Assign an integer array \"{A}\" to a variable named \"my_array\".", "Find the contiguous subarray of my_array with the largest sum and store as max_subarray.", "Compute the sum of max_subarray and store as sum_subarry.", "Print out the variable sum_subarray."], "inputs": [{"A": [1]}, {"A": [-1, 5]}, {"A": [-1, -1, -1, -1, 1]}, {"A": [-1000000, 1, 1]}, {"A": "[2, 2,2,2,2, 10000]", "B": [-2, -1]}], "outputs": [1, 5, 1, 2, 10010, -1], "max_gen_length": 128, "category": "algorithm", "name": "Maximum subarray", "description": "Find the max contiguous subarray and return the sum.", "id": "57"}
58 | {"prompts": ["Assign the positive number \"{A}\" to a variable named \"my_number\".", "Compute the squre root of the number and store as square_root. ", "Compute the largest integer but not larger than square_root and store as largest_square_root.", "Print the integer largest_square_root."], "inputs": [{"A": 1}, {"A": 5}, {"A": 101}, {"A": 30}, {"A": 10000}], "outputs": [1, 2, 10, 5, 100], "max_gen_length": 128, "category": "algorithm", "name": "Max square root integer", "description": "Compute the largest integer but not larger than square root of one positive number.", "id": "58"}
59 | {"prompts": ["Assign the list of words \"{A}\" to a variable named \"my_words\".", "Count the length of the words in the list and store as a dictionary word_count. ", "Find the element with the largest count in dictionary word_count and store as longest_word.", "print the variable longest_word."], "inputs": [{"A": ["Hello", "word"]}, {"A": ["a", "good", "place"]}, {"A": ["the", "last", "word", "in", "the", "sentence"]}, {"A": ["good"]}, {"A": ["There", "will", "be", "a", "joy"]}], "outputs": ["Hello", "place", "sentence", "good", "There"], "max_gen_length": 128, "category": "algorithm", "name": "Longest word", "description": "Find the longest word in a word list.", "id": "59"}
60 | {"prompts": ["Assign the list of numbers \"{A}\" to a variable named \"my_numbers\".", "Return the elements that appears exactly once in the above list and store as my_uniques.", "Compute the sum of the list my_uniques and print it out."], "inputs": [{"A": "[1,2,3]"}, {"A": "[1,1,1,1]"}, {"A": "[-1,-2,-3,-3]"}, {"A": "[-1000000, 1, 1, 2,2,3,3,3,3]"}, {"A": "[-5, 1, -5, 0, -10]"}], "outputs": [6, 0, -3, -1000000, -9], "max_gen_length": 128, "category": "algorithm", "name": "Sum unique elements", "description": "Sum all the unique numbers in a list.", "id": "60"}
61 | {"prompts": ["Assign the matrix \"{A}\" to a variable named \"my_matrix\".", "Find the diagonal elements of my matrix and store as diag_elements.", "print out the sum of the variable diag_elements."], "inputs": [{"A": [[3, 2], [2, 3]]}, {"A": [[3, 2, 5], [2, 3, 5]]}, {"A": [1]}, {"A": [[30000, 30000, 1], [30000, 30000, 1], [30000, 30000, 1]]}, {"A": [5, 5, 5, 5, 5, 0]}], "outputs": [6, 6, 1, 60001, 5], "max_gen_length": 128, "category": "data science", "name": "Digonal sum", "description": "Compute the digonal sum of a matrix.", "id": "61"}
62 | {"prompts": ["Assign the matrix \"{A}\" to a variable named \"my_matrix\".", "Assign the number \"{T}\" to a variable named \"t\".", "Compute the condition number of my_matrix and store as result.", "Check whether the result is smaller than t. If yes, return \"True\", otherwise return \"False\"."], "inputs": [{"A": [[3, 2], [2, 3]], "T": 1}, {"A": [[3, 2, 5], [2, 3, 5]], "T": -1}, {"A": [[1, 5]], "T": 2}, {"A": [[30000, 30000, 1], [30000, 30000, 1], [30000, 30000, 1]], "T": 100}, {"A": [[5, 5, 5, 5, 5, 0]], "T": 0.5}], "outputs": ["False", "False", "True", "False", "False"], "max_gen_length": 128, "category": "data science", "name": "Matrix condition number", "description": "Check conditon number of a matrix is less than a threshold.", "id": "62"}
63 | {"prompts": ["Assign the matrix \"{A}\" to a variable named \"a\".", "Assign the matrix \"{B}\" to a variable named \"b\".", "Compute the multiplication of two matrices and store as result.", "Compute the sum of the result and print it out."], "inputs": [{"A": [[3, 2], [2, 3]], "B": [[3, 2], [2, 3]]}, {"A": [[3, 2, 5], [2, 3, 5]], "B": [[1, 0], [0, 1], [2, -2]]}, {"A": [[1, 5, 67, -1]], "B": [[-1], [0], [0], [-1]]}, {"A": [[30000, 30000, 1], [30000, 30000, 1], [30000, 30000, 1]], "B": [[1, 0, 6], [0, 1, 5], [0, 1, 4]]}, {"A": [[5, 5, 5, 5, 5, 0]], "B": [[-1], [-1], [-1], [-1], [-1], [1000]]}], "outputs": [50, 10, 0, 1170015, -25], "max_gen_length": 128, "category": "data science", "name": "Matrix multiplication sum", "description": "Compute matrix multiplication sum of two matrices.", "id": "63"}
64 | {"prompts": ["Assign the matrix \"{A}\" to a variable named \"a\".", "Assign the matrix \"{B}\" to a variable named \"b\".", "Implement a function that computes the determinant of a matrix.", "Check whether the determinant of matrix a is large than matrix b. If yes, print \"True\", otherwise print \"False\"."], "inputs": [{"A": [[3, 2], [2, 3]], "B": [[3, 2], [2, 2]]}, {"A": [[3, 2, 5], [2, 3, 5], [3, 5, 6]], "B": [[3, 2], [2, -3]]}, {"A": [[1, 5, 67, -1], [2, 3, 6, 7], [2, 3, 6, 7], [2, 3, 6, 7]], "B": [[0, 0], [1, 4]]}, {"A": [[30000, 30000, 1], [30000, 30000, 1], [30000, 30000, 1]], "B": [[30000, 30000, 30000], [30000, 1, 1], [30000, 30000, 1]]}, {"A": [[1, 0, 6], [0, 1, 5], [0, 1, 4]], "B": [[1, 0], [0, 1]]}], "outputs": ["True", "True", "False", "False", "False"], "max_gen_length": 128, "category": "data science", "name": "Matrix determinant ", "description": "Compare two matrix determinants.", "id": "64"}
65 | {"prompts": ["Assign the list of numbesr \"{A}\" to a variable named \"my_numbers\".", "Implement a function that computes the exponential output of a list.", "Implement a function that computes summation of a list.", "Implement a function that computes log of a number.", "Print out the log of sum exponential my_numbers."], "inputs": [{"A": [1, 3, 2, 2]}, {"A": [1000, 1000, 1000]}, {"A": [0, 0.2, 0.4, -0.2]}, {"A": [1, 0, 0, 1, 3, 2, 0, 0.2]}, {"A": [0, 3, 1, 3, 2, 2, -0.2, 0.2]}], "outputs": [3.6265233750364456, 1001.0986122886682, 1.5111541217815447, 3.6144941975988285, 4.106068918955366], "max_gen_length": 128, "category": "data science", "name": "Log-sum-exp", "description": "Compute the log of sum exponential input.", "id": "65"}
66 | {"prompts": ["Assign the list of points \"{A}\" to a variable named \"my_points\".", "Assign the integer \"{K}\" to a variable named \"k\".", "Implement a function that computes the distance between a point and the origin (0,0).", "Implement a function that computes the k closest points in an array to the origin and store as result.", "Compute the k closest points in my_points and print them out."], "inputs": [{"A": [[1, 3], [2, 2]], "K": 1}, {"A": [[0, 0], [1, 4], [-4, 6], [7, -1]], "K": 1}, {"A": [[0, 0], [1, 4], [-4, 6], [7, -1]], "K": 2}, {"A": [[1, 0], [0, 1], [3, 2], [0, 0.2], [0.4, -0.2]], "K": 2}, {"A": [[0, 3], [1, 3], [2, 2], [-0.2, 0.2], [0.5, 0.5], [1, -0.5], [2, -0.5], [2, 1]], "K": 1}], "outputs": [[2, 2], [0, 0], [[0, 0], [1, 4]], [[0, 0.2], [0.4, -0.2]], [-0.2, 0.2]], "max_gen_length": 128, "category": "array", "name": "K nearest points", "description": "Find the k nearest points to the origin.", "id": "66"}
67 | {"prompts": ["Implement a function called LCP() to find the longest common prefix of two strings", "Initialize a variable named lst1 with a list {l1}.", "Apply the function LCP() recusively to lst1", "Print the the longest common prefix of the strings in lst1"], "inputs": [{"l1": ["apple", "ape", "april"]}, {"l1": ["crazy", "car"]}, {"l1": ["small", "smart", "smile"]}, {"l1": ["inbox", "income", "input", "insight"]}, {"l1": ["come", "combine", "continue", "compute"]}], "outputs": ["ap", "c", "sm", "in", "co"], "max_gen_length": 256, "category": "algorithm", "name": "Longest common prefix", "description": "Find the longest common prefix of two strings.", "id": "67"}
68 | {"prompts": ["Assigns a list {lst1} to a variable named lst1", "Create a frequency table of elements in lst1", "Find the elements with frequency larger than 1 and assign them to a list lst2", "Print out lst2"], "inputs": [{"lst1": [2, 3, 1, 2, 3]}, {"lst1": ["a", "c", "b", "a"]}, {"lst1": [3, 3, 1, 1]}, {"lst1": ["d", "c", "d", "c", "e", "a"]}, {"lst1": [1, 2, 3]}], "outputs": [[2, 3], ["a"], [3, 1], ["d", "c"], []], "max_gen_length": 256, "category": "array", "name": "Duplicate elments", "description": "Find duplicates in a list.", "id": "68"}
69 | {"prompts": ["Initialize a variable named w1 with a string '{w}'", "Get the first non-repeating character in w1", "Find its corresponding index and assign it to n1", "Print out n1"], "inputs": [{"w": "popular"}, {"w": "crunchy"}, {"w": "barbados"}, {"w": "alphabet"}, {"w": "science"}], "outputs": [1, 1, 2, 1, 0], "max_gen_length": 256, "category": "algorithm", "name": "First unique character", "description": "Find the first non-repeating character in a string.", "id": "69"}
70 | {"prompts": ["Assign a sentence '{s1}' to a variable named sentence1.", "Assign a sentence '{s2}' to a variable named sentence2.", "Split sentence1 into words and assign them to words1.", "Split sentence2 into words and assign them to words2.", "Find the words that appear once in both words1 and words2 and assign them to uncommon_words.", "Print uncommon_words."], "inputs": [{"s1": "Geeks for Geeks", "s2": "Learning from Geeks for Geeks"}, {"s1": "apple banana mango", "s2": "banana fruits mango"}, {"s1": "Seaborg spent most of his career as an educator and research scientist at the University of California, Berkeley.", "s2": "Seaborg spent most of his career as an educator and research scientist at the University of California, Los Angeles."}, {"s1": "Seaborg was the principal or co-discoverer of ten elements.", "s2": "Seaborg was the principal or co-discoverer of ten elements."}, {"s1": "Heavy rainfall began in earnest around 8 April.", "s2": "rainfall began in earnest around 8 April."}], "outputs": [["Learning", "from"], ["apple", "fruits"], ["Berkeley", "Los", "Angeles"], [], ["Heavy"]], "max_gen_length": 256, "category": "algorithm", "name": "Uncommon words", "description": "Find uncommon words in two sentences.", "id": "70"}
71 | {"prompts": ["Assign a sentence '{s1}' to a variable named sentence1.", "Split sentence1 into words and assign them to words1.", "Remove punctuation in words1.", "Compute the average word length in words1 and assign it avg.", "Print avg."], "inputs": [{"s1": "Hi all, my name is Tom...I am originally from Australia."}, {"s1": "I need to work very hard to learn more about algorithms in Python!"}, {"s1": "It received critical acclaim and continues to be praised by commentators."}, {"s1": "The Minute Man was intended to be placed on a local boulder by the town of Concord."}, {"s1": "During the height of the Cold War, teams from the Soviet Union and the United States independently created rutherfordium and dubnium."}], "outputs": [4.5, 4.076923076923077, 5.636363636363637, 3.8823529411764706, 5.285714285714286], "max_gen_length": 256, "category": "algorithm", "name": "Average words length", "description": "Compute the average word length of a sentence.", "id": "71"}
72 | {"prompts": ["Assigns strings {w1} and {w2} to variables w1 and w2 respectively", "Lower-case w1 and w2", "Count the frequency of letters in w1 and w2 and assign them to f1 and f2", "Print if f1 is equal to f2"], "inputs": [{"w1": "find", "w2": "ding"}, {"w1": "rat", "w2": "car"}, {"w1": "open", "w2": "book"}, {"w1": "fried", "w2": "fired"}, {"w1": "listen", "w2": "silent"}], "outputs": [false, false, false, true, true], "max_gen_length": 256, "category": "string", "name": "Compare char freq", "description": "Compare the character frequencies in two strings.", "id": "72"}
73 | {"prompts": ["Assign a string {w} to a variable named w1", "Concatenate the elements in w1 from end to beginning and assign it to w2", "Print w2"], "inputs": [{"w": "abc"}, {"w": "ape"}, {"w": "geeksforgeeks"}, {"w": "apple"}, {"w": "april"}], "outputs": ["cba", "epa", "skeegrofskeeg", "elppa", "lirpa"], "max_gen_length": 256, "category": "string", "name": "Reverse string", "description": "Reverse a string.", "id": "73"}
74 | {"prompts": ["Assign a natural number {n} to named num", "Create a list from 1 to num and assign it to a variable lst1", "Compute the sum of squared of the numbers in lst1 and assign n1", "Compute the sum of the numbers in lst1 and assign its square to n2", "Print out the difference between n1 and n2"], "inputs": [{"n": 12}, {"n": 2}, {"n": 10}, {"n": 5}, {"n": 100}], "outputs": [-5434, -4, -2640, -170, -25164150], "max_gen_length": 256, "category": "math", "name": "Square Sum diff", "description": "Calculate the difference between the squared sum and the sum of squares.", "id": "74"}
75 | {"prompts": ["Assigns a list {lst1} to a variable named vec1", "Assigns a list {lst2} to a variable named vec2", "Normalize vec1", "Normalize vec2", "Compute the dot product of vec1 and vec2", "Print out the dot product"], "inputs": [{"lst1": [0.3, 1.0, 2.0], "lst2": [1.0, 2.0, 3.0]}, {"lst1": [10.0, 20.0, 30.0], "lst2": [0.1, 0.2, 0.3]}, {"lst1": [1.1, 2.1, 3.1], "lst2": [10.1, 20.2, 30.3]}, {"lst1": [1.0, 2.0], "lst2": [0.1, 0.2]}, {"lst1": [5.3, 1.1, 2.6, 1.2, 10.2], "lst2": [1.3, 2.5, 3.7, 4.8, 5.9]}], "outputs": [0.9832301408945487, 0.9999999999999999, 0.9998592903536574, 0.9999999999999999, 0.8032876127853769], "max_gen_length": 256, "category": "math", "name": "Cosine sim", "description": "Compute the cosine similarity between two vectors.", "id": "75"}
76 | {"prompts": ["Assigns a list {lst1} to a variable named vec1", "Assigns a list {lst2} to a variable named vec2", "Assigns a list {lst3} to a variable named vec3", "Convert vec1, vec2, and vec3 to numpy array", "Implement a function called dist() to compute the distance between two vectors", "Compute the distance between vec1 and vec2 and assign it to d1", "Compute the distance between vec1 and vec3 and assign it to d2", "Print out whether d1 is larger than d2"], "inputs": [{"lst1": [0.0, 0.0, 0.0], "lst2": [1.0, 2.0, 3.0], "lst3": [0.1, 0.2, 0.3]}, {"lst1": [0.0, 0.0, 0.0], "lst2": [10.0, 20.0, 30.0], "lst3": [0.1, 0.2, 0.3]}, {"lst1": [0.0, 0.0, 0.0], "lst2": [1.1, 2.1, 3.1], "lst3": [10.1, 20.2, 30.3]}, {"lst1": [0.0, 0.0, 0.0, 0.0], "lst2": [-1.0, -2.0, -3.0, -10.0], "lst3": [0.1, 0.2, 0.3, 0.2]}, {"lst1": [0.0, 0.0], "lst2": [1.0, 2.0], "lst3": [0.1, 0.2]}], "outputs": [true, true, false, true, true], "max_gen_length": 256, "category": "math", "name": "Vector distance", "description": "Compare vector distances to the origin.", "id": "76"}
77 | {"prompts": ["Initialize a variable named lst1 with a list {l1}.", "Initialize a variable named lst2 with a list {l2}.", "Create a function called std() to compute the standard deviation given a list of numbers.", "Call the function std() to calculate standard deviations for lst1 and lst2.", "Print out the smaller standard deviation."], "inputs": [{"l1": [1, 1, 1, 1, 1], "l2": [1, 2, 3, 4, 5]}, {"l1": [-1, -1, 1, 1], "l2": [100, 1, -100]}, {"l1": [-100, -10, 5, 5, -10], "l2": [100, 50, 20, -100]}, {"l1": [20, 1, 50, 6], "l2": [-100]}, {"l1": [5, 6, 9, 100], "l2": [-100, -100, -100, -100, -100]}], "outputs": [0.0, 1.0, 39.57, 0.0, 0.0], "max_gen_length": 256, "category": "data science", "name": "Compare standard deviations", "description": "Find the smaller standard deviation given two lists.", "id": "77"}
78 | {"prompts": ["Initialize a variable named lst1 with a list {l1}.", "Initialize a variable named lst2 with a list {l2}.", "Create a function called mean() to compute the mean given a list of numbers.", "Call the function mean() to calculate means for lst1 and lst2.", "Print out the smaller mean."], "inputs": [{"l1": [1, 1, 1, 1, 1], "l2": [1, 2, 3, 4, 5]}, {"l1": [-1, -1, 1, 1], "l2": [100, 1, -100]}, {"l1": [-100, -10, 5, 5, -10], "l2": [100, 50, 20, -100]}, {"l1": [20, 1, 50, 6], "l2": [-100]}, {"l1": [5, 6, 9, 100], "l2": [-100, -100, -100, -100, -100]}], "outputs": [1.0, 0.0, -22.0, -100.0, -100.0], "max_gen_length": 256, "category": "data science", "name": "Compare means", "description": "Find the smaller mean given two lists.", "id": "78"}
79 | {"prompts": ["Initialize a variable named lst1 with a list {l1}.", "Compute the mean and the standard deviation for lst1 and assign it variable avg and sd, respectively", "Compute the coeffeicient of variation", "Print out the coefficient of variation"], "inputs": [{"l1": [1, 1, 1, 1, 1]}, {"l1": [-100, -10, 5, 5, -10]}, {"l1": [-1, 1, -10, 10, 2, 3, 5]}, {"l1": [-5, 7, -3, -4, 9, 10, -1, 11]}, {"l1": [20, 1, 50, 6]}], "outputs": [0.0, -1.7987599034008526, 3.9749213828703582, 2.140872096444188, 0.9906801321840804], "max_gen_length": 256, "category": "data science", "name": "Coefficient of variation", "description": "Compute coefficient of variation given a list.", "id": "79"}
80 | {"prompts": ["Initialize a variable named lst1 with a list {l1}.", "Get the absolute value of every element in lst1 and assign to a lst2", "Compute the sum of lst2 and assign to l1", "Print out l1"], "inputs": [{"l1": [0, 0]}, {"l1": [1, 1]}, {"l1": [-1, 1, -100, 100]}, {"l1": [0, 0, 59, 1, 40]}, {"l1": [-50, -10, 40, 200, 1000]}], "outputs": [0, 2, 202, 100, 1300], "max_gen_length": 256, "category": "data science", "name": "L1 norm", "description": "Compute the L1 norm given a list.", "id": "80"}
81 | {"prompts": ["Assigns a list {lst1} to a variable named lst1", "Compute the sample mean of lst1", "Compute the sample standard deviation of lst1", "Compute the z-statistic to test whether its mean is 0", "Print out the z-statistic"], "inputs": [{"lst1": [0.3, 1.0, 2.0, -2.0, 4.0, -5.0]}, {"lst1": [1.3, 5.0, 2.1, -2.4, 4.1, 5.1]}, {"lst1": [1.3, 15.0, 2.9]}, {"lst1": [0.3, -1.0, -2.0, 5.0, 1.0, 5.1]}, {"lst1": [10.3, 12.0, 20.0, 21.0, 40.0, 5.0, 10.0, 20.0, 23.0, 15.0]}], "outputs": [0.017307532290566904, 0.9670745372626464, 1.046418644730305, 0.5092873663524808, 1.8989720877738328], "max_gen_length": 256, "category": "data science", "name": "Z-statistic", "description": "Compute z-statistic given a list.", "id": "81"}
82 | {"prompts": ["Assign a list {lst} to named lst1", "Separate lst1 into two lists, lst_pos and lst_neg which contain all the positive numbers and all the negative numbers repsectively", "Concatenate lst_pos and lst_neg and assign it lst2", "Print out lst2"], "inputs": [{"lst": [3, -3, 2, -2]}, {"lst": [-5, 7, -3, -4, 9, 10, -1, 11]}, {"lst": [-1000, 11]}, {"lst": [9, -10, 8, 2, -77, -50, 11, 6]}, {"lst": [-50, -70, -30, 4, 3, -100, 1]}], "outputs": [[3, 2, -3, -2], [7, 9, 10, 11, -5, -3, -4, -1], [11, -1000], [9, 8, 2, 11, 6, -10, -77, -50], [4, 3, 1, -50, -70, -30, -100]], "max_gen_length": 256, "category": "array", "name": "Move all negative elements to end", "description": "Move all negative elements in a list to the end.", "id": "82"}
83 | {"prompts": ["Initialize a variable named w with a string {w}", "Lower every character in w", "Replace every alphabetical characters in w with ''", "Print out the new word after substitution"], "inputs": [{"w": "2a4B"}, {"w": "br2ace"}, {"w": "100"}, {"w": "3g4lc"}, {"w": "12Apple0"}], "outputs": ["24", "2", "100", "34", "120"], "max_gen_length": 256, "category": "string", "name": "Remove alphabetical characters", "description": "Remove alphabetical characters in a string.", "id": "83"}
84 | {"prompts": ["Import and initialize a numpy array \"X\" with the values {X}.", "Calculate the dot product between all rows and store the result to \"Xn\", where (i, j) element stores the dot product between i-th and j-th row of \"X\".", "Set the diagonal elements of \"Xn\" to 0.", "Print out the maximum value (cast as a float) in \"Xn\"."], "inputs": [{"X": [[0.884, 0.209], [0.067, 0.381], [0.503, 0.821], [0.306, 0.592], [0.417, 0.519]]}, {"X": [[2, 2], [1, 0], [0, 4], [2, 4], [1, 1], [0, 3], [1, 0], [1, 0], [1, 3], [0, 1]]}, {"X": [[1, 0, 3], [4, 3, 4], [4, 1, 2], [0, 1, 0], [3, 3, 2]]}, {"X": [[1.022, -0.668], [-1.082, 0.063], [-0.181, 0.841], [0.891, 1.533], [1.195, -1.69]]}, {"X": [[-8, 2, -3], [2, -10, -5], [-5, 5, -8], [-3, 2, -2], [3, 6, 2]]}], "outputs": [0.6399499999999999, 16, 29, 2.35021, 74], "max_gen_length": 128, "category": "data science", "name": "Largest norm", "description": "Find the largest norm among n-dimensional points.", "id": "84"}
85 | {"prompts": ["Initialize numpy arrays \"pred\" with the values {pred}, \"y\" with the values {y}.", "Compare the equivalence of two arrays and store the results as \"match\".", "Assign the boolean array for whether \"y\" is greater than 0 to a variable \"non_zero\".", "Perform the logical \"AND\" operation between \"match\" and \"non_zero\", store the result as \"correct\".", "Compute the precision by dividing the number of True values in \"correct\" by that in \"pred\", and store as \"prec\".", "Compute the recall by dividing the number of True values in \"correct\" by the number of actual non-zero values in \"y\", and store the result as \"rec\".", "Calculate the harmonic mean between \"prec\" and \"rec\" and print out the value."], "inputs": [{"pred": [1, 1, 1, 1, 1, 0, 1, 0, 0, 0], "y": [0, 1, 1, 0, 1, 0, 0, 0, 0, 1]}, {"pred": [0, 1, 1, 1, 1, 0, 1, 0, 0, 0], "y": [0, 1, 1, 0, 1, 0, 0, 0, 0, 0]}, {"pred": [0, 1, 0, 0, 0], "y": [0, 1, 0, 0, 0]}, {"pred": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], "y": [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]}, {"pred": [0, 1, 0, 0, 0], "y": [0, 1, 1, 1, 1]}], "outputs": [0.6, 0.74999999999, 1.0, 0.1818181818182, 0.4], "max_gen_length": 128, "category": "data science", "name": "F1 score", "description": "Given two arrays (pred, gold), calculate the F1 score.", "id": "85"}
86 | {"prompts": ["Initialize a string named \"concat\" with {x}.", "Import the regex module and define a pattern \"pat\" that matches capital alphabets that can be referenced as a group.", "Find all the matches in \"concat\" with \"pat\", and insert an additional whitespace before the matched character with, then store the result to \"result\".", "Print out \"result\"."], "inputs": [{"x": "ACapitalLetterWords"}, {"x": "camelCaseMethod"}, {"x": "ABCDE"}, {"x": "splitDB"}, {"x": "donotsplitanything"}], "outputs": [" A Capital Letter Words", "camel Case Method", " A B C D E", "split D B", "donotsplitanything"], "max_gen_length": 128, "category": "string", "name": "Add Space", "description": "Add spaces before capital letters.", "id": "86"}
87 | {"prompts": ["Initialize a list \"x\" with the values {x}.", "Assuming the normal distribution, calculate mean and standard deviation of \"x\" using numpy, store the results to \"mean\" and \"std\".", "Find the values in x that are either smaller than mean - 2 * std or larger than mean + 2 * std, and store the results to \"results\".", "Sort \"results\" in ascending order and print it out."], "inputs": [{"x": [0, 0, 0, 0, 100]}, {"x": [-100, 0, 1, 2, 3, 4, -1, -2, -10, 45, 120]}, {"x": [3, -1, 0, 3, -3, 5, -2, 0, 0, -3, 1, -4, 4, -7, -1, -1, 1, -1, -2, -3]}, {"x": [0, 0, 4, 0, 3, 0, 0, -7, -2, 1, 1, -1, -7, -3, 1, 2, 0, -1, 4, 4]}, {"x": [4, -2, -2, -2, 0, 2, 3, -3, -3, 4, 1, 0, 2, 1, 4, -2, 2, -5, -4, 3, 0, 0, -2, -1, -1, 0, -2, 1, 1, -3]}], "outputs": [[100], [-100, 120], [-7], [-7, -7], [-5]], "max_gen_length": 128, "category": "data science", "name": "Remove outlier", "description": "Remove data points in the tail (2sigma) of normal distribution.", "id": "87"}
88 | {"prompts": ["Initialize a list \"x\" with the values {x}", "Obtain a list of unique elements in x and sort them, store the results to \"vocab\".", "Create a hash map from the values of \"vocab\" to their indices and store the result to \"v2i\".", "Initialize a numpy array of zeros named \"features\" whose row size is the length of x and column size is the length of \"index\", with a data type of int.", "For each element in x, assign 1 to (i, j) location of features, where i is the index of current element and j is the mapped value of the current element using \"v2i\".", "Print out \"features\"."], "inputs": [{"x": [4, 2, 3, 1, 0, 3, 3, 3, 2, 1]}, {"x": [0, 1, 2]}, {"x": [1, 1, 1, 1, 1]}, {"x": [0, 0, 0, 0, 0]}, {"x": [0, 0, 1, 1]}], "outputs": [[[0, 0, 0, 0, 1], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0], [0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 1, 0], [0, 0, 0, 1, 0], [0, 0, 1, 0, 0], [0, 1, 0, 0, 0]], [[1, 0, 0], [0, 1, 0], [0, 0, 1]], [[1], [1], [1], [1], [1]], [[1], [1], [1], [1], [1]], [[1, 0], [1, 0], [0, 1], [0, 1]]], "max_gen_length": 128, "category": "data science", "name": "Convert to categorical", "description": "Convert values into categorical variables.", "id": "88"}
89 | {"prompts": ["Initialize a variable \"x\" with {x}.", "Apply the function \"{fun}\" to each element in the list and store the results to \"mapped\".", "Convert each element in \"mapped\" into string.", "Define \"results\" with a dictionary whose keys are the unique values in \"mapped\" and values are empty lists.", "Looping over the zip of \"mapped\" and \"x\", append the value in \"x\" to the value of \"results\" using the value in \"mapped\" as the key.", "Print out the \"results\"."], "inputs": [{"fun": "len", "x": ["a", "b", "c"]}, {"fun": "len", "x": ["apple", "banana", "orange", "peach"]}, {"fun": "type", "x": [1, 2, 3, "a", "b", "c"]}, {"fun": "len", "x": [[1, 2, 3], "a", "b", "c"]}, {"fun": "str", "x": [1, 2, 3, "1", "2", "3"]}], "outputs": [{"1": ["a", "b", "c"]}, {"5": ["apple", "peach"], "6": ["banana", "orange"]}, {"int": [1, 2, 3], "str": ["a", "b", "c"]}, {"1": ["a", "b", "c"], "3": [[1, 2, 3]]}, {"2": [2, "2"], "3": [3, "3"], "1": [1, "1"]}], "max_gen_length": 128, "category": "array", "name": "Group by key", "description": "Group items in an array using a provided function.", "id": "89"}
90 | {"prompts": ["Initialize a variable \"best\" with -1, \"array\" with {array}", "Assign the first element of \"array\" to a variable named \"minimum\".", "In a for loop over \"array\" starting from the second element, do 1) update \"best\" when the element minus \"minimum\" is larger than \"best\", and 2) update \"minimum\" with the value of element if it is smaller than \"minimum\".", "Print out \"best\"."], "inputs": [{"array": [1, 2, 3, 4, 5]}, {"array": [5, 2, 3, 4, 0]}, {"array": [12, 7, 8, 5, 9, 5, 14, 9, 8, 9]}, {"array": [1, 10, 1, 10, 0]}, {"array": [1, 2, 3, 2, 1]}], "outputs": [4, 2, 9, 9, 2], "max_gen_length": 128, "category": "array", "name": "Max stock profit", "description": "Given an array of \"prices\", find the max profit.", "id": "90"}
91 | {"prompts": ["Initialize a variable \"target\" with {target}, a variable \"nums\" with {nums}, and \"result\" with an empty list.", "Enumerating over \"nums\", compare each element with \"target\" and add its index position to \" result\" if they are equivalent.", "Print out the sum of elements in \"result\"."], "inputs": [{"target": 1, "nums": [1, 2, 1, 2, 1]}, {"target": 1, "nums": [0, 0, 0]}, {"target": 1, "nums": [1.1, 2, 3, 2, 1]}, {"target": "1", "nums": [1, 2, 3, 2, 1]}, {"target": "1", "nums": [1, "1", 2, "1"]}], "outputs": [6, 0, 4, 0, 4], "max_gen_length": 128, "category": "array", "name": "Sum positions", "description": "Sum of all position indices where a value appear.", "id": "91"}
92 | {"prompts": ["Initialize a variable \"nums\" with {nums} and a variable \"N\" with {N}.", "Initialize a variable \"all_nums\" which is a set of numbers between 1 and N.", "Subtract the set of numbers in \"nums\" from \"all_nums\", and store the result to \"diff\"", "Pop the only element in \"diff\" print it out."], "inputs": [{"nums": [1, 3, 4], "N": 4}, {"nums": [1, 2, 3, 4], "N": 5}, {"nums": [4, 3, 9, 7, 8, 5, 2, 1, 10], "N": 10}, {"nums": [6, 15, 13, 2, 14, 17, 7, 16, 11, 9, 3, 10, 8, 5, 12, 1, 20, 4, 19], "N": 20}, {"nums": [], "N": 1}], "outputs": [2, 5, 6, 18, 1], "max_gen_length": 128, "category": "array", "name": "Find missing num", "description": "Find a missing number given a list and a max number.", "id": "92"}
93 | {"prompts": ["Assign {x} to a variable named \"X\".", "Initialize a variable named \"common\" with a set of unique elements in the first index of \"X\".", "Iterating over \"X\", update \"common\" with an intersection of \"common\" and the set of unique elements in the current index of \"X\"", "Cast \"common\" as a list and print it out."], "inputs": [{"x": [[1, 2, 3, 4, 5], [0, 1, 3, 5, 7], [0, 2, 3, 4, 5]]}, {"x": [[1, 1], [1, 1]]}, {"x": [[1, 2, 3], [2, 3, 4], [3, 4, 5]]}, {"x": [[1, 12, 56, 21, 5], [21, 2, 6, 11, 7], [5, 7, 13, 8, 21], [5, 21, -5, 6, 8]]}, {"x": [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]}], "outputs": [[3, 5], [1], [3], [21], [1, 2, 3, 4, 5]], "max_gen_length": 128, "category": "array", "name": "Common num in matrix", "description": "Common numbers among rows in a matrix.", "id": "93"}
94 | {"prompts": ["Initialize a variable \"start\" with {start}, and \"seq\" with a list containing {start}.", "While the value is not 1, perform the following: if \"start\" is an even number, divide by 2, otherwise multiply by 3 and add 1, then store the number to \"start\" as well as appending to \"seq\".", "Store the sum of all numbers in \"seq\" to \"results\".", "Print out the \"results\"."], "inputs": [{"start": 1}, {"start": 9}, {"start": 27}, {"start": 28}, {"start": 123456789}], "outputs": [1, 55, 101440, 330, 1266590663], "max_gen_length": 128, "category": "algorithm", "name": "Sum Collatz", "description": "Obtain the sum of Collatz sequence starting from given number.", "id": "94"}
95 | {"prompts": ["Define a variable \"pos\" with \"{start}\", \"swap\" with {swap}.", "Write a function \"move\" that takes two strings x and y as input, and replace any appearance of x in y with an empty string, then return y.", "For each element in \"swap\", if it contains \"pos\", call \"move\" on \"pos\" and the current element and store the result to \"pos\".", "Print out \"pos\"."], "inputs": [{"start": "A", "swap": ["AB", "BC", "CA", "BC", "AC"]}, {"start": "B", "swap": ["AC", "CA"]}, {"start": "C", "swap": ["AB", "BC", "CA", "BC", "AC", "AB", "CA", "BC", "AC", "BA"]}, {"start": "C", "swap": ["AB", "AC"]}, {"start": "A", "swap": []}], "outputs": ["C", "B", "B", "A", "A"], "max_gen_length": 128, "category": "algorithm", "name": "Cup swap", "description": "Name the location of a \"ball\" after cup swapping.", "id": "95"}
96 | {"prompts": ["Initialize a variable \"stack\" with an empty list, and \"num\" with {x} as a string.", "For each chracter in \"num\", append the character to \"stack\".", "Assign an empty string to a variable \"result\", and concatenate characters popped from the last element of \"stack\" to \"result\" until \"stack\" is empty.", "Cast \"result\" as integer and print it out."], "inputs": [{"x": 123}, {"x": 123456789}, {"x": 100}, {"x": 0}, {"x": 1230}], "outputs": [321, 987654321, 1, 0, 321], "max_gen_length": 128, "category": "algorithm", "name": "Reverse digits", "description": "Reverse digits in a number with a stack.", "id": "96"}
97 | {"prompts": ["Assign {x} to a variable \"arrows\", then concatenate all the strings in \"arrows\" and store the result to \"joined_arrow\".", "Count the numbers of left-facing arrow and right-facing arrow and store the results to \"left\" and \"right\", respectively.", "If \"right\" is larger than \"left\", print out the string that consists of (right - left) right-facing arrows.", "Otherwise, print out the string that consists of (left - right) left-facing arrows."], "inputs": [{"x": ["<<", ">>>"]}, {"x": ["<<<", ">>"]}, {"x": ["<<", ">>", "<<", ">>>", ">>>"]}, {"x": ["<<", ">>"]}, {"x": ["<<<<<<<<<<<<", ">"]}], "outputs": [">", "<", ">>>>", "", "<<<<<<<<<<<"], "max_gen_length": 128, "category": "algorithm", "name": "Calculate arrows", "description": "Calculate arrowheads left and right.", "id": "97"}
98 | {"prompts": ["Initialize an array \"array\" with {x}.", "Calculate the difference of maximum and minimum values in \"array\" and store the value to \"diff\".", "Check if \"diff\" is included in \"array\" and store the boolean value to \"result\".", "Print out \"result\""], "inputs": [{"x": [1, 2, 3, 4, 5, 6, 8]}, {"x": [1, 7, 8]}, {"x": [10]}, {"x": [0, 1]}, {"x": [1000, 2, 3, 4, 5, 6, 1000000]}], "outputs": [false, true, false, true, false], "max_gen_length": 128, "category": "algorithm", "name": "Check interval num ", "description": "Check if the interval (max-min) is included in a list.", "id": "98"}
99 | {"prompts": ["Initialize a variable \"original\" with \"{x}\"", "Import OrderedDict from collections module, then initalize a variable \"dic\" with an OrderedDict with letters in \"original\" as keys and 0 as the value for each key.", "Iterating over each character in \"original\", increment the value in \"dic\" whose key is the character.", "Initialize an empty string to a variable \"result\", then iterate over items in \"dic\" and append the key and the value as strings to \"result\".", "Print out \"result\"."], "inputs": [{"x": "aabbddcc"}, {"x": "abc"}, {"x": "zzzzzyyyyyxxxxxa"}, {"x": "aaa"}, {"x": ""}], "outputs": ["a2b2d2c2", "a1b1c1", "z5y5x5a1", "a3", ""], "max_gen_length": 128, "category": "string", "name": "Length encoding", "description": "Encode a string by converting repeated chars with counts.", "id": "99"}
100 | {"prompts": ["Import re and define a regular expression that matches an email address.", "Search for an email address in \"{x}\" and store the first match to a variable \"address\".", "Remove the substring starting from the @ symbol from \"address\".", "Replace non-alphabetical symbols with a whitespace in \"address\".", "Print out \"address\"."], "inputs": [{"x": "abc@example.com."}, {"x": "a.b.c@example.com test."}, {"x": "a1b2c3.d4e_f6@example.com."}, {"x": "abc@example.com test. def@abc.def."}, {"x": "example@@example.com test, example_email@abc.io ."}], "outputs": ["abc", "a b c", "a b c d e f ", "abc", "example email"], "max_gen_length": 128, "category": "string", "name": "Convert email", "description": "Use regex to match email addresses and remove special chars.", "id": "100"}
101 | {"prompts": ["Assign the list of numbers \"{A}\" to a variable named \"my_numbers\".", "Implement a function that returns the distinct elements of a list.", "Compute the distinct elements of my_numbers and store as unique_list.", "Print out the second largest element in unique_list. If the second largest does not exit, print out the maximum."], "inputs": [{"A": [1, 3, 2, 2]}, {"A": [1000, 1000, 1000]}, {"A": [0, 0.2, 0.4, -0.2]}, {"A": [3, 3, 3, 2, 2, 1]}, {"A": [0, 3, 1, 3, 2, 2, -0.2, 0.2]}], "outputs": [2, 1000, 0.2, 2, 2], "max_gen_length": 128, "category": "array", "name": "Second largest", "description": "Print out the second largest element in an array.", "id": "101"}
102 | {"prompts": ["Assign the list of numbers \"{A}\" to a variable named \"my_numbers\".", "Implement a function that returns the prefix sum of a list as an array.", "Compute the prefix sum of my_numbers and store as prefix_sum_list.", "Print out the largest element in prefix_sum_list. "], "inputs": [{"A": [1, 3, 2, 2]}, {"A": [3, -3, -3]}, {"A": [0, 0.2, 0.4, -0.2]}, {"A": [3, 3, 3, -2, 2, 1]}, {"A": [-0.2, 5, -0.2]}], "outputs": [8, 3, 0.6, 10, 4.8], "max_gen_length": 128, "category": "array", "name": "Largest prefix sum", "description": "Return the largest prefix sum in an array.", "id": "102"}
103 | {"prompts": ["Assign the list of numbers \"{A}\" to a variable named \"my_numbers\".", "Count the distances from each element in my_number to 0. .", "Find the closest number to 0 in my_number and store as closest_number.", "Print out the distance from closest_number to 0. "], "inputs": [{"A": [1, 3, 2, 2]}, {"A": [3, -3, -3]}, {"A": [0, 0.2, 0.4, -0.2]}, {"A": [3, 3, 3, -2, 2, 1]}, {"A": [-0.2, 5, -0.2]}], "outputs": [1, 3, 0, 1, 0.2], "max_gen_length": 128, "category": "array", "name": "Closest element to zero", "description": "Find the element which is the cloest to 0 and print the distance.", "id": "103"}
104 | {"prompts": ["Assign the string \"{A}\" to a variable named \"my_string\".", "Implement a function that checks whether a string only contains unique characters.", "Find the longest substring of my_string that contains only unique characters and store as result_substring.", "Print out the length of result_substring."], "inputs": [{"A": "acc"}, {"A": "accccccccccccccccccccc"}, {"A": "abcdef"}, {"A": "acdeffce"}, {"A": "aaaaaaaaaaaaa"}], "outputs": [2, 2, 6, 5, 1], "max_gen_length": 128, "category": "string", "name": "Consecutive unique char", "description": "Find the max length contiguous subarray with unique characters.", "id": "104"}
105 | {"prompts": ["Assign a string \"{A}\" to a variable named \"my_string\".", "Find the repeated characters in the my_string.", "Count the frequency of these repeated characters.", "Print out the length of most frequent character."], "inputs": [{"A": "abadb"}, {"A": "aaaaaaaa"}, {"A": "caaaaaaaaaaaa"}, {"A": "cccccaaaaa"}, {"A": "abcde"}], "outputs": [2, 8, 12, 5, 0], "max_gen_length": 128, "category": "string", "name": "Highest frequency char", "description": "Obtain the frequency of the most frequent character.", "id": "105"}
106 | {"prompts": ["Assign a string \"{A}\" to a variable named \"my_string\".", "Implement a function that checks whether a string is a palindrome.", "Find all substrings of my_string which is a palindrome and store as a list.", "Print out the length of longest palindrome in the above list."], "inputs": [{"A": "a"}, {"A": "abcba"}, {"A": "caaa"}, {"A": "cccccaaaaa"}, {"A": "abcde"}], "outputs": [1, 5, 3, 5, 1], "max_gen_length": 128, "category": "string", "name": "Longest palindrome", "description": "Find the length of longest palindrome substring.", "id": "106"}
107 | {"prompts": ["Assign an integer \"{A}\" to a variable named \"my_integer\".", "Implement a function that checks whether an integer is a prime number.", "Find all prime numbers that are less than my_integer and store as prime_result.", "Print out the length of prime_result."], "inputs": [{"A": 10}, {"A": 0}, {"A": 1}, {"A": 100}, {"A": 17}], "outputs": [4, 0, 0, 25, 6], "max_gen_length": 128, "category": "algorithm", "name": "Count primes", "description": "Calcuate prime numbers in a range.", "id": "107"}
108 | {"prompts": ["Assign an array \"{A}\" to a variable named \"my_array\".", "Assign a positive integer \"{K}\" to a variable named \"k\".", "Implement a function that rotates one array to the right by 1 step.", "Rotate my_array k steps and store as rotated_result.", "Print out rotated_result."], "inputs": [{"A": [1, 2, 3, 4, 5], "K": 3}, {"A": [-1, 30, 50, 3], "K": 2}, {"A": [2, 3, 5, -30], "K": 1}, {"A": [1, 2, 0, 4], "K": 0}, {"A": [2, 3, 4], "K": 8}], "outputs": [[3, 4, 5, 1, 2], [50, 3, -1, 30], [-30, 2, 3, 5], [1, 2, 0, 4], [3, 4, 2]], "max_gen_length": 128, "category": "algorithm", "name": "Rotate array", "description": "Rotate an array to the right k steps.", "id": "108"}
109 | {"prompts": ["Assign an array \"{A}\" to a variable named \"my_array\".", "Compute the sum of my_array and store as my_sum.", "Implement a function that checks whether one subset of an array \"{A}\" is equal to my_sum/2.", "Print out the function output when the above array is my_array."], "inputs": [{"A": [1, 2, 3, 4, 5]}, {"A": [1, 5, 11, 5]}, {"A": [1, 2, 3, 5]}, {"A": [1, 2, 0, 4]}, {"A": [2, 3, 4, 3]}], "outputs": ["False", "True", "False", "False", "True"], "max_gen_length": 128, "category": "algorithm", "name": "Partition equal sets", "description": "Check whether one array can be divided into two subsets which have equal sums.", "id": "109"}
110 | {"prompts": ["Assign a non-negative integer \"{A}\" to a variable named \"my_number\".", "Compute the square root of my_number and store as root_number.", "Implement a function that only returns the integer part of a float number.", "Print out the integer part of root_number."], "inputs": [{"A": 2}, {"A": 5}, {"A": 101}, {"A": 8}, {"A": 226}], "outputs": [1, 2, 10, 2, 15], "max_gen_length": 128, "category": "math", "name": "Square root integer", "description": "Compute the integer part of square root.", "id": "110"}
111 | {"prompts": ["Assign a non-negative integer \"{A}\" to a variable named \"my_number\".", "Plus my_number by 1 and store as plus_number.", "Implement a function that only returns the digits of an integer as a list.", "Print out the digits of plus_number."], "inputs": [{"A": 2}, {"A": 5}, {"A": 101}, {"A": 2345}, {"A": 229}], "outputs": [[3], [6], [1, 0, 2], [2, 3, 4, 6], [2, 3, 0]], "max_gen_length": 128, "category": "math", "name": "Plus 1", "description": "Return the digits after an interger is plused by 1.", "id": "111"}
112 | {"prompts": ["Assign a non-negative integer \"{A}\" to a variable named \"my_number\".", "Implement a function that computes the square sum of two integers.", "Implement a function that checks one number is the sum of two square numbers.", "Print out \"True\" if my_number is the sum of two square numbers. Otherwise, print \"False\"."], "inputs": [{"A": 2}, {"A": 5}, {"A": 101}, {"A": 3}, {"A": 7}], "outputs": ["True", "True", "True", "False", "False"], "max_gen_length": 128, "category": "math", "name": "Check square sum", "description": "Check whether one integer is a sum of two square numbers.", "id": "112"}
113 | {"prompts": ["Assign an array \"{A}\" to a variable named \"my_array\".", "Implement a function that computes standard deviation of an array.", "Calculate the standard deviation of my_array and store as result.", "Print out \"True\" if result is less than 1. Otherwise, print \"False\"."], "inputs": [{"A": [14, 8, 11, 10]}, {"A": [3, 3, 3, 4]}, {"A": [1, 1, 1, 1, 1, 101]}, {"A": [1, 2, 3, 4, 5, 6, 7]}, {"A": [1, 0, 1, 0]}], "outputs": ["False", "True", "False", "False", "True"], "max_gen_length": 128, "category": "data science", "name": "Comare std. dev.", "description": "Determine whether standard deviation is less than 1.", "id": "113"}
114 | {"prompts": ["Assign the matrix \"{A}\" to a variable named \"my_matrix\".", "Calculate the number of rows of my_matrix and store as row_number.", "Calculate the number of columns of my_matrix and store as column_number.", "Calculate the sum of row_number and column_number and print the result."], "inputs": [{"A": [[3, 2], [2, 3]]}, {"A": [[3, 2, 5], [2, 3, 5]]}, {"A": [[1]]}, {"A": [[30000, 30000, 1], [30000, 30000, 1], [30000, 30000, 1]]}, {"A": [[5, 5, 5, 5, 5, 0]]}], "outputs": [4, 5, 2, 6, 7], "max_gen_length": 128, "category": "data science", "name": "Matrix size", "description": "Calculate the sum of row and column numbers.", "id": "114"}
115 | {"prompts": ["Assign the array \"{A}\" to a variable named \"my_array\".", "Calculate the mean of my_array and store as mean_number.", "Calculate the median of my_array and store as median_number.", "Calculate the difference between mean_number and median_number and print the result."], "inputs": [{"A": [3, 2, 2, 3]}, {"A": [3, 2, 5, 2, 3, 5]}, {"A": [1]}, {"A": [30000, 30000, 1, 30000, 30000, 1, 30000, 30000, 1]}, {"A": [5, 5, 5, 5, 5, 0]}], "outputs": [0, 0.3333333333333335, 0, -9999.666666666668, -0.833333333333333], "max_gen_length": 128, "category": "data science", "name": "Diff mean and median", "description": "Calculate the difference between mean and median for an array.", "id": "115"}
116 |
--------------------------------------------------------------------------------
/codegen1/benchmark/mtpb_exec.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2022, salesforce.com, inc.
2 | # All rights reserved.
3 | # SPDX-License-Identifier: BSD-3-Clause
4 | # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5 |
6 | """
7 | python3.9 -m venv .venv
8 | source .venv/bin/activate
9 | pip3 install --upgrade pip
10 | pip3 install --upgrade setuptools
11 |
12 | pip3 install numpy==1.21.2 psutil==5.9.4
13 |
14 | python3 mtpb_exec.py
15 | """
16 |
17 | import argparse
18 | import contextlib
19 | import glob
20 | import json
21 | import os
22 | import random
23 | import signal
24 | from time import time
25 |
26 | import numpy as np
27 |
28 | ########################################################################
29 | # util
30 | import platform
31 | import faulthandler
32 | import psutil
33 |
34 | # https://github.com/openai/human-eval/blob/master/human_eval/execution.py
35 | def reliability_guard(maximum_memory_bytes=None):
36 | """
37 | This disables various destructive functions and prevents the generated code
38 | from interfering with the test (e.g. fork bomb, killing other processes,
39 | removing filesystem files, etc.)
40 | WARNING
41 | This function is NOT a security sandbox. Untrusted code, including, model-
42 | generated code, should not be blindly executed outside of one. See the
43 | Codex paper for more information about OpenAI's code sandbox, and proceed
44 | with caution.
45 | """
46 |
47 | if maximum_memory_bytes is not None:
48 | import resource
49 | resource.setrlimit(resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes))
50 | resource.setrlimit(resource.RLIMIT_DATA, (maximum_memory_bytes, maximum_memory_bytes))
51 | if not platform.uname().system == 'Darwin':
52 | resource.setrlimit(resource.RLIMIT_STACK, (maximum_memory_bytes, maximum_memory_bytes))
53 |
54 | faulthandler.disable()
55 |
56 | import builtins
57 | builtins.exit = None
58 | builtins.quit = None
59 |
60 | import os
61 | os.environ['OMP_NUM_THREADS'] = '1'
62 |
63 | os.kill = None
64 | os.system = None
65 | os.putenv = None
66 | os.remove = None
67 | os.removedirs = None
68 | os.rmdir = None
69 | os.fchdir = None
70 | os.setuid = None
71 | os.fork = None
72 | os.forkpty = None
73 | os.killpg = None
74 | os.rename = None
75 | os.renames = None
76 | os.truncate = None
77 | os.replace = None
78 | os.unlink = None
79 | os.fchmod = None
80 | os.fchown = None
81 | os.chmod = None
82 | os.chown = None
83 | os.chroot = None
84 | os.fchdir = None
85 | os.lchflags = None
86 | os.lchmod = None
87 | os.lchown = None
88 | os.getcwd = None
89 | os.chdir = None
90 |
91 | import shutil
92 | shutil.rmtree = None
93 | shutil.move = None
94 | shutil.chown = None
95 |
96 | import subprocess
97 | subprocess.Popen = None # type: ignore
98 |
99 | # __builtins__['help'] = None
100 |
101 | import sys
102 | sys.modules['ipdb'] = None
103 | sys.modules['joblib'] = None
104 | sys.modules['resource'] = None
105 | # sys.modules['psutil'] = None
106 | sys.modules['tkinter'] = None
107 |
108 | class print_time:
109 | def __init__(self, desc):
110 | self.desc = desc
111 |
112 | def __enter__(self):
113 | print(self.desc)
114 | self.t = time()
115 |
116 | def __exit__(self, type, value, traceback):
117 | print(f"{self.desc} took {time()-self.t:.02f}s")
118 |
119 |
120 | def set_env():
121 | os.environ["TOKENIZERS_PARALLELISM"] = "false"
122 |
123 |
124 | def set_seed(seed, deterministic=True):
125 | random.seed(seed)
126 | os.environ["PYTHONHASHSEED"] = str(seed)
127 |
128 |
129 | def cast(model, fp16=True):
130 | if fp16:
131 | model.half()
132 | return model
133 |
134 |
135 | def gs_copy(from_path, to_path):
136 | command = f"gsutil -m cp -r {from_path} {to_path}"
137 |
138 | import subprocess
139 |
140 | process = subprocess.Popen(
141 | command, shell=True, stdout=subprocess.PIPE, universal_newlines=True
142 | )
143 | print(process.stdout.readline())
144 |
145 |
146 | def write_jsonl(filename, data):
147 | with open(filename, "wb") as f:
148 | for x in data:
149 | f.write((json.dumps(x) + "\n").encode("utf-8"))
150 |
151 |
152 | def read_jsonl(filename):
153 | result = []
154 | with open(filename, "rb") as f:
155 | for l in f:
156 | result.append(json.loads(l))
157 |
158 | # add a test_id
159 | test_case_map = {t: i for i, t in enumerate(set([str(r['input']) for r in result]))}
160 | print(filename, len(test_case_map))
161 | for r in result:
162 | r['test_id'] = test_case_map[str(r['input'])]
163 | return result
164 |
165 |
166 | ########################################################################
167 | # runtime
168 |
169 |
170 | def rewrite_ast_with_print(code):
171 |
172 | import ast
173 |
174 | class TransformPrint(ast.NodeTransformer):
175 |
176 | def __init__(self, last_exp):
177 | self.last_exp = last_exp
178 |
179 | def visit(self, node):
180 | if node == self.last_exp:
181 | # print(node)
182 | new_node = node
183 | new_node.value = ast.Call(func=ast.Name(id='print', ctx=ast.Load()), args=[new_node.value], keywords=[])
184 |
185 | ast.copy_location(new_node, node)
186 | ast.fix_missing_locations(new_node)
187 |
188 | return new_node
189 | else:
190 | self.generic_visit(node)
191 | return node
192 |
193 |
194 | a = ast.parse(code)
195 |
196 | def ends_with_print(a):
197 | try:
198 | return a.body[-1].value.func.id.startswith('print')
199 | except:
200 | return False
201 |
202 | last_expr = lambda a: a.body[-1]
203 |
204 | if not ends_with_print(a):
205 | ast_rewrite = TransformPrint(last_expr(a)).visit(a)
206 | code_rewrite = ast.unparse(ast_rewrite)
207 |
208 | return code_rewrite
209 | else:
210 | return code
211 |
212 |
213 | def try_rewrite_ast_with_print(code):
214 | try:
215 | return rewrite_ast_with_print(code)
216 | except Exception as e:
217 | print(e)
218 | return code
219 |
220 |
221 | def test_try_rewrite_ast_with_print():
222 |
223 | # (1)
224 |
225 | if True:
226 |
227 | code_without_print = \
228 | '''def f(a):
229 | return a ** 2
230 | f(2)'''
231 |
232 | code_with_print = \
233 | '''def f(a):
234 | return a ** 2
235 | print(f(2))'''
236 |
237 | code_rewrite = try_rewrite_ast_with_print(code_without_print)
238 | assert code_with_print == code_rewrite, f'{code_with_print}\n{code_rewrite}'
239 |
240 | code_rewrite = try_rewrite_ast_with_print(code_with_print)
241 | assert code_with_print == code_rewrite, f'{code_with_print}\n{code_rewrite}'
242 |
243 |
244 | # (2)
245 |
246 | if True:
247 |
248 | code_without_print = \
249 | '''df.groupby('gender')['alco'].value_counts()'''
250 |
251 | code_with_print = \
252 | '''print(df.groupby('gender')['alco'].value_counts())'''
253 |
254 | code_rewrite = try_rewrite_ast_with_print(code_without_print)
255 | assert code_with_print == code_rewrite, f'{code_with_print}\n{code_rewrite}'
256 |
257 | code_rewrite = try_rewrite_ast_with_print(code_with_print)
258 | assert code_with_print == code_rewrite, f'{code_with_print}\n{code_rewrite}'
259 |
260 |
261 | def do_exec(code, globals={}):
262 | from io import StringIO
263 | from contextlib import redirect_stdout
264 |
265 | f = StringIO()
266 | with redirect_stdout(f):
267 | try:
268 | exec(code, globals)
269 | except Exception as e:
270 | return False, str(e)
271 |
272 | return True, f.getvalue()
273 |
274 |
275 | def test_print_stack_exec():
276 | def overload_builtin_print(globals):
277 | overload_print = """import builtins
278 | def print(*args, **kwargs):
279 | global print_stack
280 | print_stack.append({'args': args, 'kwargs': kwargs})
281 | return builtins.print(*args, **kwargs)
282 | """
283 | return do_exec(overload_print, globals)
284 |
285 | globals = {"print_stack": []}
286 | overload_success, overload_out = overload_builtin_print(globals)
287 | assert overload_success == True, overload_out
288 |
289 | do_exec("print('hello_0')", globals)
290 | print(globals["print_stack"])
291 |
292 | do_exec("print('hello_1')", globals)
293 | print(globals["print_stack"])
294 |
295 | globals["print_stack"] = []
296 | do_exec("print('hello_2')", globals)
297 | print(globals["print_stack"])
298 |
299 |
300 | def overload_builtin_print(globals):
301 | overload_print = """import builtins
302 | def print(*args, **kwargs):
303 | global print_stack
304 | print_stack.append({'args': args, 'kwargs': kwargs})
305 | return builtins.print(*args, **kwargs)
306 | """
307 | overload_success, overload_out = do_exec(overload_print, globals)
308 | assert overload_success == True, overload_out
309 |
310 | return globals
311 |
312 |
313 | class TimeoutException(Exception):
314 | pass
315 |
316 |
317 | @contextlib.contextmanager
318 | def time_limit(seconds: float):
319 | def signal_handler(signum, frame):
320 | raise TimeoutException("Timed out!")
321 |
322 | signal.setitimer(signal.ITIMER_REAL, seconds)
323 | signal.signal(signal.SIGALRM, signal_handler)
324 | try:
325 | yield
326 | finally:
327 | signal.setitimer(signal.ITIMER_REAL, 0)
328 |
329 |
330 | def check_equivalence(o1, o2):
331 |
332 | # (1) pandas
333 | if str(type(o1)) == "":
334 | o1 = o1.to_numpy()
335 | o2 = np.array(o2)
336 |
337 | # (2) tuples
338 | elif type(o1) == list:
339 | if len(o1) > 0 and type(o1[0]) == tuple:
340 | o1 = [list(t) for t in o1]
341 |
342 | # (3) numpy
343 | elif type(o1) in [np.uint8, np.int16, np.int32, np.int64]:
344 | o1 = int(o1)
345 |
346 | elif type(o1) in [np.float16, np.float32, np.float64]:
347 | o1 = float(o1)
348 |
349 | elif type(o1) in [np.bool_]:
350 | o1 = bool(o1)
351 |
352 | elif type(o1) == np.ndarray:
353 | o1 = o1.tolist()
354 |
355 | # handles type mismatches by casting to expected output type
356 | if type(o1) != type(o2):
357 | if type(o2) == bool:
358 | # note: bool("False") -> True. Instead compare in string space
359 | o1, o2 = str(o1), str(o2)
360 | else:
361 | o1 = type(o2)(o1)
362 |
363 | type_ = type(o1)
364 | if type_ in [str, int, set, list, bool, tuple]:
365 | return o1 == o2
366 | elif type_ == float:
367 | return abs(o1 - o2) < 1e-6
368 | elif type_ == np.ndarray:
369 | return np.allclose(o1, o2)
370 | else:
371 | raise ValueError(f"Unsupported type {type_}")
372 |
373 |
374 | ########################################################################
375 | # eval
376 |
377 | def estimate_pass_at_k(num_samples, num_correct, k):
378 | """
379 | Estimates pass@k of each problem and returns them in an array.
380 | """
381 |
382 | def estimator(n: int, c: int, k: int) -> float:
383 | """
384 | Calculates 1 - comb(n - c, k) / comb(n, k).
385 | """
386 | if n - c < k:
387 | return 1.0
388 | return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
389 |
390 | if isinstance(num_samples, int):
391 | num_samples_it = itertools.repeat(num_samples, len(num_correct))
392 | else:
393 | assert len(num_samples) == len(num_correct)
394 | num_samples_it = iter(num_samples)
395 |
396 | return np.array(
397 | [estimator(int(n), int(c), k) for n, c in zip(num_samples_it, num_correct)]
398 | )
399 |
400 |
401 | ########################################################################
402 | # benchmark
403 |
404 | def eval_problems(problems, verbose, ks=1):
405 | reliability_guard(psutil.virtual_memory().total / 1)
406 |
407 | def eval_problem(problem):
408 |
409 | globals = {"print_stack": [], "__name__": "__main__"}
410 | globals = overload_builtin_print(globals)
411 |
412 | test_success = False
413 | test_msg = ''
414 | print_output = ''
415 | std_out = ''
416 |
417 | try:
418 | with time_limit(20):
419 |
420 | completions = problem["completions"]
421 | gold_output = problem["gold_output"]
422 |
423 | completions_with_print = try_rewrite_ast_with_print(completions)
424 |
425 | success, std_out = do_exec(code=completions_with_print, globals=globals)
426 | print_output = globals["print_stack"][-1]["args"][0]
427 | test_success = check_equivalence(print_output, gold_output)
428 |
429 | except TimeoutException:
430 | test_msg = "Timed out"
431 | test_success = False
432 | except Exception as e:
433 | test_msg = str(e)
434 | test_success = False
435 |
436 | return test_success, test_msg, print_output, std_out
437 |
438 | n_total = 0
439 | n_success = 0
440 | pass_stats = {(p['id'], p['test_id']): [] for p in problems}
441 |
442 | for i, problem in enumerate(problems):
443 |
444 | test_success, test_msg, sys_output, std_out = eval_problem(problem)
445 |
446 | n_total += 1
447 | n_success += 1 if test_success else 0
448 | pass_stats[problem["id"], problem['test_id']].append(1 if test_success else 0)
449 |
450 | if verbose:
451 | print('=' * 40)
452 | print(f'{i}/{len(problems)} {problem["id"]} -> {test_success} ({test_msg}) [{std_out}]')
453 | print('-' * 40)
454 | print(f'{problem["gold_output"]}')
455 | print(f'{sys_output}')
456 | print('-' * 40)
457 | print(f'{problem["completions"]}')
458 | print('-' * 40)
459 | print(f'{try_rewrite_ast_with_print(problem["completions"])}')
460 | print()
461 | else:
462 | print(f'{i}/{len(problems)} {problem["id"]} -> {test_success} ({test_msg}) [{sys_output}=={problem["gold_output"]}] [{std_out}]')
463 |
464 | print("\nOverall stats")
465 | print(f'n_total={n_total} n_success={n_success}')
466 | print("-" * 40)
467 |
468 | pids = sorted(set(p[0] for p in pass_stats))
469 | for pid in pids:
470 | test_cases = [v for k, v in pass_stats.items() if k[0] == pid]
471 | print(f"{pid:15s}: {sum([i for t in test_cases for i in t]):3d} : {sum([len(t) for t in test_cases])}")
472 |
473 |
474 | ########################################################################
475 | # main
476 |
477 |
478 | def main():
479 |
480 | # (0) args
481 |
482 | parser = argparse.ArgumentParser()
483 | parser.add_argument("--samples-dir", type=str, default='./results/mtpb_sample.py/checkpoints/codegen-350M-mono/mtpb_sample.py_checkpoints/')
484 | parser.add_argument("--k", type=int, nargs="+", default=[1])
485 |
486 | args = parser.parse_args()
487 |
488 | # (1) env
489 |
490 | set_env()
491 |
492 | # (2) exec
493 |
494 | files = glob.glob(os.path.join(args.samples_dir, "*.jsonl"))
495 | all_problems = [p for f in files for p in read_jsonl(f)]
496 | eval_problems(all_problems, verbose=False, ks=args.k)
497 |
498 |
499 |
500 | if __name__ == "__main__":
501 | test_try_rewrite_ast_with_print()
502 | main()
503 |
--------------------------------------------------------------------------------
/codegen1/benchmark/mtpb_sample.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2022, salesforce.com, inc.
2 | # All rights reserved.
3 | # SPDX-License-Identifier: BSD-3-Clause
4 | # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5 |
6 | """
7 | python3.9 -m venv .venv
8 | source .venv/bin/activate
9 | pip3 install --upgrade pip
10 | pip3 install --upgrade setuptools
11 |
12 | pip3 install torch==1.9.0 -f https://download.pytorch.org/whl/torch_stable.html
13 | pip3 install numpy==1.21.2 filelock==3.0.12 packaging==21.0 huggingface_hub==0.0.17 regex==2021.9.24 sacremoses==0.0.45 tokenizers==0.10.3
14 | pip3 install transformers==4.16.2
15 |
16 | wget -P checkpoints https://storage.googleapis.com/sfr-codegen-research/checkpoints/codegen-350M-mono.tar.gz && tar -xvf checkpoints/codegen-350M-mono.tar.gz -C checkpoints/
17 |
18 | python3 mtpb_sample.py
19 | """
20 |
21 | import argparse
22 | import json
23 | import os
24 | from pathlib import Path
25 | import random
26 | from time import time
27 |
28 | import torch
29 |
30 |
31 | ########################################################################
32 | # util
33 |
34 |
35 | class print_time:
36 | def __init__(self, desc):
37 | self.desc = desc
38 |
39 | def __enter__(self):
40 | print(self.desc)
41 | self.t = time()
42 |
43 | def __exit__(self, type, value, traceback):
44 | print(f"{self.desc} took {time()-self.t:.02f}s")
45 |
46 |
47 | def set_env():
48 | os.environ["TOKENIZERS_PARALLELISM"] = "false"
49 |
50 |
51 | def set_seed(seed, deterministic=True):
52 | random.seed(seed)
53 | os.environ["PYTHONHASHSEED"] = str(seed)
54 | torch.manual_seed(seed)
55 | if torch.cuda.is_available():
56 | torch.cuda.manual_seed(seed)
57 | torch.backends.cudnn.deterministic = deterministic
58 | torch.backends.cudnn.benchmark = True
59 |
60 |
61 | def cast(model, fp16=True):
62 | if fp16:
63 | model.half()
64 | return model
65 |
66 |
67 | def write_jsonl(filename, data):
68 | if not os.path.exists(os.path.dirname(filename)):
69 | os.makedirs(os.path.dirname(filename))
70 |
71 | with open(filename, 'wb') as f:
72 | for x in data:
73 | f.write((json.dumps(x) + "\n").encode("utf-8"))
74 |
75 |
76 | ########################################################################
77 | # model
78 |
79 | from transformers import GPT2TokenizerFast
80 | from jaxformer.hf.codegen.modeling_codegen import CodeGenForCausalLM
81 |
82 |
83 | def create_model(ckpt, fp16=True):
84 | if fp16:
85 | return CodeGenForCausalLM.from_pretrained(ckpt, revision='float16', torch_dtype=torch.float16, low_cpu_mem_usage=True)
86 | else:
87 | return CodeGenForCausalLM.from_pretrained(ckpt)
88 |
89 |
90 | def create_tokenizer():
91 | t = GPT2TokenizerFast.from_pretrained('gpt2')
92 | t.max_model_input_sizes['gpt2'] = 1e20
93 | return t
94 |
95 |
96 | def include_whitespace(t, n_min=2, n_max=20, as_special_tokens=False):
97 | t.add_tokens([' ' * n for n in reversed(range(n_min, n_max))], special_tokens=as_special_tokens)
98 | return t
99 |
100 |
101 | def include_tabs(t, n_min=2, n_max=20, as_special_tokens=False):
102 | t.add_tokens(['\t' * n for n in reversed(range(n_min, n_max))], special_tokens=as_special_tokens)
103 | return t
104 |
105 |
106 | def create_custom_gpt2_tokenizer():
107 | t = create_tokenizer()
108 | t = include_whitespace(t=t, n_min=2, n_max=32, as_special_tokens=False)
109 | t = include_tabs(t=t, n_min=2, n_max=10, as_special_tokens=False)
110 | t.padding_side = "left"
111 | t.pad_token = 50256
112 | return t
113 |
114 |
115 | ########################################################################
116 | # sample
117 |
118 | def sample(
119 | device,
120 | model,
121 | tokenizer,
122 | prompt,
123 | pad_token_id,
124 | num_return_sequences=1,
125 | temp=0.2,
126 | top_p=0.95,
127 | max_length=2048,
128 | max_gen_length=128,
129 | ):
130 |
131 | input_ids = tokenizer(
132 | prompt,
133 | truncation=True,
134 | padding=True,
135 | return_tensors="pt",
136 | ).input_ids
137 |
138 | input_ids_len = input_ids.shape[1]
139 | assert input_ids_len < max_length
140 |
141 | with torch.no_grad():
142 | input_ids = input_ids.to(device)
143 | tokens = model.generate(
144 | input_ids,
145 | do_sample=True,
146 | num_return_sequences=num_return_sequences,
147 | temperature=temp,
148 | max_length=input_ids_len + max_gen_length,
149 | top_p=top_p,
150 | pad_token_id=pad_token_id,
151 | use_cache=True,
152 | )
153 | text = tokenizer.batch_decode(tokens[:, input_ids_len:, ...])
154 |
155 | return text
156 |
157 |
158 | def truncate(completion):
159 | import re
160 |
161 | def find_re(string, pattern, start_pos):
162 | m = pattern.search(string, start_pos)
163 | return m.start() if m else -1
164 |
165 | terminals = [re.compile(r, re.MULTILINE) for r in ['^#', re.escape('<|endoftext|>'), "^'''", '^"""', '\n\n\n']]
166 |
167 | prints = list(re.finditer('^print', completion, re.MULTILINE))
168 | if len(prints) > 1:
169 | completion = completion[:prints[1].start()]
170 |
171 | defs = list(re.finditer('^def', completion, re.MULTILINE))
172 | if len(defs) > 1:
173 | completion = completion[:defs[1].start()]
174 |
175 | start_pos = 0
176 |
177 | terminals_pos = [pos for pos in [find_re(completion, terminal, start_pos) for terminal in terminals] if pos != -1]
178 | if len(terminals_pos) > 0:
179 | return completion[:min(terminals_pos)]
180 | else:
181 | return completion
182 |
183 |
184 | def test_truncate():
185 |
186 | assert truncate('\nif len_a > len_b:\n result = a\nelse:\n result = b\n\n\n\n#') == '\nif len_a > len_b:\n result = a\nelse:\n result = b'
187 |
188 |
189 |
190 |
191 | ########################################################################
192 | # benchmark
193 |
194 | def create_problem_set(problems_path, problem_ids):
195 |
196 | problems = []
197 | p = Path(problems_path)
198 | with p.open("r") as f:
199 | for line in f:
200 | try:
201 | prob = json.loads(line)
202 | except Exception as e:
203 | print(p)
204 | print(line)
205 | raise e
206 | if not problem_ids:
207 | problems.append(prob)
208 | elif int(prob['id']) in problem_ids:
209 | problems.append(prob)
210 |
211 | return sorted(problems, key=lambda x: int(x["id"]))
212 |
213 |
214 | def sample_completions(
215 | sample,
216 | device,
217 | model,
218 | tokenizer,
219 | n,
220 | t,
221 | p,
222 | pad_token_id,
223 | max_length,
224 | set_rng_seed,
225 | out_file,
226 | batch_size,
227 | problem_set,
228 | max_gen_length=256,
229 | prefix = "# Import libraries.\n\nimport numpy as np\n\n"
230 | ):
231 |
232 | with print_time("sample completions"):
233 |
234 | wrap = lambda prompt: f'# {prompt}\n'
235 |
236 | for i, problem in enumerate(problem_set):
237 |
238 | if os.path.exists(out_file(problem['id'])):
239 | print(f'skipping problem {problem["id"]}')
240 | continue
241 |
242 | samples = []
243 |
244 | print('=' * 10)
245 | print(f'Problem {problem["id"]}')
246 | print('=' * 10)
247 |
248 | set_rng_seed()
249 |
250 | num_batches = n // batch_size
251 | remainder = n % batch_size
252 |
253 | for j in range(num_batches + 1 if remainder > 0 else num_batches):
254 |
255 | filled_prompts = [
256 | ([p.format(**input) for p in problem["prompts"]], input, output) for (input, output) in zip(problem["inputs"], problem["outputs"])
257 | ]
258 |
259 | for k, (prompts, input, output) in enumerate(filled_prompts):
260 |
261 | histories = [prefix for _ in range(batch_size if j != num_batches else remainder)]
262 | histories_full = [[prefix] for _ in range(batch_size if j != num_batches else remainder)]
263 |
264 | for l, prompt in enumerate(prompts):
265 |
266 | histories = [h + wrap(prompt) for h in histories]
267 | histories_full = [h + [wrap(prompt)] for h in histories_full]
268 |
269 | completions = sample(
270 | device,
271 | model,
272 | tokenizer,
273 | histories,
274 | num_return_sequences=1,
275 | top_p=p,
276 | temp=t,
277 | pad_token_id=pad_token_id,
278 | max_length=max_length,
279 | max_gen_length=problem.get("max_gen_length", max_gen_length),
280 | )
281 |
282 | histories = [h + f"{truncate(c)}\n\n" for h, c in zip(histories, completions)]
283 | histories_full = [h + [f"{truncate(c)}\n\n"] for h, c in zip(histories_full, completions)]
284 |
285 | print('-' * 10)
286 | print(l)
287 | print('-' * 10)
288 | print(histories[0])
289 | print('-' * 10)
290 |
291 | for history, history_full in zip(histories, histories_full):
292 | samples.append(
293 | {
294 | "id": problem["id"],
295 | "input": input,
296 | "gold_output": output,
297 | "completions": history,
298 | "prompts_completions": history_full
299 | }
300 | )
301 |
302 | write_jsonl(out_file(problem['id']), samples)
303 |
304 |
305 |
306 | ########################################################################
307 | # main
308 |
309 | def main():
310 |
311 | # (0) params
312 | parser = argparse.ArgumentParser()
313 | parser.add_argument("--model", type=str, default="checkpoints/codegen-350M-mono")
314 | parser.add_argument("--device", type=str, default="cpu")
315 | parser.add_argument("--seed", type=int, default=16)
316 | parser.add_argument("--out", type=str, default="./results/{file}/{model}/{file}_{model}_{seed}_{p}_{t}_{n}_{batch_size}_{fp16}_[{problem_ids}].jsonl")
317 | parser.add_argument("--p", type=float, default=0.95)
318 | parser.add_argument("--t", type=float, default=0.2)
319 | parser.add_argument("--n", type=int, default=1)
320 | parser.add_argument("--max_length", type=int, default=2048)
321 | parser.add_argument("--batch-size", type=int, default=1)
322 | parser.add_argument("--fp16", type=bool, default=False)
323 | parser.add_argument("--problem-ids", nargs="+", type=int, default=[1])
324 | parser.add_argument("--problem-path", type=str, default="./mtpb.jsonl")
325 |
326 | args = parser.parse_args()
327 |
328 | out = lambda problem_id: args.out.format(file=os.path.basename(__file__), model=args.model, seed=args.seed, p=args.p, t=args.t, n=args.n, batch_size=args.batch_size, fp16=args.fp16, problem_ids=problem_id)
329 |
330 | device = torch.device(args.device)
331 | rng_deterministic = True
332 |
333 | # (1) env
334 |
335 | set_env()
336 |
337 | def bind_set_seed(seed=args.seed):
338 | set_seed(seed=seed, deterministic=rng_deterministic)
339 |
340 | # (2) load
341 |
342 | with print_time('loading parameters'):
343 | model = create_model(ckpt=args.model, fp16=args.fp16).to(device)
344 |
345 | with print_time("load tokenization"):
346 | tokenizer = create_custom_gpt2_tokenizer()
347 |
348 |
349 | # (3) sample
350 |
351 | with print_time("sampling"):
352 | problem_set = create_problem_set(args.problem_path, args.problem_ids)
353 |
354 | print(f'loaded {len(problem_set)} problems')
355 |
356 | sample_completions(
357 | sample=sample,
358 | device=device,
359 | model=model,
360 | tokenizer=tokenizer,
361 | n=args.n,
362 | t=args.t,
363 | p=args.p,
364 | pad_token_id=50256,
365 | max_length=args.max_length,
366 | set_rng_seed=bind_set_seed,
367 | out_file=out,
368 | batch_size=args.batch_size,
369 | problem_set=problem_set,
370 | )
371 |
372 |
373 | if __name__ == "__main__":
374 | test_truncate()
375 | main()
376 |
--------------------------------------------------------------------------------
/codegen1/jaxformer/hf/codegen/configuration_codegen.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2021 The EleutherAI and HuggingFace Teams. All rights reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | # Modified configuration implementation based on https://github.com/huggingface/transformers/blob/main/src/transformers/models/gptj/configuration_gptj.py
17 |
18 | from transformers.configuration_utils import PretrainedConfig
19 | from transformers.utils import logging
20 |
21 | logger = logging.get_logger(__name__)
22 |
23 |
24 | class CodeGenConfig(PretrainedConfig):
25 | model_type = "codegen"
26 |
27 | def __init__(
28 | self,
29 | vocab_size=50400,
30 | n_positions=2048,
31 | n_ctx=2048,
32 | n_embd=4096,
33 | n_layer=28,
34 | n_head=16,
35 | rotary_dim=64,
36 | n_inner=None,
37 | activation_function="gelu_new",
38 | resid_pdrop=0.0,
39 | embd_pdrop=0.0,
40 | attn_pdrop=0.0,
41 | layer_norm_epsilon=1e-5,
42 | initializer_range=0.02,
43 | scale_attn_weights=True,
44 | gradient_checkpointing=False,
45 | use_cache=True,
46 | bos_token_id=50256,
47 | eos_token_id=50256,
48 | **kwargs
49 | ):
50 | super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
51 |
52 | self.vocab_size = vocab_size
53 | self.n_ctx = n_ctx
54 | self.n_positions = n_positions
55 | self.n_embd = n_embd
56 | self.n_layer = n_layer
57 | self.n_head = n_head
58 | self.n_inner = n_inner
59 | self.rotary_dim = rotary_dim
60 | self.activation_function = activation_function
61 | self.resid_pdrop = resid_pdrop
62 | self.embd_pdrop = embd_pdrop
63 | self.attn_pdrop = attn_pdrop
64 | self.layer_norm_epsilon = layer_norm_epsilon
65 | self.initializer_range = initializer_range
66 | self.gradient_checkpointing = gradient_checkpointing
67 | self.scale_attn_weights = scale_attn_weights
68 | self.use_cache = use_cache
69 |
70 | self.bos_token_id = bos_token_id
71 | self.eos_token_id = eos_token_id
72 |
73 | @property
74 | def max_position_embeddings(self):
75 | return self.n_positions
76 |
77 | @property
78 | def hidden_size(self):
79 | return self.n_embd
80 |
81 | @property
82 | def num_attention_heads(self):
83 | return self.n_head
84 |
85 | @property
86 | def num_hidden_layers(self):
87 | return self.n_layer
88 |
--------------------------------------------------------------------------------
/codegen1/jaxformer/hf/codegen/modeling_codegen.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2021 The EleutherAI and HuggingFace Teams. All rights reserved.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 | # Modified forward-pass implementation based on https://github.com/huggingface/transformers/blob/main/src/transformers/models/gptj/modeling_gptj.py
17 |
18 | from typing import Tuple
19 |
20 | import numpy as np
21 |
22 | import torch
23 | import torch.utils.checkpoint
24 | from torch import nn
25 | from torch.nn import CrossEntropyLoss
26 |
27 | from transformers.activations import ACT2FN
28 | from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
29 | from transformers.modeling_utils import PreTrainedModel
30 | from transformers.utils import logging
31 | from transformers.utils.model_parallel_utils import assert_device_map, get_device_map
32 | from .configuration_codegen import CodeGenConfig
33 |
34 |
35 | logger = logging.get_logger(__name__)
36 |
37 |
38 | def fixed_pos_embedding(x, seq_dim=1, seq_len=None):
39 | dim = x.shape[-1]
40 | if seq_len is None:
41 | seq_len = x.shape[seq_dim]
42 | inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2) / dim))
43 | # original
44 | # sinusoid_inp = torch.einsum("i , j -> i j", torch.arange(seq_len), inv_freq).to(x.device).float()
45 | # QHD fix onnx error by https://github.com/microsoft/onnxruntime/discussions/10121#discussioncomment-1987845
46 | sinusoid_inp = torch.einsum("i , j -> i j", torch.arange(seq_len).float(), inv_freq).to(x.device).float()
47 | return torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)
48 |
49 |
50 | def rotate_every_two(x):
51 | x1 = x[:, :, :, ::2]
52 | x2 = x[:, :, :, 1::2]
53 | x = torch.stack((-x2, x1), axis=-1)
54 | return x.flatten(-2) # in einsum notation: rearrange(x, '... d j -> ... (d j)')
55 |
56 |
57 | def apply_rotary_pos_emb(x, sincos, offset=0):
58 | sin, cos = map(lambda t: t[None, offset : x.shape[1] + offset, None, :].repeat_interleave(2, 3), sincos)
59 | # einsum notation for lambda t: repeat(t[offset:x.shape[1]+offset,:], "n d -> () n () (d j)", j=2)
60 | return (x * cos) + (rotate_every_two(x) * sin)
61 |
62 |
63 | class CodeGenAttention(nn.Module):
64 | def __init__(self, config):
65 | super().__init__()
66 |
67 | max_positions = config.max_position_embeddings
68 | self.register_buffer(
69 | "bias",
70 | torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view(
71 | 1, 1, max_positions, max_positions
72 | ),
73 | )
74 | self.register_buffer("masked_bias", torch.tensor(-1e9))
75 |
76 | self.attn_dropout = nn.Dropout(config.attn_pdrop)
77 | self.resid_dropout = nn.Dropout(config.resid_pdrop)
78 |
79 | self.embed_dim = config.hidden_size
80 | self.num_attention_heads = config.num_attention_heads
81 | self.head_dim = self.embed_dim // self.num_attention_heads
82 | if self.head_dim * self.num_attention_heads != self.embed_dim:
83 | raise ValueError(
84 | f"embed_dim must be divisible by num_attention_heads (got `embed_dim`: {self.embed_dim} and `num_attention_heads`: {self.num_attention_heads})."
85 | )
86 | self.scale_attn = torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype())
87 | self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False)
88 |
89 | self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
90 | self.rotary_dim = None
91 | if config.rotary_dim is not None:
92 | self.rotary_dim = config.rotary_dim
93 |
94 | def _split_heads(self, x, n_head, dim_head, mp_num):
95 | reshaped = x.reshape(x.shape[:-1] + (n_head//mp_num, dim_head))
96 | reshaped = reshaped.reshape(x.shape[:-2] + (-1, ) + reshaped.shape[-1:])
97 | return reshaped
98 |
99 | def _merge_heads(self, tensor, num_attention_heads, attn_head_size):
100 | """
101 | Merges attn_head_size dim and num_attn_heads dim into n_ctx
102 | """
103 | if len(tensor.shape) == 5:
104 | tensor = tensor.permute(0, 1, 3, 2, 4).contiguous()
105 | elif len(tensor.shape) == 4:
106 | tensor = tensor.permute(0, 2, 1, 3).contiguous()
107 | else:
108 | raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}")
109 | new_shape = tensor.size()[:-2] + (num_attention_heads * attn_head_size,)
110 | return tensor.view(new_shape)
111 |
112 | def _attn(
113 | self,
114 | query,
115 | key,
116 | value,
117 | attention_mask=None,
118 | head_mask=None,
119 | ):
120 |
121 | # compute causal mask from causal mask buffer
122 | query_length, key_length = query.size(-2), key.size(-2)
123 | causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length].to(torch.bool)
124 |
125 | # Keep the attention weights computation in fp32 to avoid overflow issues
126 | query = query.to(torch.float32)
127 | key = key.to(torch.float32)
128 |
129 | attn_weights = torch.matmul(query, key.transpose(-1, -2))
130 |
131 | attn_weights = attn_weights / self.scale_attn
132 | attn_weights = torch.where(causal_mask, attn_weights, self.masked_bias.to(attn_weights.dtype))
133 |
134 | if attention_mask is not None:
135 | # Apply the attention mask
136 | attn_weights = attn_weights + attention_mask
137 |
138 | attn_weights = nn.Softmax(dim=-1)(attn_weights)
139 | attn_weights = attn_weights.to(value.dtype)
140 | attn_weights = self.attn_dropout(attn_weights)
141 |
142 | # Mask heads if we want to
143 | if head_mask is not None:
144 | attn_weights = attn_weights * head_mask
145 |
146 | attn_output = torch.matmul(attn_weights, value)
147 |
148 | return attn_output, attn_weights
149 |
150 | def forward(
151 | self,
152 | hidden_states,
153 | attention_mask=None,
154 | layer_past=None,
155 | head_mask=None,
156 | use_cache=False,
157 | output_attentions=False,
158 | ):
159 |
160 | qkv = self.qkv_proj(hidden_states)
161 | # TODO(enijkamp): factor out number of logical TPU-v4 cores or make forward pass agnostic
162 | mp_num = 4
163 | qkv_split = qkv.reshape(qkv.shape[:-1] + (mp_num, -1))
164 |
165 | local_dim = self.head_dim * self.num_attention_heads // mp_num
166 | query, value, key = torch.split(qkv_split, local_dim, dim=-1)
167 | query = self._split_heads(query, self.num_attention_heads, self.head_dim, mp_num=mp_num)
168 | key = self._split_heads(key, self.num_attention_heads, self.head_dim, mp_num=mp_num)
169 |
170 | value = self._split_heads(value, self.num_attention_heads, self.head_dim, mp_num=mp_num)
171 | value = value.permute(0, 2, 1, 3)
172 |
173 | seq_len = key.shape[1]
174 | offset = 0
175 |
176 | if layer_past is not None:
177 | offset = layer_past[0].shape[-2]
178 | seq_len += offset
179 |
180 | if self.rotary_dim is not None:
181 | k_rot = key[:, :, :, : self.rotary_dim]
182 | k_pass = key[:, :, :, self.rotary_dim :]
183 |
184 | q_rot = query[:, :, :, : self.rotary_dim]
185 | q_pass = query[:, :, :, self.rotary_dim :]
186 |
187 | sincos = fixed_pos_embedding(k_rot, 1, seq_len=seq_len)
188 | k_rot = apply_rotary_pos_emb(k_rot, sincos, offset=offset)
189 | q_rot = apply_rotary_pos_emb(q_rot, sincos, offset=offset)
190 |
191 | key = torch.cat([k_rot, k_pass], dim=-1)
192 | query = torch.cat([q_rot, q_pass], dim=-1)
193 | else:
194 | sincos = fixed_pos_embedding(key, 1, seq_len=seq_len)
195 | key = apply_rotary_pos_emb(key, sincos, offset=offset)
196 | query = apply_rotary_pos_emb(query, sincos, offset=offset)
197 |
198 | key = key.permute(0, 2, 1, 3)
199 | query = query.permute(0, 2, 1, 3)
200 |
201 | if layer_past is not None:
202 | past_key = layer_past[0]
203 | past_value = layer_past[1]
204 | key = torch.cat((past_key, key), dim=-2)
205 | value = torch.cat((past_value, value), dim=-2)
206 |
207 | if use_cache is True:
208 | present = (key, value)
209 | else:
210 | present = None
211 |
212 | # compute self-attention: V x Softmax(QK^T)
213 | attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
214 |
215 | attn_output = self._merge_heads(attn_output, self.num_attention_heads, self.head_dim)
216 |
217 | attn_output = self.out_proj(attn_output)
218 | attn_output = self.resid_dropout(attn_output)
219 |
220 | outputs = (attn_output, present)
221 | if output_attentions:
222 | outputs += (attn_weights,)
223 |
224 | return outputs # a, present, (attentions)
225 |
226 |
227 | class CodeGenMLP(nn.Module):
228 | def __init__(self, intermediate_size, config): # in MLP: intermediate_size= 4 * embed_dim
229 | super().__init__()
230 | embed_dim = config.n_embd
231 |
232 | self.fc_in = nn.Linear(embed_dim, intermediate_size)
233 | self.fc_out = nn.Linear(intermediate_size, embed_dim)
234 |
235 | self.act = ACT2FN[config.activation_function]
236 | self.dropout = nn.Dropout(config.resid_pdrop)
237 |
238 | def forward(self, hidden_states):
239 | hidden_states = self.fc_in(hidden_states)
240 | hidden_states = self.act(hidden_states)
241 | hidden_states = self.fc_out(hidden_states)
242 | hidden_states = self.dropout(hidden_states)
243 | return hidden_states
244 |
245 |
246 | class CodeGenBlock(nn.Module):
247 | def __init__(self, config):
248 | super().__init__()
249 | inner_dim = config.n_inner if config.n_inner is not None else 4 * config.n_embd
250 | self.ln_1 = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
251 | self.attn = CodeGenAttention(config)
252 | self.mlp = CodeGenMLP(inner_dim, config)
253 |
254 | def forward(
255 | self,
256 | hidden_states,
257 | layer_past=None,
258 | attention_mask=None,
259 | head_mask=None,
260 | use_cache=False,
261 | output_attentions=False,
262 | ):
263 | residual = hidden_states
264 | hidden_states = self.ln_1(hidden_states)
265 | attn_outputs = self.attn(
266 | hidden_states,
267 | layer_past=layer_past,
268 | attention_mask=attention_mask,
269 | head_mask=head_mask,
270 | use_cache=use_cache,
271 | output_attentions=output_attentions,
272 | )
273 | attn_output = attn_outputs[0] # output_attn: a, present, (attentions)
274 | outputs = attn_outputs[1:]
275 |
276 | feed_forward_hidden_states = self.mlp(hidden_states)
277 | hidden_states = attn_output + feed_forward_hidden_states + residual
278 |
279 | if use_cache:
280 | outputs = (hidden_states,) + outputs
281 | else:
282 | outputs = (hidden_states,) + outputs[1:]
283 |
284 | return outputs # hidden_states, present, (attentions)
285 |
286 |
287 | class CodeGenPreTrainedModel(PreTrainedModel):
288 | """
289 | An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
290 | models.
291 | """
292 |
293 | config_class = CodeGenConfig
294 | base_model_prefix = "transformer"
295 | is_parallelizable = True
296 |
297 | def __init__(self, *inputs, **kwargs):
298 | super().__init__(*inputs, **kwargs)
299 |
300 | def _init_weights(self, module):
301 | """Initialize the weights."""
302 | if isinstance(module, (nn.Linear,)):
303 | # Slightly different from Mesh Transformer JAX which uses truncated_normal for initialization
304 | # cf https://github.com/pytorch/pytorch/pull/5617
305 | module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
306 | if module.bias is not None:
307 | module.bias.data.zero_()
308 | elif isinstance(module, nn.Embedding):
309 | module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
310 | if module.padding_idx is not None:
311 | module.weight.data[module.padding_idx].zero_()
312 | elif isinstance(module, nn.LayerNorm):
313 | module.bias.data.zero_()
314 | module.weight.data.fill_(1.0)
315 |
316 |
317 | class CodeGenModel(CodeGenPreTrainedModel):
318 | def __init__(self, config):
319 | super().__init__(config)
320 |
321 | self.embed_dim = config.n_embd
322 | self.vocab_size = config.vocab_size
323 | self.wte = nn.Embedding(config.vocab_size, self.embed_dim)
324 | self.drop = nn.Dropout(config.embd_pdrop)
325 | self.h = nn.ModuleList([CodeGenBlock(config) for _ in range(config.n_layer)])
326 | self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)
327 | self.rotary_dim = min(config.rotary_dim, config.n_ctx // config.num_attention_heads)
328 | self.init_weights()
329 |
330 | # Model parallel
331 | self.model_parallel = False
332 | self.device_map = None
333 |
334 |
335 | def parallelize(self, device_map=None):
336 | # Check validity of device_map
337 | self.device_map = (
338 | get_device_map(len(self.h), range(torch.cuda.device_count())) if device_map is None else device_map
339 | )
340 | assert_device_map(self.device_map, len(self.h))
341 | self.model_parallel = True
342 | self.first_device = "cpu" if "cpu" in self.device_map.keys() else "cuda:" + str(min(self.device_map.keys()))
343 | self.last_device = "cuda:" + str(max(self.device_map.keys()))
344 | self.wte = self.wte.to(self.first_device)
345 | # Load onto devices
346 | for k, v in self.device_map.items():
347 | for block in v:
348 | cuda_device = "cuda:" + str(k)
349 | self.h[block] = self.h[block].to(cuda_device)
350 | # ln_f to last
351 | self.ln_f = self.ln_f.to(self.last_device)
352 |
353 |
354 | def deparallelize(self):
355 | self.model_parallel = False
356 | self.device_map = None
357 | self.first_device = "cpu"
358 | self.last_device = "cpu"
359 | self.wte = self.wte.to("cpu")
360 | for index in range(len(self.h)):
361 | self.h[index] = self.h[index].to("cpu")
362 | self.ln_f = self.ln_f.to("cpu")
363 | torch.cuda.empty_cache()
364 |
365 | def get_input_embeddings(self):
366 | return self.wte
367 |
368 | def set_input_embeddings(self, new_embeddings):
369 | self.wte = new_embeddings
370 |
371 | def forward(
372 | self,
373 | input_ids=None,
374 | past_key_values=None,
375 | attention_mask=None,
376 | token_type_ids=None,
377 | position_ids=None,
378 | head_mask=None,
379 | inputs_embeds=None,
380 | use_cache=None,
381 | output_attentions=None,
382 | output_hidden_states=None,
383 | return_dict=None,
384 | ):
385 | output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
386 | output_hidden_states = (
387 | output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
388 | )
389 | use_cache = use_cache if use_cache is not None else self.config.use_cache
390 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
391 |
392 | if input_ids is not None and inputs_embeds is not None:
393 | raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
394 | elif input_ids is not None:
395 | input_shape = input_ids.size()
396 | input_ids = input_ids.view(-1, input_shape[-1])
397 | batch_size = input_ids.shape[0]
398 | elif inputs_embeds is not None:
399 | input_shape = inputs_embeds.size()[:-1]
400 | batch_size = inputs_embeds.shape[0]
401 | else:
402 | raise ValueError("You have to specify either input_ids or inputs_embeds")
403 |
404 | device = input_ids.device if input_ids is not None else inputs_embeds.device
405 |
406 | if token_type_ids is not None:
407 | token_type_ids = token_type_ids.view(-1, input_shape[-1])
408 |
409 | if position_ids is not None:
410 | position_ids = position_ids.view(-1, input_shape[-1])
411 |
412 | if past_key_values is None:
413 | past_length = 0
414 | past_key_values = tuple([None] * len(self.h))
415 | else:
416 | past_length = past_key_values[0][0].size(-2)
417 |
418 | if position_ids is None:
419 | position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
420 | position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
421 |
422 | # Attention mask.
423 | if attention_mask is not None:
424 | assert batch_size > 0, "batch_size has to be defined and > 0"
425 | attention_mask = attention_mask.view(batch_size, -1)
426 | # We create a 3D attention mask from a 2D tensor mask.
427 | # Sizes are [batch_size, 1, 1, to_seq_length]
428 | # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
429 | # this attention mask is more simple than the triangular masking of causal attention
430 | # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
431 | attention_mask = attention_mask[:, None, None, :]
432 |
433 | # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
434 | # masked positions, this operation will create a tensor which is 0.0 for
435 | # positions we want to attend and -10000.0 for masked positions.
436 | # Since we are adding it to the raw scores before the softmax, this is
437 | # effectively the same as removing these entirely.
438 | attention_mask = attention_mask.to(dtype=self.dtype) # fp16 compatibility
439 | attention_mask = (1.0 - attention_mask) * -10000.0
440 |
441 | # Prepare head mask if needed
442 | # 1.0 in head_mask indicate we keep the head
443 | # attention_probs has shape bsz x num_attention_heads x N x N
444 | # head_mask has shape n_layer x batch x num_attention_heads x N x N
445 | head_mask = self.get_head_mask(head_mask, self.config.n_layer)
446 |
447 | if inputs_embeds is None:
448 | inputs_embeds = self.wte(input_ids)
449 |
450 | hidden_states = inputs_embeds
451 |
452 | if token_type_ids is not None:
453 | token_type_embeds = self.wte(token_type_ids)
454 | hidden_states = hidden_states + token_type_embeds
455 |
456 | hidden_states = self.drop(hidden_states)
457 |
458 | output_shape = input_shape + (hidden_states.size(-1),)
459 |
460 | presents = () if use_cache else None
461 | all_self_attentions = () if output_attentions else None
462 | all_hidden_states = () if output_hidden_states else None
463 | for i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):
464 |
465 | # Model parallel
466 | if self.model_parallel:
467 | torch.cuda.set_device(hidden_states.device)
468 | # Ensure layer_past is on same device as hidden_states (might not be correct)
469 | if layer_past is not None:
470 | layer_past = tuple(past_state.to(hidden_states.device) for past_state in layer_past)
471 | # Ensure that attention_mask is always on the same device as hidden_states
472 | if attention_mask is not None:
473 | attention_mask = attention_mask.to(hidden_states.device)
474 | if isinstance(head_mask, torch.Tensor):
475 | head_mask = head_mask.to(hidden_states.device)
476 | if output_hidden_states:
477 | all_hidden_states = all_hidden_states + (hidden_states,)
478 |
479 | if getattr(self.config, "gradient_checkpointing", False) and self.training:
480 |
481 | if use_cache:
482 | logger.warning(
483 | "`use_cache=True` is incompatible with `config.gradient_checkpointing=True`. Setting "
484 | "`use_cache=False`..."
485 | )
486 | use_cache = False
487 |
488 | def create_custom_forward(module):
489 | def custom_forward(*inputs):
490 | # None for past_key_value
491 | return module(*inputs, use_cache, output_attentions)
492 |
493 | return custom_forward
494 |
495 | outputs = torch.utils.checkpoint.checkpoint(
496 | create_custom_forward(block),
497 | hidden_states,
498 | None,
499 | attention_mask,
500 | head_mask[i],
501 | )
502 | else:
503 | outputs = block(
504 | hidden_states,
505 | layer_past=layer_past,
506 | attention_mask=attention_mask,
507 | head_mask=head_mask[i],
508 | use_cache=use_cache,
509 | output_attentions=output_attentions,
510 | )
511 |
512 | hidden_states = outputs[0]
513 | if use_cache is True:
514 | presents = presents + (outputs[1],)
515 |
516 | if output_attentions:
517 | all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)
518 |
519 | # Model Parallel: If it's the last layer for that device, put things on the next device
520 | if self.model_parallel:
521 | for k, v in self.device_map.items():
522 | if i == v[-1] and "cuda:" + str(k) != self.last_device:
523 | hidden_states = hidden_states.to("cuda:" + str(k + 1))
524 |
525 | hidden_states = self.ln_f(hidden_states)
526 |
527 | hidden_states = hidden_states.view(*output_shape)
528 | # Add last hidden state
529 | if output_hidden_states:
530 | all_hidden_states = all_hidden_states + (hidden_states,)
531 |
532 | if not return_dict:
533 | return tuple(v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None)
534 |
535 | return BaseModelOutputWithPast(
536 | last_hidden_state=hidden_states,
537 | past_key_values=presents,
538 | hidden_states=all_hidden_states,
539 | attentions=all_self_attentions,
540 | )
541 |
542 |
543 | class CodeGenForCausalLM(CodeGenPreTrainedModel):
544 | _keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.masked_bias", r"h\.\d+\.attn\.bias", r"lm_head\.weight"]
545 |
546 | def __init__(self, config):
547 | super().__init__(config)
548 | self.transformer = CodeGenModel(config)
549 | self.lm_head = nn.Linear(config.n_embd, config.vocab_size)
550 | self.init_weights()
551 |
552 | # Model parallel
553 | self.model_parallel = False
554 | self.device_map = None
555 |
556 | def parallelize(self, device_map=None):
557 | self.device_map = (
558 | get_device_map(len(self.transformer.h), range(torch.cuda.device_count()))
559 | if device_map is None
560 | else device_map
561 | )
562 | assert_device_map(self.device_map, len(self.transformer.h))
563 | self.transformer.parallelize(self.device_map)
564 | self.lm_head = self.lm_head.to(self.transformer.first_device)
565 | self.model_parallel = True
566 |
567 | def deparallelize(self):
568 | self.transformer.deparallelize()
569 | self.transformer = self.transformer.to("cpu")
570 | self.lm_head = self.lm_head.to("cpu")
571 | self.model_parallel = False
572 | torch.cuda.empty_cache()
573 |
574 | def get_output_embeddings(self):
575 | return None
576 |
577 | def set_output_embeddings(self, new_embeddings):
578 | return
579 |
580 | def prepare_inputs_for_generation(self, input_ids, past=None, **kwargs):
581 | token_type_ids = kwargs.get("token_type_ids", None)
582 | # only last token for inputs_ids if past is defined in kwargs
583 | if past:
584 | input_ids = input_ids[:, -1].unsqueeze(-1)
585 | if token_type_ids is not None:
586 | token_type_ids = token_type_ids[:, -1].unsqueeze(-1)
587 |
588 | attention_mask = kwargs.get("attention_mask", None)
589 | position_ids = kwargs.get("position_ids", None)
590 |
591 | if attention_mask is not None and position_ids is None:
592 | # create position_ids on the fly for batch generation
593 | position_ids = attention_mask.long().cumsum(-1) - 1
594 | position_ids.masked_fill_(attention_mask == 0, 1)
595 | if past:
596 | position_ids = position_ids[:, -1].unsqueeze(-1)
597 | else:
598 | position_ids = None
599 | return {
600 | "input_ids": input_ids,
601 | "past_key_values": past,
602 | "use_cache": kwargs.get("use_cache"),
603 | "position_ids": position_ids,
604 | "attention_mask": attention_mask,
605 | "token_type_ids": token_type_ids,
606 | }
607 |
608 | def forward(
609 | self,
610 | input_ids=None,
611 | past_key_values=None,
612 | attention_mask=None,
613 | token_type_ids=None,
614 | position_ids=None,
615 | head_mask=None,
616 | inputs_embeds=None,
617 | labels=None,
618 | use_cache=None,
619 | output_attentions=None,
620 | output_hidden_states=None,
621 | return_dict=None,
622 | ):
623 | r"""
624 | labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
625 | Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
626 | ``labels = input_ids`` Indices are selected in ``[-100, 0, ..., config.vocab_size]`` All labels set to
627 | ``-100`` are ignored (masked), the loss is only computed for labels in ``[0, ..., config.vocab_size]``
628 | """
629 | return_dict = return_dict if return_dict is not None else self.config.use_return_dict
630 |
631 | transformer_outputs = self.transformer(
632 | input_ids,
633 | past_key_values=past_key_values,
634 | attention_mask=attention_mask,
635 | token_type_ids=token_type_ids,
636 | position_ids=position_ids,
637 | head_mask=head_mask,
638 | inputs_embeds=inputs_embeds,
639 | use_cache=use_cache,
640 | output_attentions=output_attentions,
641 | output_hidden_states=output_hidden_states,
642 | return_dict=return_dict,
643 | )
644 | hidden_states = transformer_outputs[0]
645 |
646 | # Set device for model parallelism
647 | if self.model_parallel:
648 | torch.cuda.set_device(self.transformer.first_device)
649 | hidden_states = hidden_states.to(self.lm_head.weight.device)
650 |
651 | # make sure sampling in fp16 works correctly and
652 | # compute loss in fp32 to match with mesh-tf version
653 | # https://github.com/EleutherAI/gpt-neo/blob/89ce74164da2fb16179106f54e2269b5da8db333/models/gpt2/gpt2.py#L179
654 | lm_logits = self.lm_head(hidden_states).to(torch.float32)
655 |
656 | loss = None
657 | if labels is not None:
658 | # Shift so that tokens < n predict n
659 | shift_logits = lm_logits[..., :-1, :].contiguous()
660 | shift_labels = labels[..., 1:].contiguous()
661 | # Flatten the tokens
662 | loss_fct = CrossEntropyLoss()
663 | loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
664 |
665 | loss = loss.to(hidden_states.dtype)
666 |
667 | if not return_dict:
668 | output = (lm_logits,) + transformer_outputs[1:]
669 | return ((loss,) + output) if loss is not None else output
670 |
671 | return CausalLMOutputWithPast(
672 | loss=loss,
673 | logits=lm_logits,
674 | past_key_values=transformer_outputs.past_key_values,
675 | hidden_states=transformer_outputs.hidden_states,
676 | attentions=transformer_outputs.attentions,
677 | )
678 |
679 | @staticmethod
680 | def _reorder_cache(past: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:
681 | """
682 | This function is used to re-order the :obj:`past_key_values` cache if
683 | :meth:`~transformers.PretrainedModel.beam_search` or :meth:`~transformers.PretrainedModel.beam_sample` is
684 | called. This is required to match :obj:`past_key_values` with the correct beam_idx at every generation step.
685 | """
686 | return tuple(
687 | tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)
688 | for layer_past in past
689 | )
690 |
--------------------------------------------------------------------------------
/codegen1/jaxformer/hf/sample.py:
--------------------------------------------------------------------------------
1 | # Copyright (c) 2022, salesforce.com, inc.
2 | # All rights reserved.
3 | # SPDX-License-Identifier: BSD-3-Clause
4 | # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
5 |
6 | import os
7 | import re
8 | import time
9 | import random
10 | import argparse
11 |
12 | import torch
13 |
14 | from transformers import GPT2TokenizerFast
15 | from jaxformer.hf.codegen.modeling_codegen import CodeGenForCausalLM
16 |
17 |
18 |
19 | ########################################################################
20 | # util
21 |
22 |
23 | class print_time:
24 | def __init__(self, desc):
25 | self.desc = desc
26 |
27 | def __enter__(self):
28 | print(self.desc)
29 | self.t = time.time()
30 |
31 | def __exit__(self, type, value, traceback):
32 | print(f'{self.desc} took {time.time()-self.t:.02f}s')
33 |
34 |
35 | def set_env():
36 | os.environ['TOKENIZERS_PARALLELISM'] = 'false'
37 |
38 |
39 | def set_seed(seed, deterministic=True):
40 | random.seed(seed)
41 | os.environ['PYTHONHASHSEED'] = str(seed)
42 | torch.manual_seed(seed)
43 | if torch.cuda.is_available():
44 | torch.cuda.manual_seed(seed)
45 | torch.backends.cudnn.deterministic = deterministic
46 | torch.backends.cudnn.benchmark = not deterministic
47 | # torch.use_deterministic_algorithms(deterministic)
48 |
49 |
50 | def cast(model, fp16=True):
51 | if fp16:
52 | model.half()
53 | return model
54 |
55 |
56 |
57 | ########################################################################
58 | # model
59 |
60 |
61 | def create_model(ckpt, fp16=True):
62 | if fp16:
63 | return CodeGenForCausalLM.from_pretrained(ckpt, revision='float16', torch_dtype=torch.float16, low_cpu_mem_usage=True)
64 | else:
65 | return CodeGenForCausalLM.from_pretrained(ckpt)
66 |
67 |
68 | def create_tokenizer():
69 | t = GPT2TokenizerFast.from_pretrained('gpt2')
70 | t.max_model_input_sizes['gpt2'] = 1e20
71 | return t
72 |
73 |
74 | def include_whitespace(t, n_min=2, n_max=20, as_special_tokens=False):
75 | t.add_tokens([' ' * n for n in reversed(range(n_min, n_max))], special_tokens=as_special_tokens)
76 | return t
77 |
78 |
79 | def include_tabs(t, n_min=2, n_max=20, as_special_tokens=False):
80 | t.add_tokens(['\t' * n for n in reversed(range(n_min, n_max))], special_tokens=as_special_tokens)
81 | return t
82 |
83 |
84 | def create_custom_gpt2_tokenizer():
85 | t = create_tokenizer()
86 | t = include_whitespace(t=t, n_min=2, n_max=32, as_special_tokens=False)
87 | t = include_tabs(t=t, n_min=2, n_max=10, as_special_tokens=False)
88 | return t
89 |
90 |
91 | ########################################################################
92 | # sample
93 |
94 | def sample(
95 | device,
96 | model,
97 | tokenizer,
98 | context,
99 | pad_token_id,
100 | num_return_sequences=1,
101 | temp=0.2,
102 | top_p=0.95,
103 | max_length_sample=128,
104 | max_length=2048
105 | ):
106 |
107 | input_ids = tokenizer(
108 | context,
109 | truncation=True,
110 | padding=True,
111 | max_length=max_length,
112 | return_tensors='pt',
113 | ).input_ids
114 |
115 | input_ids_len = input_ids.shape[1]
116 | assert input_ids_len < max_length
117 |
118 | with torch.no_grad():
119 | input_ids = input_ids.to(device)
120 | tokens = model.generate(
121 | input_ids,
122 | do_sample=True,
123 | num_return_sequences=num_return_sequences,
124 | temperature=temp,
125 | max_length=input_ids_len + max_length_sample,
126 | top_p=top_p,
127 | pad_token_id=pad_token_id,
128 | use_cache=True,
129 | )
130 | text = tokenizer.batch_decode(tokens[:, input_ids_len:, ...])
131 |
132 | return text
133 |
134 |
135 | def truncate(completion):
136 |
137 | def find_re(string, pattern, start_pos):
138 | m = pattern.search(string, start_pos)
139 | return m.start() if m else -1
140 |
141 | terminals = [
142 | re.compile(r, re.MULTILINE)
143 | for r in
144 | [
145 | '^#',
146 | re.escape('<|endoftext|>'),
147 | "^'''",
148 | '^"""',
149 | '\n\n\n'
150 | ]
151 | ]
152 |
153 | prints = list(re.finditer('^print', completion, re.MULTILINE))
154 | if len(prints) > 1:
155 | completion = completion[:prints[1].start()]
156 |
157 | defs = list(re.finditer('^def', completion, re.MULTILINE))
158 | if len(defs) > 1:
159 | completion = completion[:defs[1].start()]
160 |
161 | start_pos = 0
162 |
163 | terminals_pos = [pos for pos in [find_re(completion, terminal, start_pos) for terminal in terminals] if pos != -1]
164 | if len(terminals_pos) > 0:
165 | return completion[:min(terminals_pos)]
166 | else:
167 | return completion
168 |
169 |
170 | def test_truncate():
171 |
172 | assert truncate('\nif len_a > len_b:\n result = a\nelse:\n result = b\n\n\n\n#') == '\nif len_a > len_b:\n result = a\nelse:\n result = b'
173 |
174 |
175 |
176 | ########################################################################
177 | # main
178 |
179 |
180 | def main():
181 |
182 | # (0) constants
183 |
184 | models_nl = ['codegen-350M-nl', 'codegen-2B-nl', 'codegen-6B-nl', 'codegen-16B-nl']
185 | models_pl = ['codegen-350M-multi', 'codegen-2B-multi', 'codegen-6B-multi', 'codegen-16B-multi', 'codegen-350M-mono', 'codegen-2B-mono', 'codegen-6B-mono', 'codegen-16B-mono']
186 | models = models_nl + models_pl
187 |
188 |
189 | # (1) params
190 |
191 | parser = argparse.ArgumentParser()
192 | parser.add_argument('--model', type=str, choices=models, default='codegen-350M-mono')
193 | parser.add_argument('--device', type=str, default='cuda:0')
194 | parser.add_argument('--rng-seed', type=int, default=42)
195 | parser.add_argument('--rng-deterministic', type=bool, default=True)
196 | parser.add_argument('--p', type=float, default=0.95)
197 | parser.add_argument('--t', type=float, default=0.2)
198 | parser.add_argument('--max-length', type=int, default=128)
199 | parser.add_argument('--batch-size', type=int, default=1)
200 | parser.add_argument('--no-fp16', action="store_true")
201 | parser.add_argument('--pad', type=int, default=50256)
202 | parser.add_argument('--context', type=str, default='def helloworld():')
203 | args = parser.parse_args()
204 |
205 |
206 | # (2) preamble
207 |
208 | set_env()
209 | set_seed(args.rng_seed, deterministic=args.rng_deterministic)
210 | device = torch.device(args.device)
211 |
212 | use_fp16 = True
213 | if (args.no_fp16 or device.type == "cpu"):
214 | use_fp16 = False
215 |
216 | if args.model.startswith("codegen-16B"):
217 | use_fp16 = True
218 |
219 | ckpt = f'./checkpoints/{args.model}'
220 |
221 |
222 | # (3) load
223 |
224 | with print_time('loading parameters'):
225 | model = create_model(ckpt=ckpt, fp16=use_fp16).to(device)
226 |
227 |
228 | with print_time('loading tokenizer'):
229 | if args.model in models_pl:
230 | tokenizer = create_custom_gpt2_tokenizer()
231 | else:
232 | tokenizer = create_tokenizer()
233 | tokenizer.padding_side = 'left'
234 | tokenizer.pad_token = args.pad
235 |
236 |
237 | # (4) sample
238 |
239 | with print_time('sampling'):
240 | completion = sample(device=device, model=model, tokenizer=tokenizer, context=args.context, pad_token_id=args.pad, num_return_sequences=args.batch_size, temp=args.t, top_p=args.p, max_length_sample=args.max_length)[0]
241 | truncation = truncate(completion)
242 |
243 | print('=' * 100)
244 | print(completion)
245 | print('=' * 100)
246 | print(args.context+truncation)
247 | print('=' * 100)
248 |
249 |
250 |
251 | if __name__ == '__main__':
252 | test_truncate()
253 | main()
254 | print('done.')
255 |
--------------------------------------------------------------------------------
/codegen1/jaxformer/hf/train_deepspeed.py:
--------------------------------------------------------------------------------
1 | # Minimal example of training the 16B checkpoint on GPU with CPU offloading using deepspeed.
2 |
3 | '''
4 | apt install python3.8 python3.8-venv python3.8-dev
5 |
6 | python3.8 -m venv .venv
7 | source .venv/bin/activate
8 | pip install --upgrade pip setuptools
9 | pip install torch --extra-index-url https://download.pytorch.org/whl/cu113
10 | pip install transformers==4.21.1 datasets==1.16.1 deepspeed==0.7.0
11 |
12 | deepspeed --num_gpus=1 train_deepspeed.py
13 | '''
14 |
15 | ########################################################################################################
16 | ## imports
17 |
18 | import os
19 | import argparse
20 | import random
21 | import math
22 |
23 | from time import time
24 |
25 | import numpy as np
26 |
27 | import torch
28 |
29 | from transformers import AutoConfig, AutoModelForCausalLM
30 |
31 | import deepspeed
32 |
33 |
34 | ########################################################################################################
35 | ## args
36 |
37 | DEEPSPEED_CONFIG = \
38 | {
39 | 'fp16': {'enabled': True, 'loss_scale': 0, 'loss_scale_window': 1000, 'initial_scale_power': 12, 'hysteresis': 2, 'min_loss_scale': 1},
40 | 'optimizer': {'type': 'AdamW', 'params': {'lr': 1e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}},
41 | 'scheduler': {'type': 'WarmupLR', 'params': {'warmup_min_lr': 0, 'warmup_max_lr': 1e-05, 'warmup_num_steps': 100}},
42 | 'zero_optimization': {
43 | 'stage': 3,
44 | 'offload_optimizer': {'device': 'cpu', 'pin_memory': False},
45 | 'offload_param': {'device': 'cpu', 'pin_memory': False},
46 | 'overlap_comm': True,
47 | 'contiguous_gradients': True,
48 | 'sub_group_size': 1e9,
49 | 'reduce_bucket_size': 16777216,
50 | 'stage3_prefetch_bucket_size': 15099494.4,
51 | 'stage3_param_persistence_threshold': 40960,
52 | 'stage3_max_live_parameters': 1e9,
53 | 'stage3_max_reuse_distance': 1e9,
54 | 'stage3_gather_fp16_weights_on_model_save': True
55 | },
56 | 'train_batch_size': 32,
57 | 'train_micro_batch_size_per_gpu': 2,
58 | 'gradient_accumulation_steps': 16,
59 | 'gradient_clipping': 1.0,
60 | 'steps_per_print': 8,
61 | 'wall_clock_breakdown': False,
62 | 'compression_training': {'weight_quantization': {'shared_parameters': {}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {}, 'different_groups': {}}}
63 | }
64 |
65 |
66 | def create_args(args=argparse.Namespace()):
67 |
68 | args.seed = 42
69 |
70 | args.model = 'Salesforce/codegen-16B-mono'
71 |
72 | args.deepspeed_config = DEEPSPEED_CONFIG
73 |
74 | args.opt_steps_train = 1000
75 |
76 | return args
77 |
78 |
79 |
80 | ########################################################################################################
81 | ## train
82 |
83 | def train(args):
84 |
85 | #######################
86 | ## preamble
87 |
88 | set_seed(args.seed)
89 |
90 |
91 | #######################
92 | ## model
93 |
94 | print('initializing model')
95 |
96 | config = AutoConfig.from_pretrained(args.model)
97 | config.gradient_checkpointing = True
98 | config.use_cache = False
99 |
100 | model = AutoModelForCausalLM.from_pretrained(args.model, config=config)
101 |
102 | model.train()
103 | # TODO(enijkamp): we need to set this flag twice?
104 | model.gradient_checkpointing_enable()
105 |
106 |
107 | #######################
108 | ## deepspeed
109 |
110 | print('initializing deepspeed')
111 |
112 | model_parameters = list(filter(lambda p: p.requires_grad, model.parameters()))
113 | model_engine, optimizer, _, _ = deepspeed.initialize(config=args.deepspeed_config, model=model, model_parameters=model_parameters)
114 |
115 | torch.cuda.empty_cache()
116 |
117 |
118 | #######################
119 | ## train
120 |
121 | print('starting training')
122 |
123 | input_ids = torch.randint(low=0, high=10, size=[args.deepspeed_config['train_micro_batch_size_per_gpu'], 1024], dtype=torch.int64).cuda()
124 |
125 | for step in range(args.opt_steps_train+1):
126 |
127 | loss = model_engine(input_ids=input_ids, labels=input_ids).loss
128 |
129 | model_engine.backward(loss)
130 | model_engine.step()
131 |
132 | print(f'{step} {loss:8.3f}')
133 |
134 |
135 |
136 | ########################################################################################################
137 | ## preamble
138 |
139 | def set_gpus(gpu):
140 | torch.cuda.set_device(gpu)
141 |
142 |
143 | def set_seed(seed):
144 | os.environ['PYTHONHASHSEED'] = str(seed)
145 | random.seed(seed)
146 | np.random.seed(seed)
147 | torch.manual_seed(seed)
148 | if torch.cuda.is_available():
149 | torch.cuda.manual_seed(seed)
150 | torch.cuda.manual_seed_all(seed)
151 |
152 |
153 | def set_cuda(deterministic=True):
154 | if torch.cuda.is_available():
155 | torch.backends.cudnn.deterministic = deterministic
156 | torch.backends.cudnn.benchmark = not deterministic
157 |
158 |
159 | def get_exp_id(file):
160 | return os.path.splitext(os.path.basename(file))[0]
161 |
162 |
163 | def get_output_dir(exp_id):
164 | import datetime
165 | t = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
166 | output_dir = os.path.join('output/' + exp_id, t)
167 | return output_dir
168 |
169 |
170 | def copy_source(file, output_dir):
171 | import shutil
172 | shutil.copyfile(file, os.path.join(output_dir, os.path.basename(file)))
173 |
174 |
175 |
176 |
177 | ########################################################################################################
178 | ## main
179 |
180 | def main():
181 |
182 | # preamble
183 | exp_id = get_exp_id(__file__)
184 | output_dir = get_output_dir(exp_id)
185 |
186 | # args
187 | args = create_args()
188 | args.output_dir = output_dir
189 | args.exp_id = exp_id
190 |
191 | # output
192 | os.makedirs(args.output_dir, exist_ok=True)
193 | copy_source(__file__, args.output_dir)
194 |
195 | # train
196 | train(args=args)
197 |
198 |
199 |
200 | if __name__ == '__main__':
201 | main()
--------------------------------------------------------------------------------
/codegen1/requirements.txt:
--------------------------------------------------------------------------------
1 | --find-links https://download.pytorch.org/whl/torch_stable.html
2 | torch==1.13.1
3 | transformers==4.30.0
--------------------------------------------------------------------------------
/codegen2/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/codegen2/README.md:
--------------------------------------------------------------------------------
1 | # CodeGen2
2 |
3 | Official research release for the **CodeGen2** models (`1B`, `3B`, `7B`, `16B`) for **Program Synthesis** as presented in ICLR 2023:
4 |
5 | *Title*: [CodeGen2: Lessons for Training LLMs on Programming and Natural Languages](https://arxiv.org/abs/2305.02309)
6 |
7 | *Authors*: [Erik Nijkamp](https://enijkamp.github.io/)\*, [Hiroaki Hayashi](https://hiroakih.me/)\*, [Caiming Xiong](https://scholar.google.com/citations?user=vaSdahkAAAAJ&hl=en), [Silvio Savarese](https://scholar.google.com/citations?user=ImpbxLsAAAAJ&hl=en), and [Yingbo Zhou](https://scholar.google.com/citations?user=H_6RQ7oAAAAJ&hl=en) (* indicates equal contribution)
8 |
9 | ## Hugging Face Integration
10 |
11 | Model checkpoints are published at Hugging Face Hub.
12 |
13 | * [CodeGen2-1B](https://huggingface.co/Salesforce/codegen2-1B)
14 | * [CodeGen2-3.7B](https://huggingface.co/Salesforce/codegen2-3_7B)
15 | * [CodeGen2-7B](https://huggingface.co/Salesforce/codegen2-7B)
16 | * [CodeGen2-16B](https://huggingface.co/Salesforce/codegen2-16B)
17 |
18 | Model cards outline how to use the model for causal and infill sampling.
19 |
20 | ## Sampling
21 |
22 | Program synthesis in the form of auto-regressive sampling can be performed as follows:
23 |
24 | ```python
25 | import torch
26 | from transformers import AutoTokenizer, AutoModelForCausalLM
27 |
28 | tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen2-7B")
29 | model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen2-7B", trust_remote_code=True, revision="main")
30 | inputs = tokenizer("# this function prints hello world", return_tensors="pt")
31 | sample = model.generate(**inputs, max_length=128)
32 | print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))
33 | ```
34 |
35 | ## Citation
36 |
37 | ```bibtex
38 | @article{Nijkamp2023codegen2,
39 | title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages},
40 | author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo},
41 | journal={ICLR},
42 | year={2023}
43 | }
44 | ```
45 |
--------------------------------------------------------------------------------
/codegen2/requirements.txt:
--------------------------------------------------------------------------------
1 | transformers==4.30.0
2 |
--------------------------------------------------------------------------------
/codegen2/sample.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoTokenizer, AutoModelForCausalLM
2 |
3 | tokenizer = AutoTokenizer.from_pretrained("checkpoints/codegen2-6B")
4 | model = CodeGenForCausalLM.from_pretrained("checkpoints/codegen2-6B", torch_dtype=torch.float16, revision="sharded")
5 | inputs = tokenizer("# this function prints hello world", return_tensors="pt")
6 | sample = model.generate(**inputs, max_length=128)
7 | print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))
8 |
--------------------------------------------------------------------------------
/codegen25/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/codegen25/README.md:
--------------------------------------------------------------------------------
1 | # CodeGen2.5
2 |
3 | Official research release for the **CodeGen2.5** models for **Program Synthesis**.
4 |
5 | Title: [**CodeGen2.5: Small, but mighty**](https://blog.salesforceairesearch.com/codegen25)
6 |
7 | Authors: [Erik Nijkamp](https://eriknijkamp.com)\*, [Hiroaki Hayashi](https://hiroakih.me/)\*, [Yingbo Zhou](https://scholar.google.com/citations?user=H_6RQ7oAAAAJ&hl=en), [Caiming Xiong](https://scholar.google.com/citations?user=vaSdahkAAAAJ&hl=en) (* equal contribution)
8 |
9 | ## Hugging Face Integration
10 |
11 | Model checkpoints are published at Hugging Face Hub.
12 |
13 | * [CodeGen2.5-7B-multi](https://huggingface.co/Salesforce/codegen25-7b-multi) (Apache-2.0)
14 | * [CodeGen2.5-7B-mono](https://huggingface.co/Salesforce/codegen25-7B-mono) (Apache-2.0)
15 | * [CodeGen2.5-7B-instruct](https://huggingface.co/Salesforce/codegen25-7B-instruct) (*Research purposes only*)
16 |
17 | Model cards outline how to use the model for causal and infill sampling. Please refer to each model card for more details.
18 |
19 | The models are pre-trained on the [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), a programming language dataset developed by [BigCode](https://huggingface.co/bigcode).
20 |
21 | ## Requirements
22 |
23 | ```
24 | transformers>=4.29.2
25 | tiktoken==0.4.0
26 | ```
27 |
28 | ## Sampling
29 |
30 | Program synthesis in the form of auto-regressive sampling can be performed as follows:
31 |
32 | ```python
33 | from transformers import AutoTokenizer, AutoModelForCausalLM
34 |
35 | tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen25-7b-mono", trust_remote_code=True)
36 | model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen25-7b-mono")
37 | inputs = tokenizer("def hello_world():", return_tensors="pt")
38 | sample = model.generate(**inputs, max_length=128)
39 | print(tokenizer.decode(sample[0]))
40 | ```
41 |
42 | ## Citation
43 |
44 | Please cite CodeGen2 paper:
45 |
46 | ```bibtex
47 | @article{Nijkamp2023codegen2,
48 | title={CodeGen2: Lessons for Training LLMs on Programming and Natural Languages},
49 | author={Nijkamp, Erik and Hayashi, Hiroaki and Xiong, Caiming and Savarese, Silvio and Zhou, Yingbo},
50 | journal={ICLR},
51 | year={2023}
52 | }
53 | ```
54 |
--------------------------------------------------------------------------------
/codegen25/requirements.txt:
--------------------------------------------------------------------------------
1 | transformers>=4.29.2
2 | tiktoken==0.4.0
3 |
--------------------------------------------------------------------------------
/codegen25/sample.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoTokenizer, AutoModelForCausalLM
2 |
3 | tokenizer = AutoTokenizer.from_pretrained("checkpoints/codegen25-7b-multi", trust_remote_code=True)
4 | model = AutoModelForCausalLM.from_pretrained("checkpoints/codegen25-7b-multi")
5 | inputs = tokenizer("def hello_world():", return_tensors="pt")
6 | sample = model.generate(**inputs, max_length=128)
7 | print(tokenizer.decode(sample[0], truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"]))
8 |
--------------------------------------------------------------------------------