├── .github
├── FUNDING.yml
└── ISSUE_TEMPLATE
│ ├── bug_report.md
│ ├── custom.md
│ └── feature_request.md
├── CODE_OF_CONDUCT.md
├── LICENSE.txt
├── README.md
├── datasets
└── stack.csv
├── examples
├── .ipynb_checkpoints
│ └── FewShot-checkpoint.ipynb
├── CosineClassifierDemo.ipynb
├── DLApproach.ipynb
├── DLFewShot.ipynb
├── FewShot.ipynb
├── KNNClassifierDemo.ipynb
├── datasets
│ └── stack.csv
└── sub.csv
├── fsText
├── CosineClassifier.py
├── KNNClassifier.py
├── RFClassifier.py
├── __init__.py
└── __pycache__
│ └── CosineClassifier.cpython-36.pyc
├── requirements.txt
├── resources
├── images
│ ├── nlp_fs_4.png
│ ├── nlp_fs_6.png
│ ├── perf_1.png
│ └── perf_2.png
└── papers
│ ├── DataAugmentation
│ ├── 1804.08166.pdf
│ └── 1901.11196.pdf
│ └── FewShot
│ ├── 1710.10280.pdf
│ ├── 1804.02063.pdf
│ └── 1908.08788.pdf
├── setup.cfg
└── setup.py
/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 |
3 | github: # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
4 | patreon: # Replace with a single Patreon username
5 | open_collective: # Replace with a single Open Collective username
6 | ko_fi: # Replace with a single Ko-fi username
7 | tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
8 | community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
9 | liberapay: # Replace with a single Liberapay username
10 | issuehunt: # Replace with a single IssueHunt username
11 | otechie: # Replace with a single Otechie username
12 | custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']
13 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report
3 | about: Create a report to help us improve
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 |
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 |
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 |
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 |
26 | **Desktop (please complete the following information):**
27 | - OS: [e.g. iOS]
28 | - Browser [e.g. chrome, safari]
29 | - Version [e.g. 22]
30 |
31 | **Smartphone (please complete the following information):**
32 | - Device: [e.g. iPhone6]
33 | - OS: [e.g. iOS8.1]
34 | - Browser [e.g. stock browser, safari]
35 | - Version [e.g. 22]
36 |
37 | **Additional context**
38 | Add any other context about the problem here.
39 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/custom.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Custom issue template
3 | about: Describe this issue template's purpose here.
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 |
11 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Feature request
3 | about: Suggest an idea for this project
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 |
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 |
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 |
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | In the interest of fostering an open and welcoming environment, we as
6 | contributors and maintainers pledge to making participation in our project and
7 | our community a harassment-free experience for everyone, regardless of age, body
8 | size, disability, ethnicity, sex characteristics, gender identity and expression,
9 | level of experience, education, socio-economic status, nationality, personal
10 | appearance, race, religion, or sexual identity and orientation.
11 |
12 | ## Our Standards
13 |
14 | Examples of behavior that contributes to creating a positive environment
15 | include:
16 |
17 | * Using welcoming and inclusive language
18 | * Being respectful of differing viewpoints and experiences
19 | * Gracefully accepting constructive criticism
20 | * Focusing on what is best for the community
21 | * Showing empathy towards other community members
22 |
23 | Examples of unacceptable behavior by participants include:
24 |
25 | * The use of sexualized language or imagery and unwelcome sexual attention or
26 | advances
27 | * Trolling, insulting/derogatory comments, and personal or political attacks
28 | * Public or private harassment
29 | * Publishing others' private information, such as a physical or electronic
30 | address, without explicit permission
31 | * Other conduct which could reasonably be considered inappropriate in a
32 | professional setting
33 |
34 | ## Our Responsibilities
35 |
36 | Project maintainers are responsible for clarifying the standards of acceptable
37 | behavior and are expected to take appropriate and fair corrective action in
38 | response to any instances of unacceptable behavior.
39 |
40 | Project maintainers have the right and responsibility to remove, edit, or
41 | reject comments, commits, code, wiki edits, issues, and other contributions
42 | that are not aligned to this Code of Conduct, or to ban temporarily or
43 | permanently any contributor for other behaviors that they deem inappropriate,
44 | threatening, offensive, or harmful.
45 |
46 | ## Scope
47 |
48 | This Code of Conduct applies both within project spaces and in public spaces
49 | when an individual is representing the project or its community. Examples of
50 | representing a project or community include using an official project e-mail
51 | address, posting via an official social media account, or acting as an appointed
52 | representative at an online or offline event. Representation of a project may be
53 | further defined and clarified by project maintainers.
54 |
55 | ## Enforcement
56 |
57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
58 | reported by contacting the project team at mael.fabien@gmail.com. All
59 | complaints will be reviewed and investigated and will result in a response that
60 | is deemed necessary and appropriate to the circumstances. The project team is
61 | obligated to maintain confidentiality with regard to the reporter of an incident.
62 | Further details of specific enforcement policies may be posted separately.
63 |
64 | Project maintainers who do not follow or enforce the Code of Conduct in good
65 | faith may face temporary or permanent repercussions as determined by other
66 | members of the project's leadership.
67 |
68 | ## Attribution
69 |
70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71 | available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72 |
73 | [homepage]: https://www.contributor-covenant.org
74 |
75 | For answers to common questions about this code of conduct, see
76 | https://www.contributor-covenant.org/faq
77 |
--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
202 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # fsText : Few-Shot Text Classification
2 |
3 |
4 | [](https://pypi.org/project/fsText/)
5 | [](.github/CODE_OF_CONDUCT.md)
6 | [](http://makeapullrequest.com)
7 |
8 | 🚧 ! This library is currently a work in progress ! 🚧
9 |
10 | *Use Case* : A user has a column of short texts (e.g user reviews) but the comments are not labeled. We ask the user to hand-label just a few texts of each class (i.e. few-shot), and provide a method that leverages pre-trained embeddings to generalize the classification to the whole dataset.
11 |
12 | This library will gather several state-of-the-art techniques. We will present the concepts behind each algorithm and its implementation in the section below.
13 |
14 | ## Table of Contents
15 |
16 | - [Installation](#Installation)
17 | - [With pip](#With-pip)
18 | - [From source](#From-source)
19 | - [Implemented Models](#Models)
20 | - [Getting started](#Getting-started)
21 | - [Preparing your data](#Preparing-your-data)
22 | - [Training models](#Training-models)
23 | - [Making predictions](#Making-predictions)
24 | - [Notebook Examples](#Notebook-Examples)
25 | - [Contributing](#Contributing)
26 | - [References](#References)
27 | - [LICENSE](#LICENSE)
28 | - [Contacts and Contributors](#LICENSE)
29 |
30 | ## Installation
31 |
32 | ### With pip
33 |
34 | ```shell
35 | pip install fsText
36 | ```
37 |
38 | ### From source
39 |
40 | ```shell
41 | git clone https://github.com/maelfabien/fsText.git
42 | cd fsText
43 | pip install -e .
44 | ```
45 |
46 | ## Implemented Models
47 |
48 |
49 | | Model | Status | Details | Reference Paper |
50 | | ----------------- | --------------------| -------------------- | -------------------- |
51 | | Word2Vec + Cosine Similarity | ✅ | [Article](https://maelfabien.github.io/machinelearning/NLP_5/) | [Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the Loop](https://arxiv.org/abs/1804.02063) |
52 | | Word2Vec + Advanced Classifiers | 🚧 | [Article](https://maelfabien.github.io/machinelearning/NLP_6/) | [Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the Loop](https://arxiv.org/abs/1804.02063) |
53 | | DistilBert + Advanced Classifier | 🚧 | [Article](https://maelfabien.github.io/machinelearning/NLP_7/) | --- |
54 | | Siamese Network | ❌ | [Article](https://data4thought.com/fewshot_learning_nlp.html) | --- |
55 | | Fine-Tuning Pre-trained Bert | ❌ | --- | [Improving Few-shot Text Classification via Pretrained Language Representations](https://arxiv.org/abs/1908.08788) |
56 |
57 | ## Getting started
58 |
59 | ### Preparing your data
60 |
61 | We offer a text pre-processing pipeline as well as data augmentation techniques. To use `fsText` you need to create a Pandas DataFrame with the following columns:
62 |
63 | | Text | Label |
64 | | ----------------- | --------------------|
65 | | First short text | Label of first text |
66 |
67 | ### Training models
68 |
69 | Fit the cosine classifier on your annotated texts:
70 |
71 | ```python
72 | from fsText.Classifier import CosineClassifier
73 |
74 | clf = CosineClassifier()
75 | clf.fit(X_train, y_train)
76 | ```
77 |
78 | We include an automated label encoding of `y_train` which can therefore take any form.
79 |
80 | ### Making predictions
81 |
82 | To get the prediction on the rest of your un-labeled texts:
83 |
84 | ```python
85 | clf.predict(X_test)
86 | ```
87 |
88 | To assess the accuracy of the prediction in case you have a labeled dataset:
89 |
90 | ```python
91 | from sklearn.metrics import accuracy_score
92 | accuracy_score(clf.predict(X_test), y_test)
93 | ```
94 |
95 | ## Notebook Examples
96 |
97 | We prepared some notebook examples under the [examples](examples) directory.
98 |
99 | | Notebook | Description |
100 | | --- | --- |
101 | | [1] CosineClassifierDemo | A simple demonstration of fsText Cosine Classifier + Word2Vec |
102 |
103 | ## Contributing
104 |
105 | Read our [Contributing Guidelines](.github/CONTRIBUTING.md).
106 |
107 | ## References
108 |
109 | | Type | Title | Author | Year |
110 | | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------- | ---- |
111 | | :newspaper: Paper | [One-shot and few-shot learning of word embeddings](https://arxiv.org/abs/1710.10280) | Andrew K. Lampinen & James L. McClelland | 2018 |
112 | | :newspaper: Paper | [Few-Shot Text Classification with Pre-Trained Word Embeddings and a Human in the Loop](https://arxiv.org/abs/1804.02063) | Katherine Bailey, Sunny Chopra | 2018 |
113 | | :newspaper: Paper | [Improving Few-shot Text Classification via Pretrained Language Representations](https://arxiv.org/abs/1908.08788) | Ningyu Zhang, Zhanlin Sun, Shumin Deng, Jiaoyan Chen, Huajun Chen | 2019 |
114 |
115 | ## LICENSE
116 |
117 | [Apache-2.0](LICENSE)
118 |
119 | ## Contacts and contributors
120 |
121 |
122 | |
123 | andrelmfarias 125 | 💻 |
126 |
127 |
128 | mamrouch 130 | 💻 |
131 |
132 |
133 | maelfabien 135 | 💻 |
136 |
137 |
\n", 119 | " | Text | \n", 120 | "Label | \n", 121 | "
---|---|---|
1717 | \n", 126 | "Wordpress SEO Features | \n", 127 | "1 | \n", 128 | "
820 | \n", 131 | "How to use the jQuery Cycle Plugin with WordPr... | \n", 132 | "1 | \n", 133 | "
1905 | \n", 136 | "PHP/SQL/Wordpress: Group a user list by alphabet | \n", 137 | "1 | \n", 138 | "
1361 | \n", 141 | "How do I use underscore in a wordpress permalink | \n", 142 | "1 | \n", 143 | "
1123 | \n", 146 | "Wordpress register_activation_hook() + global ... | \n", 147 | "1 | \n", 148 | "
901 | \n", 151 | "How can I optimize a dynamic search query in O... | \n", 152 | "2 | \n", 153 | "
953 | \n", 156 | "How to handle line breaks in data for importin... | \n", 157 | "2 | \n", 158 | "
173 | \n", 161 | "Search All Fields In All Tables For A Specific... | \n", 162 | "2 | \n", 163 | "
592 | \n", 166 | "How do I close an OracleConnection in .NET | \n", 167 | "2 | \n", 168 | "
1155 | \n", 171 | "Oracle: How do I convert hex to decimal in Ora... | \n", 172 | "2 | \n", 173 | "
\n", 270 | " | Text | \n", 271 | "Label | \n", 272 | "
---|---|---|
0 | \n", 277 | "how do i fill a dataset or a datatable from a ... | \n", 278 | "18 | \n", 279 | "
1 | \n", 282 | "how do you page a collection with linq | \n", 283 | "18 | \n", 284 | "
2 | \n", 287 | "best subversion clients for windows vista bit | \n", 288 | "3 | \n", 289 | "
3 | \n", 292 | "best practice collaborative environment bin di... | \n", 293 | "3 | \n", 294 | "
4 | \n", 297 | "visual studio setup project per user registry ... | \n", 298 | "7 | \n", 299 | "
\n", 459 | " | Text | \n", 460 | "Label | \n", 461 | "
---|---|---|
1521 | \n", 466 | "how to get wordpress page id after looping posts | \n", 467 | "1 | \n", 468 | "
1737 | \n", 471 | "using wp query to pull content from a specific... | \n", 472 | "1 | \n", 473 | "
1740 | \n", 476 | "wordpress how to show just posts on main index | \n", 477 | "1 | \n", 478 | "
1660 | \n", 481 | "wordpress is it possible to make one particula... | \n", 482 | "1 | \n", 483 | "
1411 | \n", 486 | "exclude templates in wordpress page | \n", 487 | "1 | \n", 488 | "
1678 | \n", 491 | "wierd date and time formating in wordpress y m d | \n", 492 | "1 | \n", 493 | "
1626 | \n", 496 | "wordpress custom post type templates | \n", 497 | "1 | \n", 498 | "
1513 | \n", 501 | "inserting wordpress plugin content to posts | \n", 502 | "1 | \n", 503 | "
1859 | \n", 506 | "how can i delay one feed in wordpress but not ... | \n", 507 | "1 | \n", 508 | "
1072 | \n", 511 | "how can i remove jquery from the frontside of ... | \n", 512 | "1 | \n", 513 | "
1811 | \n", 516 | "wordpress calling recent posts widget via scri... | \n", 517 | "1 | \n", 518 | "
721 | \n", 521 | "debuging register activation hook in wordpress | \n", 522 | "1 | \n", 523 | "
1636 | \n", 526 | "testing pluggable function calls clashes for w... | \n", 527 | "1 | \n", 528 | "
1973 | \n", 531 | "would it be quicker to make wordpress theme di... | \n", 532 | "1 | \n", 533 | "
1938 | \n", 536 | "how to find and clean wordpress from script s ... | \n", 537 | "1 | \n", 538 | "
1899 | \n", 541 | "wordpress set post date | \n", 542 | "1 | \n", 543 | "
1280 | \n", 546 | "wordpress add comment like stackoverflow | \n", 547 | "1 | \n", 548 | "
1883 | \n", 551 | "wordpress nav not visible in pages like articl... | \n", 552 | "1 | \n", 553 | "
1761 | \n", 556 | "wordpress development | \n", 557 | "1 | \n", 558 | "
1319 | \n", 561 | "wordpress static pages how to embed content in... | \n", 562 | "1 | \n", 563 | "
1549 | \n", 566 | "filtering search results with wordpress | \n", 567 | "1 | \n", 568 | "
1174 | \n", 571 | "drop down js menu blinking in ie | \n", 572 | "1 | \n", 573 | "
1371 | \n", 576 | "wrap stray text in p tags | \n", 577 | "1 | \n", 578 | "
1527 | \n", 581 | "wordpress menu of categories | \n", 582 | "1 | \n", 583 | "
1210 | \n", 586 | "wordpress menu with superslide show | \n", 587 | "1 | \n", 588 | "
1235 | \n", 591 | "getting post information outside the wordpress... | \n", 592 | "1 | \n", 593 | "
872 | \n", 596 | "why cant i include a blog | \n", 597 | "1 | \n", 598 | "
1986 | \n", 601 | "d slideshow for wordpress and url file access ... | \n", 602 | "1 | \n", 603 | "
1902 | \n", 606 | "submit wordpress form programmatically | \n", 607 | "1 | \n", 608 | "
1947 | \n", 611 | "move wordpress from home web server to web ser... | \n", 612 | "1 | \n", 613 | "
... | \n", 616 | "... | \n", 617 | "... | \n", 618 | "
99 | \n", 621 | "how to use sqlab xpert tuning to tune sql for ... | \n", 622 | "2 | \n", 623 | "
54 | \n", 626 | "oracle connection problem on mac osx status fa... | \n", 627 | "2 | \n", 628 | "
923 | \n", 631 | "xaconnection performance in oracle g | \n", 632 | "2 | \n", 633 | "
978 | \n", 636 | "what is the pl sql api difference between orac... | \n", 637 | "2 | \n", 638 | "
504 | \n", 641 | "how do i call an oracle function from oci | \n", 642 | "2 | \n", 643 | "
63 | \n", 646 | "explain plan cost vs execution time | \n", 647 | "2 | \n", 648 | "
74 | \n", 651 | "user interface for creating oracle sql loader ... | \n", 652 | "2 | \n", 653 | "
518 | \n", 656 | "is it possible to refer to column names via bi... | \n", 657 | "2 | \n", 658 | "
1117 | \n", 661 | "can i run an arbitrary oracle sql script throu... | \n", 662 | "2 | \n", 663 | "
463 | \n", 666 | "oracle logon protocol o logon in g | \n", 667 | "2 | \n", 668 | "
255 | \n", 671 | "databases oracle | \n", 672 | "2 | \n", 673 | "
968 | \n", 676 | "resultset logic when selecting tables without ... | \n", 677 | "2 | \n", 678 | "
537 | \n", 681 | "oracle stored procedures sys refcursor and nhi... | \n", 682 | "2 | \n", 683 | "
235 | \n", 686 | "oracle express edition can not connect remotel... | \n", 687 | "2 | \n", 688 | "
1051 | \n", 691 | "performance of remote materialized views in or... | \n", 692 | "2 | \n", 693 | "
1065 | \n", 696 | "merge output cursor of sp into table | \n", 697 | "2 | \n", 698 | "
277 | \n", 701 | "oracle logon trigger not being fired | \n", 702 | "2 | \n", 703 | "
453 | \n", 706 | "understanding lob segments sys lob in oracle | \n", 707 | "2 | \n", 708 | "
838 | \n", 711 | "oracle database character set issue with the a... | \n", 712 | "2 | \n", 713 | "
709 | \n", 716 | "finding the days of the week within a date ran... | \n", 717 | "2 | \n", 718 | "
591 | \n", 721 | "how to convert sql server to oracle | \n", 722 | "2 | \n", 723 | "
48 | \n", 726 | "how can i avoid ta warning fom an unused param... | \n", 727 | "2 | \n", 728 | "
786 | \n", 731 | "oracle hierarchical query how to include top l... | \n", 732 | "2 | \n", 733 | "
318 | \n", 736 | "oracle record history using as of timestamp wi... | \n", 737 | "2 | \n", 738 | "
515 | \n", 741 | "how to call a function with rowtype parameter ... | \n", 742 | "2 | \n", 743 | "
33 | \n", 746 | "ssis oracle parameter mapping | \n", 747 | "2 | \n", 748 | "
434 | \n", 751 | "multi line pl sql command with net oraclecommand | \n", 752 | "2 | \n", 753 | "
577 | \n", 756 | "read write data from to a file in pl sql witho... | \n", 757 | "2 | \n", 758 | "
327 | \n", 761 | "oracle sql parsing a name string and convertin... | \n", 762 | "2 | \n", 763 | "
830 | \n", 766 | "oracle optimizing query involving date calcula... | \n", 767 | "2 | \n", 768 | "
100 rows × 2 columns
\n", 772 | "" 773 | ], 774 | "text/plain": [ 775 | " Text Label\n", 776 | "1521 how to get wordpress page id after looping posts 1\n", 777 | "1737 using wp query to pull content from a specific... 1\n", 778 | "1740 wordpress how to show just posts on main index 1\n", 779 | "1660 wordpress is it possible to make one particula... 1\n", 780 | "1411 exclude templates in wordpress page 1\n", 781 | "1678 wierd date and time formating in wordpress y m d 1\n", 782 | "1626 wordpress custom post type templates 1\n", 783 | "1513 inserting wordpress plugin content to posts 1\n", 784 | "1859 how can i delay one feed in wordpress but not ... 1\n", 785 | "1072 how can i remove jquery from the frontside of ... 1\n", 786 | "1811 wordpress calling recent posts widget via scri... 1\n", 787 | "721 debuging register activation hook in wordpress 1\n", 788 | "1636 testing pluggable function calls clashes for w... 1\n", 789 | "1973 would it be quicker to make wordpress theme di... 1\n", 790 | "1938 how to find and clean wordpress from script s ... 1\n", 791 | "1899 wordpress set post date 1\n", 792 | "1280 wordpress add comment like stackoverflow 1\n", 793 | "1883 wordpress nav not visible in pages like articl... 1\n", 794 | "1761 wordpress development 1\n", 795 | "1319 wordpress static pages how to embed content in... 1\n", 796 | "1549 filtering search results with wordpress 1\n", 797 | "1174 drop down js menu blinking in ie 1\n", 798 | "1371 wrap stray text in p tags 1\n", 799 | "1527 wordpress menu of categories 1\n", 800 | "1210 wordpress menu with superslide show 1\n", 801 | "1235 getting post information outside the wordpress... 1\n", 802 | "872 why cant i include a blog 1\n", 803 | "1986 d slideshow for wordpress and url file access ... 1\n", 804 | "1902 submit wordpress form programmatically 1\n", 805 | "1947 move wordpress from home web server to web ser... 1\n", 806 | "... ... ...\n", 807 | "99 how to use sqlab xpert tuning to tune sql for ... 2\n", 808 | "54 oracle connection problem on mac osx status fa... 2\n", 809 | "923 xaconnection performance in oracle g 2\n", 810 | "978 what is the pl sql api difference between orac... 2\n", 811 | "504 how do i call an oracle function from oci 2\n", 812 | "63 explain plan cost vs execution time 2\n", 813 | "74 user interface for creating oracle sql loader ... 2\n", 814 | "518 is it possible to refer to column names via bi... 2\n", 815 | "1117 can i run an arbitrary oracle sql script throu... 2\n", 816 | "463 oracle logon protocol o logon in g 2\n", 817 | "255 databases oracle 2\n", 818 | "968 resultset logic when selecting tables without ... 2\n", 819 | "537 oracle stored procedures sys refcursor and nhi... 2\n", 820 | "235 oracle express edition can not connect remotel... 2\n", 821 | "1051 performance of remote materialized views in or... 2\n", 822 | "1065 merge output cursor of sp into table 2\n", 823 | "277 oracle logon trigger not being fired 2\n", 824 | "453 understanding lob segments sys lob in oracle 2\n", 825 | "838 oracle database character set issue with the a... 2\n", 826 | "709 finding the days of the week within a date ran... 2\n", 827 | "591 how to convert sql server to oracle 2\n", 828 | "48 how can i avoid ta warning fom an unused param... 2\n", 829 | "786 oracle hierarchical query how to include top l... 2\n", 830 | "318 oracle record history using as of timestamp wi... 2\n", 831 | "515 how to call a function with rowtype parameter ... 2\n", 832 | "33 ssis oracle parameter mapping 2\n", 833 | "434 multi line pl sql command with net oraclecommand 2\n", 834 | "577 read write data from to a file in pl sql witho... 2\n", 835 | "327 oracle sql parsing a name string and convertin... 2\n", 836 | "830 oracle optimizing query involving date calcula... 2\n", 837 | "\n", 838 | "[100 rows x 2 columns]" 839 | ] 840 | }, 841 | "execution_count": 12, 842 | "metadata": {}, 843 | "output_type": "execute_result" 844 | } 845 | ], 846 | "source": [ 847 | "train" 848 | ] 849 | }, 850 | { 851 | "cell_type": "code", 852 | "execution_count": 13, 853 | "metadata": { 854 | "ExecuteTime": { 855 | "end_time": "2019-09-10T08:31:14.152530Z", 856 | "start_time": "2019-09-10T08:31:14.147953Z" 857 | } 858 | }, 859 | "outputs": [ 860 | { 861 | "data": { 862 | "text/plain": [ 863 | "(100, 2)" 864 | ] 865 | }, 866 | "execution_count": 13, 867 | "metadata": {}, 868 | "output_type": "execute_result" 869 | } 870 | ], 871 | "source": [ 872 | "train.shape" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "execution_count": 14, 878 | "metadata": { 879 | "ExecuteTime": { 880 | "end_time": "2019-09-10T08:31:14.621358Z", 881 | "start_time": "2019-09-10T08:31:14.155615Z" 882 | } 883 | }, 884 | "outputs": [], 885 | "source": [ 886 | "X_train = train['Text']\n", 887 | "y_train = train['Label'].values\n", 888 | "X_test = test['Text']\n", 889 | "y_test = test['Label'].values\n", 890 | "\n", 891 | "X_train_mean = X_train.apply(lambda x : transform_sentence(x, model2))\n", 892 | "X_test_mean = X_test.apply(lambda x : transform_sentence(x, model2))\n", 893 | "\n", 894 | "X_train_mean = pd.DataFrame(X_train_mean)['Text'].apply(pd.Series)\n", 895 | "X_test_mean = pd.DataFrame(X_test_mean)['Text'].apply(pd.Series)" 896 | ] 897 | }, 898 | { 899 | "cell_type": "markdown", 900 | "metadata": {}, 901 | "source": [ 902 | "# Data Augmentation" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "## Replace words" 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": 90, 915 | "metadata": { 916 | "ExecuteTime": { 917 | "end_time": "2019-09-09T14:58:59.333032Z", 918 | "start_time": "2019-09-09T14:58:59.298243Z" 919 | } 920 | }, 921 | "outputs": [], 922 | "source": [ 923 | "def get_synonyms(word):\n", 924 | " \n", 925 | " synonyms = set()\n", 926 | " \n", 927 | " for syn in wordnet.synsets(word): \n", 928 | " for l in syn.lemmas(): \n", 929 | " synonym = l.name().replace(\"_\", \" \").replace(\"-\", \" \").lower()\n", 930 | " synonym = \"\".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])\n", 931 | " synonyms.add(synonym) \n", 932 | " \n", 933 | " if word in synonyms:\n", 934 | " synonyms.remove(word)\n", 935 | " \n", 936 | " return list(synonyms)\n", 937 | "\n", 938 | "def synonym_replacement(words, n):\n", 939 | " \n", 940 | " words = words.split()\n", 941 | " \n", 942 | " new_words = words.copy()\n", 943 | " random_word_list = list(set([word for word in words if word not in stop_words]))\n", 944 | " random.shuffle(random_word_list)\n", 945 | " num_replaced = 0\n", 946 | " \n", 947 | " for random_word in random_word_list:\n", 948 | " synonyms = get_synonyms(random_word)\n", 949 | " \n", 950 | " if len(synonyms) >= 1:\n", 951 | " synonym = random.choice(list(synonyms))\n", 952 | " new_words = [synonym if word == random_word else word for word in new_words]\n", 953 | " #print(\"replaced\", random_word, \"with\", synonym)\n", 954 | " num_replaced += 1\n", 955 | " \n", 956 | " if num_replaced >= n: #only replace up to n words\n", 957 | " break\n", 958 | "\n", 959 | " sentence = ' '.join(new_words)\n", 960 | "\n", 961 | " return sentence\n", 962 | "\n", 963 | "def iterative_replace(df):\n", 964 | " \n", 965 | " df = df.reset_index().drop(['index'], axis=1)\n", 966 | " index_row = df.index\n", 967 | " df_2 = pd.DataFrame()\n", 968 | " \n", 969 | " for row in index_row:\n", 970 | " for k in range(1,6):\n", 971 | " df_2 = df_2.append({'Text':synonym_replacement(df.loc[row]['Text'], k), 'Label':df.loc[row]['Label']}, ignore_index=True)\n", 972 | " return df_2" 973 | ] 974 | }, 975 | { 976 | "cell_type": "markdown", 977 | "metadata": {}, 978 | "source": [ 979 | "## Delete words" 980 | ] 981 | }, 982 | { 983 | "cell_type": "code", 984 | "execution_count": 91, 985 | "metadata": { 986 | "ExecuteTime": { 987 | "end_time": "2019-09-09T14:58:59.695868Z", 988 | "start_time": "2019-09-09T14:58:59.675230Z" 989 | } 990 | }, 991 | "outputs": [], 992 | "source": [ 993 | "def random_deletion(words, p):\n", 994 | "\n", 995 | " words = words.split()\n", 996 | " \n", 997 | " #obviously, if there's only one word, don't delete it\n", 998 | " if len(words) == 1:\n", 999 | " return words\n", 1000 | "\n", 1001 | " #randomly delete words with probability p\n", 1002 | " new_words = []\n", 1003 | " for word in words:\n", 1004 | " r = random.uniform(0, 1)\n", 1005 | " if r > p:\n", 1006 | " new_words.append(word)\n", 1007 | "\n", 1008 | " #if you end up deleting all words, just return a random word\n", 1009 | " if len(new_words) == 0:\n", 1010 | " rand_int = random.randint(0, len(words)-1)\n", 1011 | " return [words[rand_int]]\n", 1012 | "\n", 1013 | " sentence = ' '.join(new_words)\n", 1014 | " \n", 1015 | " return sentence\n", 1016 | "\n", 1017 | "def iterative_delete(df):\n", 1018 | " \n", 1019 | " df = df.reset_index().drop(['index'], axis=1)\n", 1020 | " index_row = df.index\n", 1021 | " df_2 = pd.DataFrame()\n", 1022 | " \n", 1023 | " for row in index_row:\n", 1024 | " df_2 = df_2.append({'Text':random_deletion(df.loc[row]['Text'], 0.25), 'Label':df.loc[row]['Label']}, ignore_index=True)\n", 1025 | " return df_2" 1026 | ] 1027 | }, 1028 | { 1029 | "cell_type": "markdown", 1030 | "metadata": {}, 1031 | "source": [ 1032 | "## Random Swap" 1033 | ] 1034 | }, 1035 | { 1036 | "cell_type": "code", 1037 | "execution_count": 92, 1038 | "metadata": { 1039 | "ExecuteTime": { 1040 | "end_time": "2019-09-09T14:59:00.168295Z", 1041 | "start_time": "2019-09-09T14:59:00.144499Z" 1042 | } 1043 | }, 1044 | "outputs": [], 1045 | "source": [ 1046 | "def random_swap(words, n):\n", 1047 | " \n", 1048 | " words = words.split()\n", 1049 | " new_words = words.copy()\n", 1050 | " \n", 1051 | " for _ in range(n):\n", 1052 | " new_words = swap_word(new_words)\n", 1053 | " \n", 1054 | " sentence = ' '.join(new_words)\n", 1055 | " \n", 1056 | " return sentence\n", 1057 | "\n", 1058 | "def swap_word(new_words):\n", 1059 | " \n", 1060 | " random_idx_1 = random.randint(0, len(new_words)-1)\n", 1061 | " random_idx_2 = random_idx_1\n", 1062 | " counter = 0\n", 1063 | " \n", 1064 | " while random_idx_2 == random_idx_1:\n", 1065 | " random_idx_2 = random.randint(0, len(new_words)-1)\n", 1066 | " counter += 1\n", 1067 | " \n", 1068 | " if counter > 3:\n", 1069 | " return new_words\n", 1070 | " \n", 1071 | " new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] \n", 1072 | " return new_words\n", 1073 | "\n", 1074 | "def iterative_swap(df):\n", 1075 | " \n", 1076 | " df = df.reset_index().drop(['index'], axis=1)\n", 1077 | " index_row = df.index\n", 1078 | " df_2 = pd.DataFrame()\n", 1079 | " for row in index_row:\n", 1080 | " df_2 = df_2.append({'Text':random_swap(df.loc[row]['Text'], 2), 'Label':df.loc[row]['Label']}, ignore_index=True)\n", 1081 | " return df_2" 1082 | ] 1083 | }, 1084 | { 1085 | "cell_type": "markdown", 1086 | "metadata": {}, 1087 | "source": [ 1088 | "## Random Insertion" 1089 | ] 1090 | }, 1091 | { 1092 | "cell_type": "code", 1093 | "execution_count": 93, 1094 | "metadata": { 1095 | "ExecuteTime": { 1096 | "end_time": "2019-09-09T14:59:01.181955Z", 1097 | "start_time": "2019-09-09T14:59:01.152899Z" 1098 | } 1099 | }, 1100 | "outputs": [], 1101 | "source": [ 1102 | "def random_insertion(words, n):\n", 1103 | " \n", 1104 | " words = words.split()\n", 1105 | " new_words = words.copy()\n", 1106 | " \n", 1107 | " for _ in range(n):\n", 1108 | " add_word(new_words)\n", 1109 | " \n", 1110 | " sentence = ' '.join(new_words)\n", 1111 | " return sentence\n", 1112 | "\n", 1113 | "def add_word(new_words):\n", 1114 | " \n", 1115 | " synonyms = []\n", 1116 | " counter = 0\n", 1117 | " \n", 1118 | " while len(synonyms) < 1:\n", 1119 | " random_word = new_words[random.randint(0, len(new_words)-1)]\n", 1120 | " synonyms = get_synonyms(random_word)\n", 1121 | " counter += 1\n", 1122 | " if counter >= 10:\n", 1123 | " return\n", 1124 | " \n", 1125 | " random_synonym = synonyms[0]\n", 1126 | " random_idx = random.randint(0, len(new_words)-1)\n", 1127 | " new_words.insert(random_idx, random_synonym)\n", 1128 | " \n", 1129 | "def iterative_insert(df):\n", 1130 | " \n", 1131 | " df = df.reset_index().drop(['index'], axis=1)\n", 1132 | " index_row = df.index\n", 1133 | " df_2 = pd.DataFrame()\n", 1134 | " \n", 1135 | " for row in index_row:\n", 1136 | " df_2 = df_2.append({'Text':random_insertion(df.loc[row]['Text'], 2), 'Label':df.loc[row]['Label']}, ignore_index=True)\n", 1137 | " \n", 1138 | " return df_2" 1139 | ] 1140 | }, 1141 | { 1142 | "cell_type": "markdown", 1143 | "metadata": {}, 1144 | "source": [ 1145 | "## Data Augmentation" 1146 | ] 1147 | }, 1148 | { 1149 | "cell_type": "code", 1150 | "execution_count": 94, 1151 | "metadata": { 1152 | "ExecuteTime": { 1153 | "end_time": "2019-09-09T14:59:04.924959Z", 1154 | "start_time": "2019-09-09T14:59:03.141470Z" 1155 | } 1156 | }, 1157 | "outputs": [ 1158 | { 1159 | "name": "stderr", 1160 | "output_type": "stream", 1161 | "text": [ 1162 | "/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:6: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version\n", 1163 | "of pandas will change to not sort by default.\n", 1164 | "\n", 1165 | "To accept the future behavior, pass 'sort=False'.\n", 1166 | "\n", 1167 | "To retain the current behavior and silence the warning, pass 'sort=True'.\n", 1168 | "\n", 1169 | " \n" 1170 | ] 1171 | } 1172 | ], 1173 | "source": [ 1174 | "df_replace = iterative_replace(train)\n", 1175 | "df_delete = iterative_delete(train)\n", 1176 | "df_swap = iterative_swap(train)\n", 1177 | "df_insert = iterative_insert(train)\n", 1178 | "\n", 1179 | "train = pd.concat([train, df_replace, df_delete, df_swap, df_insert], axis=0).reset_index().drop(['index'], axis=1)" 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "code", 1184 | "execution_count": 95, 1185 | "metadata": { 1186 | "ExecuteTime": { 1187 | "end_time": "2019-09-09T14:59:06.350551Z", 1188 | "start_time": "2019-09-09T14:59:05.829293Z" 1189 | } 1190 | }, 1191 | "outputs": [], 1192 | "source": [ 1193 | "X_train = train['Text']\n", 1194 | "y_train = train['Label'].values\n", 1195 | "X_test = test['Text']\n", 1196 | "y_test = test['Label'].values\n", 1197 | "\n", 1198 | "X_train_mean = X_train.apply(lambda x : transform_sentence(x, model2))\n", 1199 | "X_test_mean = X_test.apply(lambda x : transform_sentence(x, model2))\n", 1200 | "\n", 1201 | "X_train_mean = pd.DataFrame(X_train_mean)['Text'].apply(pd.Series)\n", 1202 | "X_test_mean = pd.DataFrame(X_test_mean)['Text'].apply(pd.Series)" 1203 | ] 1204 | }, 1205 | { 1206 | "cell_type": "markdown", 1207 | "metadata": {}, 1208 | "source": [ 1209 | "# Utilities in PyTorch " 1210 | ] 1211 | }, 1212 | { 1213 | "cell_type": "markdown", 1214 | "metadata": { 1215 | "ExecuteTime": { 1216 | "end_time": "2019-09-10T08:38:21.419881Z", 1217 | "start_time": "2019-09-10T08:38:21.415733Z" 1218 | } 1219 | }, 1220 | "source": [ 1221 | "## CNN Model" 1222 | ] 1223 | }, 1224 | { 1225 | "cell_type": "code", 1226 | "execution_count": 16, 1227 | "metadata": { 1228 | "ExecuteTime": { 1229 | "end_time": "2019-09-10T08:34:17.541159Z", 1230 | "start_time": "2019-09-10T08:34:17.426574Z" 1231 | } 1232 | }, 1233 | "outputs": [], 1234 | "source": [ 1235 | "class MaxPool(nn.Module):\n", 1236 | " def __init__(self, dim=1):\n", 1237 | " super(MaxPool, self).__init__()\n", 1238 | " self.dim = dim\n", 1239 | " \n", 1240 | " def forward(self, input):\n", 1241 | " return torch.max(input, self.dim)[0]\n", 1242 | "\n", 1243 | " def __repr__(self):\n", 1244 | " return self.__class__.__name__ +'('+ 'dim=' + str(self.dim) + ')'\n", 1245 | "\n", 1246 | "class View(nn.Module):\n", 1247 | " def __init__(self, *sizes):\n", 1248 | " super(View, self).__init__()\n", 1249 | " self.sizes_list = sizes\n", 1250 | "\n", 1251 | " def forward(self, input):\n", 1252 | " return input.view(*self.sizes_list)\n", 1253 | "\n", 1254 | " def __repr__(self):\n", 1255 | " return self.__class__.__name__ + ' (' \\\n", 1256 | " + 'sizes=' + str(self.sizes_list) + ')'\n", 1257 | "\n", 1258 | "class Transpose(nn.Module):\n", 1259 | " def __init__(self, dim1=0, dim2=1):\n", 1260 | " super(Transpose, self).__init__()\n", 1261 | " self.dim1 = dim1\n", 1262 | " self.dim2 = dim2\n", 1263 | "\n", 1264 | " def forward(self, input):\n", 1265 | " return input.transpose(self.dim1, self.dim2).contiguous()\n", 1266 | "\n", 1267 | " def __repr__(self):\n", 1268 | " return self.__class__.__name__ + ' (' \\\n", 1269 | " + 'between=' + str(self.dim1) + ',' + str(self.dim2) + ')'\n", 1270 | "\n", 1271 | "class CNNModel(nn.Module):\n", 1272 | " def __init__(self, vocab_size, num_labels, emb_size, w_hid_size, h_hid_size, win, batch_size,with_proj=False):\n", 1273 | " super(CNNModel, self).__init__()\n", 1274 | "\n", 1275 | " self.model = nn.Sequential()\n", 1276 | " self.model.add_module('transpose', Transpose())\n", 1277 | " self.embed = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_size)\n", 1278 | " self.model.add_module('emb', self.embed)\n", 1279 | " if with_proj:\n", 1280 | " self.model.add_module('view1', View(-1, emb_size))\n", 1281 | " self.model.add_module('linear1', nn.Linear(emb_size, w_hid_size))\n", 1282 | " self.model.add_module('relu1', nn.ReLU())\n", 1283 | " else:\n", 1284 | " w_hid_size = emb_size\n", 1285 | "\n", 1286 | " self.model.add_module('trans2', Transpose(1, 2))\n", 1287 | "\n", 1288 | " conv_nn = nn.Conv1d(w_hid_size, h_hid_size, win, padding=1)\n", 1289 | " self.model.add_module('conv', conv_nn)\n", 1290 | " self.model.add_module('relu2', nn.ReLU())\n", 1291 | "\n", 1292 | " self.model.add_module('max', MaxPool(2))\n", 1293 | "\n", 1294 | " self.model.add_module('view4', View(-1, h_hid_size))\n", 1295 | " self.model.add_module('linear2', nn.Linear(h_hid_size, num_labels))\n", 1296 | " self.model.add_module('softmax', nn.LogSoftmax())\n", 1297 | "\n", 1298 | "\n", 1299 | " def forward(self, x):\n", 1300 | "\n", 1301 | " output = self.model.forward(x)\n", 1302 | "\n", 1303 | " return output" 1304 | ] 1305 | }, 1306 | { 1307 | "cell_type": "markdown", 1308 | "metadata": { 1309 | "ExecuteTime": { 1310 | "end_time": "2019-09-10T08:38:21.419881Z", 1311 | "start_time": "2019-09-10T08:38:21.415733Z" 1312 | } 1313 | }, 1314 | "source": [ 1315 | "## Load BERT" 1316 | ] 1317 | }, 1318 | { 1319 | "cell_type": "code", 1320 | "execution_count": 17, 1321 | "metadata": { 1322 | "ExecuteTime": { 1323 | "end_time": "2019-09-10T08:36:59.732744Z", 1324 | "start_time": "2019-09-10T08:36:59.624085Z" 1325 | } 1326 | }, 1327 | "outputs": [ 1328 | { 1329 | "name": "stdout", 1330 | "output_type": "stream", 1331 | "text": [ 1332 | "Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.\n" 1333 | ] 1334 | } 1335 | ], 1336 | "source": [ 1337 | "dependencies = ['torch', 'tqdm', 'boto3', 'requests', 'regex']\n", 1338 | "\n", 1339 | "def bertTokenizer(*args, **kwargs):\n", 1340 | " tokenizer = BertTokenizer.from_pretrained(*args, **kwargs)\n", 1341 | " return tokenizer\n", 1342 | "\n", 1343 | "def bertModel(*args, **kwargs):\n", 1344 | " model = BertModel.from_pretrained(*args, **kwargs)\n", 1345 | " return model\n", 1346 | "\n", 1347 | "def bertForNextSentencePrediction(*args, **kwargs):\n", 1348 | " model = BertForNextSentencePrediction.from_pretrained(*args, **kwargs)\n", 1349 | " return model\n", 1350 | "\n", 1351 | "def bertForPreTraining(*args, **kwargs):\n", 1352 | " model = BertForPreTraining.from_pretrained(*args, **kwargs)\n", 1353 | " return model\n", 1354 | "\n", 1355 | "def bertForMaskedLM(*args, **kwargs):\n", 1356 | " model = BertForMaskedLM.from_pretrained(*args, **kwargs)\n", 1357 | " return model\n", 1358 | "\n", 1359 | "def bertForSequenceClassification(*args, **kwargs):\n", 1360 | " model = BertForSequenceClassification.from_pretrained(*args, **kwargs)\n", 1361 | " return model\n", 1362 | "\n", 1363 | "def bertForMultipleChoice(*args, **kwargs):\n", 1364 | " model = BertForMultipleChoice.from_pretrained(*args, **kwargs)\n", 1365 | " return model\n", 1366 | "\n", 1367 | "def bertForQuestionAnswering(*args, **kwargs):\n", 1368 | " model = BertForQuestionAnswering.from_pretrained(*args, **kwargs)\n", 1369 | " return model\n", 1370 | "\n", 1371 | "def bertForTokenClassification(*args, **kwargs):\n", 1372 | " model = BertForTokenClassification.from_pretrained(*args, **kwargs)\n", 1373 | " return model" 1374 | ] 1375 | }, 1376 | { 1377 | "cell_type": "markdown", 1378 | "metadata": {}, 1379 | "source": [ 1380 | "## Bert Fine-Tuning" 1381 | ] 1382 | }, 1383 | { 1384 | "cell_type": "code", 1385 | "execution_count": null, 1386 | "metadata": {}, 1387 | "outputs": [], 1388 | "source": [ 1389 | "import argparse\n", 1390 | "import logging\n", 1391 | "import os\n", 1392 | "import random\n", 1393 | "from io import open\n", 1394 | "\n", 1395 | "import numpy as np\n", 1396 | "import torch\n", 1397 | "from torch.utils.data import DataLoader, Dataset, RandomSampler\n", 1398 | "from torch.utils.data.distributed import DistributedSampler\n", 1399 | "from tqdm import tqdm, trange\n", 1400 | "\n", 1401 | "from pytorch_pretrained_bert.modeling import BertForPreTraining\n", 1402 | "from pytorch_pretrained_bert.tokenization import BertTokenizer\n", 1403 | "from pytorch_pretrained_bert.optimization import BertAdam, WarmupLinearSchedule\n", 1404 | "from torch.nn import CrossEntropyLoss\n", 1405 | "\n", 1406 | "\n", 1407 | "class BERTDataset(Dataset):\n", 1408 | " def __init__(self, corpus_path, tokenizer, seq_len, encoding=\"utf-8\", corpus_lines=None, on_memory=True):\n", 1409 | " self.vocab = tokenizer.vocab\n", 1410 | " self.tokenizer = tokenizer\n", 1411 | " self.seq_len = seq_len\n", 1412 | " self.on_memory = on_memory\n", 1413 | " self.corpus_lines = corpus_lines # number of non-empty lines in input corpus\n", 1414 | " self.corpus_path = corpus_path\n", 1415 | " self.encoding = encoding\n", 1416 | " self.current_doc = 0 # to avoid random sentence from same doc\n", 1417 | "\n", 1418 | " # for loading samples directly from file\n", 1419 | " self.sample_counter = 0 # used to keep track of full epochs on file\n", 1420 | " self.line_buffer = None # keep second sentence of a pair in memory and use as first sentence in next pair\n", 1421 | "\n", 1422 | " # for loading samples in memory\n", 1423 | " self.current_random_doc = 0\n", 1424 | " self.num_docs = 0\n", 1425 | " self.sample_to_doc = [] # map sample index to doc and line\n", 1426 | "\n", 1427 | " # load samples into memory\n", 1428 | " if on_memory:\n", 1429 | " self.all_docs = []\n", 1430 | " doc = []\n", 1431 | " self.corpus_lines = 0\n", 1432 | " with open(corpus_path, \"r\", encoding=encoding) as f:\n", 1433 | " for line in tqdm(f, desc=\"Loading Dataset\", total=corpus_lines):\n", 1434 | " line = line.strip()\n", 1435 | " if line == \"\":\n", 1436 | " self.all_docs.append(doc)\n", 1437 | " doc = []\n", 1438 | " #remove last added sample because there won't be a subsequent line anymore in the doc\n", 1439 | " self.sample_to_doc.pop()\n", 1440 | " else:\n", 1441 | " #store as one sample\n", 1442 | " sample = {\"doc_id\": len(self.all_docs),\n", 1443 | " \"line\": len(doc)}\n", 1444 | " self.sample_to_doc.append(sample)\n", 1445 | " doc.append(line)\n", 1446 | " self.corpus_lines = self.corpus_lines + 1\n", 1447 | "\n", 1448 | " # if last row in file is not empty\n", 1449 | " if self.all_docs[-1] != doc:\n", 1450 | " self.all_docs.append(doc)\n", 1451 | " self.sample_to_doc.pop()\n", 1452 | "\n", 1453 | " self.num_docs = len(self.all_docs)\n", 1454 | "\n", 1455 | " # load samples later lazily from disk\n", 1456 | " else:\n", 1457 | " if self.corpus_lines is None:\n", 1458 | " with open(corpus_path, \"r\", encoding=encoding) as f:\n", 1459 | " self.corpus_lines = 0\n", 1460 | " for line in tqdm(f, desc=\"Loading Dataset\", total=corpus_lines):\n", 1461 | " if line.strip() == \"\":\n", 1462 | " self.num_docs += 1\n", 1463 | " else:\n", 1464 | " self.corpus_lines += 1\n", 1465 | "\n", 1466 | " # if doc does not end with empty line\n", 1467 | " if line.strip() != \"\":\n", 1468 | " self.num_docs += 1\n", 1469 | "\n", 1470 | " self.file = open(corpus_path, \"r\", encoding=encoding)\n", 1471 | " self.random_file = open(corpus_path, \"r\", encoding=encoding)\n", 1472 | "\n", 1473 | " def __len__(self):\n", 1474 | " # last line of doc won't be used, because there's no \"nextSentence\". Additionally, we start counting at 0.\n", 1475 | " return self.corpus_lines - self.num_docs - 1\n", 1476 | "\n", 1477 | " def __getitem__(self, item):\n", 1478 | " cur_id = self.sample_counter\n", 1479 | " self.sample_counter += 1\n", 1480 | " if not self.on_memory:\n", 1481 | " # after one epoch we start again from beginning of file\n", 1482 | " if cur_id != 0 and (cur_id % len(self) == 0):\n", 1483 | " self.file.close()\n", 1484 | " self.file = open(self.corpus_path, \"r\", encoding=self.encoding)\n", 1485 | "\n", 1486 | " t1, t2, is_next_label = self.random_sent(item)\n", 1487 | "\n", 1488 | " # tokenize\n", 1489 | " tokens_a = self.tokenizer.tokenize(t1)\n", 1490 | " tokens_b = self.tokenizer.tokenize(t2)\n", 1491 | "\n", 1492 | " # combine to one sample\n", 1493 | " cur_example = InputExample(guid=cur_id, tokens_a=tokens_a, tokens_b=tokens_b, is_next=is_next_label)\n", 1494 | "\n", 1495 | " # transform sample to features\n", 1496 | " cur_features = convert_example_to_features(cur_example, self.seq_len, self.tokenizer)\n", 1497 | "\n", 1498 | " cur_tensors = (torch.tensor(cur_features.input_ids),\n", 1499 | " torch.tensor(cur_features.input_mask),\n", 1500 | " torch.tensor(cur_features.segment_ids),\n", 1501 | " torch.tensor(cur_features.lm_label_ids),\n", 1502 | " torch.tensor(cur_features.is_next))\n", 1503 | "\n", 1504 | " return cur_tensors\n", 1505 | "\n", 1506 | " def random_sent(self, index):\n", 1507 | " \"\"\"\n", 1508 | " Get one sample from corpus consisting of two sentences. With prob. 50% these are two subsequent sentences\n", 1509 | " from one doc. With 50% the second sentence will be a random one from another doc.\n", 1510 | " :param index: int, index of sample.\n", 1511 | " :return: (str, str, int), sentence 1, sentence 2, isNextSentence Label\n", 1512 | " \"\"\"\n", 1513 | " t1, t2 = self.get_corpus_line(index)\n", 1514 | " t = random.random()\n", 1515 | " if t > 0.5:\n", 1516 | " label = 0\n", 1517 | " else:\n", 1518 | " t2 = self.get_random_line()\n", 1519 | " label = 1\n", 1520 | "\n", 1521 | " assert len(t1) > 0\n", 1522 | " assert len(t2) > 0\n", 1523 | " return t1, t2, label\n", 1524 | "\n", 1525 | " def get_corpus_line(self, item):\n", 1526 | " \"\"\"\n", 1527 | " Get one sample from corpus consisting of a pair of two subsequent lines from the same doc.\n", 1528 | " :param item: int, index of sample.\n", 1529 | " :return: (str, str), two subsequent sentences from corpus\n", 1530 | " \"\"\"\n", 1531 | " t1 = \"\"\n", 1532 | " t2 = \"\"\n", 1533 | " assert item < self.corpus_lines\n", 1534 | " if self.on_memory:\n", 1535 | " sample = self.sample_to_doc[item]\n", 1536 | " t1 = self.all_docs[sample[\"doc_id\"]][sample[\"line\"]]\n", 1537 | " t2 = self.all_docs[sample[\"doc_id\"]][sample[\"line\"]+1]\n", 1538 | " # used later to avoid random nextSentence from same doc\n", 1539 | " self.current_doc = sample[\"doc_id\"]\n", 1540 | " return t1, t2\n", 1541 | " else:\n", 1542 | " if self.line_buffer is None:\n", 1543 | " # read first non-empty line of file\n", 1544 | " while t1 == \"\" :\n", 1545 | " t1 = next(self.file).strip()\n", 1546 | " t2 = next(self.file).strip()\n", 1547 | " else:\n", 1548 | " # use t2 from previous iteration as new t1\n", 1549 | " t1 = self.line_buffer\n", 1550 | " t2 = next(self.file).strip()\n", 1551 | " # skip empty rows that are used for separating documents and keep track of current doc id\n", 1552 | " while t2 == \"\" or t1 == \"\":\n", 1553 | " t1 = next(self.file).strip()\n", 1554 | " t2 = next(self.file).strip()\n", 1555 | " self.current_doc = self.current_doc+1\n", 1556 | " self.line_buffer = t2\n", 1557 | "\n", 1558 | " assert t1 != \"\"\n", 1559 | " assert t2 != \"\"\n", 1560 | " return t1, t2\n", 1561 | "\n", 1562 | " def get_random_line(self):\n", 1563 | " \"\"\"\n", 1564 | " Get random line from another document for nextSentence task.\n", 1565 | " :return: str, content of one line\n", 1566 | " \"\"\"\n", 1567 | " # Similar to original tf repo: This outer loop should rarely go for more than one iteration for large\n", 1568 | " # corpora. However, just to be careful, we try to make sure that\n", 1569 | " # the random document is not the same as the document we're processing.\n", 1570 | " for _ in range(10):\n", 1571 | " if self.on_memory:\n", 1572 | " rand_doc_idx = random.randint(0, len(self.all_docs)-1)\n", 1573 | " rand_doc = self.all_docs[rand_doc_idx]\n", 1574 | " line = rand_doc[random.randrange(len(rand_doc))]\n", 1575 | " else:\n", 1576 | " rand_index = random.randint(1, self.corpus_lines if self.corpus_lines < 1000 else 1000)\n", 1577 | " # rand_index = 892\n", 1578 | " #pick random line\n", 1579 | " for _ in range(rand_index):\n", 1580 | " line = self.get_next_line()\n", 1581 | " \n", 1582 | " #check if our picked random line is really from another doc like we want it to be\n", 1583 | " if self.current_random_doc != self.current_doc:\n", 1584 | " \n", 1585 | " break\n", 1586 | " # print(\"random Index:\", rand_index, line)\n", 1587 | " return line\n", 1588 | "\n", 1589 | " def get_next_line(self):\n", 1590 | " \"\"\" Gets next line of random_file and starts over when reaching end of file\"\"\"\n", 1591 | " try:\n", 1592 | " line = next(self.random_file).strip()\n", 1593 | "\n", 1594 | " #keep track of which document we are currently looking at to later avoid having the same doc as t1\n", 1595 | " while line == \"\":\n", 1596 | " self.current_random_doc = self.current_random_doc + 1\n", 1597 | " line = next(self.random_file).strip()\n", 1598 | " except StopIteration:\n", 1599 | " self.random_file.close()\n", 1600 | " self.random_file = open(self.corpus_path, \"r\", encoding=self.encoding)\n", 1601 | " line = next(self.random_file).strip()\n", 1602 | " \n", 1603 | " return line\n", 1604 | "\n", 1605 | "\n", 1606 | "class InputExample(object):\n", 1607 | " \"\"\"A single training/test example for the language model.\"\"\"\n", 1608 | "\n", 1609 | " def __init__(self, guid, tokens_a, tokens_b=None, is_next=None, lm_labels=None):\n", 1610 | " \"\"\"Constructs a InputExample.\n", 1611 | " Args:\n", 1612 | " guid: Unique id for the example.\n", 1613 | " tokens_a: string. The untokenized text of the first sequence. For single\n", 1614 | " sequence tasks, only this sequence must be specified.\n", 1615 | " tokens_b: (Optional) string. The untokenized text of the second sequence.\n", 1616 | " Only must be specified for sequence pair tasks.\n", 1617 | " label: (Optional) string. The label of the example. This should be\n", 1618 | " specified for train and dev examples, but not for test examples.\n", 1619 | " \"\"\"\n", 1620 | " self.guid = guid\n", 1621 | " self.tokens_a = tokens_a\n", 1622 | " self.tokens_b = tokens_b\n", 1623 | " self.is_next = is_next # nextSentence\n", 1624 | " self.lm_labels = lm_labels # masked words for language model\n", 1625 | "\n", 1626 | "\n", 1627 | "class InputFeatures(object):\n", 1628 | " \"\"\"A single set of features of data.\"\"\"\n", 1629 | "\n", 1630 | " def __init__(self, input_ids, input_mask, segment_ids, is_next, lm_label_ids):\n", 1631 | " self.input_ids = input_ids\n", 1632 | " self.input_mask = input_mask\n", 1633 | " self.segment_ids = segment_ids\n", 1634 | " self.is_next = is_next\n", 1635 | " self.lm_label_ids = lm_label_ids\n", 1636 | "\n", 1637 | "\n", 1638 | "def random_word(tokens, tokenizer):\n", 1639 | " \"\"\"\n", 1640 | " Masking some random tokens for Language Model task with probabilities as in the original BERT paper.\n", 1641 | " :param tokens: list of str, tokenized sentence.\n", 1642 | " :param tokenizer: Tokenizer, object used for tokenization (we need it's vocab here)\n", 1643 | " :return: (list of str, list of int), masked tokens and related labels for LM prediction\n", 1644 | " \"\"\"\n", 1645 | " output_label = []\n", 1646 | "\n", 1647 | " for i, token in enumerate(tokens):\n", 1648 | " prob = random.random()\n", 1649 | " # mask token with 15% probability\n", 1650 | " if prob < 0.15:\n", 1651 | " prob /= 0.15\n", 1652 | "\n", 1653 | " # 80% randomly change token to mask token\n", 1654 | " if prob < 0.8:\n", 1655 | " tokens[i] = \"[MASK]\"\n", 1656 | "\n", 1657 | " # 10% randomly change token to random token\n", 1658 | " elif prob < 0.9:\n", 1659 | " tokens[i] = random.choice(list(tokenizer.vocab.items()))[0]\n", 1660 | "\n", 1661 | " # -> rest 10% randomly keep current token\n", 1662 | "\n", 1663 | " # append current token to output (we will predict these later)\n", 1664 | " try:\n", 1665 | " output_label.append(tokenizer.vocab[token])\n", 1666 | " except KeyError:\n", 1667 | " # For unknown words (should not occur with BPE vocab)\n", 1668 | " output_label.append(tokenizer.vocab[\"[UNK]\"])\n", 1669 | " logger.warning(\"Cannot find token '{}' in vocab. Using [UNK] insetad\".format(token))\n", 1670 | " else:\n", 1671 | " # no masking token (will be ignored by loss function later)\n", 1672 | " output_label.append(-1)\n", 1673 | "\n", 1674 | " return tokens, output_label\n", 1675 | "\n", 1676 | "\n", 1677 | "def convert_example_to_features(example, max_seq_length, tokenizer):\n", 1678 | " \"\"\"\n", 1679 | " Convert a raw sample (pair of sentences as tokenized strings) into a proper training sample with\n", 1680 | " IDs, LM labels, input_mask, CLS and SEP tokens etc.\n", 1681 | " :param example: InputExample, containing sentence input as strings and is_next label\n", 1682 | " :param max_seq_length: int, maximum length of sequence.\n", 1683 | " :param tokenizer: Tokenizer\n", 1684 | " :return: InputFeatures, containing all inputs and labels of one sample as IDs (as used for model training)\n", 1685 | " \"\"\"\n", 1686 | " tokens_a = example.tokens_a\n", 1687 | " tokens_b = example.tokens_b\n", 1688 | " # Modifies `tokens_a` and `tokens_b` in place so that the total\n", 1689 | " # length is less than the specified length.\n", 1690 | " # Account for [CLS], [SEP], [SEP] with \"- 3\"\n", 1691 | " _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)\n", 1692 | "\n", 1693 | " tokens_a, t1_label = random_word(tokens_a, tokenizer)\n", 1694 | " tokens_b, t2_label = random_word(tokens_b, tokenizer)\n", 1695 | " # concatenate lm labels and account for CLS, SEP, SEP\n", 1696 | " lm_label_ids = ([-1] + t1_label + [-1] + t2_label + [-1])\n", 1697 | "\n", 1698 | " # The convention in BERT is:\n", 1699 | " # (a) For sequence pairs:\n", 1700 | " # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]\n", 1701 | " # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1\n", 1702 | " # (b) For single sequences:\n", 1703 | " # tokens: [CLS] the dog is hairy . [SEP]\n", 1704 | " # type_ids: 0 0 0 0 0 0 0\n", 1705 | " #\n", 1706 | " # Where \"type_ids\" are used to indicate whether this is the first\n", 1707 | " # sequence or the second sequence. The embedding vectors for `type=0` and\n", 1708 | " # `type=1` were learned during pre-training and are added to the wordpiece\n", 1709 | " # embedding vector (and position vector). This is not *strictly* necessary\n", 1710 | " # since the [SEP] token unambigiously separates the sequences, but it makes\n", 1711 | " # it easier for the model to learn the concept of sequences.\n", 1712 | " #\n", 1713 | " # For classification tasks, the first vector (corresponding to [CLS]) is\n", 1714 | " # used as as the \"sentence vector\". Note that this only makes sense because\n", 1715 | " # the entire model is fine-tuned.\n", 1716 | " tokens = []\n", 1717 | " segment_ids = []\n", 1718 | " tokens.append(\"[CLS]\")\n", 1719 | " segment_ids.append(0)\n", 1720 | " for token in tokens_a:\n", 1721 | " tokens.append(token)\n", 1722 | " segment_ids.append(0)\n", 1723 | " tokens.append(\"[SEP]\")\n", 1724 | " segment_ids.append(0)\n", 1725 | "\n", 1726 | " assert len(tokens_b) > 0\n", 1727 | " for token in tokens_b:\n", 1728 | " tokens.append(token)\n", 1729 | " segment_ids.append(1)\n", 1730 | " tokens.append(\"[SEP]\")\n", 1731 | " segment_ids.append(1)\n", 1732 | "\n", 1733 | " input_ids = tokenizer.convert_tokens_to_ids(tokens)\n", 1734 | "\n", 1735 | " # The mask has 1 for real tokens and 0 for padding tokens. Only real\n", 1736 | " # tokens are attended to.\n", 1737 | " input_mask = [1] * len(input_ids)\n", 1738 | "\n", 1739 | " # Zero-pad up to the sequence length.\n", 1740 | " while len(input_ids) < max_seq_length:\n", 1741 | " input_ids.append(0)\n", 1742 | " input_mask.append(0)\n", 1743 | " segment_ids.append(0)\n", 1744 | " lm_label_ids.append(-1)\n", 1745 | "\n", 1746 | " assert len(input_ids) == max_seq_length\n", 1747 | " assert len(input_mask) == max_seq_length\n", 1748 | " assert len(segment_ids) == max_seq_length\n", 1749 | " assert len(lm_label_ids) == max_seq_length\n", 1750 | "\n", 1751 | " if example.guid < 5:\n", 1752 | " logger.info(\"*** Example ***\")\n", 1753 | " logger.info(\"guid: %s\" % (example.guid))\n", 1754 | " logger.info(\"tokens: %s\" % \" \".join(\n", 1755 | " [str(x) for x in tokens]))\n", 1756 | " logger.info(\"input_ids: %s\" % \" \".join([str(x) for x in input_ids]))\n", 1757 | " logger.info(\"input_mask: %s\" % \" \".join([str(x) for x in input_mask]))\n", 1758 | " logger.info(\n", 1759 | " \"segment_ids: %s\" % \" \".join([str(x) for x in segment_ids]))\n", 1760 | " logger.info(\"LM label: %s \" % (lm_label_ids))\n", 1761 | " logger.info(\"Is next sentence label: %s \" % (example.is_next))\n", 1762 | "\n", 1763 | " features = InputFeatures(input_ids=input_ids,\n", 1764 | " input_mask=input_mask,\n", 1765 | " segment_ids=segment_ids,\n", 1766 | " lm_label_ids=lm_label_ids,\n", 1767 | " is_next=example.is_next)\n", 1768 | " return features" 1769 | ] 1770 | }, 1771 | { 1772 | "cell_type": "markdown", 1773 | "metadata": { 1774 | "ExecuteTime": { 1775 | "end_time": "2019-09-10T08:38:21.419881Z", 1776 | "start_time": "2019-09-10T08:38:21.415733Z" 1777 | } 1778 | }, 1779 | "source": [ 1780 | "## Build Vocabulary" 1781 | ] 1782 | }, 1783 | { 1784 | "cell_type": "code", 1785 | "execution_count": null, 1786 | "metadata": {}, 1787 | "outputs": [], 1788 | "source": [ 1789 | "class MTLField(Field):\n", 1790 | "\n", 1791 | " def __init__(\n", 1792 | " self, **kwargs):\n", 1793 | " super(MTLField, self).__init__(**kwargs)\n", 1794 | "\n", 1795 | " def build_vocab(self, dataset_list, **kwargs):\n", 1796 | " ## Load BERT\n", 1797 | " counter = Counter()\n", 1798 | " sources = []\n", 1799 | " for arg in dataset_list:\n", 1800 | " if isinstance(arg, Dataset):\n", 1801 | " sources += [getattr(arg, name) for name, field in\n", 1802 | " arg.fields.items() if field is self]\n", 1803 | " else:\n", 1804 | " sources.append(arg)\n", 1805 | " for data in sources:\n", 1806 | " for x in data:\n", 1807 | " if not self.sequential:\n", 1808 | " x = [x]\n", 1809 | " counter.update(x)\n", 1810 | " specials = list(OrderedDict.fromkeys(\n", 1811 | " tok for tok in [self.pad_token, self.init_token, self.eos_token]\n", 1812 | " if tok is not None))\n", 1813 | " self.vocab = Vocab(counter, specials=specials, **kwargs)" 1814 | ] 1815 | }, 1816 | { 1817 | "cell_type": "markdown", 1818 | "metadata": {}, 1819 | "source": [ 1820 | "## MAML CNN Classifier" 1821 | ] 1822 | }, 1823 | { 1824 | "cell_type": "code", 1825 | "execution_count": null, 1826 | "metadata": {}, 1827 | "outputs": [], 1828 | "source": [ 1829 | "import sys, os, glob, random\n", 1830 | "import time\n", 1831 | "import parser\n", 1832 | "import torch\n", 1833 | "import torch.nn as nn\n", 1834 | "# from AdaAdam import AdaAdam\n", 1835 | "import torch.optim as OPT\n", 1836 | "import numpy as np\n", 1837 | "from copy import deepcopy\n", 1838 | "from tqdm import tqdm, trange\n", 1839 | "import logging\n", 1840 | "\n", 1841 | "from torchtext import data\n", 1842 | "import DataProcessing\n", 1843 | "from DataProcessing.MLTField import MTLField\n", 1844 | "from DataProcessing.NlcDatasetSingleFile import NlcDatasetSingleFile\n", 1845 | "from CNNModel import CNNModel\n", 1846 | "\n", 1847 | "\n", 1848 | "logger = logging.getLogger(__name__)\n", 1849 | "\n", 1850 | "logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',\n", 1851 | " datefmt = '%m/%d/%Y %H:%M:%S',\n", 1852 | " level = logging.INFO )\n", 1853 | "batch_size = 10\n", 1854 | "seed = 12345678\n", 1855 | "torch.manual_seed(seed)\n", 1856 | "Train = False\n", 1857 | "\n", 1858 | "\n", 1859 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 1860 | "n_gpu = torch.cuda.device_count()\n", 1861 | "random.seed(seed)\n", 1862 | "np.random.seed(seed)\n", 1863 | "torch.manual_seed(seed)\n", 1864 | "if n_gpu > 0:\n", 1865 | " torch.cuda.manual_seed_all(seed)\n", 1866 | "\n", 1867 | "def load_train_test_files(listfilename, test_suffix='.test'):\n", 1868 | " filein = open(listfilename, 'r')\n", 1869 | " file_tuples = []\n", 1870 | " task_classes = ['.t2', '.t4', '.t5']\n", 1871 | " for line in filein:\n", 1872 | " array = line.strip().split('\\t')\n", 1873 | " line = array[0]\n", 1874 | " for t_class in task_classes:\n", 1875 | " trainfile = line + t_class + '.train'\n", 1876 | " devfile = line + t_class + '.dev'\n", 1877 | " testfile = line + t_class + test_suffix\n", 1878 | " file_tuples.append((trainfile, devfile, testfile))\n", 1879 | " filein.close()\n", 1880 | " return file_tuples\n", 1881 | "\n", 1882 | "filelist = 'data/Amazon_few_shot/workspace.filtered.list'\n", 1883 | "targetlist = 'data/Amazon_few_shot/workspace.target.list'\n", 1884 | "workingdir = 'data/Amazon_few_shot'\n", 1885 | "emfilename = 'glove.6B.300d'\n", 1886 | "emfiledir = '..'\n", 1887 | "\n", 1888 | "datasets = []\n", 1889 | "list_datasets = []\n", 1890 | "\n", 1891 | "\n", 1892 | "file_tuples = load_train_test_files(filelist)\n", 1893 | "print(file_tuples)\n", 1894 | "\n", 1895 | "TEXT = MTLField(lower=True)\n", 1896 | "for (trainfile, devfile, testfile) in file_tuples:\n", 1897 | " print(trainfile, devfile, testfile)\n", 1898 | " LABEL1 = data.Field(sequential=False)\n", 1899 | " train1, dev1, test1 = NlcDatasetSingleFile.splits(\n", 1900 | " TEXT, LABEL1, path=workingdir, train=trainfile,\n", 1901 | " validation=devfile, test=testfile)\n", 1902 | " datasets.append((TEXT, LABEL1, train1, dev1, test1))\n", 1903 | " list_datasets.append(train1)\n", 1904 | " list_datasets.append(dev1)\n", 1905 | " list_datasets.append(test1)\n", 1906 | "\n", 1907 | "target_datasets = []\n", 1908 | "target_file = load_train_test_files(targetlist)\n", 1909 | "print(target_file)\n", 1910 | "\n", 1911 | "for (trainfile, devfile, testfile) in target_file:\n", 1912 | " print(trainfile, devfile, testfile)\n", 1913 | " LABEL2 = data.Field(sequential=False)\n", 1914 | " train2, dev2, test2 = NlcDatasetSingleFile.splits(TEXT, LABEL2, path=workingdir, \n", 1915 | " train=trainfile,validation=devfile, test=testfile)\n", 1916 | " target_datasets.append((TEXT, LABEL2, train2, dev2, test2))\n", 1917 | "\n", 1918 | " \n", 1919 | "\n", 1920 | "datasets_iters = []\n", 1921 | "for (TEXT, LABEL, train, dev, test) in datasets:\n", 1922 | " train_iter, dev_iter, test_iter = data.BucketIterator.splits(\n", 1923 | " (train, dev, test), batch_size=batch_size, device=device,shuffle=True)\n", 1924 | " train_iter.repeat = False\n", 1925 | " datasets_iters.append((train_iter, dev_iter, test_iter))\n", 1926 | "\n", 1927 | "fsl_ds_iters = []\n", 1928 | "for (TEXT, LABEL, train, dev, test) in target_datasets:\n", 1929 | " train_iter, dev_iter, test_iter = data.BucketIterator.splits(\n", 1930 | " (train,dev, test), batch_size=batch_size, device=device)\n", 1931 | " train_iter.repeat = False\n", 1932 | " fsl_ds_iters.append((train_iter, dev_iter, test_iter))\n", 1933 | "\n", 1934 | "num_batch_total = 0\n", 1935 | "for i, (TEXT, LABEL, train, dev, test) in enumerate(datasets):\n", 1936 | " # print('DATASET%d'%(i+1))\n", 1937 | " # print('train.fields', train.fields)\n", 1938 | " # print('len(train)', len(train))\n", 1939 | " # print('len(dev)', len(dev))\n", 1940 | " # print('len(test)', len(test))\n", 1941 | " # print('vars(train[0])', vars(train[0]))\n", 1942 | " num_batch_total += len(train) / batch_size\n", 1943 | "\n", 1944 | "TEXT.build_vocab(list_datasets, vectors = emfilename, vectors_cache = emfiledir)\n", 1945 | "# TEXT.build_vocab(list_dataset)\n", 1946 | "\n", 1947 | "# build the vocabulary\n", 1948 | "for taskid, (TEXT, LABEL, train, dev, test) in enumerate(datasets):\n", 1949 | " LABEL.build_vocab(train, dev, test)\n", 1950 | " LABEL.vocab.itos = LABEL.vocab.itos[1:]\n", 1951 | "\n", 1952 | " for k, v in LABEL.vocab.stoi.items():\n", 1953 | " LABEL.vocab.stoi[k] = v - 1\n", 1954 | "\n", 1955 | " # print vocab information\n", 1956 | " # print('len(TEXT.vocab)', len(TEXT.vocab))\n", 1957 | " # print('TEXT.vocab.vectors.size()', TEXT.vocab.vectors.size())\n", 1958 | "\n", 1959 | " # print(LABEL.vocab.itos)\n", 1960 | " # print(len(LABEL.vocab.itos))\n", 1961 | "\n", 1962 | " # print(len(LABEL.vocab.stoi))\n", 1963 | "fsl_num_tasks = 0\n", 1964 | "for taskid, (TEXT, LABEL, train, dev, test) in enumerate(target_datasets):\n", 1965 | " fsl_num_tasks += 1\n", 1966 | " LABEL.build_vocab(train, dev, test)\n", 1967 | " LABEL.vocab.itos = LABEL.vocab.itos[1:]\n", 1968 | " for k, v in LABEL.vocab.stoi.items():\n", 1969 | " LABEL.vocab.stoi[k] = v - 1\n", 1970 | "\n", 1971 | "nums_embed = len(TEXT.vocab)\n", 1972 | "dim_embed = 100\n", 1973 | "dim_w_hid = 200\n", 1974 | "dim_h_hid = 100\n", 1975 | "Inner_lr = 2e-6\n", 1976 | "Outer_lr = 1e-5\n", 1977 | "\n", 1978 | "n_labels = []\n", 1979 | "for (TEXT, LABEL, train, dev, test) in datasets:\n", 1980 | " n_labels.append(len(LABEL.vocab))\n", 1981 | "print(n_labels)\n", 1982 | "num_tasks = len(n_labels)\n", 1983 | "print(\"num_tasks\", num_tasks)\n", 1984 | "winsize = 3\n", 1985 | "num_labels = len(LABEL.vocab.itos)\n", 1986 | "model = CNNModel(nums_embed, num_labels, dim_embed, dim_w_hid, dim_h_hid, winsize, batch_size)\n", 1987 | "\n", 1988 | "print(\"GPU Device: \", device)\n", 1989 | "model.to(device)\n", 1990 | "print(model)\n", 1991 | "\n", 1992 | "criterion = nn.CrossEntropyLoss()\n", 1993 | "opt = OPT.Adam(model.parameters(), lr=Inner_lr)\n", 1994 | "Inner_epochs = 4\n", 1995 | "epochs = 2\n", 1996 | "\n", 1997 | "N_task = 5\n", 1998 | "\n", 1999 | "task_list = np.arange(num_tasks)\n", 2000 | "print(\"Total Batch: \", num_batch_total)\n", 2001 | "output_model_file = '/tmp/CNN_MAML_output'\n", 2002 | "if Train:\n", 2003 | " for t in trange(int(num_batch_total*epochs/Inner_epochs), desc=\"Iterations\"):\n", 2004 | " selected_task = np.random.choice(task_list, N_task,replace=False)\n", 2005 | " weight_before = deepcopy(model.state_dict())\n", 2006 | " update_vars = []\n", 2007 | " fomaml_vars = []\n", 2008 | " for task_id in selected_task:\n", 2009 | " # print(task_id)\n", 2010 | " (train_iter, dev_iter, test_iter) = datasets_iters[task_id]\n", 2011 | " train_iter.init_epoch()\n", 2012 | " model.train()\n", 2013 | " n_correct = 0\n", 2014 | " n_step = 0\n", 2015 | " for inner_iter in range(Inner_epochs):\n", 2016 | " batch = next(iter(train_iter))\n", 2017 | "\n", 2018 | " # print(batch.text)\n", 2019 | " # print(batch.label)\n", 2020 | " logits = model(batch.text)\n", 2021 | " loss = criterion(logits.view(-1, num_labels), batch.label.data.view(-1))\n", 2022 | " \n", 2023 | "\n", 2024 | " n_correct = (torch.max(logits, 1)[1].view(batch.label.size()).data == batch.label.data).sum()\n", 2025 | " n_step = batch.batch_size\n", 2026 | " loss.backward()\n", 2027 | " opt.step()\n", 2028 | " opt.zero_grad()\n", 2029 | " task_acc = 100.*n_correct/n_step\n", 2030 | " if t%10 == 0:\n", 2031 | " logger.info(\"Iter: %d, task id: %d, train acc: %f\", t, task_id, task_acc)\n", 2032 | " weight_after = deepcopy(model.state_dict())\n", 2033 | " update_vars.append(weight_after)\n", 2034 | " model.load_state_dict(weight_before)\n", 2035 | "\n", 2036 | " new_weight_dict = {}\n", 2037 | " for name in weight_before:\n", 2038 | " weight_list = [tmp_weight_dict[name] for tmp_weight_dict in update_vars]\n", 2039 | " weight_shape = list(weight_list[0].size())\n", 2040 | " stack_shape = [len(weight_list)] + weight_shape\n", 2041 | " stack_weight = torch.empty(stack_shape)\n", 2042 | " for i in range(len(weight_list)):\n", 2043 | " stack_weight[i,:] = weight_list[i] \n", 2044 | " new_weight_dict[name] = torch.mean(stack_weight, dim=0).cuda()\n", 2045 | " new_weight_dict[name] = weight_before[name]+(new_weight_dict[name]-weight_before[name])/Inner_lr*Outer_lr\n", 2046 | " model.load_state_dict(new_weight_dict)\n", 2047 | "\n", 2048 | "\n", 2049 | " torch.save(model.state_dict(), output_model_file)\n", 2050 | "\n", 2051 | "model.load_state_dict(torch.load(output_model_file))\n", 2052 | "logger.info(\"***** Running evaluation *****\")\n", 2053 | "fsl_task_list = np.arange(fsl_num_tasks)\n", 2054 | "weight_before = deepcopy(model.state_dict())\n", 2055 | "fsl_epochs = 3\n", 2056 | "Total_acc = 0\n", 2057 | "opt = OPT.Adam(model.parameters(), lr=3e-4)\n", 2058 | "\n", 2059 | "for task_id in fsl_task_list:\n", 2060 | " model.train()\n", 2061 | " (train_iter, dev_iter, test_iter) = fsl_ds_iters[task_id]\n", 2062 | " train_iter.init_epoch()\n", 2063 | " batch = next(iter(train_iter))\n", 2064 | " for i in range(fsl_epochs):\n", 2065 | " logits = model(batch.text)\n", 2066 | " loss = criterion(logits.view(-1, num_labels), batch.label.data.view(-1))\n", 2067 | " n_correct = (torch.max(logits, 1)[1].view(batch.label.size()).data == batch.label.data).sum()\n", 2068 | " n_size = batch.batch_size\n", 2069 | " train_acc = 100. * n_correct / n_size\n", 2070 | " loss = criterion(logits.view(-1, num_labels), batch.label.data.view(-1))\n", 2071 | " loss.backward()\n", 2072 | " opt.step()\n", 2073 | " opt.zero_grad()\n", 2074 | " logger.info(\" Task id: %d, fsl epoch: %d, Acc: %f, loss: %f\", task_id, i, train_acc, loss)\n", 2075 | "\n", 2076 | " model.eval()\n", 2077 | " test_iter.init_epoch()\n", 2078 | " n_correct = 0\n", 2079 | " n_size = 0\n", 2080 | " for test_batch_idx, test_batch in enumerate(test_iter):\n", 2081 | " with torch.no_grad():\n", 2082 | " logits = model(test_batch.text)\n", 2083 | " loss = criterion(logits.view(-1, num_labels), test_batch.label.data.view(-1))\n", 2084 | " n_correct += (torch.max(logits, 1)[1].view(test_batch.label.size()).data == test_batch.label.data).sum()\n", 2085 | " n_size += test_batch.batch_size\n", 2086 | " test_acc = 100.* n_correct/n_size\n", 2087 | " logger.info(\"FSL test Number: %d, Accuracy: %f\",n_size, test_acc)\n", 2088 | " Total_acc += test_acc\n", 2089 | " model.load_state_dict(weight_before)\n", 2090 | "\n", 2091 | "print(\"Mean Accuracy is : \", float(Total_acc)/fsl_num_tasks)\n" 2092 | ] 2093 | } 2094 | ], 2095 | "metadata": { 2096 | "kernelspec": { 2097 | "display_name": "Python 3", 2098 | "language": "python", 2099 | "name": "python3" 2100 | }, 2101 | "language_info": { 2102 | "codemirror_mode": { 2103 | "name": "ipython", 2104 | "version": 3 2105 | }, 2106 | "file_extension": ".py", 2107 | "mimetype": "text/x-python", 2108 | "name": "python", 2109 | "nbconvert_exporter": "python", 2110 | "pygments_lexer": "ipython3", 2111 | "version": "3.6.5" 2112 | }, 2113 | "latex_envs": { 2114 | "LaTeX_envs_menu_present": true, 2115 | "autoclose": false, 2116 | "autocomplete": true, 2117 | "bibliofile": "biblio.bib", 2118 | "cite_by": "apalike", 2119 | "current_citInitial": 1, 2120 | "eqLabelWithNumbers": true, 2121 | "eqNumInitial": 1, 2122 | "hotkeys": { 2123 | "equation": "Ctrl-E", 2124 | "itemize": "Ctrl-I" 2125 | }, 2126 | "labels_anchors": false, 2127 | "latex_user_defs": false, 2128 | "report_style_numbering": false, 2129 | "user_envs_cfg": false 2130 | } 2131 | }, 2132 | "nbformat": 4, 2133 | "nbformat_minor": 2 2134 | } 2135 | -------------------------------------------------------------------------------- /examples/KNNClassifierDemo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# K-NN Classifier - Example" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 13, 13 | "metadata": { 14 | "ExecuteTime": { 15 | "end_time": "2019-09-14T05:58:22.844762Z", 16 | "start_time": "2019-09-14T05:58:22.841213Z" 17 | } 18 | }, 19 | "outputs": [], 20 | "source": [ 21 | "import fsText" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 18, 27 | "metadata": { 28 | "ExecuteTime": { 29 | "end_time": "2019-09-14T05:59:21.382000Z", 30 | "start_time": "2019-09-14T05:59:21.375627Z" 31 | } 32 | }, 33 | "outputs": [ 34 | { 35 | "data": { 36 | "text/plain": [ 37 | "fsText.Classifier.CosineClassifier" 38 | ] 39 | }, 40 | "execution_count": 18, 41 | "metadata": {}, 42 | "output_type": "execute_result" 43 | } 44 | ], 45 | "source": [ 46 | "fsText.Classifier.CosineClassifier" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 17, 52 | "metadata": { 53 | "ExecuteTime": { 54 | "end_time": "2019-09-14T05:58:53.112061Z", 55 | "start_time": "2019-09-14T05:58:53.096141Z" 56 | } 57 | }, 58 | "outputs": [ 59 | { 60 | "ename": "ImportError", 61 | "evalue": "cannot import name 'RFClassifier'", 62 | "output_type": "error", 63 | "traceback": [ 64 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 65 | "\u001b[0;31mImportError\u001b[0m Traceback (most recent call last)", 66 | "\u001b[0;32m\n", 161 | " | Text | \n", 162 | "Label | \n", 163 | "
---|---|---|
3958 | \n", 168 | "WP_Insert_Post and GUID Issue [Wordpress] | \n", 169 | "1 | \n", 170 | "
2540 | \n", 173 | "How can I debug WordPress in IIS? | \n", 174 | "1 | \n", 175 | "
3594 | \n", 178 | "wordpress: how to get x youtube thumbnails of ... | \n", 179 | "1 | \n", 180 | "
3240 | \n", 183 | "Where to place a query to show only one post i... | \n", 184 | "1 | \n", 185 | "
3638 | \n", 188 | "Wordpress Blog RSS Feed Problems | \n", 189 | "1 | \n", 190 | "
3702 | \n", 193 | "Excluding one category in Wordpress | \n", 194 | "1 | \n", 195 | "
1720 | \n", 198 | "debuging \"register_activation_hook\" in wordpress | \n", 199 | "1 | \n", 200 | "
3622 | \n", 203 | "Wordpress \"Read more\" is not working | \n", 204 | "1 | \n", 205 | "
3895 | \n", 208 | "Create blog post simply and easily | \n", 209 | "1 | \n", 210 | "
3727 | \n", 213 | "Why is IE7 rendering these differently? | \n", 214 | "1 | \n", 215 | "
3900 | \n", 218 | "Using AJAX to load WordPress pages | \n", 219 | "1 | \n", 220 | "
3310 | \n", 223 | "WordPress Monthly Archive by Year | \n", 224 | "1 | \n", 225 | "
3521 | \n", 228 | "Is there an easier way to add menu items to a ... | \n", 229 | "1 | \n", 230 | "
538 | \n", 233 | "Wordpress: How can i move my index (blogpage) ... | \n", 234 | "1 | \n", 235 | "
3286 | \n", 238 | "wp_list_categories does not show current category | \n", 239 | "1 | \n", 240 | "
388 | \n", 243 | "What is the default URL for APEX for an Oracle... | \n", 244 | "2 | \n", 245 | "
2147 | \n", 248 | "Oracle DB: How can I write query ignoring case? | \n", 249 | "2 | \n", 250 | "
1332 | \n", 253 | "Oracle converts empty string to null but JPA d... | \n", 254 | "2 | \n", 255 | "
575 | \n", 258 | "Is there a way to do full text search of all o... | \n", 259 | "2 | \n", 260 | "
1943 | \n", 263 | "Faster 'select distinct thing_id,thing_name fr... | \n", 264 | "2 | \n", 265 | "
1564 | \n", 268 | "When did oracle start supporting \"top\": select... | \n", 269 | "2 | \n", 270 | "
2063 | \n", 273 | "Return an Oracle Ref Cursor to a SqlServer T-S... | \n", 274 | "2 | \n", 275 | "
2541 | \n", 278 | "How do I insert sysdate into a column using OD... | \n", 279 | "2 | \n", 280 | "
981 | \n", 283 | "Is it possible to kill a single query in oracl... | \n", 284 | "2 | \n", 285 | "
2282 | \n", 288 | "Calculate difference between 2 date / times in... | \n", 289 | "2 | \n", 290 | "
404 | \n", 293 | "Oracle date | \n", 294 | "2 | \n", 295 | "
2218 | \n", 298 | "Compare strings by their written representatio... | \n", 299 | "2 | \n", 300 | "
1081 | \n", 303 | "Allow Oracle User to connect from one IP addre... | \n", 304 | "2 | \n", 305 | "
2346 | \n", 308 | "What's the equivalent of Oracle's to_char in A... | \n", 309 | "2 | \n", 310 | "
871 | \n", 313 | "Oracle: how to use updateXML to update multipl... | \n", 314 | "2 | \n", 315 | "
338 | \n", 318 | "subversion diff including new files | \n", 319 | "3 | \n", 320 | "
663 | \n", 323 | "Unlocking a SVN working copy which has unversi... | \n", 324 | "3 | \n", 325 | "
898 | \n", 328 | "Windows Backup for SVN Repositories | \n", 329 | "3 | \n", 330 | "
2376 | \n", 333 | "Using svn:ignore to ignore everything but cert... | \n", 334 | "3 | \n", 335 | "
1323 | \n", 338 | "How can I only commit property changes without... | \n", 339 | "3 | \n", 340 | "
2098 | \n", 343 | "How to forbit subversion commits to svn:extern... | \n", 344 | "3 | \n", 345 | "
136 | \n", 348 | "How can I speed up SVN updates? | \n", 349 | "3 | \n", 350 | "
57 | \n", 353 | "Begining SVN | \n", 354 | "3 | \n", 355 | "
1013 | \n", 358 | "Is there any way to only update added files? | \n", 359 | "3 | \n", 360 | "
1137 | \n", 363 | "Change Revesion Number in Subversion, even if ... | \n", 364 | "3 | \n", 365 | "
2027 | \n", 368 | "svn create tag problem | \n", 369 | "3 | \n", 370 | "
708 | \n", 373 | "Is there some way to commit a file \"partially\"... | \n", 374 | "3 | \n", 375 | "
124 | \n", 378 | "Free Online SVN repositories | \n", 379 | "3 | \n", 380 | "
302 | \n", 383 | "Create a tag upon every build of the application? | \n", 384 | "3 | \n", 385 | "
958 | \n", 388 | "Subversion plugin to Visual Studio? | \n", 389 | "3 | \n", 390 | "
3200 | \n", 393 | "How to view error messages from ruby CGI app o... | \n", 394 | "4 | \n", 395 | "
2822 | \n", 398 | "PHP using too much memory | \n", 399 | "4 | \n", 400 | "
1831 | \n", 403 | "Getting Apache to execute command on every pag... | \n", 404 | "4 | \n", 405 | "
2901 | \n", 408 | "XAMPP: I edited PHP.ini, and now Apache crashes | \n", 409 | "4 | \n", 410 | "
195 | \n", 413 | "Restrict Apache to only allow access using SSL... | \n", 414 | "4 | \n", 415 | "
3623 | \n", 418 | "Adding slashes to the end of directories + mor... | \n", 419 | "4 | \n", 420 | "
689 | \n", 423 | "IIS equivalent of VirtualHost in Apache | \n", 424 | "4 | \n", 425 | "
2272 | \n", 428 | "Why do some page requests hang when fetching i... | \n", 429 | "4 | \n", 430 | "
1361 | \n", 433 | "Setting a header in apache | \n", 434 | "4 | \n", 435 | "
1874 | \n", 438 | "Apache Perl http headers problem | \n", 439 | "4 | \n", 440 | "
3444 | \n", 443 | ".htaccess mod-rewrite how to | \n", 444 | "4 | \n", 445 | "
2684 | \n", 448 | "Problem with URL rewriting for same .php page | \n", 449 | "4 | \n", 450 | "
299 | \n", 453 | "Top & httpd - demystifying what is actually ru... | \n", 454 | "4 | \n", 455 | "
2857 | \n", 458 | "Weird behaviour with two Trac instances under ... | \n", 459 | "4 | \n", 460 | "
3205 | \n", 463 | "defer processing during apache page render | \n", 464 | "4 | \n", 465 | "