├── CHANGES
├── CONTRIBUTORS
├── LICENSE
├── README.md
├── cyatp.py
├── data
├── DATA_6_KLTQCP.json
├── concept_map.json
├── concept_text.json
├── create_map.txt
└── crossword_puzzle_data.json
├── neo_db
├── config.py
├── create_graph.py
├── cyatp.db
└── query_graph.py
├── requirements.txt
├── scripts
├── get_5_data.py
├── get_cross.py
├── save_feedback.py
└── show_profile.py
├── static
├── css
│ ├── bootstrap.css.map
│ ├── bootstrap.min.css
│ ├── bootstrap.min.css.map
│ ├── quiz_style.css
│ └── star-rating.css
├── images
│ ├── concept_map.gif
│ ├── cyatp_screenshot.png
│ ├── learn.gif
│ ├── platform_architecture.png
│ ├── puzzle.gif
│ ├── quiz.gif
│ ├── share.png
│ ├── survey.gif
│ └── training_content_overview.png
└── js
│ ├── bootstrap.bundle.min.js
│ ├── bootstrap.bundle.min.js.map
│ ├── bootstrap.js
│ ├── bootstrap.js.map
│ ├── bootstrap.min.js
│ ├── bootstrap.min.js.map
│ ├── concept_map.json
│ ├── crossword_script.js
│ ├── echarts.min.js
│ ├── jquery-2.2.4.min.js
│ ├── jquery.crossword.js
│ ├── question_show.js
│ ├── star-rating.js
│ └── tags.js
├── templates
├── 404_page.html
├── back.html
├── base.html
├── index.html
├── learn.html
├── map.html
├── puzzle.html
├── quiz.html
├── quiz_feedback.html
└── survey.html
└── training_content
├── content_guide.md
├── feedback
├── feedback_question.json
└── feedback_survey_data.json
└── generate_content
├── generate_cross.py
├── generate_learning_content_data.py
└── generate_quiz
├── Future_Engineering.py
├── Generate_Choice.py
├── Generate_Question.py
├── Pickle_File.py
├── Predict_Answer.py
├── main-GQ.py
└── requirements.txt
/CHANGES:
--------------------------------------------------------------------------------
1 |
2 | CyATP v1.0
3 | ------------
4 | * First public release of the CyATP cybersecurity awareness training
5 | platform, which includes pregenerated training content that can be
6 | used to learn about cybersecurity concepts. The quiz and crossword
7 | puzzle serious game functionality makes possible to test and deepen
8 | one's security awareness knowledge.
9 |
--------------------------------------------------------------------------------
/CONTRIBUTORS:
--------------------------------------------------------------------------------
1 | This file includes the main contributors to the CyATP project.
2 |
3 | Initial implementation:
4 | Youmeizi Zeng
5 |
6 | Current maintainers:
7 | Razvan Beuran
8 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | BSD 3-Clause License
2 |
3 | Copyright (c) 2021, Japan Advanced Institute of Science and Technology
4 | All rights reserved.
5 |
6 | Redistribution and use in source and binary forms, with or without
7 | modification, are permitted provided that the following conditions are met:
8 |
9 | 1. Redistributions of source code must retain the above copyright notice, this
10 | list of conditions and the following disclaimer.
11 |
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 | this list of conditions and the following disclaimer in the documentation
14 | and/or other materials provided with the distribution.
15 |
16 | 3. Neither the name of the copyright holder nor the names of its
17 | contributors may be used to endorse or promote products derived from
18 | this software without specific prior written permission.
19 |
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # CyATP: Cybersecurity Awareness Training Platform
3 |
4 | CyATP is a web platform for cybersecurity awareness training that
5 | makes use of Natural Language Generation (NLG) techniques to
6 | automatically generate the training content; the serious game approach
7 | is employed for learning purposes. Using this platform, learners can
8 | increase their security awareness knowledge and put it to use in their
9 | daily life. The included training content is comprised of about 2500
10 | computer security concepts and definitions, 278 quiz questions and 10
11 | crossword puzzles. CyATP is being developed by the Cyber Range
12 | Organization and Design
13 | ([CROND](https://www.jaist.ac.jp/misc/crond/index-en.html))
14 | NEC-endowed chair at the Japan Advanced Institute of Science and
15 | Technology ([JAIST](https://www.jaist.ac.jp/english/)) in Ishikawa,
16 | Japan.
17 |
18 | An overview of the CyATP architecture is provided in the figure
19 | below. Trainees use the web interface to access the **Concept Map**
20 | and **Learn Concepts** pages in order to find out about the security
21 | concepts they want to study. They can also use the **Take Quiz** and
22 | **Crossword Puzzle** pages to test and deepen their knowledge. The
23 | front end of the CyATP platform is developed using Bootstrap and
24 | jQuery, and the back end employs Flask and Neo4j.
25 |
26 |
27 |
28 | CyATP already includes some pregenerated awareness training
29 | content. To learn more about this content, and about how to add new
30 | training content to the database, see the [Training Content
31 | Guide](https://github.com/crond-jaist/CyATP/blob/master/training_content/content_guide.md).
32 |
33 |
34 | ## Prerequisites
35 |
36 | As we store some components of the training content into a Neo4j graph
37 | database, the following step must be carried out before using CyATP:
38 |
39 | * **Install the Neo4j database platform.** You can download the [Neo4j
40 | Community Edition](https://neo4j.com/download-center/#community)
41 | free of charge; we recommend Neo4j v4.0+. (Note that all versions
42 | of Neo4j require Java to be preinstalled.)
43 |
44 | The following optional step can also be performed:
45 |
46 | * **Create a virtual Python environment.** You can create an isolated
47 | Python environment and install packages into this virtual
48 | environment to avoid conflicts. This can be done using the options
49 | `venv` for Python 3 or `virtualenv` for Python 2. For example, you
50 | can run the following commands to create and activate the
51 | `cyatp-env` virtual environment:
52 |
53 | ```
54 | $ python3 -m venv cyatp-env
55 | $ source cyatp-env/bin/activate
56 | ```
57 |
58 |
59 | ## Setup
60 |
61 | To set up CyATP, follow the steps below:
62 |
63 | 1. **Install the latest version of CyATP.** Use the
64 | [releases](https://github.com/crond-jaist/CyATP/releases) page to
65 | download the source code archive of the latest version of the training
66 | platform and extract it on your computer.
67 |
68 | 2. **Install the required Python libraries.** Go to the directory
69 | where CyATP is located and install the required third-party libriaries
70 | by running the following command:
71 |
72 | ```
73 | $ sudo -H pip install -r requirements.txt
74 | ```
75 |
76 | 3. **Set up the Neo4j database.** Follow the next steps in a terminal
77 | window to install the CyATP training database content into Neo4j:
78 | 1. Enter the directory where Neo4j was installed (e.g.,
79 | `~/neo4j-community-4.1.1/`).
80 | 2. Stop the Neo4j database service:
81 | ```
82 | $ ./bin/neo4j stop
83 | ```
84 | 3. Copy the file `neo_db/cyatp.db` from the CyATP directory to the
85 | `bin/` directory in the Neo4j installation:
86 | ```
87 | $ cp /neo_db/cyatp.db /bin
88 | ```
89 | 4. Load the CyATP training database into Neo4j:
90 | ```
91 | $ neo4j-admin load --from=cyatp.db --database=neo4j --force
92 | ```
93 | 5. Restart the Neo4j database service:
94 | ```
95 | $ ./bin/neo4j restart
96 | ```
97 |
98 | ### Notes
99 |
100 | * After installing Neo4j, when you access http://localhost:7474 for
101 | the first time you will be asked to change the default database
102 | password. After you do that, make sure to also change the password
103 | included in file `neo_db/config.py` in the CyATP directory.
104 |
105 | * After setting up the Neo4j database with CyATP training data, you
106 | can access http://localhost:7474 and run the following command to
107 | retrieve the existing data (the
108 | [Cypher](https://neo4j.com/developer/cypher/) query language is
109 | used):
110 |
111 | ```
112 | MATCH (n) RETURN n
113 | ```
114 |
115 |
116 | ## Quick Start
117 |
118 | In order to start the CyATP web server, use a terminal window to go to
119 | the project directory and execute the following command:
120 |
121 | ```
122 | $ env FLASK_APP=cyatp.py flask run
123 | ```
124 |
125 | By default the web interface of CyATP can be accessed only locally at
126 | http://127.0.0.1:5000/. The screenshot below displays the top page,
127 | which provides an overview of CyATP and the functions of each
128 | additional page.
129 |
130 |
131 |
132 | ### Configuration
133 |
134 | * If you want to change the default port used by CyATP (which is
135 | `5000`), use the option `-p` when running the program:
136 | ```
137 | $ env FLASK_APP=cyatp.py flask run -p
138 | ```
139 |
140 | * To make the CyATP server publicly available, use the option
141 | `--host=0.0.0.0` when running the program:
142 | ```
143 | $ env FLASK_APP=cyatp.py flask run --host=0.0.0.0
144 | ```
145 |
146 | * For more details about setting up Flask applications, see this
147 | [tutorial](https://flask.palletsprojects.com/en/1.1.x/quickstart/). CyATP
148 | can also be deployed via a cloud service, such as Amazon Web
149 | Services (AWS), Microsoft Azure or Alibaba Cloud. For details, see
150 | the documentation on [Flask application
151 | deployment](https://flask.palletsprojects.com/en/1.1.x/deploying/).
152 |
153 |
154 | ## References
155 |
156 | For a research background regarding CyATP, please refer to the
157 | following document:
158 |
159 | * Y. Zeng, "Content Generation and Serious Game Implementation for
160 | Security Awareness Training", Master's thesis, March 2021.
161 | https://hdl.handle.net/10119/17105
162 |
163 | For a list of contributors to this project, check the file
164 | CONTRIBUTORS included with the source code.
165 |
--------------------------------------------------------------------------------
/cyatp.py:
--------------------------------------------------------------------------------
1 | from flask import Flask, request, jsonify, url_for, render_template
2 | from neo_db.query_graph import query
3 | from scripts.show_profile import get_keyword_profile
4 | from scripts.get_5_data import get_all_question
5 | from scripts.get_cross import get_crossword
6 | from scripts.save_feedback import save, save_survey
7 | import json
8 | app = Flask(__name__)
9 |
10 |
11 | # homepage
12 | @app.route('/', methods=['GET', 'POST'])
13 | @app.route('/index', methods=['GET', 'POST'])
14 | def index():
15 | return render_template('index.html')
16 |
17 | @app.errorhandler(404)
18 | def page_not_found(e):
19 | return render_template('404_page.html'), 404
20 |
21 |
22 | # concept map page
23 | @app.route('/map', methods=['GET', 'POST'])
24 | def map():
25 | return render_template('map.html')
26 |
27 | # learn page
28 | @app.route('/learn', methods=['GET', 'POST'])
29 | def learn():
30 | return render_template('learn.html')
31 |
32 | # search keyword information for learn page
33 | @app.route('/search_name', methods=['GET', 'POST'])
34 | def search_name():
35 | name = request.args.get('name')
36 | json_data = query(str(name))
37 | # print(json_data)
38 | return jsonify(json_data)
39 |
40 | # get keyword's text for learn page
41 | @app.route('/get_profile', methods=['GET', 'POST'])
42 | def get_profile():
43 | name = request.args.get('name')
44 | json_data = get_keyword_profile(name)
45 | #print(json_data)
46 | #print(jsonify(json_data))
47 | return jsonify(json_data)
48 |
49 | # quiz page
50 | @app.route('/game/quiz', methods=['GET', 'POST'])
51 | def quiz():
52 | return render_template('quiz.html')
53 |
54 | # get questions for quiz page
55 | @app.route('/get_question', methods=['GET', 'POST'])
56 | def get_question():
57 | with app.app_context():
58 | All_question = get_all_question()
59 | return jsonify(All_question)
60 |
61 | # quiz feedback
62 | @app.route('/game/feedback', methods=['GET', 'POST'])
63 | def game_feedback():
64 | temp_value = request.form.to_dict()
65 | final = []
66 | result = []
67 | for key in temp_value:
68 | if(key=='record'):
69 | temp_num = temp_value[key].split(',')
70 | for i in range(len(temp_num)):
71 | result.append(temp_num[i])
72 | else:
73 | temp_dic = {}
74 | temp_text = temp_value[key]
75 | temp_split = temp_text.split(',answer,')
76 | temp_dic.update({'Question': temp_split[0]})
77 | temp_dic.update({'Answer': temp_split[1]})
78 | final.append(temp_dic)
79 |
80 | for index, q_dic in enumerate(final):
81 | q_dic.update({'Record': result[index]})
82 | return render_template('quiz_feedback.html', data = final)
83 |
84 | # crossword puzzle page
85 | @app.route('/game/puzzle', methods=['GET', 'POST'])
86 | def puzzle():
87 | return render_template('puzzle.html')
88 |
89 | # get puzzle for crossword puzzle page
90 | @app.route('/get_cross', methods=['GET', 'POST'])
91 | def get_cross():
92 | puzzleData = get_crossword()
93 | return jsonify(puzzleData)
94 |
95 | # survey page
96 | survey_data = []
97 | @app.route('/survey', methods=['GET', 'POST'])
98 | def survey():
99 | if request.method == 'GET':
100 | return render_template('survey.html')
101 | if request.method == 'POST':
102 | survey_feedback = json.loads(request.get_data())
103 | # print(survey_feedback)
104 | # add feedback to survey_data, and save it
105 | survey_data.append(survey_feedback)
106 | save_survey(survey_data)
107 | return render_template('survey.html')
108 |
109 | data_list = []
110 | @app.route('/back', methods=['GET', 'POST'])
111 | def back():
112 | if request.method == 'POST':
113 | # get feedback user information
114 | info_ip = request.remote_addr
115 | info_platform = request.user_agent.platform
116 | info_brower = request.user_agent.browser
117 | info_bv = request.user_agent.version
118 | info_all = info_ip + ' ' + info_platform + ' ' + info_brower + ' ' + info_bv
119 |
120 | data = json.loads(request.get_data())
121 | data.update({'info': info_all})
122 | data_list.append(data)
123 | statues_data = save(data_list)
124 |
125 | return render_template('back.html', data_ = data_list)
126 |
127 | if request.method == 'GET':
128 | return render_template('back.html', data_=data_list)
129 |
130 |
131 | if __name__ == '__main__':
132 | app.config['JSON_AS_ASCII'] = False
133 | app.config['JSON_SORT_KEYS'] = False
134 | app.config['JSONIFY_MIMETYPE'] = "application/json;charset=utf-8"
135 | # app.debug = True
136 | app.debug = False
137 | app.run(host='0.0.0.0')
138 |
--------------------------------------------------------------------------------
/neo_db/config.py:
--------------------------------------------------------------------------------
1 | from py2neo import Graph
2 |
3 | graph = Graph(
4 | "http://localhost:7474",
5 | username="neo4j",
6 | password="123456"
7 | )
8 |
9 | CA_LIST = {"level0": 0, "level1": 1, "level2": 2, "level3": 3, "level4": 4, "level5": 5, "level6": 6, 'level7': 7}
10 |
--------------------------------------------------------------------------------
/neo_db/create_graph.py:
--------------------------------------------------------------------------------
1 | from py2neo import Graph, Node, Relationship
2 | from config import graph
3 |
4 | with open("../data/create_map.txt") as f:
5 | for line in f.readlines():
6 | rela_array = line.strip("\n").split(",")
7 | print(rela_array)
8 | graph.run("MERGE(p: Keyword{level:'%s',Name: '%s'})" % (rela_array[3], rela_array[0]))
9 | graph.run("MERGE(p: Keyword{level:'%s',Name: '%s'})" % (rela_array[4], rela_array[1]))
10 | graph.run(
11 | "MATCH(e: Keyword), (cc: Keyword) \
12 | WHERE e.Name='%s' AND cc.Name='%s'\
13 | CREATE(e)-[r:%s{relation: '%s'}]->(cc)\
14 | RETURN r" % (rela_array[0], rela_array[1], rela_array[2], rela_array[2])
15 |
16 | )
17 |
18 |
--------------------------------------------------------------------------------
/neo_db/cyatp.db:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/crond-jaist/CyATP/0d7d1ddcf5c7b22b26b698bd94637765e75302cf/neo_db/cyatp.db
--------------------------------------------------------------------------------
/neo_db/query_graph.py:
--------------------------------------------------------------------------------
1 | from neo_db.config import graph, CA_LIST
2 | import difflib
3 | from scripts.get_5_data import get_all_keywords
4 |
5 | # load all the keyword to a directory
6 | # if can not query the keyword, recommend a similar keyword
7 | keywords = get_all_keywords()
8 |
9 | def query(name):
10 | # print(name)
11 | name_ = name.capitalize()
12 | data = graph.run(
13 | "match(p )-[r]->(n:Keyword{Name:'%s'}) return p.Name,r.relation,n.Name,p.level,n.level\
14 | Union all\
15 | match(p:Keyword{Name:'%s'}) -[r]->(n) return p.Name, r.relation, n.Name, p.level, n.level" % (name_, name_)
16 | )
17 | data = list(data)
18 | # print(data)
19 | # json_data = get_json_data(data)
20 | return get_json_data(data, name)
21 |
22 |
23 | def get_json_data(data, name):
24 | json_data = {'data': [], "links": []}
25 | d = []
26 |
27 | if(data == []):
28 | try:
29 | close_item = difflib.get_close_matches(name, keywords, 5, cutoff = 0.5)
30 | # print(close_item)
31 | #query(close_item)
32 | except:
33 | close_item = []
34 |
35 | return close_item
36 | else:
37 | for i in data:
38 | #print(i["p.Name"], i["r.relation"], i["n.Name"], i["p.level"], i["n.level"])
39 | d.append(i['p.Name'] + "/" + i['p.level'])
40 | d.append(i['n.Name'] + "/" + i['n.level'])
41 | d = list(set(d))
42 | name_dict = {}
43 | count = 0
44 | for j in d:
45 | j_array = j.split("/")
46 |
47 | data_item = {}
48 | name_dict[j_array[0]] = count
49 | count += 1
50 | data_item['name'] = j_array[0]
51 | data_item['category'] = CA_LIST[j_array[1]]
52 | json_data['data'].append(data_item)
53 | for i in data:
54 | link_item = {}
55 |
56 | link_item['source'] = name_dict[i['p.Name']]
57 |
58 | link_item['target'] = name_dict[i['n.Name']]
59 | link_item['value'] = i['r.relation']
60 | json_data['links'].append(link_item)
61 | #print(json_data)
62 | return json_data
63 |
64 |
65 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | Flask==1.1.2
2 | py2neo==4.3.0
3 | Flask_Share==0.1.1
4 |
--------------------------------------------------------------------------------
/scripts/get_5_data.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 |
4 | filepath = os.getcwd()
5 | with open(os.path.join(filepath, 'data', 'DATA_6_KLTQCP.json'), 'r', encoding='utf-8') as f:
6 | data = json.load(f)
7 |
8 | def get_all_info(name):
9 | info = []
10 | try:
11 | for i in range(len(data)):
12 | if(data[i]['Keyword'] == name):
13 | info = data[i]
14 | except:
15 | info = 'not find this information'
16 | return {'Keyword': [name], 'info': [info]}
17 |
18 | def get_all_keywords():
19 | keywords = []
20 | for i in range(len(data)):
21 | keywords.append(data[i]['Keyword'])
22 | return keywords
23 |
24 |
25 | def get_word_question(name):
26 | question = []
27 | try:
28 | for i in range(len(data)):
29 | if(data[i]['Keyword']==name):
30 | question = data[i]['Question']
31 | except:
32 | question = 'this keyword has not quesion'
33 | return {'Keyword': [name], 'question': [question]}
34 |
35 |
36 | def get_all_question():
37 | All_Question = []
38 | for i in range(len(data)):
39 | try:
40 | for q in data[i]['Questions']:
41 | All_Question.append(q)
42 | except:
43 | continue
44 | return All_Question
45 |
46 |
47 |
48 | def get_puzzle():
49 | All_puzzle = []
50 | for i in range(len(data)):
51 | try:
52 | for pu in data[i]['Puzzle']:
53 | All_puzzle.append(pu)
54 | except:
55 | continue
56 |
57 | return All_puzzle
58 |
59 |
--------------------------------------------------------------------------------
/scripts/get_cross.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | from random import choice
4 | filepath = os.getcwd()
5 |
6 | def get_crossword():
7 | with open(os.path.join(filepath, 'data', 'crossword_puzzle_data.json'), 'r', encoding='utf-8') as f:
8 | cross_all = json.load(f)
9 | cross = choice(cross_all)
10 |
11 | tempx = []
12 | tempy = []
13 | final_cross =[]
14 | for i in cross:
15 | if(i['orientation']=='across'):
16 | tempx.append(i)
17 |
18 | else:
19 | tempy.append(i)
20 |
21 | for xitem in tempx:
22 | final_cross.append(xitem)
23 | for yitem in tempy:
24 | final_cross.append(yitem)
25 |
26 |
27 | return final_cross
28 |
--------------------------------------------------------------------------------
/scripts/save_feedback.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 |
4 | filepath = os.getcwd()
5 |
6 | def save(data):
7 | # if not exists feedback_question.json, generate it
8 | if not os.path.exists(os.path.join(filepath, 'training_content/feedback', 'feedback_question.json')):
9 | os.system(r"touch {}".format('feedback_question.json'))
10 |
11 | with open(os.path.join(filepath, 'training_content/feedback', 'feedback_question.json'), 'r', encoding='utf8') as fp:
12 | file = json.load(fp)
13 |
14 | # if not repeat data, write it in the json file
15 | for i in range(len(data)):
16 | if data[i] not in file:
17 | file.append(data[i])
18 | with open(os.path.join(filepath, 'training_content/feedback', 'feedback_question.json'), 'w') as fp:
19 | json.dump(file, fp)
20 | return 'sava ok'
21 |
22 | def save_survey(data):
23 | if not os.path.exists(os.path.join(filepath, 'training_content/feedback', 'feedback_survey_data.json')):
24 | os.system(r"touch {}".format('feedback_survey_data.json'))
25 |
26 | with open(os.path.join(filepath, 'training_content/feedback', 'feedback_survey_data.json'), 'r', encoding='utf8') as fp:
27 | file = json.load(fp)
28 |
29 | for i in range(len(data)):
30 | if data[i] not in file:
31 | file.append(data[i])
32 | with open(os.path.join(filepath, 'training_content/feedback', 'feedback_survey_data.json'), 'w') as fp:
33 | json.dump(file, fp)
34 |
35 | return 'sava survey ok'
36 |
--------------------------------------------------------------------------------
/scripts/show_profile.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 |
4 | filepath = os .getcwd()
5 |
6 | with open(os.path.join(filepath, 'data', 'concept_text.json'), 'r', encoding='utf-8') as f:
7 | data = json.load(f)
8 |
9 |
10 | def get_keyword_profile(o_name):
11 | s = ''
12 | name = o_name.replace(' ', '_')
13 | try:
14 | for i in data[name]:
15 | dir = {}
16 | st = "
CyATP (Cybersecurity Awareness Training Platform) is a learning tool for everyone who wants to gain cybersecurity awareness. This online platform is based on the concept of "serious games", which aims to enhance the interest of learners, leading them into learning while playing games with a serious content. CyATP includes a quiz and a crossword puzzle as cybersecurity awareness serious games.
24 |
The main features of the CyATP training platform are as follows:
25 |
26 |
27 |
Rich Content
28 | The included training database contains already a lot of cybersecurity awareness content. Thus, there are 2640 cybersecurity concepts, 2315 concept definitions, 278 concept-based quiz questions (just for the 126 concepts on the 0-2 level), and 28 clues for the crossword puzzle (just for the 126 concepts on the 0-2 level).
29 |
30 |
31 |
Convenience
32 | This web-based application enables learners to study anywhere and at any time. No environment installation and configuration are needed, just a regular web browser is enough. Operation is simple and the user interface is friendly.
33 |
34 |
35 |
Entertainment
36 | Given the serious game approach, there is a short feedback cycle for learners, via an immediate stream of rewards and penalties. In addition, a good balance is achieved between having fun and learning cybersecurity awareness content.
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
47 |
48 |
49 |
Concept Map
50 |
- Shows the cybersecurity concept map available in the database
51 |
- Related concepts are displayed for each clicked concept
52 |
- Use the top navigation bar to select the depth level of the concepts to be shown
53 |
- Click the top-right corner "Save" button to store the concepts of interest
- Click "Start Game", and some quiz questions will be randomly selected from the database. You have 30 seconds to answer each of them.
95 |
- During the quiz, you can see the number of remaining questions and your current points. When the game is over, click on the detail button to get feedback on your answers.
96 |
- If a question seems wrong or strange, you can click the "Feedback about question" button and tell us about it. We appreciate your feedback!
19 | {% if idata.Record =='0' %}
20 | Result - Wrong
21 | {% elif idata.Record =='1' %}
22 | Result - Correct
23 | {% elif idata.Record =='2' %}
24 | Result - Miss
25 | {% endif %}
26 |
27 | {% endfor %}
28 |
29 |
30 |
31 |
32 |
33 |
38 |
39 | {% endblock%}
--------------------------------------------------------------------------------
/templates/survey.html:
--------------------------------------------------------------------------------
1 | {% extends 'base.html' %}
2 |
3 | {%block head%}
4 | Survey
5 |
6 |
7 | {% endblock%}
8 |
9 | {% block content %}
10 |
11 |
12 |
13 |
166 |
167 | {% endblock %}
--------------------------------------------------------------------------------
/training_content/content_guide.md:
--------------------------------------------------------------------------------
1 |
2 | # Training Content Guide
3 |
4 | This file contains information about the training content included
5 | with CyATP, as well as details about the procedure of generating
6 | training content. The generated training concept is stored partially
7 | in the Neo4j database (keywords and concept map) and partially in a
8 | JSON file named `data/DATA_6_KLTQCP.json` (concept definitions and
9 | quiz questions).
10 |
11 |
12 | ## Included Training Content
13 |
14 | CyATP includes training content that was generated using the
15 | methodology outlined in the figure below, as follows:
16 | * 2640 cybersecurity concepts extracted from DBpedia by using
17 | "computer security" as keyword
18 | * 2315 concept definitions extracted from Wikipedia for the above
19 | concepts (when available)
20 | * 278 quiz questions and 10 crossword puzzles with a total of 28 clues
21 | for the top 126 concepts, produced using Natural Language Generation
22 | (NLG) techniques
23 |
24 |
25 |
26 |
27 | ## Training Content Generation
28 |
29 | The generation of new/updated training content requires generating the
30 | following three types of data, as it will be described below in
31 | detail:
32 | * Concept data
33 | * Quiz data
34 | * Crossword puzzle
35 |
36 | ### Concept Data
37 |
38 | Concept data is created by extracting keywords from the Linked Open
39 | Data (LOD) database DBpedia (see [this
40 | paper](http://hdl.handle.net/10119/15928) for details), then
41 | extracting definitions of those concepts from Wikipedia. To change the
42 | search keyword from "computer security" to something else, modify the
43 | value of the variable `keyword` located at the end of the file
44 | `training_content/generate_content/generate_learning_content_data.py`,
45 | then run that file to generate the new concept data file.
46 |
47 | ### Quiz Data
48 |
49 | To generate quiz question data, we trained the Naive Bayes Model using
50 | the [SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/) dataset,
51 | then used the trained model to predict questions based on the concept
52 | definitions text. Finally, used the
53 | [GENSIM](https://radimrehurek.com/gensim/) library is used to generate
54 | the proposed choices for each quiz question.
55 |
56 | To regenerate quiz question data if concept data was changed/updated,
57 | follow the following steps:
58 |
59 | 1. Extract the content of the `model_data.zip` archive provided as
60 | asset with the CyATP release into the directory
61 | `training_content/generate_content/generate_quiz/` inside the CyATP
62 | folder.
63 |
64 | 2. Install the additional third-party Python libraries that are
65 | required to run the models; these dependencies are specified in the
66 | file
67 | `training_content/generate_content/generate_quiz/requirements.txt`.
68 |
69 | 3. Run the following command to generate the quiz data:
70 | ```
71 | $ python main-GQ.py
72 | ```
73 |
74 | **NOTES**
75 | * If you want to save the quiz data in a file with a name different
76 | than the default, modify the parameter `text_data` in the file
77 | `main-GQ.py` to be the desired name of the data file.
78 | * If you also want to retrain the Naive Bayes Model, make sure to
79 | place it in the file named
80 | `training_content/generate_content/generate_quiz/Trained_Models/M_BernoulliNB+isotonic_smote.pkl`.
81 |
82 | ### Crossword Puzzle
83 |
84 | To generate the crossword puzzle, keywords that have clues are
85 | extracted from the JSON data and the script
86 | `training_content/generate_content/generate_cross.py` is used to
87 | generate the puzzle. To alter the input for the crossword puzzle
88 | generation, you can modify the parameter `All_data` in the mentioned
89 | file to include your own.
90 |
91 |
92 | ## Database Update
93 |
94 | Part of the generated training content needs to be put in the Neo4j
95 | database for performance reasons, so that the concept keywords and the
96 | relationship between them (concept map) can be queried fast. The
97 | necessary data is saved in the file `data/create_map.txt` when concept
98 | data is generated, and can be loaded into the CyATP Neo4j database
99 | file using the script `neo_db/create_graph.py`. The actual Ne4j
100 | database must then be updated as described at step "3. Set up the
101 | Neo4j database" of the `README.md` file.
102 |
103 | If you want to change the update the content yourself, pay attention
104 | to the structure of the concept map data stored in the file
105 | `create_map.txt`. Thus, the 1st and 2nd fields contain the start and
106 | end concepts, respectively; the 3rd field contains the concept
107 | relationship; finally, the 4th and 5th fields contain the levels on
108 | which the start and end concepts are located, respectively.
109 |
110 |
111 | ## File Overview
112 |
113 | The CyATP release contains a large number of files, and we provide an
114 | overview below to facilitate development and further extensions of
115 | CyATP.
116 | ```
117 | ├── data # Directory for training data
118 | │ ├── DATA_6_KLTQCP.json # Pre-generated training dataset
119 | │ ├── concept_map.json # Concept map data
120 | │ ├── concept_text.json # Concept text data
121 | │ ├── create_map.txt # Keywords and relationships
122 | │ ├── crossword_puzzle_data.json # Crossword puzzle data
123 | ├── neo_db # Directory for Neo4j related data
124 | │ ├── config.py # Script to configure the database
125 | │ ├── create_graph.py # Script to create the database
126 | │ ├── cyatp.db # Pre-generated database file
127 | │ └── query_graph.py # Script to query the database
128 | ├── scripts # Directory for other scripts
129 | │ ├── get_cross.py # Script to get crossword puzzle
130 | │ ├── get_5_data.py # Script to get information from dataset
131 | │ ├── save_feedback.py # Script to save trainee feedback
132 | │ └── show_profile.py # Script to show concept text
133 | ├── static # Directory for static web files
134 | ├── templates # Directory for HTML web files
135 | ├── training_content # Directory for content generation
136 | │ ├── feedback # Directory for storing trainee feedback
137 | │ ├── generate_content # Directory for generation scripts
138 | └── └── training_content_guide.md # Training content generation guide
139 | ```
140 |
141 | ### Question Feedback
142 |
143 | When you are taking a quiz, you will see a button name "Question
144 | Evaluation" in the bottom-right corner. Using this button, trainees
145 | can provide feedback about the quality of a quiz question at any
146 | time. This feedback data is saved in the file named
147 | `training_content/feedback/feedback_question.json`.
148 |
149 | ### Survey Data
150 |
151 | The CyATP website includes a **Survey** page on which we use the
152 | [System Usability Scale
153 | (SUS)](https://www.usability.gov/how-to-and-tools/methods/system-usability-scale.html)
154 | method to evaluate the CyATP platform. The survey data is saved in the
155 | file `training_content/feedback/feedback_survey_data.json`.
156 |
--------------------------------------------------------------------------------
/training_content/feedback/feedback_question.json:
--------------------------------------------------------------------------------
1 | [{"question": "_____ have historically been dormitories for sentries or guards, and places where sentries not posted to sentry posts wait \"on call\", but are more recently manned by a contracted security company.", "star": "4 Stars", "text": "nice!", "time": "2020-11-17T03:52:24.258Z", "info": "127.0.0.1 macos chrome 86.0.4240.193"}, {"question": "SSO is a subset of federated identity management, as it relates only to _____ and is understood on the level of technical interoperability and it would not be possible without some sort of federation.", "star": "5 Stars", "text": "good!", "time": "2020-11-17T07:51:58.824Z", "info": "127.0.0.1 macos chrome 86.0.4240.193"}, {"question": "_____ is a subset of data privacy.", "star": "1.5 Star", "text": "jlk", "time": "2020-11-17T07:53:49.067Z", "info": "127.0.0.1 macos chrome 86.0.4240.183"}, {"question": "_____ typically require a user to generate and remember one \"master\" password to unlock and access any information stored in their databases.", "star": "4.5 Stars", "text": "good!!!", "time": "2020-12-20T07:45:59.064Z", "info": "127.0.0.1 macos chrome 87.0.4280.88"}, {"question": "Since Turing first introduced his test, it has proven to be both highly influential and widely criticised, and it has become an important concept in the philosophy of _____..", "star": "4.5 Stars", "text": "mama..", "time": "2020-12-20T12:46:39.737Z", "info": "127.0.0.1 macos chrome 87.0.4280.88"}, {"question": "In cryptography, PKCS stands for _____", "star": "3.5 Stars", "text": "nice!!!", "time": "2020-12-20T13:02:35.534Z", "info": "127.0.0.1 macos chrome 87.0.4280.88"}]
--------------------------------------------------------------------------------
/training_content/feedback/feedback_survey_data.json:
--------------------------------------------------------------------------------
1 | [{"question01": 5, "question02": 2, "question03": 5, "question04": 2, "question05": 5, "question06": 1, "question07": 5, "question08": 1, "question09": 5, "question10": 1}, {"question01": 4, "question02": 2, "question03": 4, "question04": 2, "question05": 4, "question06": 2, "question07": 4, "question08": 2, "question09": 4, "question10": 2}, {"question01": 4, "question02": 1, "question03": 5, "question04": 2, "question05": 5, "question06": 2, "question07": 4, "question08": 1, "question09": 4, "question10": 1}, {"question01": 4, "question02": 2, "question03": 5, "question04": 1, "question05": 4, "question06": 2, "question07": 5, "question08": 2, "question09": 4, "question10": 1}]
--------------------------------------------------------------------------------
/training_content/generate_content/generate_cross.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import random, pprint, subprocess, tempfile, os, shutil, time, argparse
3 | import json
4 |
5 | # from file get words
6 | def read_word_list(filename,word_len_limit):
7 | words = []
8 | with open(filename) as words_file:
9 |
10 | for line in words_file:
11 | if len(line)<=word_len_limit:
12 | words.append(line.strip())
13 | return words
14 |
15 | # whether words can put into the grid
16 | def is_valid(possibility, grid, words ,counts, length):
17 | i = possibility["location"][0]
18 | j = possibility["location"][1]
19 | word = possibility["word"]
20 | D = possibility["D"]
21 |
22 | # Is this word length over the grid
23 | if (D == "E" and j + len(word) > len(grid[0])) or (D == "S" and i + len(word) > len(grid)):
24 | return [False, []]
25 |
26 | for k, letter in enumerate(list(word)):
27 | if D is "E":
28 | if grid[i][j+k] != 0 and grid[i][j+k] != letter:
29 | return [False, []]
30 | if D is "S":
31 | if grid[i+k][j] != 0 and grid[i+k][j] != letter:
32 | return [False, []]
33 |
34 |
35 | if D is "E":
36 | if j > 0 and grid[i][j-1] != 0:
37 | return [False, []]
38 | if j+len(word) < len(grid[0]) and grid[i][j+len(word)] != 0:
39 | return [False, []]
40 | if D is "S":
41 | if i > 0 and grid[i-1][j] != 0:
42 | return [False, []]
43 | if i+len(word) < len(grid) and grid[i+len(word)][j] != 0:
44 | return [False, []]
45 |
46 | new_words = []
47 | for k, letter in enumerate(list(word)):
48 | if D is "E":
49 | if grid[i][j+k] == 0 and (i > 0 and grid[i-1][j+k] != 0 or i < len(grid)-1 and grid[i+1][j+k]):
50 | poss_word = [letter]
51 | l = 1
52 | while i+l < len(grid[0]) and grid[i+l][j+k] != 0:
53 | poss_word.append(grid[i+l][j+k])
54 | l+=1
55 | l = 1
56 | while i-l > 0 and grid[i-l][j+k] != 0:
57 | poss_word.insert(0, grid[i-l][j+k])
58 | l+=1
59 | poss_word = ''.join(poss_word)
60 | if poss_word not in words:
61 | return [False, []]
62 | new_words.append({"D": "S", "word":poss_word, "location": [i-l+1,j+k]})
63 |
64 | if D is "S":
65 | if grid[i+k][j] == 0 and (j > 0 and grid[i+k][j-1] != 0 or j < len(grid[0])-1 and grid[i+k][j+1]):
66 | poss_word = [letter]
67 | l = 1
68 | while j+l < len(grid) and grid[i+k][j+l] != 0:
69 | poss_word.append(grid[i+k][j+l])
70 | l+=1
71 | l = 1
72 | while j-l > 0 and grid[i+k][j-l] != 0:
73 | poss_word.insert(0, grid[i+k][j-l])
74 | l+=1
75 | poss_word = ''.join(poss_word)
76 | if poss_word not in words:
77 | return [False, []]
78 | new_words.append({"D": "E", "word":poss_word, "location": [i+k,j-l+1]})
79 |
80 | if len(words) == length:
81 | return [True, new_words]
82 | else:
83 | for k, letter in enumerate(list(word)):
84 | if D is "E":
85 | if grid[i][j+k] == letter:
86 | return [True, new_words]
87 | if D is "S":
88 | if grid[i+k][j] == letter:
89 | return [True, new_words]
90 |
91 | return [False, []]
92 |
93 | # add the word into the grid
94 | def add_word_to_grid(possibility, grid):
95 | """ Adds a possibility to the given grid, which is modified in-place.
96 | (see generate_grid)
97 | """
98 | # Import possibility to local vars, for clarity
99 | i = possibility["location"][0]
100 | j = possibility["location"][1]
101 | word = possibility["word"]
102 |
103 | # Word is left-to-right
104 | if possibility["D"] == "E":
105 | grid[i][j:len(list(word))+j] = list(word)
106 | # Word is top-to-bottom
107 | if possibility["D"] == "S":
108 | for index, a in enumerate(list(word)):
109 | grid[i+index][j] = a
110 |
111 | # print the grid
112 | def write_grid(grid, screen=False):
113 | print('\n')
114 | if screen is True:
115 | # Print grid to the screen
116 | for line in grid:
117 | for element in line:
118 | print(" {}".format(element), end="")
119 | print()
120 |
121 |
122 | # generate the grid
123 | def generate_grid_occupancy_max(words, dim, index,timeout=60, occ_goal=1):
124 | print("Generating {} grid with {} words.".format(dim, len(words)))
125 | temp_word = [word for word in words]
126 | start_time = time.time()
127 | occupancy = 0
128 | letter_counter = 0
129 | inter_counter = 1
130 | occupancy_max = 0
131 | occupancy_min = 1
132 | grid_with_max_occupancy = []
133 | while not(not words) and time.time() - start_time < timeout:
134 | added_words = []
135 | words = [word for word in temp_word]
136 | grid = [x[:] for x in [[0]*dim[1]]*dim[0]]
137 | try_state = True
138 | word_index = 0
139 |
140 | # first word is random, try to put the remaining words in the grid in order
141 | while word_index 0.5 else "E"}
150 | else:
151 | new = {"word": words[word_index],
152 | "location": [[counts % dim[0], counts // dim[1]] if counts (dim[0]*dim[1])-1 else "E"}
154 | valid, new_words = is_valid(new, grid, words, counts, len(temp_word))
155 | counts += 1
156 | if valid:
157 | word_index = 0
158 | add_word_to_grid(new, grid)
159 | added_words.append(new)
160 | for word in new_words:
161 | added_words.append(word)
162 | words.remove(new["word"])
163 | for word in new_words:
164 | words.remove(word["word"])
165 | if len(words) == 0:
166 | try_state = False
167 | else:
168 | word_index += 1
169 | occupancy = sum(x.count(0) for x in grid) / (dim[0]*dim[1])
170 | if occupancy < occupancy_min and occupancy > 0:
171 | occupancy_min = occupancy
172 | grid_with_min_occupancy = grid
173 | word_with_min_occupancy = added_words
174 | print("\rEmpty_rate: {:2.6f}. Min_Empty_rate: {:2.6f}. Number_of_attempts: {}".format(occupancy, occupancy_min, inter_counter), end='')
175 | inter_counter += 1
176 |
177 | # Report and return the grid
178 | # print("Built a grid of occupancy {}.".format(occupancy))
179 | return occupancy_min, occupancy, {"grid": grid, "words": added_words}, grid_with_min_occupancy, word_with_min_occupancy
180 |
181 |
182 |
183 |
184 |
185 |
186 |
187 | #----------------------------------------------------------------------------------------------
188 | # set the grid size
189 | dd = 15
190 | words = []
191 |
192 | filepath = os.getcwd()
193 | # Extract words with clue from the data
194 | with open(os.path.join(filepath, '../../data', 'DATA_6_KLTQCP.json'), 'r', encoding='utf-8') as f:
195 | All_data = json.load(f)
196 |
197 | for i in range(len(All_data)):
198 | try:
199 | if All_data[i]['Puzzle'] !=[]:
200 | for pu in All_data[i]['Puzzle']:
201 | words.append(pu['answer'])
202 | except:
203 | continue
204 | words = list(dict.fromkeys(words))
205 | print(len(words))
206 | print(words)
207 |
208 | #----------------------------------------------------------------------------------------------
209 |
210 | words, dim= [x for x in words if len(x) > 2] , [dd, dd]
211 | O_max, O__current, grid, grid_with_min_occupancy, word_with_min_occupancy = generate_grid_occupancy_max(words, dim, 1)
212 |
213 | # 'E' means horizontal placement,'S' means vertical placement
214 | write_grid(grid_with_min_occupancy, screen=True)
215 | pprint.pprint(word_with_min_occupancy)
216 |
217 | weizhi = sorted(word_with_min_occupancy,key=lambda t: t['location'], reverse=False)
218 | print(weizhi)
219 |
220 | puzzledata = []
221 | position = 0
222 | for i in weizhi:
223 | for item in word_with_min_occupancy:
224 | if(i == item['location']):
225 |
226 | if (item['D'] == 'E'):
227 | item['D'] = 'across'
228 | else:
229 | item['D'] = 'down'
230 | temp = {}
231 |
232 | # x coordinate
233 | # startx = item['location'][0] + 1
234 | startx = item['location'][1] + 1
235 | # y coordinate
236 | # starty = item['location'][1] + 1
237 | starty = item['location'][0] + 1
238 |
239 | # word
240 | answer = item['word']
241 | # information
242 | for i_temp in range(len(All_data)):
243 | if ((All_data[i_temp]['Level'] < 3 )and (All_data[i_temp]['Puzzle']!=[])):
244 | for puzzle_info in All_data[i_temp]['Puzzle']:
245 | if (answer == puzzle_info['answer']):
246 | clue = str(puzzle_info['clue'])
247 | # clue = " "
248 | # Horizontal or vertical
249 | orientation = item['D']
250 | # position
251 | position = position + 1
252 |
253 | temp.update({'clue': clue})
254 | temp.update({'answer': answer})
255 | temp.update({'position': position})
256 | temp.update({'orientation': orientation})
257 | temp.update({'startx': startx})
258 | temp.update({'starty': starty})
259 | puzzledata.append(temp)
260 |
261 | print(temp)
262 |
263 | print(puzzledata)
264 |
265 |
266 |
267 | filename = 'crossword_puzzle_data.json'
268 | with open(os.path.join(filepath, '../../data', filename), 'w') as file_obj:
269 | json.dump(puzzledata, file_obj)
--------------------------------------------------------------------------------
/training_content/generate_content/generate_learning_content_data.py:
--------------------------------------------------------------------------------
1 | import re
2 | from SPARQLWrapper import SPARQLWrapper, JSON
3 | import json
4 | import wikipedia
5 | import time
6 | import requests
7 | import sys
8 | import importlib
9 | importlib.reload(sys)
10 |
11 |
12 | class Query_children:
13 | __doc__ = '''use SPARQL to query children from DBpedia'''
14 |
15 | def __init__(self, keyword):
16 | self.keyword = keyword
17 |
18 | def query_children(self, keyword):
19 | sparql = SPARQLWrapper("http://dbpedia.org/sparql")
20 | queryString = """
21 | PREFIX rdfs:
22 | PREFIX skos:
23 | PREFIX category:
24 |
25 |
26 | SELECT DISTINCT ?child ?childlabel
27 | WHERE{
28 | ?child skos:broader ;
29 | rdfs:label ?childname.
30 | FILTER (LANG(?childname) = 'en')
31 | BIND (?childname AS ?childlabel)
32 | }
33 | """.replace("keyword", self.keyword)
34 | sparql.setQuery(queryString)
35 | sparql.setReturnFormat(JSON)
36 | results = sparql.query().convert()
37 | childlist = []
38 | for result in results["results"]["bindings"]:
39 | childlist.append(result["childlabel"]["value"].replace(" ", "_"))
40 | return childlist
41 |
42 |
43 |
44 |
45 | def getKeywordText(data_level):
46 | no_define_word = []
47 | for i in range(len(data_level)):
48 | keyword = data_level[i]['Keyword']
49 | try:
50 | temp = wikipedia.summary(keyword)
51 | text = re.sub(u"\\(.*?\\)", "", temp).replace('\n', '')
52 | data_level[i].update({'Text': text})
53 | print(i, keyword)
54 |
55 | except wikipedia.DisambiguationError as a:
56 | no_define_word.append(keyword)
57 |
58 | except wikipedia.PageError as e:
59 | no_define_word.append(keyword)
60 |
61 | except requests.exceptions.ConnectionError:
62 | time.sleep(5)
63 | print('continue..')
64 | continue
65 | except UnicodeEncodeError:
66 | continue
67 |
68 | filename = 'DATA_3_KLT.json'
69 | with open(filename, 'w') as file_obj:
70 | json.dump(data_level, file_obj)
71 | print("\tFile DATA_3_KLT.json generated.")
72 |
73 | filename = 'no_text_keyword.json'
74 | with open(filename, 'w') as file_obj:
75 | json.dump(no_define_word, file_obj)
76 | print("\tFile no_text_keyword.json generated.")
77 |
78 |
79 | def getContent(keyword):
80 |
81 | # 1. from DBpedia get keywords and edges
82 | # 2. generate concept map
83 | # 3. from wikipedia get texts
84 |
85 | print("Start build the \"Computer Security\" Concept map. Please waiting...")
86 |
87 | # store concept in keys
88 | keys = []
89 | keys.append(keyword)
90 | # store relationship in edges
91 | edges = []
92 |
93 | # query children of the keyword, store in childlist
94 | children = Query_children(keyword)
95 | childlist = children.query_children(keyword)
96 | # print("childlist:", childlist)
97 |
98 | for child in childlist:
99 | if re.findall('\(.*?\)', child):
100 | continue
101 | elif re.findall('\'', child):
102 | continue
103 | elif re.findall('\+', child):
104 | continue
105 | elif re.findall('\/', child):
106 | continue
107 | elif re.findall('\&', child):
108 | continue
109 | elif re.findall('\.', child):
110 | continue
111 | elif re.findall('\!', child):
112 | continue
113 | elif re.findall(':', child):
114 | continue
115 | elif re.findall('\"', child):
116 | continue
117 | else:
118 | edges.append((child, keyword))
119 |
120 | # for i in range(len(childlist)) doesn't work
121 | # since the length of childlist keep changing
122 | i = 0
123 | while i < len(childlist):
124 | # print(i)
125 |
126 | if re.findall('\(.*?\)', childlist[i]):
127 | continue
128 | elif re.findall('\'', childlist[i]):
129 | continue
130 | elif re.findall('\+', childlist[i]):
131 | continue
132 | elif re.findall('\/', child):
133 | continue
134 | elif re.findall('\&', child):
135 | continue
136 | elif re.findall('\.', child):
137 | continue
138 | elif re.findall('\!', child):
139 | continue
140 | elif re.findall(':', child):
141 | continue
142 | elif re.findall('\"', child):
143 | continue
144 | else:
145 | child = childlist[i]
146 | if child not in keys:
147 | keys.append(child)
148 |
149 | # query children of the children
150 | grand_children = Query_children(child.replace(" ", "_"))
151 | results = grand_children.query_children(
152 | child.replace(" ", "_"))
153 | # print('reselts', child, results)
154 | for result in results:
155 | if re.findall('\(.*?\)', result):
156 | continue
157 | elif re.findall('\'', result):
158 | continue
159 | elif re.findall('\+', result):
160 | continue
161 | elif re.findall('\/', result):
162 | continue
163 | elif re.findall('\&', result):
164 | continue
165 | elif re.findall('\.', result):
166 | continue
167 | elif re.findall('\!', result):
168 | continue
169 | elif re.findall('\"', result):
170 | continue
171 | else:
172 | grandchild = result
173 | if grandchild not in childlist:
174 | if grandchild not in keys:
175 | # change the size of concept of here.len(keys) < total number + 1
176 | # !TODO improve the concept map size control method
177 | # - [x] 0: 1, 1
178 | # - [x] 1: 22, 23
179 | # - [x] 2 : 103, 126
180 | # - [x] 3 :205, 331
181 | # - [x] 4: 287, 618
182 | # - [x] 5: 266, 884
183 | # - [x] 6: 463, 1347
184 | # - [x] 7: 1293, 2640
185 | if len(keys) < 1348:
186 | childlist.append(grandchild)
187 | edges.append((grandchild, child))
188 | i = i + 1
189 | # print("edges", edges)
190 | # print("keys", keys)
191 | #print('Number ', len(keys), ' keyword was added.')
192 | print('\tNumber of ', len(keys), 'keywords was added.')
193 |
194 | data_KL = []
195 | for key_level in range(len(keys)):
196 | if key_level == 0:
197 | d1 = {"Keyword": keys[key_level].replace('_', ' ')}
198 | d2 = {"Level": 0}
199 | d1.update(d2)
200 | data_KL.append(d1)
201 |
202 | if key_level > 0 and key_level <23:
203 | d1 = {"Keyword": keys[key_level].replace('_', ' ')}
204 | d2 = {"Level": 1}
205 | d1.update(d2)
206 | data_KL.append(d1)
207 | if key_level >= 23 and key_level <126:
208 | d1 = {"Keyword": keys[key_level].replace('_', ' ')}
209 | d2 = {"Level": 2}
210 | d1.update(d2)
211 | data_KL.append(d1)
212 |
213 | if key_level >= 126 and key_level <331:
214 | d1 = {"Keyword": keys[key_level].replace('_', ' ')}
215 | d2 = {"Level": 3}
216 | d1.update(d2)
217 | data_KL.append(d1)
218 |
219 | if key_level >=331 and key_level <618:
220 | d1 = {"Keyword": keys[key_level].replace('_', ' ')}
221 | d2 = {"Level": 4}
222 | d1.update(d2)
223 | data_KL.append(d1)
224 |
225 | if key_level >=618 and key_level <884:
226 | d1 = {"Keyword": keys[key_level].replace('_', ' ')}
227 | d2 = {"Level": 5}
228 | d1.update(d2)
229 | data_KL.append(d1)
230 |
231 | if key_level >=884 and key_level <1347:
232 | d1 = {"Keyword": keys[key_level].replace('_', ' ')}
233 | d2 = {"Level": 6}
234 | d1.update(d2)
235 | data_KL.append(d1)
236 |
237 | if key_level >=1347 and key_level <2640:
238 | d1 = {"Keyword": keys[key_level].replace('_', ' ')}
239 | d2 = {"Level": 7}
240 | d1.update(d2)
241 | data_KL.append(d1)
242 |
243 |
244 | # save keyword and level information
245 | filename = 'tDATA_2_KL.json'
246 | with open(filename, 'w') as file_obj:
247 | json.dump(data_KL, file_obj)
248 | print('\tFile DATA_2_KL.json generated.')
249 |
250 | # generate computer_security.txt
251 | temp = []
252 | l0 = {"0": keys[:1]}
253 | temp.append(l0)
254 | l1 = {"1": keys[1:23]}
255 | temp.append(l1)
256 | l2 = {"2": keys[23:126]}
257 | temp.append(l2)
258 | l3 = {"3": keys[126:331]}
259 | temp.append(l3)
260 | l4 = {"4": keys[331:618]}
261 | temp.append(l4)
262 | l5 = {"5": keys[618:884]}
263 | temp.append(l5)
264 | l6 = {"6": keys[884:1347]}
265 | temp.append(l6)
266 | l7 = {"7": keys[1347:2640]}
267 | temp.append(l7)
268 | #print(temp)
269 | with open("Computer_security.txt", "w") as f:
270 | f.write(str(temp))
271 | print('\tFile Computer_security.txt generated.')
272 |
273 |
274 | all_KL = []
275 | for index, cont in enumerate(temp):
276 | for j in cont[str(index)]:
277 | d1 = {'name': j}
278 | d2 = {'category': index}
279 | d1.update(d2)
280 | all_KL.append(d1)
281 | nodes = {"data": all_KL}
282 |
283 | all_edges = []
284 | for index, cont in enumerate(edges):
285 | value = 'level' + str(index)
286 | dd = {"source": cont[1], "target": cont[0], "value": value}
287 | all_edges.append(dd)
288 |
289 | link = []
290 | for each_eage_index in range(len(all_edges)):
291 | source = all_edges[each_eage_index]['source'].replace("_", " ")
292 | target = all_edges[each_eage_index]['target'].replace("_", " ")
293 | value = all_edges[each_eage_index]['value'].replace("_", " ")
294 |
295 | for idd in range(len(data_KL)):
296 | if (source == data_KL[idd]['Keyword']):
297 | source = idd
298 | value = "level" + str(data_KL[idd]['Level'])
299 | break
300 |
301 | for ttt in range(len(data_KL)):
302 | if (target == data_KL[ttt]['Keyword']):
303 | target = ttt
304 | break
305 | sm_dic = {"source": source, "target": target, "value": value}
306 | link.append(sm_dic)
307 |
308 | Link = {"links": link}
309 |
310 | map = []
311 | map.append(nodes)
312 | map.append(Link)
313 |
314 |
315 | filename = 'concept_map.json'
316 | with open(filename, 'w') as file_obj:
317 | json.dump(map, file_obj)
318 | print("\tConcept map generate.")
319 |
320 | print("Start get each keywords text")
321 | getKeywordText(data_KL)
322 | print("Content prepare finished.")
323 |
324 |
325 |
326 |
327 | if __name__ == '__main__':
328 | keyword = 'Computer_security'
329 | concept_map = getContent(keyword)
330 |
331 |
--------------------------------------------------------------------------------
/training_content/generate_content/generate_quiz/Future_Engineering.py:
--------------------------------------------------------------------------------
1 | from training_content.generate_content.generate_quiz.Pickle_File import loadPickle, dumpPickle
2 | import pandas as pd
3 | import spacy
4 | nlp = spacy.load("en_core_web_sm")
5 | from nltk.tokenize import word_tokenize
6 |
7 | # extract the answer
8 | def extractAnswers(qas, doc):
9 | answers = []
10 | senStart = 0
11 | senId = 0
12 | for sentence in doc.sents:
13 | senLen = len(sentence.text)
14 | for answer in qas:
15 | answerStart = answer['answers'][0]['answer_start']
16 | if (answerStart >= senStart and answerStart < (senStart + senLen)):
17 | answers.append({'sentenceId': senId, 'text': answer['answers'][0]['text']})
18 | senStart += senLen
19 | senId += 1
20 |
21 | return answers
22 |
23 |
24 | # Determine if the current word is the answer
25 | def tokenIsAnswer(token, sentenceId, answers):
26 | for i in range(len(answers)):
27 | if (answers[i]['sentenceId'] == sentenceId):
28 | if (answers[i]['text'] == token):
29 | return True
30 | return False
31 |
32 | def getNEStartIndexs(doc):
33 | neStarts = {}
34 | for ne in doc.ents:
35 | print(ne.start)
36 | neStarts[ne.start] = ne
37 | return neStarts
38 |
39 | # extract the start index of none chunk in the text
40 | def getNCStartIndexs(doc):
41 | neStarts = {}
42 | for ne in doc.noun_chunks:
43 | neStarts[ne.start] = ne
44 | return neStarts
45 |
46 |
47 | # get sentence start index
48 | def getSentenceStartIndexes(doc):
49 | senStarts = [] # sentence start position
50 | for sentence in doc.sents: # doc.sents: divide the sentences of the text into sentence by sentence
51 | senStarts.append(sentence[0].i) # add the each sentence start position to senStarts
52 | return senStarts
53 |
54 |
55 | # get each word in which sentence position
56 | def getSentenceForWordPosition(wordPos, senStarts): # words position/sentence start index position
57 | if (len(senStarts) == 1):
58 | return 0
59 | for i in range(1, len(senStarts)):
60 | if (wordPos < senStarts[i] or wordPos == 0):
61 | return i - 1
62 | if (wordPos > senStarts[len(senStarts) - 1]):
63 | return len(senStarts) - 1
64 | if (wordPos == senStarts[i]):
65 | return i
66 |
67 |
68 | # KMP algorithm
69 | def getNext(p, nexts):
70 | i = 0
71 | j = -1
72 | nexts[0] = -1
73 | while i < len(p):
74 | if j == -1 or list(p)[i] == list(p)[j]:
75 | i = i + 1
76 | j = j + 1
77 | nexts[i] = j
78 | else:
79 | j = nexts[j]
80 | return nexts
81 |
82 |
83 | # KMP algorithm
84 | def match(s, p, nexts):
85 | if s == None or p == None:
86 | return -1
87 | slen = len(s)
88 | plen = len(p)
89 | if slen < plen:
90 | return -1
91 | i = 0
92 | j = 0
93 | while i < slen and j < plen:
94 | # print("i="+str(i)+","+"j="+str(j))
95 | if j == -1 or list(s)[i] == list(p)[j]:
96 | i = i + 1
97 | j = j + 1
98 | else:
99 | j = nexts[j]
100 | if j >= plen:
101 | return i - plen
102 | return -1
103 |
104 | ######Training - Feature engineering#####
105 | # extract the none chunk in answer
106 | def addAnswersNC(df, newWords, titleId, paragraphId):
107 | text = df['data'][titleId]['paragraphs'][paragraphId]['context']
108 | qas = df['data'][titleId]['paragraphs'][paragraphId]['qas']
109 |
110 | doc = nlp(text) # use nlp processing the text
111 | answers = extractAnswers(qas, doc)
112 |
113 | answers_noun_chunks = []
114 | answers_noun_chunks_sentenceId = []
115 |
116 | for j in range(len(answers)):
117 | temp = nlp(answers[j]['text']).noun_chunks
118 | for i in temp:
119 | answers_noun_chunks.append(i)
120 | answers_noun_chunks_sentenceId.append(answers[j]['sentenceId'])
121 |
122 | i = 0
123 | while (i < len(answers_noun_chunks)):
124 |
125 | word = answers_noun_chunks[i]
126 | wordLen = word.end - word.start
127 | shape = ''
128 | for wordIndex in range(word.start, word.end):
129 | shape += (' ' + doc[wordIndex].shape_)
130 | currentSentence = answers_noun_chunks_sentenceId[i]
131 |
132 | sentence = []
133 | for Sentence in list(doc.sents):
134 | sentence.append(Sentence)
135 |
136 | sentence_token = word_tokenize(str(sentence[currentSentence]))
137 | word_token = word_tokenize(str(word))
138 |
139 | lens = len(word_token)
140 | nexts = [0] * (lens + 1)
141 | nexts = getNext(word_token, nexts)
142 | InSentencePosition = match(sentence_token, word_token, nexts)
143 | # print(InSentencePosition)
144 |
145 | nlpword = nlp(str(word))
146 |
147 | if (len(nlpword) == 1):
148 | newWords.append([word.text,
149 | True,
150 | titleId,
151 | paragraphId,
152 | currentSentence,
153 | InSentencePosition,
154 | wordLen,
155 | word.label_,
156 | nlpword[0].pos_,
157 | nlpword[0].tag_,
158 | nlpword[0].dep_,
159 | shape,
160 | nlpword[0].is_alpha,
161 | nlpword[0].is_stop])
162 |
163 | else:
164 | newWords.append([word.text,
165 | True,
166 | titleId,
167 | paragraphId,
168 | currentSentence,
169 | InSentencePosition,
170 | wordLen,
171 | word.label_,
172 | None,
173 | None,
174 | None,
175 | shape,
176 | False,
177 | False])
178 |
179 | i = i + 1
180 | return newWords
181 |
182 |
183 | ######training - feature engineering#####
184 | # extract the none chunk from text
185 | def addWordsForParagrapgh(df, newWords, titleId, paragraphId):
186 | text = df['data'][titleId]['paragraphs'][paragraphId]['context'] # paragraph
187 | qas = df['data'][titleId]['paragraphs'][paragraphId]['qas'] # answer and question
188 |
189 | doc = nlp(text)
190 | answers = extractAnswers(qas, doc)
191 |
192 | doc_NC = getNCStartIndexs(doc)
193 | senStarts = getSentenceStartIndexes(doc)
194 |
195 | i = 0
196 | while (i < len(doc)):
197 | if (i in doc_NC):
198 |
199 | word = doc_NC[i]
200 | currentSentence = getSentenceForWordPosition(word.start, senStarts)
201 | wordLen = word.end - word.start
202 | shape = ''
203 | for wordIndex in range(word.start, word.end):
204 | shape += (' ' + doc[wordIndex].shape_)
205 |
206 | sentence = []
207 | for Sentence in list(doc.sents):
208 | sentence.append(Sentence)
209 |
210 | # print(sentence)
211 | sentence_token = word_tokenize(str(sentence[currentSentence]))
212 | # print(sentence_token)
213 | word_token = word_tokenize(str(word))
214 | # print(word_token)
215 |
216 | lens = len(word_token)
217 | nexts = [0] * (lens + 1)
218 | nexts = getNext(word_token, nexts)
219 | InSentencePosition = match(sentence_token, word_token, nexts)
220 |
221 | newWords.append([word.text,
222 | tokenIsAnswer(word.text, currentSentence, answers),
223 | titleId,
224 | paragraphId,
225 | currentSentence,
226 | InSentencePosition,
227 | wordLen,
228 | word.label_,
229 | None,
230 | None,
231 | None,
232 | shape,
233 | False,
234 | False])
235 | i = doc_NC[i].end - 1
236 |
237 | if (doc[i].is_stop == False and doc[i].is_alpha == True):
238 | word = doc[i]
239 | currentSentence = getSentenceForWordPosition(i, senStarts)
240 | wordLen = 1
241 | sentence = []
242 | for Sentence in list(doc.sents):
243 | sentence.append(Sentence)
244 |
245 | # print(sentence)
246 | sentence_token = word_tokenize(str(sentence[currentSentence]))
247 | # print(sentence_token)
248 | if (str(word) in sentence_token):
249 | for t in range(len(sentence_token)):
250 | if (str(word) == sentence_token[t]):
251 | InSentencePosition = t
252 | else:
253 | InSentencePosition = -1
254 |
255 | newWords.append([word.text,
256 | tokenIsAnswer(word.text, currentSentence, answers),
257 | titleId,
258 | paragraphId,
259 | currentSentence,
260 | InSentencePosition,
261 | wordLen,
262 | None,
263 | word.pos_,
264 | word.tag_,
265 | word.dep_,
266 | word.shape_,
267 | True,
268 | False])
269 | i = i + 1
270 | return newWords
271 |
272 |
273 | ######predict - feature engineering#####
274 | # each none chunk's feature
275 | def newAddNCForParagrapgh(newWords, text):
276 | doc = nlp(text)
277 | senStarts = getSentenceStartIndexes(doc)
278 | i = 0
279 |
280 | doc_NC = []
281 | for temp in doc.noun_chunks:
282 | doc_NC.append(temp)
283 | while (i < len(doc_NC)):
284 |
285 | word = doc_NC[i]
286 | currentSentence = getSentenceForWordPosition(word.start, senStarts)
287 | wordLen = word.end - word.start
288 | shape = ''
289 | for wordIndex in range(word.start, word.end):
290 | shape += (' ' + doc[wordIndex].shape_)
291 |
292 | sentence = []
293 | for Sentence in list(doc.sents):
294 | sentence.append(Sentence)
295 |
296 | # print(sentence)
297 | sentence_token = word_tokenize(str(sentence[currentSentence]))
298 | # print(sentence_token)
299 | word_token = word_tokenize(str(word))
300 | # print(word_token)
301 |
302 | lens = len(word_token)
303 | nexts = [0] * (lens + 1)
304 | nexts = getNext(word_token, nexts)
305 | InSentencePosition = match(sentence_token, word_token, nexts)
306 |
307 | newWords.append([word.text,
308 | InSentencePosition,
309 | wordLen,
310 | word.label_,
311 | None,
312 | None,
313 | None,
314 | False,
315 | False])
316 | i = i+1
317 | #print(i)
318 |
319 | return newWords
320 |
321 | ######predict - feature engineering#####
322 | # each words feature
323 | def newAddWordForParagrapgh(newWords, text):
324 | doc = nlp(text)
325 | senStarts = getSentenceStartIndexes(doc)
326 | j = 0
327 | while(j < len(doc)):
328 | if (doc[j].is_stop == False and doc[j].is_alpha == True):
329 | word = doc[j]
330 |
331 | currentSentence = getSentenceForWordPosition(j, senStarts)
332 | wordLen = 1
333 |
334 | sentence = []
335 | for Sentence in list(doc.sents):
336 | sentence.append(Sentence)
337 |
338 | # print(sentence)
339 | sentence_token = word_tokenize(str(sentence[currentSentence]))
340 | # print(sentence_token)
341 |
342 | if (str(word) in sentence_token):
343 | for t in range(len(sentence_token)):
344 | if (str(word) == sentence_token[t]):
345 | InSentencePosition = t
346 | else:
347 | InSentencePosition = -1
348 |
349 | newWords.append([word.text,
350 | InSentencePosition,
351 | wordLen,
352 | None,
353 | word.pos_,
354 | word.tag_,
355 | word.dep_,
356 | True,
357 | False])
358 | j += 1
359 | return newWords
360 |
361 |
362 | def processWordsData(wordsDf):
363 | # 1 - delete strange data
364 | df_1 = wordsDf[~wordsDf['InSentencePosition'].isin([-1])]
365 | df_1 = df_1.reset_index(drop=True)
366 | print(df_1['InSentencePosition'].describe())
367 | print(df_1.duplicated().value_counts())
368 |
369 | # 2 - delete repeat data
370 | cleandf = df_1.drop_duplicates(subset=['Words', 'Is_Answer'], keep='last')
371 | cleandf = cleandf.reset_index(drop=True)
372 |
373 | cleandf.to_csv('All_cleandf.csv', encoding='utf_8_sig')
374 | dumpPickle('All_cleandf.pkl', cleandf)
375 |
376 | con = len(cleandf[cleandf['Is_Answer'] == True]) / len(cleandf)
377 | print('After cleandf (words and Is_answer) Is_Answer percent:', con)
378 |
379 | # 3 - save feature matrix df
380 | columnsToDrop = ['Words', 'TitleId', 'ParagrapghId', 'SentenceId', 'Shape']
381 | FeDf = cleandf.drop(columnsToDrop, axis=1)
382 |
383 | FeDf.replace([True, False, None], [1, 0, 0], inplace=True)
384 |
385 | return FeDf
386 |
387 |
388 | # one-hot-encoding
389 | def oneHotEncoding(wordsDf):
390 | columnsToEncode = ['NER', 'POS', "TAG", 'DEP']
391 | for column in columnsToEncode:
392 | one_hot = pd.get_dummies(wordsDf[column])
393 | one_hot = one_hot.add_prefix(column + '_')
394 |
395 | wordsDf = wordsDf.drop(column, axis=1)
396 | wordsDf = wordsDf.join(one_hot)
397 |
398 | dumpPickle('All_featureDf.pkl', wordsDf)
399 | wordsDf.to_csv('All_featureDf.csv', encoding='utf_8_sig')
400 |
401 | return wordsDf
402 |
403 | # Correspond to the feature table items of the training set
404 | def asTrainingTable(wordsDf):
405 | df0 = loadPickle('./Trained_Models/All_featureDf.pkl')
406 | predictorColumns = list(df0.columns.values)
407 | predictorColumns.pop(0)
408 | w_Df = pd.DataFrame(columns=predictorColumns)
409 | for column in w_Df.columns:
410 | if (column in wordsDf.columns):
411 | w_Df[column] = wordsDf[column]
412 | else:
413 | w_Df[column] = 0
414 | Finished_fe_table = pd.DataFrame(w_Df)
415 | #print(Finished_fe_table)
416 | return w_Df, Finished_fe_table
--------------------------------------------------------------------------------
/training_content/generate_content/generate_quiz/Generate_Choice.py:
--------------------------------------------------------------------------------
1 | from training_content.generate_content.generate_quiz.Pickle_File import dumpPickle, loadPickle, pickleExists
2 | import spacy
3 | from sense2vec import Sense2Vec
4 | nlp = spacy.load("en_core_web_sm")
5 |
6 | # Load the generated vector
7 | modelName = './model_glove.pkl'
8 | if (pickleExists(modelName)):
9 | model_glove = loadPickle('model_glove.pkl')
10 | else:
11 | from gensim.models import KeyedVectors
12 | glove_file = './Out_Source/glove_6B/glove.6B.300d.txt'
13 | tmp_file = './Out_Source/glove_6B/word2vec-glove.6B.300d.txt'
14 | from gensim.scripts.glove2word2vec import glove2word2vec
15 | glove2word2vec(glove_file, tmp_file)
16 | model_glove = KeyedVectors.load_word2vec_format(tmp_file)
17 | dumpPickle('model_glove.pkl', model_glove)
18 |
19 | # generate choice
20 | def generateOtherChoice(answer, count):
21 | # make all word lowercase
22 | answer = str.lower(answer)
23 | closestWords = []
24 |
25 | try:
26 |
27 | doc1 = nlp((answer))
28 |
29 | # if answer count ==1, use glove2word2vec
30 | if (len(doc1) == 1):
31 | closestWords = model_glove.most_similar(positive=[doc1.text], topn=count)
32 |
33 | # if answer count >1, use Sense2Vec
34 | else:
35 | temp = doc1.text.replace(' ', '_') + '|NOUN'
36 |
37 | s2v = Sense2Vec().from_disk("./Out_Source/s2v_old")
38 | most_similar = s2v.most_similar(temp, n=count)
39 |
40 | for each_choice in most_similar:
41 | del_ = each_choice[0].replace('_', ' ')
42 | choice, sep, suffix = del_.partition('|')
43 | closestWords.append((choice, each_choice[1]))
44 |
45 | except:
46 | return []
47 |
48 | other_choice = list(map(lambda x: x[0], closestWords))[0:count]
49 |
50 | # Remove words that are the same as the answer
51 | temp = other_choice[:]
52 | for i in temp:
53 | choice = nlp(i)
54 | for token in choice:
55 | if (token.lemma_ in answer) or (answer in token.lemma_):
56 | other_choice.remove(i)
57 | break
58 |
59 | # Remove the cognate words in the options
60 | other_choice_ = []
61 | for j in other_choice:
62 | choice = nlp(j)
63 | if (len(choice) == 1):
64 | for token in choice:
65 | other_choice_.append(token.lemma_)
66 | else:
67 | tempword = ''
68 | for token in choice:
69 | tempword = tempword + token.lemma_ + ' '
70 | other_choice_.append(tempword.rstrip(' '))
71 |
72 |
73 | return list(set(other_choice_))
74 |
75 | def addChoice(qaPairs, count):
76 | for qaPair in qaPairs:
77 | answer = qaPair['answer']
78 | distractors = generateOtherChoice(qaPair['answer'], count)
79 | if ((len(distractors) != 0) and (len(distractors) < 4)):
80 | for i in range(len(distractors)):
81 | second_choice_ = generateOtherChoice((distractors[i]), 2)
82 | if (second_choice_ != []):
83 | for each_second in second_choice_:
84 | distractors.append(each_second)
85 |
86 | # delete repeat choice
87 | temp = distractors[:]
88 | for i in temp:
89 | choice = nlp(i)
90 | for token in choice:
91 | if (token.lemma_ in answer) or (answer in token.lemma_):
92 | distractors.remove(i)
93 | break
94 |
95 | # Remove the cognate words in the options
96 | other_choice_ = []
97 | for j in distractors:
98 | choice = nlp(j)
99 | if (len(choice) == 1):
100 | for token in choice:
101 | other_choice_.append(token.lemma_)
102 | else:
103 | tempword = ''
104 | for token in choice:
105 | tempword = tempword + token.lemma_ + ' '
106 | other_choice_.append(tempword.rstrip(' '))
107 |
108 | qaPair['other_choice'] = distractors
109 |
110 | return qaPairs
111 |
--------------------------------------------------------------------------------
/training_content/generate_content/generate_quiz/Generate_Question.py:
--------------------------------------------------------------------------------
1 | import spacy
2 | nlp = spacy.load("en_core_web_sm")
3 |
4 |
5 | def blankAnswer(firstTokenIndex, lastTokenIndex, sentStart, sentEnd, doc):
6 | try:
7 | leftPartStart = doc[sentStart].idx
8 | leftPartEnd = doc[firstTokenIndex].idx
9 | rightPartStart = doc[lastTokenIndex].idx + len(doc[lastTokenIndex])
10 | rightPartEnd = doc[sentEnd - 1].idx + len(doc[sentEnd - 1])
11 | question = doc.text[leftPartStart:leftPartEnd] + '_____' + doc.text[rightPartStart:rightPartEnd]
12 | return question
13 | except IndexError:
14 | print('IndexError')
15 | pass
16 | except Exception as e:
17 | print(e)
18 |
19 |
20 |
21 | # One answer should only generate one question
22 | def addQuestions(answers, text):
23 | doc = nlp(text)
24 | currAnswerIndex = 0
25 | qaPair = []
26 |
27 | while (currAnswerIndex < len(answers)):
28 | answerDoc = nlp(answers[currAnswerIndex]['word'])
29 | # print(answerDoc)
30 | # answerIsFound = True
31 |
32 | for sent in doc.sents:
33 | for token in sent:
34 | answerIsFound = True
35 |
36 | # when answer count == 1
37 | if (len(answerDoc) == 1):
38 | if token.i >= len(doc) or doc[token.i].text.lower() != answerDoc[0].text.lower():
39 | answerIsFound = False
40 |
41 | if answerIsFound:
42 | question = blankAnswer(token.i, token.i + len(answerDoc) - 1, sent.start, sent.end, doc)
43 | #print(question)
44 | qaPair.append({'question': question, 'answer': answers[currAnswerIndex]['word']})
45 |
46 | # when answer count == 2
47 | elif(len(answerDoc)==2):
48 | if (token.i >= len(doc) or doc[token.i].text.lower() != answerDoc[0].text.lower()):
49 | answerIsFound = False
50 |
51 | if (token.i + 1 >= len(doc) or doc[token.i + 1].text.lower() != answerDoc[1].text.lower()):
52 | answerIsFound = False
53 |
54 | if answerIsFound:
55 | question = blankAnswer(token.i, token.i + len(answerDoc) - 1, sent.start, sent.end, doc)
56 | #print(question)
57 | qaPair.append({'question': question, 'answer': answers[currAnswerIndex]['word']})
58 |
59 | # when answer count > 2
60 | else:
61 | if (token.i >= len(doc) or doc[token.i].text.lower() != answerDoc[0].text.lower()):
62 | answerIsFound = False
63 |
64 | if (token.i + 2 >= len(doc) or doc[token.i + 2].text.lower() != answerDoc[2].text.lower()):
65 | answerIsFound = False
66 |
67 | if answerIsFound:
68 | question = blankAnswer(token.i, token.i + len(answerDoc) - 1, sent.start, sent.end, doc)
69 | #print(question)
70 | qaPair.append({'question': question, 'answer': answers[currAnswerIndex]['word']})
71 |
72 | currAnswerIndex += 1
73 |
74 | return qaPair
75 |
76 |
77 |
--------------------------------------------------------------------------------
/training_content/generate_content/generate_quiz/Pickle_File.py:
--------------------------------------------------------------------------------
1 | import _pickle as cPickle
2 | from pathlib import Path
3 |
4 | def dumpPickle(fileName, content):
5 | pickleFile = open(fileName, 'wb')
6 | cPickle.dump(content, pickleFile, -1)
7 | pickleFile.close()
8 |
9 |
10 | def loadPickle(fileName):
11 | file = open(fileName, 'rb')
12 | content = cPickle.load(file)
13 | file.close()
14 | return content
15 |
16 | def pickleExists(fileName):
17 | file = Path(str(fileName))
18 | if file.is_file():
19 | return True
20 | return False
--------------------------------------------------------------------------------
/training_content/generate_content/generate_quiz/Predict_Answer.py:
--------------------------------------------------------------------------------
1 |
2 | def predict(clf, w_Df, wordsDf, NCnumber, Allnumber):
3 | # predict label
4 | pre_data = clf.predict(w_Df)
5 | # predict probability
6 | y_pred = clf.predict_proba(w_Df)
7 |
8 | labeledAnswers = []
9 | for i in range(len(pre_data)):
10 | labeledAnswers.append({'word': wordsDf.iloc[i]['Words'], 'pre_label': pre_data[i], 'prob': y_pred[i][0]})
11 |
12 |
13 | NClabeled = []
14 | Wlabeled = []
15 |
16 | for i in range(NCnumber):
17 | NClabeled.append(labeledAnswers[i])
18 |
19 | for i in range(NCnumber, Allnumber):
20 | Wlabeled.append(labeledAnswers[i])
21 |
22 | for i in range(Allnumber - NCnumber):
23 | if (Wlabeled[i]['pre_label'] == True):
24 | for j in range(NCnumber):
25 | if (Wlabeled[i]['word'] in NClabeled[j]['word']):
26 | Wlabeled.append({'word': NClabeled[j]['word'], 'pre_label': True,
27 | 'prob': max(Wlabeled[i]['prob'], NClabeled[j]['prob'])})
28 |
29 |
30 | # Sort
31 | paixu = sorted(Wlabeled, key=lambda labeledAnswers: labeledAnswers['prob'], reverse=True)
32 |
33 |
34 | # Deduplication
35 | dic = {}
36 | duplicate = []
37 | for word in paixu:
38 | if (word['pre_label'] == True):
39 | if word['word'] not in dic.keys():
40 | dic[word['word']] = word
41 | elif dic[word['word']]['prob'] < word['prob']:
42 | dic[word['word']] = word
43 | for value in dic.values():
44 | duplicate.append(value)
45 |
46 | # If the predicted probability is more than 0.5, generate a problem
47 | final = []
48 | for i in duplicate:
49 | if (i['prob'] >= 0.5):
50 | final.append(i)
51 | return final
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
--------------------------------------------------------------------------------
/training_content/generate_content/generate_quiz/main-GQ.py:
--------------------------------------------------------------------------------
1 | from training_content.generate_content.generate_quiz.Pickle_File import dumpPickle, loadPickle, pickleExists
2 | import pandas as pd
3 | from training_content.generate_content.generate_learning_content_data import getContent
4 | from training_content.generate_content.generate_quiz.Future_Engineering import newAddNCForParagrapgh, newAddWordForParagrapgh, oneHotEncoding, asTrainingTable
5 | from training_content.generate_content.generate_quiz.Predict_Answer import predict
6 | from training_content.generate_content.generate_quiz.Generate_Question import addQuestions
7 | from training_content.generate_content.generate_quiz.Generate_Choice import addChoice
8 | import json
9 | import spacy
10 | import os.path
11 | nlp = spacy.load("en_core_web_sm")
12 |
13 | # load glove pre-trained word vectors
14 | modelName = './model_glove.pkl'
15 | if (pickleExists(modelName) == False):
16 | from gensim.models import KeyedVectors
17 | glove_file = './Out_Source/glove_6B/glove.6B.300d.txt'
18 | tmp_file = './Out_Source/glove_6B/word2vec-glove.6B.300d.txt'
19 | from gensim.scripts.glove2word2vec import glove2word2vec
20 | glove2word2vec(glove_file, tmp_file)
21 | model_glove = KeyedVectors.load_word2vec_format(tmp_file)
22 | dumpPickle('model_glove.pkl', model_glove)
23 | else:
24 | model_glove = loadPickle('model_glove.pkl')
25 |
26 | # redefine the stop word in spacy
27 | from spacy.lang.en.stop_words import STOP_WORDS
28 | for word in STOP_WORDS:
29 | for w in (word, word[0].capitalize(), word.upper()):
30 | lex = nlp.vocab[w]
31 | lex.is_stop = True
32 |
33 |
34 | # 1 load content text
35 | # 2 feature engineering
36 | # 3 training and processing
37 | # 4 generate question and answer
38 | # 5 constructe data
39 |
40 | if __name__ =='__main__':
41 | text0_3 = []
42 | keyword = 'Computer_security'
43 |
44 | # load trained model
45 | clf = loadPickle('./Trained_Models/M_BernoulliNB+isotonic_smote.pkl')
46 | # load content text file
47 | try:
48 | with open('./Data_3_KLT.json', 'r', encoding='utf8') as fp:
49 | text_data = json.load(fp)
50 | except FileNotFoundError:
51 | getContent(keyword)
52 |
53 | # extract all keywords' text
54 | for index in range(len(text_data)):
55 | # extract a certain level of text, here is the 3 level of text
56 | if text_data[index]['Level'] == 3:
57 | # if text_data[index]['Level'] < 3:
58 | try:
59 | each_text = text_data[index]['Text']
60 | # delete the content in brackets and '\n'
61 | # text = re.sub(u"\\(.*?\\)", "", each_text).replace('\n', '')
62 | # remove the content is empty
63 | if each_text != '':
64 | text0_3.append([index, each_text])
65 | except:
66 | print(text_data[index]['Keyword'], 'has not text.')
67 | continue
68 |
69 | print('\nload text0-3 finished')
70 | print('All text number:', len(text0_3),'\n')
71 |
72 | for whole_number in range(len(text0_3)):
73 | text = text0_3[whole_number][1]
74 | print('Number ', text0_3[whole_number][0], 'text is processing.')
75 | print('Keyword is ', text_data[text0_3[whole_number][0]]['Keyword'])
76 | ##################################
77 | #######Text Feature Engineering###
78 | ##################################
79 | words = []
80 | # Feature engineering for noun chunk in the text
81 | words = newAddNCForParagrapgh(words, text)
82 | NCnumber = len(words)
83 | # Feature engineering for word in the text
84 | words = newAddWordForParagrapgh(words, text)
85 | Allnumber = len(words)
86 |
87 | # Define feature engineering headers and tables(wordsDf)
88 | wordColums = ['Words', 'InSentencePosition', 'Word_Count', 'NER', 'POS', 'TAG', 'DEP', 'Is_Alpha', 'Is_Stop']
89 | wordsDf = pd.DataFrame(words, columns=wordColums)
90 |
91 | # wordsDf-onehotencoding
92 | wordsDf = oneHotEncoding(wordsDf)
93 |
94 | # Correspond to the feature table items of the training set
95 | w_Df, Finished_fe_table = asTrainingTable(wordsDf)
96 | #print('Feature engineer table', Finished_fe_table)
97 |
98 | ##################################
99 | ####### Predicted Answer #########
100 | ##################################
101 | # predict answer
102 | predict_answer = predict(clf, w_Df, wordsDf, NCnumber, Allnumber)
103 | print('predict answer: ', predict_answer)
104 | ##################################
105 | ####### Generate Question ########
106 | ##################################
107 | # generate question
108 | question_sets = addQuestions(predict_answer, text)
109 | #print(question_sets)
110 | #print(len(question_sets))
111 |
112 | ##################################
113 | ####### Generate other choice ####
114 | ##################################
115 | # generate other choice
116 | Allsets = addChoice(question_sets[:len(question_sets)], 4)
117 |
118 | completed_set = []
119 | for i in range(len(question_sets)):
120 | if (Allsets[i]['other_choice'] != [] and len(Allsets[i]['other_choice']) > 2):
121 | completed_set.append(Allsets[i])
122 |
123 | print(completed_set)
124 | print(len(completed_set) , ' questions is generated.')
125 | # words_questionsets = {'Question': completed_set}
126 |
127 | text_data[text0_3[whole_number][0]].update({'Questions': completed_set})
128 | print(text_data[text0_3[whole_number][0]])
129 | print('number', text0_3[whole_number][0], 'question is added.\n')
130 |
131 | #print(text_data)
132 |
133 | # output the final quiz data
134 | filename = 'Data_level3_KLTQ.json'
135 | with open(filename, 'w') as file_obj:
136 | json.dump(text_data, file_obj)
137 |
--------------------------------------------------------------------------------
/training_content/generate_content/generate_quiz/requirements.txt:
--------------------------------------------------------------------------------
1 | tqdm==4.46.0
2 | numpy==1.18.1
3 | wikipedia==1.4.0
4 | pandas==1.0.3
5 | requests==2.23.0
6 | nltk==3.4.5
7 | imbalanced_learn==0.7.0
8 | gensim==3.8.3
9 | sense2vec==1.0.2
10 | spacy==2.3.2
11 | matplotlib==3.2.2
12 | scikit_learn==0.23.2
13 |
--------------------------------------------------------------------------------