31 |
32 | The solution is to append :code:`?charset=utf8mb4` to the database URL.
33 |
34 | So, if the database URL was:
35 |
36 | .. code-block:: python
37 |
38 | f"mysql+mysqldb://{username}:{password}@{host}:{port}/{database}"
39 |
40 | then the new URL would be:
41 |
42 | .. code-block:: python
43 |
44 | f"mysql+mysqldb://{username}:{password}@{host}:{port}/{database}?charset=utf8mb4"
45 |
46 | The database URL is what is passed as a first argument to create the engine:
47 |
48 | .. code-block:: python
49 |
50 | import sqlalchemy
51 |
52 | engine = sqlalchemy.create_engine(f"{dialect}+{driver}://{username}:{password}@{host}:{port}/{database}")
53 |
54 |
55 | DVC dataclasses issue
56 | ----------------------
57 |
58 | When in a Python 3.7+ environment the package :code:`dataclasses` is installed,
59 | one might run into the following error when doing :code:`dvc pull`:
60 |
61 | .. code-block:: bash
62 |
63 | AttributeError: module 'typing' has no attribute '_ClassVar'
64 |
65 | The solution is to uninstall the package :code:`dataclasses`:
66 |
67 | .. code-block:: bash
68 |
69 | pip uninstall dataclasses
70 |
71 |
72 | DVC pull issue
73 | --------------
74 |
75 | When launching mining_cache or mining_server entrypoints or even simply
76 | :code:`dvc pull`, one might run into the following error:
77 |
78 | .. code-block:: text
79 |
80 | WARNING: Some of the cache files do not exist neither locally nor on remote.
81 | Missing cache files:
82 |
83 | In this case, the solution is to go to the :code:`.dvc` directory
84 | and remove the file called `config.local`:
85 |
86 | .. code-block:: bash
87 |
88 | cd .dvc
89 | rm config.local
90 |
91 | Doing `dvc pull` again should work fine after this.
92 |
--------------------------------------------------------------------------------
/docs/source/instructions.rst:
--------------------------------------------------------------------------------
1 | .. Blue Brain Search is a text mining toolbox focused on scientific use cases.
2 | Copyright (C) 2020 Blue Brain Project, EPFL.
3 | This program is free software: you can redistribute it and/or modify
4 | it under the terms of the GNU Lesser General Public License as published by
5 | the Free Software Foundation, either version 3 of the License, or
6 | (at your option) any later version.
7 | This program is distributed in the hope that it will be useful,
8 | but WITHOUT ANY WARRANTY; without even the implied warranty of
9 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
10 | GNU Lesser General Public License for more details.
11 | You should have received a copy of the GNU Lesser General Public License
12 | along with this program. If not, see .
13 |
14 | Instructions
15 | ============
16 |
17 | Installation
18 | ------------
19 | Before installation, please make sure you have a recent :code:`pip` installed (:code:`>=19.1`)
20 |
21 | Then you can easily install :code:`bluesearch` from PyPI:
22 |
23 | .. code-block:: bash
24 |
25 | pip install bluesearch[data_and_models]
26 |
27 | You can also build from source if you prefer:
28 |
29 | .. code-block:: bash
30 |
31 | pip install .[data_and_models] # use -e for editable install
32 |
33 | NB: The optional dependencies installed with the :code:`[data_and_models]`
34 | option are only necessary if you want to execute training or inference using the
35 | :code:`dvc` and the model and scripts contained under :code:`data_and_models/`. If this is not
36 | the case, you can skip the :code:`[data_and_models]` at the end of :code:`pip install`.
37 |
38 |
39 | Generating docs
40 | ---------------
41 | All the versions of our documentation, both stable and latest,
42 | `can be found on Read the Docs `_.
43 |
44 |
45 | To generate the documentation manually, we use :code:`sphinx` with a custom BBP theme.
46 | Make sure to install the :code:`bluesearch` package with :code:`dev` extras to get
47 | the necessary dependencies.
48 |
49 | .. code-block:: bash
50 |
51 | pip install -e .[dev]
52 |
53 | To generate autodoc directives one can run
54 |
55 | .. code-block:: bash
56 |
57 | cd docs
58 | sphinx-apidoc -o source/api/ -f -e ../src/bluesearch/ ../src/bluesearch/entrypoint/*
59 |
60 | Note that it only needs to be rerun when there are new subpackages/modules.
61 |
62 | To generate the documentation run
63 |
64 | .. code-block:: bash
65 |
66 | cd docs
67 | make clean && make html
68 |
69 |
70 | Finally, one can also run doctests
71 |
72 | .. code-block:: bash
73 |
74 | cd docs
75 | make doctest
76 |
--------------------------------------------------------------------------------
/docs/source/logo/BlueBrainSearch_banner.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BlueBrain/Search/503fdf320a62ab2eb1f9a2a371600e4f1a38df62/docs/source/logo/BlueBrainSearch_banner.jpg
--------------------------------------------------------------------------------
/luigi.cfg:
--------------------------------------------------------------------------------
1 | ;Blue Brain Search is a text mining toolbox focused on scientific use cases.
2 | ;
3 | ;Copyright (C) 2020 Blue Brain Project, EPFL.
4 | ;
5 | ;This program is free software: you can redistribute it and/or modify
6 | ;it under the terms of the GNU Lesser General Public License as published by
7 | ;the Free Software Foundation, either version 3 of the License, or
8 | ;(at your option) any later version.
9 | ;
10 | ;This program is distributed in the hope that it will be useful,
11 | ;but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | ;MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | ;GNU Lesser General Public License for more details.
14 | ;
15 | ;You should have received a copy of the GNU Lesser General Public License
16 | ;along with this program. If not, see .
17 |
18 | [core]
19 | autoload_range = true
20 | log_level = INFO
21 | local_scheduler = True
22 |
23 | [GlobalParams]
24 | source = pubmed
25 |
26 | [DownloadTask]
27 | from_month = 2021-12
28 | to_month = 2022-02
29 | output_dir = luigi-pipeline
30 | identifier =
31 | ; emtpy string is considered default value
32 |
33 | [TopicExtractTask]
34 | mesh_topic_db = luigi-pipeline/mesh_topic_db.json
35 |
36 | [TopicFilterTask]
37 | filter_config = luigi-pipeline/filter-config.jsonl
38 |
39 | [ConvertPDFTask]
40 | grobid_host = 0.0.0.0
41 | grobid_port = 8070
42 |
43 | [AddTask]
44 | db_url = luigi-pipeline/my-db.db
45 | db_type = sqlite
46 |
--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
1 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
2 | #
3 | # Copyright (C) 2020 Blue Brain Project, EPFL.
4 | #
5 | # This program is free software: you can redistribute it and/or modify
6 | # it under the terms of the GNU Lesser General Public License as published by
7 | # the Free Software Foundation, either version 3 of the License, or
8 | # (at your option) any later version.
9 | #
10 | # This program is distributed in the hope that it will be useful,
11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | # GNU Lesser General Public License for more details.
14 | #
15 | # You should have received a copy of the GNU Lesser General Public License
16 | # along with this program. If not, see .
17 |
18 | [build-system]
19 | requires = [
20 | "pip>=9",
21 | "setuptools>=45",
22 | "setuptools_scm[toml]>=3.4",
23 | "wheel",
24 | ]
25 | # This is pip's default value if the build-backend key is missing
26 | # Ref: https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/#fallback-behaviour
27 | # Tox with tox.isolated_build = true needs this key to be defined explicitly.
28 | # Setuptools instructs setting build-backend without __legacy__, ref:
29 | # https://setuptools.pypa.io/en/latest/build_meta.html#how-to-use-it
30 | build-backend = "setuptools.build_meta"
31 |
32 | [tool.black]
33 | extend-exclude = """data_and_models/pipelines/ner/transformers_vs_spacy/transformers/
34 | |data_and_models/pipelines/sentence_embedding/training_transformers/"""
35 |
--------------------------------------------------------------------------------
/requirements-data_and_models.txt:
--------------------------------------------------------------------------------
1 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
2 | #
3 | # Copyright (C) 2020 Blue Brain Project, EPFL.
4 | #
5 | # This program is free software: you can redistribute it and/or modify
6 | # it under the terms of the GNU Lesser General Public License as published by
7 | # the Free Software Foundation, either version 3 of the License, or
8 | # (at your option) any later version.
9 | #
10 | # This program is distributed in the hope that it will be useful,
11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | # GNU Lesser General Public License for more details.
14 | #
15 | # You should have received a copy of the GNU Lesser General Public License
16 | # along with this program. If not, see .
17 |
18 | PyYAML==5.4.1
19 | dvc[ssh]==2.5.4
20 | matplotlib==3.4.2
21 | scipy==1.7.0
22 | spacy_lookups_data==1.0.2
23 | srsly==2.4.1
24 | transformers==4.6.1
25 | typer==0.3.2
26 |
--------------------------------------------------------------------------------
/requirements-dev.txt:
--------------------------------------------------------------------------------
1 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
2 | #
3 | # Copyright (C) 2020 Blue Brain Project, EPFL.
4 | #
5 | # This program is free software: you can redistribute it and/or modify
6 | # it under the terms of the GNU Lesser General Public License as published by
7 | # the Free Software Foundation, either version 3 of the License, or
8 | # (at your option) any later version.
9 | #
10 | # This program is distributed in the hope that it will be useful,
11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | # GNU Lesser General Public License for more details.
14 | #
15 | # You should have received a copy of the GNU Lesser General Public License
16 | # along with this program. If not, see .
17 |
18 | Sphinx==4.1.1
19 | docker==5.0.0
20 | en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
21 | pytest-benchmark==3.4.1
22 | pytest-cov==2.12.1
23 | pytest==6.2.4
24 | responses==0.19.0
25 | sphinx-bluebrain-theme==0.2.4
26 | tox==3.24.0
27 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
2 | #
3 | # Copyright (C) 2020 Blue Brain Project, EPFL.
4 | #
5 | # This program is free software: you can redistribute it and/or modify
6 | # it under the terms of the GNU Lesser General Public License as published by
7 | # the Free Software Foundation, either version 3 of the License, or
8 | # (at your option) any later version.
9 | #
10 | # This program is distributed in the hope that it will be useful,
11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | # GNU Lesser General Public License for more details.
14 | #
15 | # You should have received a copy of the GNU Lesser General Public License
16 | # along with this program. If not, see .
17 |
18 | Flask==2.0.1
19 | SQLAlchemy[mysql,pymysql]==1.4.21
20 | boto3==1.20.16
21 | catalogue==2.0.4
22 | cryptography==3.4.7
23 | defusedxml==0.6.0
24 | google-cloud-storage==1.43.0
25 | h5py==3.3.0
26 | ipython==7.31.1
27 | ipywidgets==7.6.3
28 | jupyterlab==3.0.17
29 | langdetect==1.0.9
30 | luigi==3.0.3
31 | mashumaro==3.0
32 | numpy==1.21.0
33 | pandas==1.3.0
34 | pg8000==1.23.0
35 | python-dotenv==0.18.0
36 | requests==2.26.0
37 | scikit-learn==0.24.2
38 | sentence-transformers==2.0.0
39 | spacy==3.0.7
40 | spacy-transformers==1.0.3
41 | torch==1.9.0
42 | elasticsearch==8.3.3
--------------------------------------------------------------------------------
/screenshots/mining_widget_articles.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BlueBrain/Search/503fdf320a62ab2eb1f9a2a371600e4f1a38df62/screenshots/mining_widget_articles.png
--------------------------------------------------------------------------------
/screenshots/mining_widget_text.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BlueBrain/Search/503fdf320a62ab2eb1f9a2a371600e4f1a38df62/screenshots/mining_widget_text.png
--------------------------------------------------------------------------------
/screenshots/search_widget.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BlueBrain/Search/503fdf320a62ab2eb1f9a2a371600e4f1a38df62/screenshots/search_widget.png
--------------------------------------------------------------------------------
/src/bluesearch/__init__.py:
--------------------------------------------------------------------------------
1 | """bluesearch: a Python package for text mining on scientific use cases."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | from bluesearch.version import __version__ # noqa
21 |
--------------------------------------------------------------------------------
/src/bluesearch/_css/__init__.py:
--------------------------------------------------------------------------------
1 | """CSS styling utilities."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
--------------------------------------------------------------------------------
/src/bluesearch/_css/style.py:
--------------------------------------------------------------------------------
1 | """CSS styling utilities."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | import pkg_resources
21 |
22 |
23 | def get_css_style():
24 | """Get content of CSS style sheet."""
25 | css_file = pkg_resources.resource_filename(__name__, "stylesheet.css")
26 | with open(css_file, "r") as f:
27 | css_style = f.read()
28 | return css_style
29 |
--------------------------------------------------------------------------------
/src/bluesearch/_css/stylesheet.css:
--------------------------------------------------------------------------------
1 | /*
2 | Blue Brain Search is a text mining toolbox focused on scientific use cases.
3 |
4 | Copyright (C) 2020 Blue Brain Project, EPFL.
5 |
6 | This program is free software: you can redistribute it and/or modify
7 | it under the terms of the GNU Lesser General Public License as published by
8 | the Free Software Foundation, either version 3 of the License, or
9 | (at your option) any later version.
10 |
11 | This program is distributed in the hope that it will be useful,
12 | but WITHOUT ANY WARRANTY; without even the implied warranty of
13 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
14 | GNU Lesser General Public License for more details.
15 |
16 | You should have received a copy of the GNU Lesser General Public License
17 | along with this program. If not, see .
18 | */
19 |
20 | /* search engine */
21 |
22 | .article_title {
23 | font-size: 17px;
24 | color: #1A0DAB;
25 | }
26 | .paragraph {
27 | font-size: 13px;
28 | color: #222;
29 | }
30 | .paragraph_emph {
31 | font-weight: bold;
32 | color: #000;
33 | }
34 | .metadata {
35 | font-size: 13px;
36 | color: #006621;
37 | }
38 |
39 | /* success */
40 | .bbs_success {
41 | color : #388E3B
42 | }
43 |
44 | /* warnings */
45 | .bbs_warning {
46 | color: #DDB62C
47 | }
48 |
49 | /* errors */
50 | .bbs_error {
51 | color: #E75C58
52 | }
53 |
54 | /* widgets buttons */
55 | .bbs_button {
56 | background-color: #3c96f3;
57 | color: #FFF;
58 | font-size: 150%;
59 | transition-duration: 0.2s;
60 | }
61 | .bbs_button:hover {
62 | background-color: #3176d2;
63 | }
64 |
65 | .jupyter-button:active, .jupyter-button.mod-active {
66 | color: #FFF;
67 | background-color: #3c96f3;
68 | }
69 | .jupyter-button:hover {
70 | color: #FFF;
71 | background-color: #3176d2;
72 | }
73 |
74 | /* attribute extraction */
75 |
76 | .number {
77 | display: inline-block;
78 | background: lightgreen;
79 | padding: 0.2em 0.5em;
80 | border-radius: 7px;
81 | }
82 | .unit {
83 | display: inline-block;
84 | background: pink;
85 | padding: 0.2em 0.5em;
86 | border-radius: 7px;
87 | }
88 | .quantityType {
89 | display: inline-block;
90 | background: yellow;
91 | font-variant:small-caps;
92 | padding: 0.2em 0.5em;
93 | border-radius: 7px;
94 | }
95 | .fixedWidth {
96 | width: 4px;
97 | text-align: justify;
98 | }
99 |
--------------------------------------------------------------------------------
/src/bluesearch/database/__init__.py:
--------------------------------------------------------------------------------
1 | """Embedding and Mining Databases."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
--------------------------------------------------------------------------------
/src/bluesearch/database/pdf.py:
--------------------------------------------------------------------------------
1 | # Copyright (C) 2020 Blue Brain Project, EPFL.
2 | #
3 | # This program is free software: you can redistribute it and/or modify
4 | # it under the terms of the GNU Lesser General Public License as published by
5 | # the Free Software Foundation, either version 3 of the License, or
6 | # (at your option) any later version.
7 | #
8 | # This program is distributed in the hope that it will be useful,
9 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
10 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
11 | # GNU Lesser General Public License for more details.
12 | #
13 | # You should have received a copy of the GNU Lesser General Public License
14 | # along with this program. If not, see .
15 | """Module for PDF conversion."""
16 | import requests
17 |
18 |
19 | def grobid_is_alive(host: str, port: int) -> bool:
20 | """Test if the GROBID server is alive.
21 |
22 | This server API is documented here:
23 | https://grobid.readthedocs.io/en/latest/Grobid-service/#service-checks
24 |
25 | Parameters
26 | ----------
27 | host
28 | Host of the GROBID server.
29 | port
30 | Port of the GROBID server.
31 |
32 | Returns
33 | -------
34 | bool
35 | Whether the GROBID server is alive.
36 | """
37 | try:
38 | response = requests.get(f"http://{host}:{port}/api/isalive")
39 | except requests.RequestException:
40 | return False
41 |
42 | if response.content == b"true":
43 | return True
44 | else:
45 | return False
46 |
47 |
48 | def grobid_pdf_to_tei_xml(pdf_content: bytes, host: str, port: int) -> str:
49 | """Convert PDF file to TEI XML using GROBID server.
50 |
51 | This function uses the GROBID API service to convert PDF to a TEI XML format.
52 | In order to setup GROBID server, follow the instructions from
53 | https://grobid.readthedocs.io/en/latest/Grobid-docker/.
54 |
55 | Parameters
56 | ----------
57 | pdf_content
58 | PDF content
59 | host
60 | Host of the GROBID server.
61 | port
62 | Port of the GROBID server.
63 |
64 | Returns
65 | -------
66 | str
67 | TEI XML parsing of the PDF content.
68 | """
69 | url = f"http://{host}:{port}/api/processFulltextDocument"
70 | files = {"input": pdf_content}
71 | headers = {"Accept": "application/xml"}
72 | timeout = 60
73 |
74 | response = requests.post(
75 | url=url,
76 | files=files,
77 | headers=headers,
78 | timeout=timeout,
79 | )
80 | response.raise_for_status()
81 | return response.text
82 |
--------------------------------------------------------------------------------
/src/bluesearch/entrypoint/__init__.py:
--------------------------------------------------------------------------------
1 | """Subpackage containing all the entry points."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
--------------------------------------------------------------------------------
/src/bluesearch/entrypoint/database/__init__.py:
--------------------------------------------------------------------------------
1 | """Subpackage for database creation."""
2 |
--------------------------------------------------------------------------------
/src/bluesearch/entrypoint/database/init.py:
--------------------------------------------------------------------------------
1 | """Initialization of the database."""
2 | import argparse
3 | import logging
4 |
5 | logger = logging.getLogger(__name__)
6 |
7 |
8 | def init_parser(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
9 | """Initialise the argument parser for the init subcommand.
10 |
11 | Parameters
12 | ----------
13 | parser
14 | The argument parser to initialise.
15 |
16 | Returns
17 | -------
18 | argparse.ArgumentParser
19 | The initialised argument parser. The same object as the `parser`
20 | argument.
21 | """
22 | parser.description = "Initialize a database."
23 |
24 | parser.add_argument(
25 | "db_url",
26 | type=str,
27 | help="""
28 | The location of the database depending on the database type.
29 |
30 | For MySQL and MariaDB the server URL should be provided, for SQLite the
31 | location of the database file. Generally, the scheme part of
32 | the URL should be omitted, e.g. for MySQL the URL should be
33 | of the form 'my_sql_server.ch:1234/my_database' and for SQLite
34 | of the form '/path/to/the/local/database.db'.
35 | """,
36 | )
37 | parser.add_argument(
38 | "--db-type",
39 | default="sqlite",
40 | type=str,
41 | choices=("mariadb", "mysql", "postgres", "sqlite"),
42 | help="Type of the database.",
43 | )
44 | return parser
45 |
46 |
47 | def run(
48 | *,
49 | db_url: str,
50 | db_type: str,
51 | ) -> int:
52 | """Initialize database.
53 |
54 | Parameter description and potential defaults are documented inside of the
55 | `get_parser` function.
56 | """
57 | logger.info("Importing dependencies")
58 | import sqlalchemy
59 |
60 | from bluesearch.entrypoint.database.schemas import schema_articles, schema_sentences
61 |
62 | if db_type == "sqlite":
63 | engine = sqlalchemy.create_engine(f"sqlite:///{db_url}")
64 |
65 | elif db_type in {"mariadb", "mysql"}:
66 | engine = sqlalchemy.create_engine(f"mysql+pymysql://{db_url}")
67 |
68 | elif db_type == "postgres":
69 | engine = sqlalchemy.create_engine(f"postgresql+pg8000://{db_url}")
70 |
71 | else:
72 | # This branch never reached because of `choices` in `argparse`
73 | raise ValueError(f"Unrecognized database type {db_type}") # pragma: nocover
74 |
75 | metadata = sqlalchemy.MetaData()
76 |
77 | # Creation of the schema of the tables
78 | schema_articles(metadata)
79 | schema_sentences(metadata)
80 |
81 | # Construction
82 | with engine.begin() as connection:
83 | metadata.create_all(connection)
84 |
85 | logger.info("Initialization done")
86 |
87 | return 0
88 |
--------------------------------------------------------------------------------
/src/bluesearch/entrypoint/database/parse_mesh_rdf.py:
--------------------------------------------------------------------------------
1 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
2 | #
3 | # Copyright (C) 2022 Blue Brain Project, EPFL.
4 | #
5 | # This program is free software: you can redistribute it and/or modify
6 | # it under the terms of the GNU Lesser General Public License as published by
7 | # the Free Software Foundation, either version 3 of the License, or
8 | # (at your option) any later version.
9 | #
10 | # This program is distributed in the hope that it will be useful,
11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | # GNU Lesser General Public License for more details.
14 | #
15 | # You should have received a copy of the GNU Lesser General Public License
16 | # along with this program. If not, see .
17 | """CLI sub-command for parsing MeSH RDF files."""
18 | from __future__ import annotations
19 |
20 | import argparse
21 | import gzip
22 | import json
23 | import logging
24 | import pathlib
25 |
26 | logger = logging.getLogger(__name__)
27 |
28 |
29 | def init_parser(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
30 | """Initialise the argument parser for the parse-mesh-rdf subcommand.
31 |
32 | Parameters
33 | ----------
34 | parser
35 | The argument parser to initialise.
36 |
37 | Returns
38 | -------
39 | argparse.ArgumentParser
40 | The initialised argument parser. The same object as the `parser`
41 | argument.
42 | """
43 | parser.description = "Parse a MeSH RDF file in N-Triples format."
44 | parser.add_argument(
45 | "mesh_nt_gz_file",
46 | type=pathlib.Path,
47 | help="""
48 | Path to a "mesh*.nt.gz" file downloaded from
49 | https://nlmpubs.nlm.nih.gov/projects/mesh/rdf/
50 | """,
51 | )
52 | parser.add_argument(
53 | "output_json_file",
54 | type=pathlib.Path,
55 | help="""
56 | The output file for parsing results. The JSON file will contain a
57 | flat dictionary with MeSH tree names as keys and corresponding topic
58 | labels as values.
59 | """,
60 | )
61 | return parser
62 |
63 |
64 | def run(*, mesh_nt_gz_file: pathlib.Path, output_json_file: pathlib.Path) -> int:
65 | """Parse a MeSH RDF file to extract the topic tree structure.
66 |
67 | See the description of the `init_parser` command for more information on
68 | the command and its parameters.
69 | """
70 | from bluesearch.database import mesh
71 |
72 | if not mesh_nt_gz_file.exists():
73 | logger.error(f"The file {mesh_nt_gz_file} does not exist.")
74 | return 1
75 | if not mesh_nt_gz_file.is_file():
76 | logger.error(f"The path {mesh_nt_gz_file} must be a file.")
77 | return 1
78 |
79 | logger.info(f"Parsing the MeSH file {mesh_nt_gz_file.resolve().as_uri()}")
80 | with gzip.open(mesh_nt_gz_file, "rt") as fh:
81 | tree_number_to_label = mesh.parse_tree_numbers(fh)
82 |
83 | logger.info(f"Saving results to {output_json_file.resolve().as_uri()}")
84 | with open(output_json_file, "w") as fh:
85 | json.dump(tree_number_to_label, fh)
86 |
87 | logger.info("Done")
88 | return 0
89 |
--------------------------------------------------------------------------------
/src/bluesearch/entrypoint/database/schemas.py:
--------------------------------------------------------------------------------
1 | """Module for defining SQL schemas."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | from sqlalchemy import (
21 | Boolean,
22 | Column,
23 | Date,
24 | ForeignKey,
25 | Integer,
26 | MetaData,
27 | String,
28 | Table,
29 | Text,
30 | UniqueConstraint,
31 | )
32 |
33 |
34 | def schema_articles(metadata: MetaData) -> None:
35 | """Add to the given 'metadata' the schema of the table 'articles'."""
36 | Table(
37 | "articles",
38 | metadata,
39 | Column("article_id", String(32), primary_key=True),
40 | Column("doi", Text()),
41 | Column("pmc_id", Text()),
42 | Column("pubmed_id", Text()),
43 | Column("title", Text()),
44 | Column("authors", Text()),
45 | Column("abstract", Text()),
46 | Column("journal", Text()),
47 | Column("publish_time", Date()),
48 | Column("license", Text()),
49 | Column("is_english", Boolean()),
50 | )
51 |
52 |
53 | def schema_sentences(metadata: MetaData) -> None:
54 | """Add to the given 'metadata' the schema of the table 'sentences'."""
55 | Table(
56 | "sentences",
57 | metadata,
58 | Column("sentence_id", Integer(), primary_key=True, autoincrement=True),
59 | Column("section_name", Text()),
60 | Column("text", Text()),
61 | Column(
62 | "article_id", String(32), ForeignKey("articles.article_id"), nullable=False
63 | ),
64 | Column("paragraph_pos_in_article", Integer(), nullable=False),
65 | Column("sentence_pos_in_paragraph", Integer(), nullable=False),
66 | UniqueConstraint(
67 | "article_id",
68 | "paragraph_pos_in_article",
69 | "sentence_pos_in_paragraph",
70 | name="sentence_unique_identifier",
71 | ),
72 | Column("is_bad", Boolean(), server_default="0"),
73 | )
74 |
--------------------------------------------------------------------------------
/src/bluesearch/entrypoint/embedding_server.py:
--------------------------------------------------------------------------------
1 | """Entrypoint for launching an embedding server."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | import logging
21 | import sys
22 |
23 | from bluesearch.embedding_models import get_embedding_model
24 | from bluesearch.entrypoint._helper import configure_logging, get_var, run_server
25 |
26 |
27 | def get_embedding_app():
28 | """Construct the embedding flask app."""
29 | from bluesearch.server.embedding_server import EmbeddingServer
30 |
31 | # Read configuration
32 | log_file = get_var("BBS_EMBEDDING_LOG_FILE", check_not_set=False)
33 | log_level = get_var("BBS_EMBEDDING_LOG_LEVEL", logging.INFO, var_type=int)
34 |
35 | # Configure logging
36 | configure_logging(log_file, log_level)
37 | logger = logging.getLogger(__name__)
38 |
39 | logger.info(" Configuration ".center(80, "-"))
40 | logger.info(f"log-file : {log_file}")
41 | logger.info(f"log-level : {log_level}")
42 | logger.info("-" * 80)
43 |
44 | # Load embedding models
45 | logger.info("Loading embedding models")
46 | supported_models = ["SBERT", "SBioBERT", "BioBERT NLI+STS"]
47 | embedding_models = {
48 | model_name: get_embedding_model(model_name) for model_name in supported_models
49 | }
50 |
51 | # Create Server app
52 | logger.info("Creating the server app")
53 | embedding_app = EmbeddingServer(embedding_models)
54 |
55 | return embedding_app
56 |
57 |
58 | def run_embedding_server():
59 | """Run the embedding server."""
60 | run_server(get_embedding_app, "embedding")
61 |
62 |
63 | if __name__ == "__main__": # pragma: no cover
64 | sys.exit(run_embedding_server())
65 |
--------------------------------------------------------------------------------
/src/bluesearch/entrypoint/search_server.py:
--------------------------------------------------------------------------------
1 | """The entrypoint script for the search server."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | import logging
21 | import pathlib
22 | import sys
23 |
24 | import sqlalchemy
25 |
26 | from bluesearch.entrypoint._helper import configure_logging, get_var, run_server
27 |
28 |
29 | def get_search_app():
30 | """Construct the search flask app."""
31 | from bluesearch.server.search_server import SearchServer
32 | from bluesearch.utils import H5
33 |
34 | # Read configuration
35 | log_file = get_var("BBS_SEARCH_LOG_FILE", check_not_set=False)
36 | log_level = get_var("BBS_SEARCH_LOG_LEVEL", logging.INFO, var_type=int)
37 |
38 | models_path = get_var("BBS_SEARCH_MODELS_PATH")
39 | embeddings_path = get_var("BBS_SEARCH_EMBEDDINGS_PATH")
40 | which_models = get_var("BBS_SEARCH_MODELS")
41 |
42 | mysql_url = get_var("BBS_SEARCH_DB_URL")
43 | mysql_user = get_var("BBS_SEARCH_MYSQL_USER")
44 | mysql_password = get_var("BBS_SEARCH_MYSQL_PASSWORD")
45 |
46 | # Configure logging
47 | configure_logging(log_file, log_level)
48 | logger = logging.getLogger(__name__)
49 |
50 | logger.info(" Configuration ".center(80, "-"))
51 | logger.info(f"log-file : {log_file}")
52 | logger.info(f"log-level : {log_level}")
53 | logger.info(f"models-path : {models_path}")
54 | logger.info(f"embeddings-path : {embeddings_path}")
55 | logger.info(f"which-models : {which_models}")
56 | logger.info(f"mysql_url : {mysql_url}")
57 | logger.info(f"mysql_user : {mysql_user}")
58 | logger.info(f"mysql_password : {mysql_password}")
59 | logger.info("-" * 80)
60 |
61 | # Initialize flask app
62 | logger.info("Creating the Flask app")
63 | models_path = pathlib.Path(models_path)
64 | embeddings_path = pathlib.Path(embeddings_path)
65 | engine_url = f"mysql://{mysql_user}:{mysql_password}@{mysql_url}"
66 | engine = sqlalchemy.create_engine(engine_url, pool_recycle=14400)
67 | models_list = [model.strip() for model in which_models.split(",")]
68 | indices = H5.find_populated_rows(embeddings_path, models_list[0])
69 |
70 | server_app = SearchServer(
71 | models_path, embeddings_path, indices, engine, models_list
72 | )
73 | return server_app
74 |
75 |
76 | def run_search_server():
77 | """Run the search server."""
78 | run_server(get_search_app, "search")
79 |
80 |
81 | if __name__ == "__main__": # pragma: no cover
82 | sys.exit(run_search_server())
83 |
--------------------------------------------------------------------------------
/src/bluesearch/k8s/__init__.py:
--------------------------------------------------------------------------------
1 | """Subpackage for Kubernetes related code."""
2 |
--------------------------------------------------------------------------------
/src/bluesearch/k8s/connect.py:
--------------------------------------------------------------------------------
1 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
2 | #
3 | # Copyright (C) 2020 Blue Brain Project, EPFL.
4 | #
5 | # This program is free software: you can redistribute it and/or modify
6 | # it under the terms of the GNU Lesser General Public License as published by
7 | # the Free Software Foundation, either version 3 of the License, or
8 | # (at your option) any later version.
9 | #
10 | # This program is distributed in the hope that it will be useful,
11 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
12 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13 | # GNU Lesser General Public License for more details.
14 | #
15 | # You should have received a copy of the GNU Lesser General Public License
16 | # along with this program. If not, see .
17 | """connects to ES."""
18 | import logging
19 | import os
20 |
21 | import urllib3
22 | from dotenv import load_dotenv
23 | from elasticsearch import Elasticsearch
24 |
25 | load_dotenv()
26 | urllib3.disable_warnings()
27 |
28 | logger = logging.getLogger(__name__)
29 |
30 |
31 | def connect() -> Elasticsearch:
32 | """Return a client connect ES."""
33 | client = Elasticsearch(
34 | os.environ["ES_URL"],
35 | basic_auth=("elastic", os.environ["ES_PASS"]),
36 | verify_certs=False,
37 | )
38 |
39 | if not client.ping():
40 | raise RuntimeError(f"Cannot connect to ES: {os.environ['ES_URL']}")
41 |
42 | logger.info("Connected to ES")
43 |
44 | return client
45 |
46 |
47 | if __name__ == "__main__":
48 | connect()
49 |
--------------------------------------------------------------------------------
/src/bluesearch/mining/__init__.py:
--------------------------------------------------------------------------------
1 | """Subpackage for text mining."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
--------------------------------------------------------------------------------
/src/bluesearch/py.typed:
--------------------------------------------------------------------------------
1 | # Marker file for PEP 561.
2 |
--------------------------------------------------------------------------------
/src/bluesearch/server/__init__.py:
--------------------------------------------------------------------------------
1 | """Implementation of servers."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
--------------------------------------------------------------------------------
/src/bluesearch/server/invalid_usage_exception.py:
--------------------------------------------------------------------------------
1 | """Custom exceptions."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 |
21 | class InvalidUsage(Exception):
22 | """An exception used in the REST API server.
23 |
24 | The code was largely copied from
25 | https://flask.palletsprojects.com/en/1.1.x/patterns/apierrors/
26 | """
27 |
28 | def __init__(self, message, status_code=None):
29 | Exception.__init__(self)
30 | self.message = message
31 | if status_code is None:
32 | self.status_code = 400
33 | else:
34 | self.status_code = status_code
35 |
36 | def to_dict(self):
37 | """Generate a dictionary."""
38 | rv = {}
39 | rv["message"] = self.message
40 | return rv
41 |
--------------------------------------------------------------------------------
/src/bluesearch/widgets/__init__.py:
--------------------------------------------------------------------------------
1 | """Various widgets related to the BBS."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
--------------------------------------------------------------------------------
/tests/data/cord19_v35/document_parses/pmc_json/PMC7186928.xml.json:
--------------------------------------------------------------------------------
1 | {
2 | "paper_id": "PMC7186928",
3 | "metadata": {
4 | "title": "Will we see protection or reinfection in COVID-19?",
5 | "authors": [
6 | {
7 | "first": "Miyo",
8 | "middle": [],
9 | "last": "Ota",
10 | "suffix": "",
11 | "email": "sinai.immunology@gmail.com",
12 | "affiliation": {}
13 | }
14 | ]
15 | },
16 | "body_text": [
17 | {
18 | "text": "There is rising concern that patients who recover from COVID-19 may be at risk of reinfection. In this preprint, Bao et al. investigated acquired immunity to SARS-CoV-2 in rhesus macaques. Four rhesus monkeys were infected with SARS-CoV-2 and two were reinfected after confirmed recovery. After primary infection, viral replication was detected in the nose, pharynx, lungs and gut, with histopathological evidence of lung damage. Sera collected from recovered monkeys before reinfection exhibited neutralizing activity against SARS-CoV-2. Upon reinfection, viral replication was not detected in nasopharyngeal or anal swabs, and reinfected monkeys did not show any signs of COVID-19 disease recurrence. This suggests that immunity acquired following primary infection with SARS-CoV-2 may protect upon subsequent exposure to the virus.",
19 | "cite_spans": [],
20 | "section": "",
21 | "ref_spans": []
22 | }
23 | ],
24 | "ref_entries": {},
25 | "back_matter": [],
26 | "bib_entries": {
27 | "BIBREF0": {
28 | "title": "Reinfection could not occur in SARS-CoV-2 infected rhesus macaques",
29 | "authors": [
30 | {
31 | "first": "L",
32 | "middle": [],
33 | "last": "Bao",
34 | "suffix": ""
35 | }
36 | ],
37 | "year": 2020,
38 | "venue": "bioRxiv",
39 | "volume": "",
40 | "issn": "",
41 | "pages": null,
42 | "other_ids": {
43 | "DOI": [
44 | "10.1101/2020.03.13.990226"
45 | ]
46 | }
47 | }
48 | }
49 | }
--------------------------------------------------------------------------------
/tests/data/mining/eval/iob_punctuation_after.csv:
--------------------------------------------------------------------------------
1 | text,class_ann1,class_ann2,class_ann3
2 | Potato,B-VEGETABLE,B-VEGETABLE,B-VEGETABLE
3 | Solanum,B-VEGETABLE,I-VEGETABLE,B-VEGETABLE
4 | tuberosum,I-VEGETABLE,I-VEGETABLE,I-VEGETABLE
5 | is,O,O,O
6 | a,O,O,O
7 | vegetable,B-VEGETABLE,B-VEGETABLE,B-VEGETABLE
8 | Cherry,B-FRUIT,B-FRUIT,B-FRUIT
9 | tomato,I-FRUIT,I-FRUIT,I-FRUIT
10 | is,O,O,O
11 | technically,O,O,O
12 | a,O,O,O
13 | fruit,B-FRUIT,B-VEGETABLE,B-FRUIT
14 | but,O,O,O
15 | few,O,O,O
16 | know,O,O,O
17 | that,O,O,O
--------------------------------------------------------------------------------
/tests/data/mining/eval/iob_punctuation_before.csv:
--------------------------------------------------------------------------------
1 | text,class_ann1,class_ann2,class_ann3
2 | Potato,B-VEGETABLE,B-VEGETABLE,B-VEGETABLE
3 | (,B-VEGETABLE,I-VEGETABLE,I-VEGETABLE
4 | Solanum,I-VEGETABLE,I-VEGETABLE,B-VEGETABLE
5 | tuberosum,I-VEGETABLE,I-VEGETABLE,I-VEGETABLE
6 | ),I-VEGETABLE,O,I-VEGETABLE
7 | is,O,O,O
8 | a,O,O,O
9 | """",B-VEGETABLE,O,B-VEGETABLE
10 | vegetable,I-VEGETABLE,B-VEGETABLE,I-VEGETABLE
11 | """",I-VEGETABLE,I-VEGETABLE,B-FRUIT
12 | .,B-FRUIT,I-VEGETABLE,I-FRUIT
13 | """",I-FRUIT,B-FRUIT,I-FRUIT
14 | Cherry,I-FRUIT,I-FRUIT,I-FRUIT
15 | tomato,I-FRUIT,I-FRUIT,I-FRUIT
16 | """",O,I-FRUIT,O
17 | is,O,O,O
18 | technically,O,O,O
19 | a,O,O,O
20 | fruit,B-FRUIT,B-VEGETABLE,B-FRUIT
21 | ",",I-FRUIT,O,I-FRUIT
22 | but,O,O,O
23 | few,O,O,O
24 | know,O,O,O
25 | that,O,O,O
26 | .,O,O,O
--------------------------------------------------------------------------------
/tests/data/mining/request/request.csv:
--------------------------------------------------------------------------------
1 | entity_type,property,property_type,property_value_type,ontology_source
2 | DISEASE,,,,UMLS
3 | CELL_TYPE,,,,
4 | CHEMICAL,,,,UMLS
5 | PROTEIN,,,,UMLS
6 | ORGAN,,,,UMLS
7 | DISEASE,,,,UMLS
8 | CHEMICAL,agonist_of,relation,PROTEIN,UMLS
9 | CHEMICAL,inhibitor_of,relation,PROTEIN,UMLS
10 | CHEMICAL,product_of,relation,PROTEIN,UMLS
11 | ORGANISM,,,,UMLS
12 | CHEMICAL,concentration,attribute,QuantitativeValue,UMLS
13 | DISEASE,is_hereditary,attribute,Boolean,
14 |
--------------------------------------------------------------------------------
/tests/data/pubmed_article_minimal.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 | 123456
4 |
5 |
6 |
7 |
8 | 2020
9 |
10 |
11 |
12 | Article Title
13 |
14 | 012-34
15 |
16 | eng
17 |
18 | Journal Article
19 | MeSH Publication Type
20 |
21 |
22 |
23 | Medline TA
24 |
25 |
26 |
27 |
--------------------------------------------------------------------------------
/tests/data/pubmed_articles.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | 123456
7 |
8 |
9 |
10 |
11 | 2020
12 |
13 |
14 |
15 | Article Title 1
16 |
17 | 012-34
18 |
19 | eng
20 |
21 | Journal Article
22 | MeSH Publication Type
23 |
24 |
25 |
26 | Medline TA 1
27 |
28 |
29 |
30 |
31 |
32 | 789123
33 |
34 |
35 |
36 |
37 | 2021
38 |
39 |
40 |
41 | Article Title 2
42 |
43 | 567-89
44 |
45 | eng
46 |
47 | Journal Article
48 | MeSH Publication Type
49 |
50 |
51 |
52 | Medline TA 2
53 |
54 |
55 |
56 |
57 |
--------------------------------------------------------------------------------
/tests/data/pubmed_download_index.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | Index of /pubmed/updatefiles
5 |
6 |
7 | Index of /pubmed/updatefiles
8 | Name Last modified Size
Parent Directory -
9 | README.txt 2020-12-14 08:15 4.0K
10 | pubmed21n1063.xml.gz 2020-12-14 14:10 67M
11 | pubmed21n1063.xml.gz.md5 2020-12-14 14:10 60
12 | pubmed21n1063_stats.html 2020-12-14 14:10 585
13 | pubmed21n1064.xml.gz 2020-12-14 14:10 53M
14 | pubmed21n1064.xml.gz.md5 2020-12-14 14:10 60
15 | pubmed21n1064_stats.html 2020-12-14 14:10 582
16 | pubmed21n1065.xml.gz 2020-12-14 14:10 12M
17 | pubmed21n1065.xml.gz.md5 2020-12-14 14:10 60
18 | pubmed21n1065_stats.html 2020-12-14 14:10 571
19 | pubmed21n1066.xml.gz 2020-12-15 14:04 64M
20 | pubmed21n1066.xml.gz.md5 2020-12-15 14:04 60
21 | pubmed21n1066_stats.html 2020-12-15 14:04 584
22 | pubmed21n1067.xml.gz 2020-12-15 14:04 7.7M
23 | pubmed21n1067.xml.gz.md5 2020-12-15 14:04 60
24 | pubmed21n1067_stats.html 2020-12-15 14:04 571
25 | pubmed21n1068.xml.gz 2020-12-16 14:02 51M
26 | pubmed21n1068.xml.gz.md5 2020-12-16 14:02 60
27 | pubmed21n1068_stats.html 2020-12-16 14:02 583
28 | pubmed21n1069.xml.gz 2020-12-17 14:02 61M
29 | pubmed21n1069.xml.gz.md5 2020-12-17 14:02 60
30 | pubmed21n1069_stats.html 2020-12-17 14:02 582
31 | pubmed21n1070.xml.gz 2020-12-18 14:04 57M
32 |
33 |
34 |
--------------------------------------------------------------------------------
/tests/unit/database/test_pdf.py:
--------------------------------------------------------------------------------
1 | import re
2 |
3 | import pytest
4 | import requests
5 | import responses
6 |
7 | from bluesearch.database.pdf import grobid_is_alive, grobid_pdf_to_tei_xml
8 |
9 |
10 | @responses.activate
11 | def test_conversion_pdf(monkeypatch):
12 | """Test PDF conversion"""
13 |
14 | responses.add(
15 | responses.POST,
16 | "http://fake_host:8888/api/processFulltextDocument",
17 | body="body",
18 | )
19 |
20 | result = grobid_pdf_to_tei_xml(b"", host="fake_host", port=8888)
21 | assert result == "body"
22 | assert len(responses.calls) == 1
23 |
24 |
25 | @pytest.mark.parametrize(
26 | ("body", "expected_result"),
27 | (
28 | ("true", True),
29 | (requests.RequestException(), False),
30 | ("false", False),
31 | ("unknown", False),
32 | ),
33 | )
34 | @responses.activate
35 | def test_grobid_is_alive(body, expected_result):
36 | host = "host"
37 | port = 12345
38 | responses.add(
39 | responses.GET,
40 | re.compile(rf"http://{host}:{port}/.*"),
41 | body=body,
42 | )
43 | assert grobid_is_alive(host, port) is expected_result
44 |
--------------------------------------------------------------------------------
/tests/unit/entrypoint/__init__.py:
--------------------------------------------------------------------------------
1 | """Collection of tests covering entrypoint functionalities."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
--------------------------------------------------------------------------------
/tests/unit/entrypoint/database/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BlueBrain/Search/503fdf320a62ab2eb1f9a2a371600e4f1a38df62/tests/unit/entrypoint/database/__init__.py
--------------------------------------------------------------------------------
/tests/unit/entrypoint/database/test_init.py:
--------------------------------------------------------------------------------
1 | import pathlib
2 |
3 | import sqlalchemy
4 |
5 | from bluesearch.entrypoint.database.parent import main
6 | from bluesearch.entrypoint.database.schemas import schema_articles, schema_sentences
7 |
8 |
9 | def test_sqlite(tmpdir):
10 | tmpdir = pathlib.Path(str(tmpdir))
11 | db_path = tmpdir / "database.db"
12 |
13 | args_and_opts = [
14 | "init",
15 | str(db_path),
16 | "--db-type=sqlite",
17 | ]
18 |
19 | assert not db_path.exists()
20 |
21 | main(args_and_opts)
22 |
23 | assert db_path.exists()
24 |
25 | engine = sqlalchemy.create_engine(f"sqlite:///{db_path}")
26 | metadata = sqlalchemy.MetaData(engine)
27 | metadata.reflect(engine)
28 | tables = metadata.sorted_tables
29 |
30 | expected_metadata = sqlalchemy.MetaData()
31 | schema_articles(expected_metadata)
32 | schema_sentences(expected_metadata)
33 | expected_tables = expected_metadata.sorted_tables
34 |
35 | assert len(tables) == len(expected_tables)
36 |
37 | for table, expected in zip(tables, expected_tables):
38 | assert table.compare(expected)
39 |
--------------------------------------------------------------------------------
/tests/unit/entrypoint/database/test_parent.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 |
3 | import logging
4 | import subprocess
5 |
6 | import pytest
7 |
8 | from bluesearch.entrypoint.database.parent import _setup_logging
9 |
10 |
11 | @pytest.mark.parametrize("command", ["add", "convert-pdf", "init", "parse"])
12 | def test_commands_work(command):
13 | subprocess.run(["bbs_database", command, "--help"], check=True)
14 |
15 |
16 | def test_setup_logging(caplog):
17 | def get_levels(loggers: dict[str, logging.Logger]) -> dict[str, int]:
18 | """Get logging level for each logger."""
19 | return {name: logger.getEffectiveLevel() for name, logger in loggers.items()}
20 |
21 | caplog.set_level(logging.WARNING, logger="bluesearch")
22 |
23 | all_loggers = logging.root.manager.loggerDict
24 | bluesearch_loggers = {
25 | k: v
26 | for k, v in all_loggers.items()
27 | if k.startswith("bluesearch") and isinstance(v, logging.Logger)
28 | }
29 | external_loggers = {
30 | k: v
31 | for k, v in all_loggers.items()
32 | if not k.startswith("bluesearch") and isinstance(v, logging.Logger)
33 | }
34 |
35 | bluesearch_levels_before = get_levels(bluesearch_loggers)
36 | external_levels_before = get_levels(external_loggers)
37 |
38 | _setup_logging(logging.DEBUG)
39 |
40 | bluesearch_levels_after = get_levels(bluesearch_loggers)
41 | external_levels_after = get_levels(external_loggers)
42 |
43 | assert set(bluesearch_levels_before.values()) == {logging.WARNING}
44 | assert set(bluesearch_levels_after.values()) == {logging.DEBUG}
45 |
46 | assert external_levels_before == external_levels_after
47 |
--------------------------------------------------------------------------------
/tests/unit/entrypoint/test__helper.py:
--------------------------------------------------------------------------------
1 | from __future__ import annotations
2 |
3 | import argparse
4 | from typing import Sequence
5 |
6 | import pytest
7 |
8 | from bluesearch.entrypoint._helper import parse_args_or_environment
9 |
10 |
11 | def test_parse_args_or_environment(monkeypatch):
12 | parser = argparse.ArgumentParser()
13 | parser.add_argument("--normal-arg")
14 | parser.add_argument("--env-arg", default=argparse.SUPPRESS)
15 | argv_value = "5"
16 | env_value = "6"
17 |
18 | # --env-arg not provided at all
19 | argv: Sequence[str] = []
20 | env_variable_names: dict[str, str] = {}
21 | args = parse_args_or_environment(parser, env_variable_names, argv)
22 | assert "normal_arg" in args.__dict__
23 | assert "env_arg" not in args.__dict__
24 |
25 | # --env-arg provided through the CLI
26 | argv = ["--env-arg", argv_value]
27 | env_variable_names = {}
28 | args = parse_args_or_environment(parser, env_variable_names, argv)
29 | assert "normal_arg" in args.__dict__
30 | assert "env_arg" in args.__dict__
31 | assert args.env_arg == argv_value
32 |
33 | # --env-arg provided through the environment
34 | argv = []
35 | environ = {
36 | "ENV_ARG": env_value,
37 | }
38 | monkeypatch.setattr("bluesearch.entrypoint._helper.os.environ", environ)
39 | env_variable_names = {
40 | "env_arg": "ENV_ARG",
41 | }
42 | args = parse_args_or_environment(parser, env_variable_names, argv)
43 | assert "normal_arg" in args.__dict__
44 | assert "env_arg" in args.__dict__
45 | assert args.env_arg == env_value
46 |
47 | # Check that CLI argument have precedence over environment variables
48 | argv = ["--env-arg", argv_value]
49 | environ = {
50 | "ENV_ARG": env_value,
51 | }
52 | monkeypatch.setattr("bluesearch.entrypoint._helper.os.environ", environ)
53 | env_variable_names = {
54 | "env_arg": "ENV_ARG",
55 | }
56 | args = parse_args_or_environment(parser, env_variable_names, argv)
57 | assert "normal_arg" in args.__dict__
58 | assert "env_arg" in args.__dict__
59 | assert args.env_arg == argv_value
60 |
61 | # Value not specified through the CLI, nor through environment
62 | argv = []
63 | environ = {}
64 | monkeypatch.setattr("bluesearch.entrypoint._helper.os.environ", environ)
65 | env_variable_names = {
66 | "env_arg": "ENV_ARG",
67 | }
68 | with pytest.raises(SystemExit) as pytest_wrapped_e:
69 | parse_args_or_environment(parser, env_variable_names, argv)
70 | assert pytest_wrapped_e.value.code == 1
71 |
--------------------------------------------------------------------------------
/tests/unit/entrypoint/test_embedding_server.py:
--------------------------------------------------------------------------------
1 | """Collection of tests focusing on the `embedding_server` entrypoint."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | import pathlib
21 | from unittest.mock import Mock
22 |
23 | from bluesearch.entrypoint.embedding_server import get_embedding_app
24 | from bluesearch.server.embedding_server import EmbeddingServer
25 |
26 |
27 | def test_environment_reading(monkeypatch, tmpdir):
28 | tmpdir = pathlib.Path(str(tmpdir))
29 | logfile = tmpdir / "log.txt"
30 | logfile.touch()
31 |
32 | fake_embedding_server_inst = Mock(spec=EmbeddingServer)
33 | fake_embedding_server_class = Mock(return_value=fake_embedding_server_inst)
34 |
35 | monkeypatch.setattr(
36 | "bluesearch.server.embedding_server.EmbeddingServer",
37 | fake_embedding_server_class,
38 | )
39 |
40 | # Mock all of our embedding models
41 | embedding_models = ["SentTransformer"]
42 |
43 | for model in embedding_models:
44 | monkeypatch.setattr(f"bluesearch.embedding_models.{model}", Mock())
45 |
46 | monkeypatch.setenv("BBS_EMBEDDING_LOG_FILE", str(logfile))
47 |
48 | embedding_app = get_embedding_app()
49 |
50 | assert embedding_app is fake_embedding_server_inst
51 |
52 | args, _ = fake_embedding_server_class.call_args
53 |
54 | assert len(args) == 1
55 | assert isinstance(args[0], dict)
56 |
--------------------------------------------------------------------------------
/tests/unit/entrypoint/test_entrypoint_installation.py:
--------------------------------------------------------------------------------
1 | """Tests covering entrypoint installation."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | import subprocess
21 |
22 | import pytest
23 |
24 |
25 | @pytest.mark.parametrize(
26 | "entrypoint_name",
27 | [
28 | "bbs_database",
29 | "compute_embeddings",
30 | "create_database",
31 | "create_mining_cache",
32 | "embedding_server",
33 | "mining_server",
34 | "search_server",
35 | ],
36 | )
37 | def test_entrypoint(entrypoint_name):
38 | subprocess.run([entrypoint_name, "--help"], check=True)
39 |
--------------------------------------------------------------------------------
/tests/unit/entrypoint/test_mining_server.py:
--------------------------------------------------------------------------------
1 | """Collection of tests focused on the `mining_server`."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | import pathlib
21 | from unittest.mock import Mock
22 |
23 | import pytest
24 |
25 | from bluesearch.entrypoint.mining_server import get_mining_app
26 |
27 |
28 | @pytest.mark.parametrize(
29 | ("db_type", "sqlite_db_exists"),
30 | (
31 | ("sqlite", True),
32 | ("sqlite", False),
33 | ("mysql", False),
34 | ("wrong", False),
35 | ),
36 | )
37 | def test_send_through(
38 | tmpdir, monkeypatch, db_type, sqlite_db_exists, entity_types, spacy_model_path
39 | ):
40 | tmpdir = pathlib.Path(str(tmpdir))
41 | logfile = tmpdir / "log.txt"
42 | db_path = tmpdir / "something.db"
43 |
44 | if sqlite_db_exists:
45 | db_path.parent.mkdir(exist_ok=True, parents=True)
46 | db_path.touch()
47 |
48 | monkeypatch.setenv("BBS_MINING_LOG_FILE", str(logfile))
49 | monkeypatch.setenv("BBS_MINING_DB_TYPE", db_type)
50 | monkeypatch.setenv("BBS_MINING_DB_URL", str(db_path))
51 | monkeypatch.setenv("BBS_MINING_MYSQL_USER", "some_user")
52 | monkeypatch.setenv("BBS_MINING_MYSQL_PASSWORD", "some_pwd")
53 | monkeypatch.setenv("BBS_DATA_AND_MODELS_DIR", str(spacy_model_path))
54 |
55 | fake_sqlalchemy = Mock()
56 | fake_mining_server_inst = Mock()
57 | fake_mining_server_class = Mock(return_value=fake_mining_server_inst)
58 |
59 | monkeypatch.setattr(
60 | "bluesearch.server.mining_server.MiningServer", fake_mining_server_class
61 | )
62 | monkeypatch.setattr(
63 | "bluesearch.entrypoint.mining_server.sqlalchemy", fake_sqlalchemy
64 | )
65 |
66 | if db_type not in {"mysql", "sqlite"}:
67 | with pytest.raises(ValueError):
68 | get_mining_app()
69 | else:
70 | mining_app = get_mining_app()
71 |
72 | fake_mining_server_class.assert_called_once()
73 | assert mining_app == fake_mining_server_inst
74 |
75 | args, kwargs = fake_mining_server_class.call_args
76 | assert not args
77 | assert kwargs["connection"] == fake_sqlalchemy.create_engine.return_value
78 | assert "ee" in kwargs["models_libs"]
79 | assert isinstance(kwargs["models_libs"]["ee"], dict)
80 | assert len(kwargs["models_libs"]["ee"]) == len(entity_types)
81 |
--------------------------------------------------------------------------------
/tests/unit/entrypoint/test_search_sever.py:
--------------------------------------------------------------------------------
1 | """Collection of tests focused on "search_server" entrypoint."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | import pathlib
21 | from unittest.mock import Mock
22 |
23 | import numpy as np
24 | import pytest
25 |
26 | from bluesearch.entrypoint.search_server import get_search_app
27 | from bluesearch.server.search_server import SearchServer
28 |
29 |
30 | @pytest.mark.parametrize(
31 | "embeddings_path,models,models_path",
32 | [
33 | ("path_1", ["A", "B", "C"], "path_a"),
34 | ("path_2", ["X", "Y"], "path_b"),
35 | ],
36 | )
37 | def test_send_through(tmpdir, monkeypatch, embeddings_path, models, models_path):
38 | tmpdir = pathlib.Path(str(tmpdir))
39 | logfile = tmpdir / "log.txt"
40 |
41 | monkeypatch.setenv("BBS_SEARCH_LOG_FILE", str(logfile))
42 | monkeypatch.setenv("BBS_SEARCH_MODELS_PATH", models_path)
43 | monkeypatch.setenv("BBS_SEARCH_EMBEDDINGS_PATH", embeddings_path)
44 | monkeypatch.setenv("BBS_SEARCH_MODELS", ",".join(models))
45 | monkeypatch.setenv("BBS_SEARCH_DB_URL", "some_url")
46 | monkeypatch.setenv("BBS_SEARCH_MYSQL_USER", "some_user")
47 | monkeypatch.setenv("BBS_SEARCH_MYSQL_PASSWORD", "some_pwd")
48 |
49 | fake_sqlalchemy = Mock()
50 | fake_H5 = Mock()
51 | fake_H5.find_populated_rows.return_value = np.arange(1, 11)
52 | fake_search_server_inst = Mock(spec=SearchServer)
53 | fake_search_server_class = Mock(return_value=fake_search_server_inst)
54 |
55 | monkeypatch.setattr(
56 | "bluesearch.entrypoint.search_server.sqlalchemy", fake_sqlalchemy
57 | )
58 | monkeypatch.setattr("bluesearch.utils.H5", fake_H5)
59 | monkeypatch.setattr(
60 | "bluesearch.server.search_server.SearchServer", fake_search_server_class
61 | )
62 |
63 | server_app = get_search_app()
64 |
65 | # Checks
66 | fake_search_server_class.assert_called_once()
67 | fake_H5.find_populated_rows.assert_called_once()
68 | fake_sqlalchemy.create_engine.assert_called_once()
69 |
70 | assert server_app is fake_search_server_inst
71 |
72 | args, kwargs = fake_search_server_class.call_args
73 |
74 | assert args[0] == pathlib.Path(models_path)
75 | assert args[1] == pathlib.Path(embeddings_path)
76 | np.testing.assert_array_equal(args[2], np.arange(1, 11))
77 | assert args[3] is fake_sqlalchemy.create_engine.return_value
78 | assert args[4] == models
79 |
--------------------------------------------------------------------------------
/tests/unit/k8s/test_create_indices.py:
--------------------------------------------------------------------------------
1 | import pytest
2 |
3 | from bluesearch.k8s.create_indices import add_index, remove_index
4 |
5 |
6 | def test_create_and_remove_index(get_es_client):
7 | client = get_es_client
8 |
9 | if client is None:
10 | pytest.skip("Elastic search is not available")
11 |
12 | index = "test_index"
13 |
14 | add_index(client, index)
15 | remove_index(client, index)
16 |
--------------------------------------------------------------------------------
/tests/unit/server/__init__.py:
--------------------------------------------------------------------------------
1 | """Collection of tests covering server functionalities."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
--------------------------------------------------------------------------------
/tests/unit/server/test_embedding_server.py:
--------------------------------------------------------------------------------
1 | """Tests covering embedding server"""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | from unittest.mock import Mock
21 |
22 | import numpy as np
23 | import pytest
24 |
25 | from bluesearch.server.embedding_server import EmbeddingServer
26 |
27 |
28 | @pytest.fixture(scope="session")
29 | def embedding_client():
30 | """Fixture to create a client for mining_server."""
31 |
32 | sbiobert = Mock()
33 | sbiobert.preprocess.return_value = "This is a dummy sentence"
34 | sbiobert.embed.return_value = np.ones((2,))
35 | embedding_models = {"sbiobert": sbiobert}
36 |
37 | embedding_server_app = EmbeddingServer(embedding_models=embedding_models)
38 | embedding_server_app.config["TESTING"] = True
39 | with embedding_server_app.test_client() as client:
40 | yield client
41 |
42 |
43 | class TestEmbeddingServer:
44 | def test_embedding_server_help(self, embedding_client):
45 | response = embedding_client.post("/help")
46 | assert response.status_code == 200
47 | assert response.json["name"] == "EmbeddingServer"
48 |
49 | def test_embedding_server_welcome(self, embedding_client):
50 | response = embedding_client.get("/")
51 | assert response.status_code == 200
52 | response = embedding_client.post("/")
53 | assert response.status_code == 405
54 |
55 | def test_embedding_server_embed(self, embedding_client):
56 | request_json = {"model": "sbiobert", "text": "hello"}
57 | response = embedding_client.post("/v1/embed/json", json=request_json)
58 | assert response.status_code == 200
59 |
60 | request_json = {"model": "sbiobert"}
61 | response = embedding_client.post("/v1/embed/json", json=request_json)
62 | assert response.status_code == 400
63 |
64 | request_json = {"model": "sbiobert", "text": "hello"}
65 | response = embedding_client.post("/v1/embed/csv", json=request_json)
66 | assert response.status_code == 200
67 |
68 | request_json = {"model": "invalid_model", "text": "hello"}
69 | response = embedding_client.post("/v1/embed/csv", json=request_json)
70 | assert response.status_code == 400
71 |
72 | response = embedding_client.post("/v1/embed/csv", data="not json")
73 | assert response.status_code == 400
74 |
75 | response = embedding_client.post("/v1/embed/invalid_format", data="not json")
76 | assert response.status_code == 400
77 |
--------------------------------------------------------------------------------
/tests/unit/server/test_search_server.py:
--------------------------------------------------------------------------------
1 | """Tests covering the search server."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | from unittest.mock import Mock
21 |
22 | import numpy as np
23 | import pytest
24 |
25 | from bluesearch.server.search_server import SearchServer
26 | from bluesearch.utils import H5
27 |
28 |
29 | @pytest.fixture
30 | def search_client(
31 | monkeypatch, embeddings_h5_path, fake_sqlalchemy_engine, test_parameters
32 | ):
33 | """Fixture to create a client for mining_server."""
34 |
35 | fake_embedding_model = Mock()
36 | fake_embedding_model.preprocess.return_value = "hello"
37 | fake_embedding_model.embed.return_value = np.ones(
38 | (test_parameters["embedding_size"],)
39 | )
40 |
41 | monkeypatch.setattr(
42 | "bluesearch.server.search_server.get_embedding_model",
43 | lambda *args, **kwargs: fake_embedding_model,
44 | )
45 |
46 | indices = H5.find_populated_rows(embeddings_h5_path, "SBioBERT")
47 |
48 | search_server_app = SearchServer(
49 | trained_models_path="",
50 | embeddings_h5_path=embeddings_h5_path,
51 | indices=indices,
52 | connection=fake_sqlalchemy_engine,
53 | models=["SBioBERT"],
54 | )
55 | search_server_app.config["TESTING"] = True
56 | with search_server_app.test_client() as client:
57 | yield client
58 |
59 |
60 | class TestSearchServer:
61 | def test_search_server(self, search_client):
62 | # Test the help request
63 | response = search_client.post("/help")
64 | assert response.status_code == 200
65 | assert response.json["name"] == "SearchServer"
66 |
67 | # Test a valid JSON request
68 | k = 3
69 | request_json = {"which_model": "SBioBERT", "k": k, "query_text": "hello"}
70 | response = search_client.post("/", json=request_json)
71 | assert response.status_code == 200
72 | json_response = response.json
73 | assert len(json_response["sentence_ids"]) == k
74 | assert len(json_response["similarities"]) == k
75 |
76 | # Test a non-JSON request
77 | response = search_client.post("/", data="data is not a json")
78 | assert response.status_code == 200
79 | json_response = response.json
80 | assert json_response["sentence_ids"] is None
81 | assert json_response["similarities"] is None
82 |
--------------------------------------------------------------------------------
/tests/unit/test_fixtures.py:
--------------------------------------------------------------------------------
1 | """Collection of tests that make sure that fixtures are set up correctly.
2 |
3 | Notes
4 | -----
5 | The internals of fixtures might vary based on how conftest.py sets them up.
6 | The goal of these tests is to run simple sanity checks rather than detailed
7 | bookkeeping.
8 | """
9 |
10 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
11 | #
12 | # Copyright (C) 2020 Blue Brain Project, EPFL.
13 | #
14 | # This program is free software: you can redistribute it and/or modify
15 | # it under the terms of the GNU Lesser General Public License as published by
16 | # the Free Software Foundation, either version 3 of the License, or
17 | # (at your option) any later version.
18 | #
19 | # This program is distributed in the hope that it will be useful,
20 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
21 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
22 | # GNU Lesser General Public License for more details.
23 | #
24 | # You should have received a copy of the GNU Lesser General Public License
25 | # along with this program. If not, see .
26 |
27 | import pandas as pd
28 | import pytest
29 | import sqlalchemy
30 | from sqlalchemy.exc import OperationalError, ProgrammingError
31 |
32 |
33 | def test_database(fake_sqlalchemy_engine, backend_database):
34 | """Make sure database tables setup correctly."""
35 | inspector = sqlalchemy.inspect(fake_sqlalchemy_engine)
36 |
37 | for table_name in ["articles", "sentences", "mining_cache"]:
38 | res = pd.read_sql("SELECT * FROM {}".format(table_name), fake_sqlalchemy_engine)
39 |
40 | if table_name != "articles":
41 | # Mysql consider that sentences table has 2 indexes (article_id one + UNIQUE
42 | # constraint)
43 | # sqlite will only consider 1 index for this table (article_id one)
44 | assert len(inspector.get_indexes(table_name)) >= 1
45 |
46 | assert len(res) > 0
47 | if backend_database == "sqlite":
48 | with pytest.raises(OperationalError):
49 | fake_sqlalchemy_engine.execute("SELECT * FROM fake_table").all()
50 | else:
51 | with pytest.raises(ProgrammingError):
52 | fake_sqlalchemy_engine.execute("SELECT * FROM fake_table").all()
53 |
54 |
55 | def test_h5(embeddings_h5_path):
56 | assert embeddings_h5_path.is_file()
57 |
58 |
59 | def test_metadata(metadata_path):
60 | """Make sure all metadata csv is correct"""
61 | df = pd.read_csv(str(metadata_path))
62 |
63 | assert len(df) > 0
64 |
65 |
66 | def test_jsons(jsons_path):
67 | """Make sure all jsons are present."""
68 | n_json_files = len(list(jsons_path.rglob("*.json")))
69 |
70 | assert n_json_files > 0
71 |
--------------------------------------------------------------------------------
/tests/unit/widgets/test_mining_schema.py:
--------------------------------------------------------------------------------
1 | """Tests covering the MiningSchema class."""
2 |
3 | # Blue Brain Search is a text mining toolbox focused on scientific use cases.
4 | #
5 | # Copyright (C) 2020 Blue Brain Project, EPFL.
6 | #
7 | # This program is free software: you can redistribute it and/or modify
8 | # it under the terms of the GNU Lesser General Public License as published by
9 | # the Free Software Foundation, either version 3 of the License, or
10 | # (at your option) any later version.
11 | #
12 | # This program is distributed in the hope that it will be useful,
13 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
14 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
15 | # GNU Lesser General Public License for more details.
16 | #
17 | # You should have received a copy of the GNU Lesser General Public License
18 | # along with this program. If not, see .
19 |
20 | import pytest
21 |
22 | from bluesearch.widgets.mining_schema import MiningSchema
23 |
24 |
25 | def test_add_entity():
26 | mining_schema = MiningSchema()
27 |
28 | # Test adding entities
29 | mining_schema.add_entity(
30 | "CHEMICAL",
31 | property_name="isChiral",
32 | property_type="ATTRIBUTE",
33 | property_value_type="BOOLEAN",
34 | ontology_source="NCIT",
35 | )
36 | mining_schema.add_entity("DRUG")
37 | assert len(mining_schema.schema_df) == 2
38 |
39 | # Test warning upon adding a duplicate entity
40 | with pytest.warns(UserWarning, match=r"already exists"):
41 | mining_schema.add_entity("DRUG")
42 |
43 |
44 | def test_df(mining_schema_df):
45 | # We won't be testing for duplicates in this test
46 | mining_schema_df = mining_schema_df.drop_duplicates(ignore_index=True)
47 |
48 | # Test adding from a dataframe
49 | mining_schema = MiningSchema()
50 | mining_schema.add_from_df(mining_schema_df)
51 | # Make sure a copy is returned
52 | assert mining_schema.df is not mining_schema.schema_df
53 | # Check that all data was added
54 | assert mining_schema.df.equals(mining_schema_df)
55 |
56 | # Test missing entity_type
57 | wrong_schema_df = mining_schema_df.drop("entity_type", axis=1)
58 | mining_schema = MiningSchema()
59 | with pytest.raises(ValueError, match=r"entity_type.* not found"):
60 | mining_schema.add_from_df(wrong_schema_df)
61 |
62 | # Test ignoring unknown columns
63 | schema_df_new = mining_schema_df.drop_duplicates().copy()
64 | schema_df_new["unknown_column"] = list(range(len(schema_df_new)))
65 | mining_schema = MiningSchema()
66 | with pytest.warns(UserWarning, match=r"column.* unknown_column"):
67 | mining_schema.add_from_df(schema_df_new)
68 | # Check that all data was added and the unknown columns was ignored
69 | assert mining_schema.df.equals(mining_schema_df)
70 |
--------------------------------------------------------------------------------