├── .flake8 ├── .github └── workflows │ └── config.yaml ├── .gitignore ├── Dockerfile ├── Makefile ├── README.md ├── request_samples.json ├── requirements.txt ├── response_samples.json ├── setup.cfg ├── src ├── README.md ├── __init__.py ├── app_main.py ├── data_modules │ ├── __init__.py │ ├── get_all_data.py │ ├── pg_ops.py │ ├── process_csv_turkish_data.py │ └── url_redirect.py ├── eval_modules │ ├── __init__.py │ └── eval.py ├── main.py └── ml_modules │ ├── __init__.py │ ├── base_classifier.py │ ├── bert_classifier.py │ ├── gpt3_classifier.py │ ├── intent_config │ ├── GIYSI.txt │ ├── KURTARMA.txt │ ├── README.md │ ├── YEMEK-SU.txt │ └── __init__.py │ ├── rule_based_clustering.py │ └── run_zsc.py └── tests ├── __init__.py └── ml ├── __init__.py └── classifiers ├── __init__.py └── test_models.py /.flake8: -------------------------------------------------------------------------------- 1 | [flake8] 2 | ignore = E741,W504,C812,W605,E266 3 | -------------------------------------------------------------------------------- /.github/workflows/config.yaml: -------------------------------------------------------------------------------- 1 | name: Python package 2 | 3 | on: [push] 4 | 5 | jobs: 6 | build: 7 | runs-on: ubuntu-latest 8 | strategy: 9 | matrix: 10 | python-version: ["3.9"] 11 | 12 | steps: 13 | - uses: actions/checkout@v3 14 | 15 | - name: Set up Python ${{ matrix.python-version }} 16 | uses: actions/setup-python@v4 17 | with: 18 | python-version: ${{ matrix.python-version }} 19 | 20 | - name: Install dependencies 21 | run: | 22 | python3 -m pip install --upgrade pip 23 | if [ -f requirements.txt ]; then pip3 install -r requirements.txt; fi 24 | 25 | - name: Lint with flake8 26 | run: | 27 | # stop the build if there are Python syntax errors or undefined names 28 | flake8 . 29 | 30 | - name: Test with pytest 31 | run: | 32 | pytest 33 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | indenv 3 | intent-classification-v0\sample_data.csv 4 | .vscode 5 | # MacOS directory specific files 6 | .DS_Storecls 7 | 8 | # Byte-compiled / optimized / DLL files 9 | __pycache__/ 10 | *.py[cod] 11 | *$py.class 12 | 13 | # Distribution / packaging 14 | .Python 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | wheels/ 27 | share/python-wheels/ 28 | *.egg-info/ 29 | .installed.cfg 30 | *.egg 31 | MANIFEST 32 | *.pyc 33 | intent-classification-v0\__pycache__ -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | # Use an official Python runtime as the base image 2 | FROM python:3.9-buster 3 | 4 | # Set the working directory to /app 5 | WORKDIR /app 6 | 7 | # Install required packages 8 | RUN pip3 install --upgrade pip 9 | COPY requirements.txt . 10 | RUN pip3 install --upgrade-strategy=only-if-needed -r requirements.txt 11 | 12 | # Set the environment variable for FastAPI 13 | ENV PYTHONUNBUFFERED 1 14 | 15 | # Expose port 8000 for incoming traffic 16 | EXPOSE 8000 17 | 18 | # Copy the current directory contents into the container at /app 19 | # This should be at the end of Dockerfile to utilize layer caching 20 | COPY . /app 21 | 22 | # Run the command to start the FastAPI server 23 | CMD [ "python3", "src/app_main.py" ] 24 | 25 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | build-lint-test: 2 | docker build -t deprem-intent-classification . 3 | docker run --rm -it deprem-intent-classification sh -c "flake8 && pytest" 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Yeni classifier eklenmesi 2 | 3 | 1. `ml_modules` altında BaseClassifier'ı extend eden yeni sınıf yarat. 4 | 2. Eğer gerekliyse `requirements.txt`'yi güncelle. 5 | 3. `tests/ml/classifiers/test_models.py` dosyasında yeni classifier'ı importlayıp `setupClass`'ı güncelle. 6 | 4. `make test-docker` komutu ile testlerin geçtiğinden emin ol. 7 | 5. Yeni branch'ı pushla ve PR aç. 8 | 9 | # Deprem Verisi ile Intent Classificiation 10 | 11 | **Proje amaci**: Twitter’dan toplanan deprem alakali verilerin labellanmasi ve analizi. Su ana kadar kelime tabanli ve zero-shot classification modeli kullanilarak intent classification yapildi. Onumuzdeki amacimiz veri takimlarinin yardimiyla verisetleri olusturup tweetleri ilgili yardim birimlerine ulastirmak amacli kumelemek. Ornegin: yemek ihtiyaci oldugunu belirten bir tweetin enkaz icin degil de gida transferi saglayacak ekiplerle paylasilmasi icin duzenlenmesi. 12 | 13 | #depremadres data Trello'da mevcut: 130k tweet data 14 | 15 | **Yapilmasi gerekenler** 16 | 17 | 1. Depremle ilgili elimizde olan verilerin analiz edilmesi. 18 | 19 | 1. #depremadres hashtaginden 2232 tweet var elimizde. 20 | 2. deprem alakali 130k tweet var elimizde. 21 | 3. yapilabilecek preprocessing gelistirmeleri. 22 | 4. veriyi en iyi represent edecek label listesinin olusturulmasi. 23 | 5. data exploration 24 | 25 | 2. yeni toplanabilecek verilerin planlanmasi. 26 | 3. veri takimiyla iletisime gecilip toplanmis verilerin annotation isleminin yapilmasi. 27 | 4. supervised classification probleminin tasarlanmasi. 28 | 29 | 1. elimizdeki tweet yardim isteyen tweet vs ornegin siyasi tweet mi. 30 | 2. yardim tweetlerinin hangi contextte yardim istedigi: enkaz yardim, gida yardim, ilac yardim, gibi. 31 | 3. daha fazla model mumkun, veri analizleri dogrultusunda neler gelistirilebilirse deneyebilirsiniz. 32 | 33 | 5. problemin etrafinda gelistirilebilecek modellerin belirlenmesi. 34 | 6. modellerin training ve testing gelistirmelerinin yapilmasi. 35 | 7. sonuclarin analizinin yapilmasi. 36 | 8. pm ekibiyle iletisime gecilip modellerin ne yonde ise yarayabileceginin ve entegrasyonunun planlanmasi. 37 | -------------------------------------------------------------------------------- /request_samples.json: -------------------------------------------------------------------------------- 1 | [ 2 | { 3 | "Full_text": "#depremadres @AFADBaskanlik @ahbap Deliklitaş Cd. No:106 https://t.co/egoOXKNALA ACİL!Sütçü imam mahallesi Deliklitaş caddesi hünkar apt no 106/05 Dulkadiroğlu Kahramanmaraş Serkan kol Şenay kol hacı kol enkaz altında canlı!", 4 | "Tweet_id": "1623120030782173184", 5 | "Geo_loc": "https://maps.app.goo.gl/Du2XcdvqPVorr35w8", 6 | "Rules": [ 7 | "enkaz", 8 | "altında", 9 | "Kahramanmaraş" 10 | ] 11 | }, 12 | {"Full_text": "#Adıyaman Merkez Alitaşı Mahallesi 1241 sokak No: 25/A ÇADIR YOK, ERZAK YOK İHTİYAÇLAR KARŞILANMIYOR ACİL YARDIM ❗❗#depremadıyaman #YardımCağrısı #TurkiyeDeprem #earthquake #babalatv #PrayForTurkey #depremAdres @DepremDairesi @BabalaTv @haluklevent @ahbap @ahbapacil https://t.co/REYSYHL5X7", 13 | "Tweet_id": "1623225686830743552", 14 | "Geo_loc": "none", 15 | "Rules": [ 16 | "erzak", 17 | "ihtiyaç", 18 | "yardım" 19 | ] 20 | } 21 | 22 | 23 | ] -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiokafka==0.8.0 2 | anyio==3.6.2 3 | async-timeout==4.0.2 4 | certifi==2022.12.7 5 | charset-normalizer==3.0.1 6 | click==8.1.3 7 | colorama==0.4.6 8 | contourpy==1.0.7 9 | cycler==0.11.0 10 | fastapi==0.90.1 11 | fonttools==4.38.0 12 | h11==0.14.0 13 | httptools==0.5.0 14 | idna==3.4 15 | kafka-python==2.0.2 16 | kiwisolver==1.4.4 17 | matplotlib==3.6.3 18 | packaging==23.0 19 | pandas 20 | Pillow==9.4.0 21 | psycopg2==2.9.5 22 | pydantic==1.10.4 23 | pyparsing==3.0.9 24 | python-dateutil==2.8.2 25 | python-dotenv==0.21.1 26 | pytz==2022.7.1 27 | PyYAML==6.0 28 | requests==2.28.2 29 | six==1.16.0 30 | sniffio==1.3.0 31 | starlette==0.23.1 32 | tqdm==4.64.1 33 | typing_extensions==4.4.0 34 | Unidecode==1.3.6 35 | urllib3==1.26.14 36 | uvicorn==0.20.0 37 | watchfiles==0.18.1 38 | websockets==10.4 39 | numpy 40 | # linting & test packages 41 | pytest==7.2.1 42 | pycodestyle==2.5.0 43 | autopep8==1.4.4 44 | autoflake==1.3 45 | flake8-commas==2.0.0 46 | flake8==3.7.8 47 | -------------------------------------------------------------------------------- /response_samples.json: -------------------------------------------------------------------------------- 1 | { 2 | "Tweet_id": "1623225686830743552", 3 | "Rules": [ 4 | "erzak", 5 | "ihtiyaç", 6 | "yardım" 7 | ], 8 | "Full_text": "#Adıyaman Merkez Alitaşı Mahallesi 1241 sokak No: 25/A ÇADIR YOK, ERZAK YOK İHTİYAÇLAR KARŞILANMIYOR ACİL YARDIM #depremadıyaman #YardımCağrısı #TurkiyeDeprem #earthquake #babalatv #PrayForTurkey #depremAdres @DepremDairesi @BabalaTv @haluklevent @ahbap @ahbapacil https://t.co/REYSYHL5X7", 9 | "data": "KURTARMA,YEMEK-SU" 10 | } -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [flake8] 2 | max-line-length=100 3 | -------------------------------------------------------------------------------- /src/README.md: -------------------------------------------------------------------------------- 1 | # deprem-intent-classification-v0 2 | 3 | Twitterdan #depremadres hashtagine paylasilan tweetlerin intent analizi amaciyla gelistirilmis basit bir kural tabanli kodbase. 4 | 5 | [AcikKaynak](https://github.com/acikkaynak) organizasyonun'un Discord kanalinda toplanmis bir ekiple gelistiriyoruz. 6 | 7 | Projenin calismasi icin gereken paketler: 8 | 9 | psycopg2, dotenv, pandas, matplotlib 10 | 11 | Projeyi calistirmak icin: 12 | 13 | Postgres tablosuna erisim icin credentiallar gerekli. 14 | Projenin yonetimiyle alakasi olan birileriyle iletisime gecip Postgres erisim credentiallariyla baglanti saglanmasi gerekiyor. 15 | 16 | ``` 17 | pip install -r requirements.txt 18 | python main.py 19 | ``` 20 | 21 | 22 | # Quality performance 23 | Current performance on `RuleBasedClassifier` is as follows (but take the stats with a grain of salt because eval set is not perfect & doesn't have multilabels) 24 | 25 | ``` 26 | Intent: KURTARMA 27 | ================================== 28 | KURTARMA Precision: 0.73 29 | KURTARMA Recall: 0.87 30 | KURTARMA F1: 0.79 31 | 32 | Intent: YEMEK-SU 33 | ================================== 34 | YEMEK-SU Precision: 0.44 35 | YEMEK-SU Recall: 0.80 36 | YEMEK-SU F1: 0.57 37 | 38 | Intent: GIYSI 39 | ================================== 40 | GIYSI Precision: 0.32 41 | GIYSI Recall: 0.78 42 | GIYSI F1: 0.45 43 | ``` -------------------------------------------------------------------------------- /src/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acikyazilimagi/depremadres-intent-classification-v0/19b1c3e12fabc775f5999c3d01a10251e22a3d5c/src/__init__.py -------------------------------------------------------------------------------- /src/app_main.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from typing import List 3 | 4 | import uvicorn 5 | from aiokafka import AIOKafkaConsumer 6 | from fastapi import FastAPI, HTTPException 7 | from pydantic import BaseModel 8 | 9 | # ML modules 10 | from src.ml_modules.bert_classifier import BertClassifier 11 | from src.ml_modules.rule_based_clustering import RuleBasedClassifier 12 | 13 | # import ml_modules.run_zsc as zsc 14 | 15 | # Define command line arguments to control which classifiers to run. 16 | parser = argparse.ArgumentParser() 17 | parser.add_argument('--run_rule_based_classifier', 18 | action=argparse.BooleanOptionalAction, default=True) 19 | parser.add_argument('--run_bert_classifier', 20 | action=argparse.BooleanOptionalAction, default=True) 21 | args = parser.parse_args() 22 | 23 | # Initialize classifiers 24 | rule_based_classifier = None 25 | if args.run_rule_based_classifier: 26 | rule_based_classifier = RuleBasedClassifier() 27 | 28 | bert_classifier = None 29 | if args.run_bert_classifier: 30 | bert_classifier = BertClassifier() 31 | 32 | # Initialize fastapi. 33 | app = FastAPI() 34 | 35 | 36 | # Data models. 37 | class Request(BaseModel): 38 | text: str 39 | 40 | 41 | class Response(BaseModel): 42 | intents: List[str] 43 | 44 | 45 | @app.post("/get_intents/") 46 | async def Get_Intent(item: Request) -> Response: 47 | if not item.text: 48 | raise HTTPException(status_code=400, detail="Bad request, no text") 49 | 50 | try: 51 | intents = [] 52 | 53 | if args.run_rule_based_classifier: 54 | assert rule_based_classifier 55 | intents.extend(rule_based_classifier.classify(item.text)) 56 | 57 | if args.run_bert_classifier: 58 | assert bert_classifier 59 | intents.extend(bert_classifier.classify(item.text)) 60 | 61 | # Remove duplicates. 62 | intents = list(set(intents)) 63 | return intents 64 | 65 | except Exception as e: 66 | raise HTTPException(status_code=500, detail=str(e)) 67 | 68 | 69 | async def start_kafka_consumer(app): 70 | consumer = AIOKafkaConsumer( 71 | "your_topic", 72 | bootstrap_servers="localhost:9092", 73 | group_id="your_group_id", 74 | auto_offset_reset="earliest" 75 | ) 76 | 77 | # Start consuming 78 | await consumer.start() 79 | 80 | try: 81 | # Poll for new messages 82 | async for msg in consumer: 83 | print(f"Consumed message: {msg.value}") 84 | finally: 85 | # Close the consumer 86 | await consumer.stop() 87 | 88 | 89 | if __name__ == '__main__': 90 | # app.add_event_handler("startup", start_kafka_consumer) 91 | uvicorn.run(app, host="0.0.0.0", port=8000) 92 | -------------------------------------------------------------------------------- /src/data_modules/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acikyazilimagi/depremadres-intent-classification-v0/19b1c3e12fabc775f5999c3d01a10251e22a3d5c/src/data_modules/__init__.py -------------------------------------------------------------------------------- /src/data_modules/get_all_data.py: -------------------------------------------------------------------------------- 1 | import csv 2 | 3 | import pg_ops 4 | 5 | conn = pg_ops.connect_to_db() 6 | 7 | data = pg_ops.get_data(conn, 'tweets_depremaddress', [ 8 | 'id', 'full_text', 'tweet_id', 'geo_link'], '1=1') 9 | 10 | # function to write data to csv file 11 | 12 | 13 | def write_to_csv(data, filename): 14 | with open(filename, 'w', newline='') as csvfile: 15 | # creating a csv writer object 16 | csvwriter = csv.writer(csvfile) 17 | # writing the fields 18 | csvwriter.writerow(['id', 'full_text', 'tweet_id', 'geo_loc']) 19 | # writing the data rows 20 | csvwriter.writerows(data) 21 | 22 | # write_to_csv(data, 'data.csv') 23 | -------------------------------------------------------------------------------- /src/data_modules/pg_ops.py: -------------------------------------------------------------------------------- 1 | import os 2 | from typing import Any, List, Tuple 3 | 4 | import psycopg2 5 | from dotenv import load_dotenv 6 | 7 | 8 | def get_data(conn: psycopg2.connection, table_name: str, column_names: List[str], 9 | condition: str) -> List[Tuple]: 10 | """ Get data from the table 11 | 12 | Args: 13 | conn: PostgreSQL connection object 14 | table_name: Table name to get data 15 | column_names: Name of the columns to select 16 | condition: WHERE condition 17 | 18 | Returns: 19 | Query results 20 | """ 21 | cur = conn.cursor() 22 | query = "SELECT {} FROM {} WHERE {}".format(', '.join(column_names), table_name, condition) 23 | # query for filtering data with multiple claueses 24 | # query = "SELECT {} FROM {} WHERE is_done = True AND intent_result = ''".format(', '.join(column_names), table_name) # noqa 25 | cur.execute(query) 26 | return cur.fetchall() 27 | 28 | 29 | def update_data(conn: psycopg2.connection, table_name: str, column_name: str, new_value: Any, 30 | condition: str) -> None: 31 | """ Update data in the table 32 | 33 | Args: 34 | conn: PostgreSQL connection object 35 | table_name: Table name to update data 36 | column_name: Name of the column to update 37 | new_value: New value for the column 38 | condition: WHERE condition 39 | """ 40 | 41 | cur = conn.cursor() 42 | query = "UPDATE {} SET {} = '{}' WHERE {}".format(table_name, column_name, new_value, condition) 43 | cur.execute(query) 44 | conn.commit() 45 | 46 | 47 | def connect_to_db(): 48 | """ Connect to the database 49 | 50 | Returns: 51 | PostgreSQL connection object 52 | """ 53 | # Load environment variables 54 | load_dotenv(".env") 55 | 56 | host = os.getenv('HOST') 57 | database = os.getenv('DATABASE') 58 | user = os.getenv('USERNAME') 59 | password = os.getenv('PASSWORD') 60 | 61 | # Connect to the database 62 | conn = psycopg2.connect( 63 | host=host, 64 | database=database, 65 | user=user, 66 | password=password 67 | ) 68 | return conn 69 | -------------------------------------------------------------------------------- /src/data_modules/process_csv_turkish_data.py: -------------------------------------------------------------------------------- 1 | import re 2 | 3 | import pandas as pd 4 | # from unicodedata import normalize 5 | from unidecode import unidecode 6 | 7 | 8 | def get_data(file_name): 9 | df = pd.read_csv(file_name, header=None) 10 | return df 11 | 12 | # write pd dataframe to csv 13 | 14 | 15 | def write_to_csv(df, file_name): 16 | df.to_csv(file_name, index=False) 17 | 18 | 19 | def check_regex(full_text): 20 | full_text = re.sub(r'[^a-z\s]+', '', full_text, flags=re.IGNORECASE) 21 | return full_text 22 | 23 | 24 | def remove_diacritics(text): 25 | # define the mapping from diacritic characters to non-diacritic characters 26 | mapping = { 27 | '\u00c7': 'C', '\u00e7': 'c', 28 | '\u011e': 'G', '\u011f': 'g', 29 | '\u0130': 'I', '\u0131': 'i', 30 | '\u015e': 'S', '\u015f': 's', 31 | '\u00d6': 'O', '\u00f6': 'o', 32 | '\u00dc': 'U', '\u00fc': 'u', 33 | '\u0152': 'OE', '\u0153': 'oe', 34 | '\u0049': 'I', '\u0131': 'i', 35 | } 36 | 37 | # replace each diacritic character with its non-diacritic counterpart 38 | text = ''.join(mapping.get(c, c) for c in text) 39 | 40 | return text 41 | 42 | 43 | if __name__ == "__main__": 44 | df = get_data("deprem_convert_csv/v10.csv") 45 | # df.drop(index=df.index[1], inplace=True) # drop first row 46 | print(df.head()) 47 | for i in df.columns: 48 | df[i] = df[i].apply(lambda x: unidecode(x) if type(x) == str else x) 49 | print(df.head()) 50 | write_to_csv(df, "deprem_converted_csv/v10.csv") 51 | -------------------------------------------------------------------------------- /src/data_modules/url_redirect.py: -------------------------------------------------------------------------------- 1 | from typing import List, Union 2 | 3 | import requests 4 | from requests import Response 5 | 6 | 7 | def chase_redirects(url: Union[List, str]) -> List[str]: 8 | _urls = [] 9 | 10 | def _chase(url: str): 11 | resp: Response = requests.head(url) 12 | if 300 < resp.status_code < 400: 13 | return resp.headers["location"] 14 | 15 | if isinstance(url, list): 16 | for each in url: 17 | _urls.append(_chase(each)) 18 | 19 | if isinstance(url, str): 20 | _urls.append(_chase(url)) 21 | 22 | return _urls 23 | 24 | 25 | # url = ["https://t.co/xRhOZeNJLe", "https://t.co/Dy33xpoxAV"] 26 | # ret = chase_redirects(url) 27 | # print(ret) 28 | -------------------------------------------------------------------------------- /src/eval_modules/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acikyazilimagi/depremadres-intent-classification-v0/19b1c3e12fabc775f5999c3d01a10251e22a3d5c/src/eval_modules/__init__.py -------------------------------------------------------------------------------- /src/eval_modules/eval.py: -------------------------------------------------------------------------------- 1 | """Eval. 2 | 3 | Using the eval.csv file, evaluate the performance of the classifier for each intent. 4 | 5 | Usage: 6 | python3 eval_modules.eval 7 | 8 | """ 9 | 10 | import argparse 11 | 12 | import pandas as pd 13 | 14 | from src.ml_modules.bert_classifier import BertClassifier 15 | from src.ml_modules.rule_based_clustering import (RuleBasedClassifier, 16 | preprocess_tweet) 17 | 18 | # Define command line arguments to control which classifiers to run. 19 | parser = argparse.ArgumentParser() 20 | 21 | # CSV file containing the eval data. 22 | parser.add_argument('--eval_file', type=str, default="../data/eval.csv") 23 | 24 | # Number of entries to use for evaluation. 0 means use all entries. 25 | parser.add_argument('--max_num_entries', type=int, default=0) 26 | 27 | parser.add_argument('--run_rule_based_classifier', 28 | action=argparse.BooleanOptionalAction, default=True) 29 | parser.add_argument('--run_bert_classifier', 30 | action=argparse.BooleanOptionalAction, default=False) 31 | args = parser.parse_args() 32 | 33 | 34 | class ClassificationEval(object): 35 | def __init__(self, eval_frame, classifier_instance): 36 | self.classifier = classifier_instance 37 | self.eval_frame = eval_frame 38 | 39 | def __eval_fn(self, arg): 40 | def err(): 41 | print(f"No funciton named 'classify()' found in {self.classifier}") 42 | func = getattr(self.classifier, 'classify', err) 43 | # TODO find a more robust way of getting at most one item and/or handling multiclass eval. 44 | return func(arg) 45 | 46 | def __all_intents_fn(self): 47 | def err(): 48 | print( 49 | f"No funciton named 'all_intents()' found in {self.classifier}") 50 | func = getattr(self.classifier, 'all_intents', err) 51 | return func() 52 | 53 | def __prep_eval_frame(self, df): 54 | # Fill NaNs with empty strings to include no intent tweets. 55 | df = df.fillna("") 56 | df = df[df['label'].notna()] 57 | df = df[df['tweet_text'].notna()] 58 | df['tweet_text'] = df['tweet_text'].apply(preprocess_tweet) 59 | # Only needed columps 60 | df = df[['tweet_text', 'label']] 61 | # One hot encode the labels. 62 | for intent in self.__all_intents_fn(): 63 | df[f'{intent}_golden'] = df['label'] == intent 64 | return df 65 | 66 | def __prep_classification_frame(self, df): 67 | """Only using tweet_text, return a one hot encoded prediciton frame""" 68 | all_intents = self.__all_intents_fn() 69 | # Returns a set. 70 | df['predicted'] = df['tweet_text'].apply(self.__eval_fn) 71 | 72 | # Create a one hot encoded frame. 73 | for intent in all_intents: 74 | df[f'{intent}_pred'] = df['predicted'].apply(lambda x: intent in x) 75 | 76 | del df['predicted'] 77 | return df 78 | 79 | def eval(self): 80 | df = self.__prep_eval_frame(self.eval_frame) 81 | df = self.__prep_classification_frame(df) 82 | 83 | df['predicted'] = df['tweet_text'].apply(self.__eval_fn) 84 | for intent in self.__all_intents_fn(): 85 | # Calculate false positives. 86 | df[f'{intent}_fp'] = df.apply( 87 | lambda x: not x[f'{intent}_golden'] and x[f'{intent}_pred'], axis=1) 88 | # Calculate false negatives. 89 | df[f'{intent}_fn'] = df.apply( 90 | lambda x: x[f'{intent}_golden'] and not x[f'{intent}_pred'], axis=1) 91 | # Calculate true positives. 92 | df[f'{intent}_tp'] = df.apply( 93 | lambda x: x[f'{intent}_golden'] and x[f'{intent}_pred'], axis=1) 94 | 95 | # Calcualte metrics for each intent. 96 | for intent in self.__all_intents_fn(): 97 | print(f"Intent: {intent}") 98 | print("==================================") 99 | # Calculate precision. 100 | precision = df[f'{intent}_tp'].sum( 101 | ) / (df[f'{intent}_tp'].sum() + df[f'{intent}_fp'].sum()) 102 | print(f"{intent} Precision: {precision:.2f}") 103 | # Calculate recall. 104 | recall = df[f'{intent}_tp'].sum( 105 | ) / (df[f'{intent}_tp'].sum() + df[f'{intent}_fn'].sum()) 106 | print(f"{intent} Recall: {recall:.2f}") 107 | # Calculate F1 score. 108 | f1 = 2 * (precision * recall) / (precision + recall) 109 | print(f"{intent} F1: {f1:.2f}") 110 | print("") 111 | 112 | return df 113 | 114 | 115 | if __name__ == '__main__': 116 | eval_frame = pd.read_csv(args.eval_file) 117 | if args.max_num_entries: 118 | eval_frame = eval_frame[args.max_num_entries] 119 | 120 | if args.run_rule_based_classifier: 121 | eval = ClassificationEval(eval_frame, RuleBasedClassifier()) 122 | eval.eval() 123 | 124 | if args.run_bert_classifier: 125 | eval = ClassificationEval(eval_frame, BertClassifier()) 126 | eval.eval() 127 | -------------------------------------------------------------------------------- /src/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | Command line verison of the intent classification app. 3 | 4 | Usage: 5 | python app_main.py --run_rule_based_classifier --run_bert_classifier -text='Yardim' 6 | """ 7 | import argparse 8 | 9 | # ML modules 10 | from src.ml_modules.rule_based_clustering import RuleBasedClassifier 11 | from src.ml_modules.bert_classifier import BertClassifier 12 | 13 | # Define command line arguments to control which classifiers to run. 14 | parser = argparse.ArgumentParser() 15 | parser.add_argument('--text', type=str) 16 | parser.add_argument('--run_rule_based_classifier', 17 | action=argparse.BooleanOptionalAction, default=True) 18 | parser.add_argument('--run_bert_classifier', 19 | action=argparse.BooleanOptionalAction, default=True) 20 | args = parser.parse_args() 21 | 22 | # Initialize classifiers 23 | rule_based_classifier = None 24 | if args.run_rule_based_classifier: 25 | rule_based_classifier = RuleBasedClassifier() 26 | 27 | bert_classifier = None 28 | if args.run_bert_classifier: 29 | bert_classifier = BertClassifier() 30 | 31 | 32 | def run_classifiers(text): 33 | intents = [] 34 | 35 | if args.run_rule_based_classifier: 36 | assert rule_based_classifier 37 | intents.extend(rule_based_classifier.classify(text)) 38 | 39 | if args.run_bert_classifier: 40 | assert bert_classifier 41 | intents.extend(bert_classifier.classify(text)) 42 | 43 | # Remove duplicates. 44 | intents = list(set(intents)) 45 | return intents 46 | 47 | 48 | if __name__ == '__main__': 49 | intents = run_classifiers(args.text) 50 | print(intents) 51 | -------------------------------------------------------------------------------- /src/ml_modules/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acikyazilimagi/depremadres-intent-classification-v0/19b1c3e12fabc775f5999c3d01a10251e22a3d5c/src/ml_modules/__init__.py -------------------------------------------------------------------------------- /src/ml_modules/base_classifier.py: -------------------------------------------------------------------------------- 1 | from abc import ABC, abstractmethod 2 | from typing import List 3 | 4 | 5 | class BaseClassifier(ABC): 6 | 7 | @abstractmethod 8 | def classify(self, text: str) -> List[str]: 9 | raise NotImplementedError 10 | 11 | @abstractmethod 12 | def all_intents(self) -> List[str]: 13 | raise NotImplementedError 14 | -------------------------------------------------------------------------------- /src/ml_modules/bert_classifier.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os 3 | from typing import List 4 | 5 | import requests 6 | from dotenv import load_dotenv 7 | 8 | from src.ml_modules.base_classifier import BaseClassifier 9 | 10 | load_dotenv(".env") 11 | 12 | API_TOKEN = os.getenv("HF_HUB_TOKEN") 13 | MODEL_NAME = "deprem-ml/multilabel_earthquake_tweet_intent_bert_base_turkish_cased" 14 | 15 | # Add logging. 16 | logging.basicConfig(level=logging.INFO) 17 | 18 | 19 | class BertClassifier(BaseClassifier): 20 | """ 21 | BERT based classifier to uses huggingface inference API to classify tweets. 22 | 23 | Example usage: 24 | >>> classifier = BertClassifier() 25 | >>> classifier.classify("Kırıkkale'de deprem oldu.") 26 | ["KURTARMA"] 27 | """ 28 | 29 | def __init__(self, classification_threshold=0.5): 30 | # The score threshold to deem a label as positive. 31 | self.classification_threshold = classification_threshold 32 | 33 | self.headers = {"Authorization": f"Bearer {API_TOKEN}"} 34 | self.api_url = f"https://api-inference.huggingface.co/models/{MODEL_NAME}" 35 | 36 | def __query(self, text): 37 | response = requests.post( 38 | self.api_url, headers=self.headers, json={"inputs": text}) 39 | 40 | # raise HTTPException if status code != 200 41 | response.raise_for_status() 42 | return response.json() 43 | 44 | def all_intents(self): 45 | """ 46 | Returns list of all possible intents this classifier can classify. 47 | """ 48 | return ["ALAKASIZ", "KURTARMA", "BARINMA", "SAGLIK", "LOJISTIK", "SU", 49 | "YAGMA", "YEMEK", "GIYSI", "ELEKTRONIK"] 50 | 51 | @property 52 | def none_label(self): 53 | return "ALAKASIZ" 54 | 55 | def classify(self, text: str) -> List[str]: 56 | """ 57 | Check if the given text contains any of the keywords of any intent. 58 | Args: 59 | text: The text to check. 60 | 61 | Returns: 62 | List of labels of the tweet, if any, ordered by score. 63 | """ 64 | response = self.__query(text) 65 | logging.info(f"Response: {response}") 66 | labels = [] 67 | 68 | if response: 69 | # Labels are returned as a list of lists. 70 | labels_and_scores = response[0] 71 | labels = [l_and_s["label"].upper() 72 | for l_and_s in labels_and_scores 73 | if l_and_s["score"] > self.classification_threshold 74 | ] 75 | 76 | # if ALAKASIZ and other category survive after threshold filter, 77 | # remove ALAKASIZ 78 | if self.none_label in labels and len(labels) > 1: 79 | labels.pop(self.none_label) 80 | 81 | # Don't return set, as it will lose ordering. 82 | return labels 83 | 84 | 85 | # If name main 86 | if __name__ == "__main__": 87 | text = "Kırıkkale'de deprem oldu." 88 | labels = BertClassifier().classify(text) 89 | print(labels) 90 | -------------------------------------------------------------------------------- /src/ml_modules/gpt3_classifier.py: -------------------------------------------------------------------------------- 1 | from typing import List 2 | 3 | from src.ml_modules.base_classifier import BaseClassifier 4 | 5 | 6 | class GPT3Classifier(BaseClassifier): 7 | 8 | def classify(self, text: str) -> List[str]: 9 | raise NotImplementedError 10 | 11 | def all_intents(self) -> List[str]: 12 | raise NotImplementedError 13 | -------------------------------------------------------------------------------- /src/ml_modules/intent_config/GIYSI.txt: -------------------------------------------------------------------------------- 1 | giysi talebi 2 | giysi 3 | battaniye 4 | yagmurluk 5 | kazak 6 | corap 7 | soguk 8 | bot 9 | isitici 10 | cadir 11 | hava 12 | camasir 13 | pijama 14 | soguktan 15 | yatak 16 | sisme 17 | bez 18 | bezi 19 | bebek bezi 20 | soba 21 | hijyen 22 | temizlik 23 | temizlik malzemesi 24 | basortu 25 | hijyen paketi 26 | kar 27 | hipotermi 28 | donmak 29 | yorgan -------------------------------------------------------------------------------- /src/ml_modules/intent_config/KURTARMA.txt: -------------------------------------------------------------------------------- 1 | enkaz 2 | enkaz altinda ses 3 | yardim 4 | altinda 5 | enkaz 6 | gocuk 7 | bina 8 | YARDIM 9 | acil 10 | kat 11 | ACIL 12 | altindalar 13 | enkazaltindayim 14 | yardim 15 | alinamiyor 16 | Enkaz 17 | yardimci 18 | ENKAZ 19 | saatlerdir 20 | destek 21 | altinda 22 | enkazda 23 | kurtarma 24 | kurtarma calismasi 25 | kurtarma talebi 26 | ulasilamayan kisiler 27 | ses 28 | vinc 29 | eskavator 30 | projektor 31 | sesler 32 | kurtarilmayi 33 | yaşında 34 | blok 35 | altında 36 | apartmanı 37 | Sitesi 38 | ailesi 39 | göçük 40 | acilvinc 41 | sesi 42 | altındalar 43 | Doktor -------------------------------------------------------------------------------- /src/ml_modules/intent_config/README.md: -------------------------------------------------------------------------------- 1 | # Intent config 2 | 3 | Burada classify edilecek intentlerin konfigurasyonunu tutuyoruz. 4 | 5 | Bunu `rule_based_clustering.py` deki `RuleBasedClassifier` okuyup dinamik olarak intent - keyword eslesmesini olusturuyor. 6 | 7 | Detayli kullanim icin `RuleBasedClassifier` pydoc a bakin. 8 | 9 | Konfigurasyon dosyalari soyle: 10 | 11 | - **dosya ismi**: .txt 12 | - **dosya icerigi**: o intente ait keywordlerin line separated listesi 13 | 14 | ### Ornek 15 | Dosya adi: 16 | 17 | ``` 18 | YEMEK-SU.txt 19 | ``` 20 | 21 | icerik: 22 | 23 | ``` 24 | gida talebi 25 | gida 26 | yemek 27 | su 28 | corba 29 | ... 30 | ``` 31 | -------------------------------------------------------------------------------- /src/ml_modules/intent_config/YEMEK-SU.txt: -------------------------------------------------------------------------------- 1 | gida talebi 2 | gida 3 | yemek 4 | su 5 | corba 6 | yiyecek 7 | icecek 8 | acliktan 9 | erzak -------------------------------------------------------------------------------- /src/ml_modules/intent_config/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acikyazilimagi/depremadres-intent-classification-v0/19b1c3e12fabc775f5999c3d01a10251e22a3d5c/src/ml_modules/intent_config/__init__.py -------------------------------------------------------------------------------- /src/ml_modules/rule_based_clustering.py: -------------------------------------------------------------------------------- 1 | import os 2 | import re 3 | from typing import Dict, List, Optional, Set, Tuple, Union 4 | 5 | import matplotlib.pyplot as plt 6 | import pandas as pd 7 | from unidecode import unidecode 8 | from src.ml_modules.base_classifier import BaseClassifier 9 | 10 | 11 | # Directory of this file. 12 | CUR_DIR = os.path.dirname(os.path.realpath(__file__)) 13 | 14 | # Directory that has .txt config files. 15 | INTENT_CONFIG_DIR = os.path.join(CUR_DIR, "intent_config") 16 | 17 | 18 | class RuleBasedClassifier(BaseClassifier): 19 | """ 20 | Rule based classifier that uses regex patterns to classify tweets. 21 | 22 | It will read all the .txt files in the intent_config directory and use the 23 | file name as the intent name and the file contents as the keywords. 24 | 25 | The keywords are used to compile regex patterns that are used to classify 26 | the tweets. 27 | 28 | Example usage: 29 | >>> classifier = RuleBasedClassifier() 30 | >>> classifier.classify("Yardim edin") 31 | {"KURTARMA"} 32 | """ 33 | 34 | def __init__(self, intent_config_dir=INTENT_CONFIG_DIR): 35 | self.intent_to_keywords = {} # Will be loaded below. 36 | self.intent_to_patterns = {} # Will be loaded below. 37 | self.__load_intent_configs(intent_config_dir) 38 | self.__compile_keywords_to_patterns() 39 | 40 | def __load_intent_configs(self, intent_config_dir): 41 | configs = [f for f in os.listdir( 42 | intent_config_dir) if f.endswith(".txt")] 43 | self.intent_to_config = {os.path.splitext(c)[0]: c for c in configs} 44 | for intent, config in self.intent_to_config.items(): 45 | with open(os.path.join(intent_config_dir, config), "r") as f: 46 | self.intent_to_keywords[intent] = f.read().splitlines() 47 | 48 | def __compile_keywords_to_patterns(self): 49 | for intent, keywords in self.intent_to_keywords.items(): 50 | self.intent_to_patterns[intent] = [re.compile( 51 | f"(^|\\W){k}($|\\W)", re.IGNORECASE) for k in keywords] 52 | 53 | def all_intents(self): 54 | """ 55 | Returns list of all possible intents this classifier can classify. 56 | """ 57 | return self.intent_to_patterns.keys() 58 | 59 | def classify(self, text: str) -> List[str]: 60 | """ 61 | Check if the given text contains any of the keywords of any intent. 62 | Args: 63 | text: The text to check. 64 | 65 | Returns: 66 | Set of labels of the tweet, if any. 67 | """ 68 | intents = [] 69 | for intent, patterns in self.intent_to_patterns.items(): 70 | for pattern in patterns: 71 | if re.search(pattern, text): 72 | intents.append(intent) 73 | break # No need to check other patterns for this intent. 74 | return list(set(intents)) 75 | 76 | 77 | def get_data(file_name): 78 | df = pd.read_csv(file_name) 79 | return df 80 | 81 | 82 | def preprocess_tweet(text: str) -> str: 83 | """ 84 | Preprocess the given text before inference. 85 | 86 | Right now only converts diacritics to ascii versions (turkish letters). 87 | Args: 88 | text: The text to normalize. 89 | 90 | Returns: 91 | Normalized text. 92 | """ 93 | return unidecode(text) 94 | 95 | 96 | def process_tweet(classifier, tweet: Tuple, plot_data: Dict) -> Tuple[Optional[Set[str]], Dict]: 97 | """ 98 | Process the given tweet. 99 | Check if the tweet contains any of the keywords using rules. 100 | If it does, update the plot data and assign labels to the tweet 101 | 102 | Args: 103 | tweet: The tweet to process. tweet[1] -> full_text 104 | plot_data: The plot data to update. 105 | 106 | Returns: 107 | The labels of the tweet. If the tweet does not contain any of the keywords, return None. 108 | """ 109 | 110 | # normalize text to english characters 111 | tweet_normalized = preprocess_tweet(tweet[1]) # tweet[1] -> full_text 112 | 113 | # check if tweet contains any of the keywords 114 | labels = classifier.classify(tweet_normalized) 115 | if not labels: 116 | return None, plot_data 117 | 118 | plot_data = update_plot_data(plot_data, labels) 119 | return labels, plot_data 120 | 121 | 122 | def process_tweet_stream(df): 123 | classifier = RuleBasedClassifier() 124 | plot_data = {} 125 | db_ready_data_list = [] 126 | for _, row in df.iterrows(): 127 | db_ready_data_list.append(classifier, process_tweet(row, plot_data)) 128 | return db_ready_data_list, plot_data 129 | 130 | 131 | def update_plot_data(plot_data: Dict, labels: Union[Set[str], List[str]]) -> Dict: 132 | """ 133 | Increment the count of the given labels in the plot data. 134 | Args: 135 | plot_data: The dictionary that holds INTENT - COUNT pairs. 136 | labels: The labels to increment the count of. 137 | 138 | Returns: 139 | The updated plot data. 140 | """ 141 | for label in labels: 142 | if label in plot_data: 143 | plot_data[label] += 1 144 | else: 145 | plot_data[label] = 1 146 | return plot_data 147 | 148 | 149 | def draw_plot(plot_data: Dict): 150 | """ Draw the plot with the given plot data. 151 | It draws label count of the tweets. 152 | 153 | Args: 154 | plot_data: The plot data to draw the plot with. 155 | 156 | Returns: 157 | None 158 | """ 159 | plt.bar(plot_data.keys(), plot_data.values()) 160 | plt.xlabel("Cluster Label") 161 | plt.ylabel("Tweet Count") 162 | plt.title("Tweet Count per Cluster Label") 163 | plt.show() 164 | 165 | 166 | if __name__ == '__main__': 167 | data = get_data('sample_data.csv') 168 | processed_data, plot_data = process_tweet_stream(data) 169 | draw_plot(plot_data) 170 | -------------------------------------------------------------------------------- /src/ml_modules/run_zsc.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import time 4 | from typing import Dict, Optional 5 | 6 | import requests 7 | from dotenv import load_dotenv 8 | 9 | load_dotenv(".env") 10 | 11 | API_TOKEN = os.getenv("HF_HUB_TOKEN") 12 | 13 | headers = {"Authorization": f"Bearer {API_TOKEN}"} 14 | 15 | API_URL = "https://api-inference.huggingface.co/models/emrecan/convbert-base-turkish-mc4-cased-allnli_tr" # noqa 16 | 17 | 18 | def query(payload: Dict) -> Optional[Dict]: 19 | data = json.dumps(payload) 20 | response = requests.request("POST", API_URL, headers=headers, data=data) 21 | 22 | # Check status code 23 | if response.status_code != 200: 24 | print( 25 | f"Query: {payload} failed to run by returning code of {response.status_code}. Response: {response.text} ") # noqa 26 | print("Trying again in 10 seconds...") 27 | # Wait 10 seconds and try again 28 | time.sleep(10) 29 | response = requests.request("POST", API_URL, headers=headers, data=data) 30 | if response.status_code != 200: 31 | return None 32 | 33 | return response.json() 34 | 35 | 36 | def batch_query(data, candidate_labels): 37 | """ 38 | List ya da text'leri iceren herhangi bir iterable alir. 39 | 40 | Parameter 41 | --------- 42 | data : Iterable 43 | Text'leri iceren iterable. 44 | candidate_labels : List 45 | Siniflandirilmak istenen topic'ler. 46 | 47 | Returns 48 | ------- 49 | outputs 50 | JSON output listesi: JSON'un key'ler sequence (asil input), labels (tahmin edilen siniflar) 51 | ve scores (siniflarin kac olasilikla tahmin edildigi) 52 | # TODO: olasiliklara gore fallback mekanizmasi yazilacak. 53 | """ 54 | outputs = [] 55 | if not candidate_labels: 56 | candidate_labels = ["battaniye", "yemek", "göçük"] 57 | for tweet in data: 58 | outputs.append(query( 59 | { 60 | "inputs": tweet, 61 | "parameters": {"candidate_labels": candidate_labels}, 62 | })) 63 | return outputs 64 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acikyazilimagi/depremadres-intent-classification-v0/19b1c3e12fabc775f5999c3d01a10251e22a3d5c/tests/__init__.py -------------------------------------------------------------------------------- /tests/ml/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acikyazilimagi/depremadres-intent-classification-v0/19b1c3e12fabc775f5999c3d01a10251e22a3d5c/tests/ml/__init__.py -------------------------------------------------------------------------------- /tests/ml/classifiers/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/acikyazilimagi/depremadres-intent-classification-v0/19b1c3e12fabc775f5999c3d01a10251e22a3d5c/tests/ml/classifiers/__init__.py -------------------------------------------------------------------------------- /tests/ml/classifiers/test_models.py: -------------------------------------------------------------------------------- 1 | from unittest import TestCase 2 | from unittest.mock import Mock, patch 3 | 4 | from src.ml_modules.bert_classifier import BertClassifier 5 | from src.ml_modules.rule_based_clustering import RuleBasedClassifier 6 | 7 | 8 | class TestClassifiers(TestCase): 9 | 10 | @classmethod 11 | def setUpClass(cls): 12 | cls.classifiers = [BertClassifier(), RuleBasedClassifier()] 13 | 14 | def test_classify_empty_case(self): 15 | with patch('src.ml_modules.bert_classifier.requests.post') as mock_post: 16 | mock_post.return_value = Mock() 17 | mock_post.return_value.json.return_value = [ 18 | [ 19 | { 20 | "label": "foo", 21 | "score": 0.0, 22 | }, 23 | { 24 | "label": "bar", 25 | "score": 0.0, 26 | }, 27 | ] 28 | ] 29 | mock_post.return_value.status_code = 200 30 | 31 | for classifier in self.classifiers: 32 | result = classifier.classify("") 33 | assert isinstance(result, list) 34 | assert len(result) == 0 35 | 36 | def test_classify_base(self): 37 | with patch('src.ml_modules.bert_classifier.requests.post') as mock_post: 38 | mock_post.return_value = Mock() 39 | mock_post.return_value.json.return_value = [ 40 | [ 41 | { 42 | "label": "foo", 43 | "score": 0.7, 44 | }, 45 | { 46 | "label": "bar", 47 | "score": 0.4, 48 | }, 49 | ] 50 | ] 51 | mock_post.return_value.status_code = 200 52 | 53 | for classifier in self.classifiers: 54 | result = classifier.classify( 55 | "Enkaz altındayım lütfen yardım edin") 56 | assert isinstance(result, list) 57 | assert len(result) > 0 58 | for el in result: 59 | assert isinstance(el, str) 60 | --------------------------------------------------------------------------------