├── images ├── allimagesHERE.md ├── logo.png └── banner.png ├── requirements.txt ├── model └── yourGGUFhere.md ├── singleExtract.py ├── README.md ├── example_2BD1Q_log.txt ├── stapp.py ├── example_IME7D_log.txt └── instrucitons.txt /images/allimagesHERE.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /images/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/images/logo.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/requirements.txt -------------------------------------------------------------------------------- /images/banner.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/images/banner.png -------------------------------------------------------------------------------- /model/yourGGUFhere.md: -------------------------------------------------------------------------------- 1 | ``` 2 | mkdir model 3 | cd model 4 | wget https://huggingface.co/Felladrin/gguf-NuExtract-tiny/resolve/main/NuExtract-tiny.gguf -OutFile NuExtract-tiny.gguf 5 | ``` 6 | -------------------------------------------------------------------------------- /singleExtract.py: -------------------------------------------------------------------------------- 1 | 2 | # https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract 3 | # https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion 4 | # https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction 5 | # https://huggingface.co/numind/NuExtract-tiny 6 | # GGUF repo https://huggingface.co/Felladrin/gguf-NuExtract-tiny 7 | # https://www.geeksforgeeks.org/json-loads-in-python/ 8 | 9 | import json 10 | from llama_cpp import Llama 11 | from rich.markdown import Markdown 12 | import warnings 13 | warnings.filterwarnings(action='ignore') 14 | import datetime 15 | from rich.console import Console 16 | console = Console(width=90) 17 | 18 | print('loading NuExtract-tiny.gguf with LlamaCPP...') 19 | nCTX = 12000 20 | sTOPS = ['<|end-output|>'] 21 | client = Llama( 22 | model_path='model/NuExtract-tiny.gguf', 23 | #n_gpu_layers=0, 24 | temperature=0.24, 25 | n_ctx=nCTX, 26 | max_tokens=600, 27 | repeat_penalty=1.176, 28 | stop=sTOPS, 29 | verbose=False, 30 | ) 31 | print('Done...') 32 | 33 | prompt = """<|input|>\n### Template: 34 | { 35 | "Car": { 36 | "Name": "", 37 | "Manufacturer": "", 38 | "Designers": [], 39 | "Number of units produced": "", 40 | } 41 | } 42 | ### Text: 43 | The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998). 44 | 45 | The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia. 46 | 47 | In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers. 48 | <|output|> 49 | """ 50 | 51 | nCTX = 12000 52 | sTOPS = ['<|end-output|>'] 53 | 54 | start = datetime.datetime.now() 55 | 56 | 57 | output = client.create_completion( 58 | prompt =prompt, 59 | temperature=0.1, 60 | repeat_penalty= 1.11, 61 | stop=sTOPS, 62 | max_tokens=500, 63 | stream=False) 64 | delta = datetime.datetime.now() - start 65 | result = output['choices'][0]['text'] 66 | console.print(result) 67 | console.print('---') 68 | console.print(f'Completed in {delta}') 69 | 70 | replies = [result] 71 | import json 72 | from haystack.components.converters import OutputAdapter 73 | 74 | 75 | adapter = result.replace("'",'"') 76 | final = json.loads(adapter) 77 | console.print('---') 78 | console.print(final) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AI-ExtractData-NuExtract-tiny 2 | NuExtract-tiny GGUF for data extraction in json format 3 | 4 | 5 | 6 | 7 | ### How to go from this to this 8 | 9 | ##### JSON schema template: 10 | ``` 11 | { 12 | "Car": { 13 | "Name": "", 14 | "Manufacturer": "", 15 | "Designers": [], 16 | "Number of units produced": "", 17 | } 18 | } 19 | ``` 20 | ##### original text 21 | ``` 22 | The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, 23 | introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured 24 | through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license 25 | to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998). 26 | 27 | The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car 28 | of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under 29 | the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande 30 | Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia. 31 | 32 | In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 33 | 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] 34 | During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, 35 | the Panda was named after Empanda, the Roman goddess and patroness of travelers. 36 | 37 | ``` 38 | 39 | ##### extracted JSON: 40 | ``` 41 | { 42 | 'Car': { 43 | 'Name': 'Fiat Panda', 44 | 'Manufacturer': 'Fiat', 45 | 'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani'], 46 | 'Number of units produced': '7.8 million' 47 | } 48 | } 49 | ``` 50 | 51 | ### Create venv and install packages 52 | ``` 53 | python -m venv venv 54 | venv\Scripts\activate 55 | 56 | pip install llama-cpp-python==0.2.85 tiktoken streamlit==1.36.0 57 | ``` 58 | 59 | ### download he GGUF file 60 | download from HugginFace into `model` subfolder file NuExtract-tiny.gguf (fp16 quantization) 61 | 62 | ``` 63 | GGUF repo [https://huggingface.co/Felladrin/gguf-NuExtract-tiny](https://huggingface.co/Felladrin/gguf-NuExtract-tiny) 64 | ``` 65 | 66 | Original model card: https://huggingface.co/numind/NuExtract-tiny 67 | ``` 68 | NuExtract_tiny is a version of Qwen1.5-0.5, fine-tuned on a private high-quality synthetic dataset for information extraction. To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract. 69 | 70 | Note: This model is purely extractive, so all text output by the model is present as is in the original text. You can also provide an example of output formatting to help the model understand your task more precisely. 71 | 72 | Note: While this model provides good 0 shot performance, it is intended to be fine-tuned on a specific task (>=30 examples). 73 | ``` 74 | 75 | Run everything with 76 | ``` 77 | python singleExtract.py 78 | ``` 79 | 80 | To run Streamlit interface, after downloading all images in the `images` subfolder: 81 | ``` 82 | streamlit run stapp.py 83 | ``` 84 | 85 | ### Hyperparameters 86 | ``` 87 | temperature=0.1, 88 | repeat_penalty= 1.11, 89 | stop=['<|end-output|>'], 90 | ``` 91 | 92 | --- 93 | 94 | ### Additional resources 95 | 96 | - [https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract](https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract) 97 | - [https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion](https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion) 98 | - [https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction](https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction) 99 | - [https://huggingface.co/numind/NuExtract-tiny](https://huggingface.co/numind/NuExtract-tiny) 100 | - GGUF repo [https://huggingface.co/Felladrin/gguf-NuExtract-tiny](https://huggingface.co/Felladrin/gguf-NuExtract-tiny) 101 | - [https://www.geeksforgeeks.org/json-loads-in-python/](https://www.geeksforgeeks.org/json-loads-in-python/) 102 | 103 | 104 | --- 105 | 106 | 107 | 108 | -------------------------------------------------------------------------------- /example_2BD1Q_log.txt: -------------------------------------------------------------------------------- 1 | 2024-08-05 17:57:04.227098 2 | 3 | Your own LocalGPT with 🌀 NuExtract-tiny 4 | --- 5 | 🧠🫡: You are a helpful assistant. 6 | 🌀: How may I help you today? 7 | ✨: <|input|> 8 | ### Template: 9 | { 10 | "Applicant": { 11 | "Name": "", 12 | "Surname": "", 13 | "email address":"", 14 | "phone number":"", 15 | "Date of Birth":"", 16 | "Nationality":"", 17 | "Country":"", 18 | "Languages": [] 19 | } 20 | } 21 | 22 | 23 | { 24 | "Applicant": { 25 | "Name": "", 26 | "Surname": "", 27 | "email address":"", 28 | "phone number":"", 29 | "Date of Birth":"" 30 | } 31 | } 32 | 33 | 34 | 35 | 36 | ### Text: 37 | CURRICULUM VITAE Name Fabio ElMassry 38 | Position Instrument & Control Engineer 39 | Date of Birth 05/04/1986 40 | Nationality Egyptian 41 | Status Married 42 | Languages English/Italian/Arabic 43 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control 44 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration & 45 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack, 46 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General: 47 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas 48 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical 49 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering, 50 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting & 51 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of 52 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria 53 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach, 54 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant 55 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check, 56 | graphic configuration, PLC programming and configuration. 57 | x Troubleshooting system and process alarms using DCS Honeywell 58 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems. 59 | <|output|> 60 | 61 | 🌀: { 62 | "Applicant":{ 63 | "Name":"Fabio ElMassry" 64 | "Surname":"" 65 | "email address":"fabioElMassry@yahoo.fr" 66 | "phone number":"+39685737365" 67 | "Date of Birth":"05/04/1986" 68 | } 69 | } 70 | 71 | --- 72 | 73 | { 74 | "Details": { 75 | "Nationality": "", 76 | "Country": "", 77 | "Languages": [] 78 | } 79 | } 80 | 81 | CURRICULUM VITAE Name Fabio ElMassry 82 | Position Instrument & Control Engineer 83 | Date of Birth 05/04/1986 84 | Nationality Egyptian 85 | Status Married 86 | Languages English/Italian/Arabic 87 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control 88 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration & 89 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack, 90 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General: 91 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas 92 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical 93 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering, 94 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting & 95 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of 96 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria 97 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach, 98 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant 99 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check, 100 | graphic configuration, PLC programming and configuration. 101 | x Troubleshooting system and process alarms using DCS Honeywell 102 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems. 103 | 104 | 105 | { 106 | "Details":{ 107 | "Nationality":"Egyptian" 108 | "Country":"" 109 | "Languages":[ 110 | 0:"English" 111 | 1:"Italian" 112 | 2:"Arabic" 113 | ] 114 | } 115 | } 116 | 117 | 118 | -------------------------------------------------------------------------------- /stapp.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from llama_cpp import Llama 3 | import warnings 4 | warnings.filterwarnings(action='ignore') 5 | import datetime 6 | import random 7 | import string 8 | from time import sleep 9 | import tiktoken 10 | import json 11 | 12 | # for counting the tokens in the prompt and in the result 13 | #context_count = len(encoding.encode(yourtext)) 14 | encoding = tiktoken.get_encoding("r50k_base") 15 | 16 | nCTX = 12000 17 | sTOPS = ['<|end-output|>'] 18 | modelname = "NuExtract-tiny" 19 | # Set the webpage title 20 | st.set_page_config( 21 | page_title=f"Your LocalGPT ✨ with {modelname}", 22 | page_icon="🌟", 23 | layout="wide") 24 | 25 | if "hf_model" not in st.session_state: 26 | st.session_state.hf_model = "NuExtract-tiny" 27 | # Initialize chat history 28 | if "messages" not in st.session_state: 29 | st.session_state.messages = [] 30 | 31 | if "repeat" not in st.session_state: 32 | st.session_state.repeat = 1.35 33 | 34 | if "temperature" not in st.session_state: 35 | st.session_state.temperature = 0.1 36 | 37 | if "maxlength" not in st.session_state: 38 | st.session_state.maxlength = 500 39 | 40 | if "speed" not in st.session_state: 41 | st.session_state.speed = 0.0 42 | 43 | def writehistory(filename,text): 44 | with open(filename, 'a', encoding='utf-8') as f: 45 | f.write(text) 46 | f.write('\n') 47 | f.close() 48 | 49 | def genRANstring(n): 50 | """ 51 | n = int number of char to randomize 52 | """ 53 | N = n 54 | res = ''.join(random.choices(string.ascii_uppercase + 55 | string.digits, k=N)) 56 | return res 57 | 58 | @st.cache_resource 59 | def create_chat(): 60 | # Set HF API token and HF repo 61 | from llama_cpp import Llama 62 | client = Llama( 63 | model_path='model/NuExtract-tiny.gguf', 64 | temperature=0.24, 65 | n_ctx=nCTX, 66 | max_tokens=600, 67 | repeat_penalty=1.176, 68 | stop=sTOPS, 69 | verbose=False, 70 | ) 71 | print('loading NuExtract-tiny.gguf with LlamaCPP...') 72 | return client 73 | 74 | 75 | # create THE SESSIoN STATES 76 | if "logfilename" not in st.session_state: 77 | ## Logger file 78 | logfile = f'{genRANstring(5)}_log.txt' 79 | st.session_state.logfilename = logfile 80 | #Write in the history the first 2 sessions 81 | writehistory(st.session_state.logfilename,f'{str(datetime.datetime.now())}\n\nYour own LocalGPT with 🌀 {modelname}\n---\n🧠🫡: You are a helpful assistant.') 82 | writehistory(st.session_state.logfilename,f'🌀: How may I help you today?') 83 | 84 | 85 | 86 | ### START STREAMLIT UI 87 | # Create a header element 88 | st.image('images/banner.png',use_column_width=True) 89 | mytitle = f'> *Extract data with {modelname} into `JSON` format*' 90 | st.markdown(mytitle, unsafe_allow_html=True) 91 | #st.markdown('> Local Chat ') 92 | #st.markdown('---') 93 | 94 | # CREATE THE SIDEBAR 95 | with st.sidebar: 96 | st.image('images/logo.png', use_column_width=True) 97 | st.session_state.temperature = st.slider('Temperature:', min_value=0.0, max_value=1.0, value=0.1, step=0.01) 98 | st.session_state.maxlength = st.slider('Length reply:', min_value=150, max_value=2000, 99 | value=500, step=50) 100 | st.session_state.repeat = st.slider('Repeat Penalty:', min_value=0.0, max_value=2.0, value=1.11, step=0.02) 101 | st.markdown(f"**Logfile**: {st.session_state.logfilename}") 102 | statspeed = st.markdown(f'💫 speed: {st.session_state.speed} t/s') 103 | btnClear = st.button("Clear History",type="primary", use_container_width=True) 104 | 105 | llm = create_chat() 106 | 107 | st.session_state.jsonformat = st.text_area('JSON Schema to be applied', value="", height=150, 108 | placeholder='here your schema', disabled=False, label_visibility="visible") 109 | st.session_state.origintext = st.text_area('Source Document', value="", height=150, 110 | placeholder='here your text', disabled=False, label_visibility="visible") 111 | extract_btn = st.button("Extract Data",type="primary", use_container_width=False) 112 | st.markdown('---') 113 | st.session_state.extractedJSON = st.empty() 114 | st.session_state.onlyJSON = st.empty() 115 | 116 | 117 | 118 | if extract_btn: 119 | prompt = f"""<|input|>\n### Template: 120 | {st.session_state.jsonformat} 121 | 122 | ### Text: 123 | {st.session_state.origintext} 124 | <|output|> 125 | """ 126 | print(prompt) 127 | with st.spinner("Thinking..."): 128 | start = datetime.datetime.now() 129 | output = llm.create_completion( 130 | prompt =prompt, 131 | temperature=0.1, 132 | repeat_penalty= 1.11, 133 | stop=sTOPS, 134 | max_tokens=500, 135 | stream=False) 136 | 137 | delta = datetime.datetime.now() -start 138 | result = output['choices'][0]['text'] 139 | st.write(result) 140 | adapter = result.replace("'",'"') 141 | final = json.loads(adapter) 142 | totalTokens = len(encoding.encode(prompt))+len(encoding.encode(result)) 143 | totalseconds = delta.total_seconds() 144 | st.session_state.speed = totalTokens/totalseconds 145 | statspeed.markdown(f'💫 speed: {st.session_state.speed:.2f} t/s') 146 | totalstring = f"""GENERATED STRING 147 | 148 | {result} 149 | --- 150 | 151 | Generated in {delta} 152 | 153 | --- 154 | 155 | JSON FORMAT: 156 | """ 157 | with st.session_state.extractedJSON: 158 | st.markdown(totalstring) 159 | st.session_state.onlyJSON.json(final) 160 | writehistory(st.session_state.logfilename,f'✨: {prompt}') 161 | writehistory(st.session_state.logfilename,f'🌀: {result}') 162 | writehistory(st.session_state.logfilename,f'---\n\n') 163 | 164 | -------------------------------------------------------------------------------- /example_IME7D_log.txt: -------------------------------------------------------------------------------- 1 | 2024-08-06 07:30:44.882140 2 | 3 | Your own LocalGPT with 🌀 NuExtract-tiny 4 | --- 5 | 🧠🫡: You are a helpful assistant. 6 | 🌀: How may I help you today? 7 | 8 | --- 9 | 10 | 11 | ✨: <|input|> 12 | ### Template: 13 | { 14 | "Applicant": { 15 | "Name": "", 16 | "Surname": "", 17 | "email address":"", 18 | "phone number":"", 19 | "Date of Birth":"" 20 | } 21 | } 22 | 23 | ### Text: 24 | CURRICULUM VITAE Name Fabio ElMassry 25 | Position Instrument & Control Engineer 26 | Date of Birth 05/04/1986 27 | Nationality Egyptian 28 | Status Married 29 | Languages English/Italian/Arabic 30 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control 31 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration & 32 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack, 33 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General: 34 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas 35 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical 36 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering, 37 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting & 38 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of 39 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria 40 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach, 41 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant 42 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check, 43 | graphic configuration, PLC programming and configuration. 44 | x Troubleshooting system and process alarms using DCS Honeywell 45 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems. 46 | <|output|> 47 | 48 | 🌀: { 49 | "Applicant": { 50 | "Name": "Fabio ElMassry", 51 | "Surname": "", 52 | "email address":"fabioElMassry@yahoo.fr", 53 | "phone number":"+39685737365", 54 | "Date of Birth": "05/04/1986" 55 | } 56 | } 57 | 58 | --- 59 | 60 | 61 | ✨: <|input|> 62 | ### Template: 63 | { 64 | "Details": { 65 | "Nationality": "", 66 | "Country": "", 67 | "Languages":"", 68 | "phone number": [] 69 | } 70 | } 71 | 72 | ### Text: 73 | CURRICULUM VITAE Name Fabio ElMassry 74 | Position Instrument & Control Engineer 75 | Date of Birth 05/04/1986 76 | Nationality Egyptian 77 | Status Married 78 | Languages English/Italian/Arabic 79 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control 80 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration & 81 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack, 82 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General: 83 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas 84 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical 85 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering, 86 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting & 87 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of 88 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria 89 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach, 90 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant 91 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check, 92 | graphic configuration, PLC programming and configuration. 93 | x Troubleshooting system and process alarms using DCS Honeywell 94 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems. 95 | <|output|> 96 | 97 | 🌀: { 98 | "Details": { 99 | "Nationality": "Egyptian", 100 | "Country": "", 101 | "Languages": [ 102 | "English", 103 | "Italian", 104 | "Arabic" 105 | ], 106 | "phone number": [ 107 | "+39685737365" 108 | ] 109 | } 110 | } 111 | 112 | --- 113 | 114 | 115 | ✨: <|input|> 116 | ### Template: 117 | { 118 | "Details": { 119 | "Nationality": "", 120 | "Country": "", 121 | "Languages": [] 122 | } 123 | } 124 | 125 | ### Text: 126 | CURRICULUM VITAE Name Fabio ElMassry 127 | Position Instrument & Control Engineer 128 | Date of Birth 05/04/1986 129 | Nationality Egyptian 130 | Status Married 131 | Languages English/Italian/Arabic 132 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control 133 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration & 134 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack, 135 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General: 136 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas 137 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical 138 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering, 139 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting & 140 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of 141 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria 142 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach, 143 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant 144 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check, 145 | graphic configuration, PLC programming and configuration. 146 | x Troubleshooting system and process alarms using DCS Honeywell 147 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems. 148 | <|output|> 149 | 150 | 🌀: { 151 | "Details": { 152 | "Nationality": "Egyptian", 153 | "Country": "", 154 | "Languages": [ 155 | "English", 156 | "Italian", 157 | "Arabic" 158 | ] 159 | } 160 | } 161 | 162 | --- 163 | 164 | 165 | -------------------------------------------------------------------------------- /instrucitons.txt: -------------------------------------------------------------------------------- 1 | Felladrin/gguf-NuExtract-tiny 2 | https://huggingface.co/Felladrin/gguf-NuExtract-tiny/tree/main 3 | 4 | Original Repo 5 | https://huggingface.co/numind/NuExtract-tiny 6 | 7 | 8 | pip install llama-cpp-python==0.2.85 tiktoken streamlit==1.36.0 huggingface-hub 9 | pip install langchain langchain-community faiss-cpu duckduckgo-search newspaper3k 10 | pip install pymupdf4llm strip_markdown 11 | 12 | ``` 13 | >>> import strip_markdown 14 | >>> 15 | >>> TXT: str = strip_markdown.strip_markdown(MD: str) 16 | ``` 17 | 18 | 19 | 20 | MODEL CARD 21 | NuExtract_tiny is a version of Qwen1.5-0.5, fine-tuned on a private high-quality synthetic dataset for information extraction. To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract. 22 | 23 | Note: This model is purely extractive, so all text output by the model is present as is in the original text. You can also provide an example of output formatting to help the model understand your task more precisely. 24 | 25 | Note: While this model provides good 0 shot performance, it is intended to be fine-tuned on a specific task (>=30 examples). 26 | 27 | We also provide a base (3.8B) and large(7B) version of this model: NuExtract and NuExtract-large 28 | https://huggingface.co/numind/NuExtract 29 | https://huggingface.co/numind/NuExtract-large 30 | 31 | 32 | llama_model_loader: loaded meta data with 27 key-value pairs and 290 tensors from model/NuExtract-tiny.gguf (version GGUF V3 (latest)) 33 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 34 | llama_model_loader: - kv 0: general.architecture str = qwen2 35 | llama_model_loader: - kv 1: general.type str = model 36 | llama_model_loader: - kv 2: general.name str = Qwen1.5 0.5B 37 | llama_model_loader: - kv 3: general.organization str = Qwen 38 | llama_model_loader: - kv 4: general.basename str = Qwen1.5 39 | llama_model_loader: - kv 5: general.size_label str = 0.5B 40 | llama_model_loader: - kv 6: general.license str = mit 41 | llama_model_loader: - kv 7: general.languages arr[str,1] = ["en"] 42 | llama_model_loader: - kv 8: qwen2.block_count u32 = 24 43 | llama_model_loader: - kv 9: qwen2.context_length u32 = 32768 44 | llama_model_loader: - kv 10: qwen2.embedding_length u32 = 1024 45 | llm_load_print_meta: format = GGUF V3 (latest) 46 | llm_load_print_meta: arch = qwen2 47 | llm_load_print_meta: vocab type = BPE 48 | llm_load_print_meta: n_vocab = 151936 49 | llm_load_print_meta: n_merges = 151387 50 | llm_load_print_meta: vocab_only = 0 51 | llm_load_print_meta: n_ctx_train = 32768 52 | llm_load_print_meta: n_embd = 1024 53 | llm_load_print_meta: model type = 0.5B 54 | llm_load_print_meta: model ftype = BF16 55 | llm_load_print_meta: model params = 463.99 M 56 | llm_load_print_meta: model size = 885.22 MiB (16.00 BPW) 57 | llm_load_print_meta: general.name = Qwen1.5 0.5B 58 | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 59 | llm_load_print_meta: EOS token = 151646 '<|end-output|>' 60 | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 61 | llm_load_print_meta: LF token = 148848 'ÄĬ' 62 | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 63 | 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"} 64 | Available chat formats from metadata: chat_template.default 65 | Using gguf chat template: {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system 66 | You are a helpful assistant<|im_end|> 67 | ' }}{% endif %}{{'<|im_start|>' + message['role'] + ' 68 | ' + message['content'] + '<|im_end|>' + ' 69 | '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant 70 | ' }}{% endif %} 71 | Using chat eos_token: <|end-output|> 72 | Using chat bos_token: <|endoftext|> 73 | 74 | 75 | 76 | CODE SNIPPET 77 | ``` 78 | import json 79 | from transformers import AutoModelForCausalLM, AutoTokenizer 80 | 81 | 82 | def predict_NuExtract(model, tokenizer, text, schema, example=["","",""]): 83 | schema = json.dumps(json.loads(schema), indent=4) 84 | input_llm = "<|input|>\n### Template:\n" + schema + "\n" 85 | for i in example: 86 | if i != "": 87 | input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n" 88 | 89 | input_llm += "### Text:\n"+text +"\n<|output|>\n" 90 | input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to("cuda") 91 | 92 | output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True) 93 | return output.split("<|output|>")[1].split("<|end-output|>")[0] 94 | 95 | 96 | model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True) 97 | tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True) 98 | 99 | model.to("cuda") 100 | 101 | model.eval() 102 | 103 | text = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for 104 | superior performance and efficiency. Mistral 7B outperforms the best open 13B 105 | model (Llama 2) across all evaluated benchmarks, and the best released 34B 106 | model (Llama 1) in reasoning, mathematics, and code generation. Our model 107 | leverages grouped-query attention (GQA) for faster inference, coupled with sliding 108 | window attention (SWA) to effectively handle sequences of arbitrary length with a 109 | reduced inference cost. We also provide a model fine-tuned to follow instructions, 110 | Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and 111 | automated benchmarks. Our models are released under the Apache 2.0 license. 112 | Code: https://github.com/mistralai/mistral-src 113 | Webpage: https://mistral.ai/news/announcing-mistral-7b/""" 114 | 115 | schema = """{ 116 | "Model": { 117 | "Name": "", 118 | "Number of parameters": "", 119 | "Number of max token": "", 120 | "Architecture": [] 121 | }, 122 | "Usage": { 123 | "Use case": [], 124 | "Licence": "" 125 | } 126 | }""" 127 | 128 | prediction = predict_NuExtract(model, tokenizer, text, schema, example=["","",""]) 129 | print(prediction) 130 | ``` 131 | 132 | 133 | 134 | 135 | 136 | 137 | Zero configuration Local LLMs for everyone! 138 | 139 | LM Studio: experience the magic of LLMs with Zero technical expertise 140 | Your guide to Zero configuration Local LLMs on any computer. 141 | 142 | 143 | https://medium.com/mlearning-ai/metadata-metamorphosis-from-plain-data-to-enhanced-insights-with-retrieval-augmented-generation-8d1a8d5a6061?sk=70e8abf76409be379bce7509d35afe05 144 | 145 | 146 | On the command line, including multiple files at once 147 | I recommend using the huggingface-hub Python library: 148 | 149 | pip3 install huggingface-hub 150 | 151 | Then you can download any individual model file to the current directory, at high speed, with a command like this: 152 | 153 | huggingface-cli download TheBloke/Panda-7B-v0.1-GGUF panda-7b-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False 154 | 155 | 156 | Example llama.cpp command 157 | Make sure you are using llama.cpp from commit d0cee0d or later. 158 | 159 | ./main -ngl 35 -m panda-7b-v0.1.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "{prompt}" 160 | 161 | Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration. 162 | 163 | Change -c 32768 to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value. 164 | 165 | If you want to have a chat-style conversation, replace the -p argument with -i -ins 166 | 167 | 168 | 169 | 170 | 171 | Due to the unstructured nature of human conversational language data, the input to LLMs are conversational and unstructured, in the form of Prompt Engineering. 172 | And the output of LLMs is also conversational and unstructured; a highly succinct form of natural language generation (NLG). 173 | LLMs introduced functionality to fine-tune and create custom models. And the initial primary approach to customising LLMs was creating custom models via fine-tuning. 174 | This approach has fallen into disfavour for three reasons: 175 | As LLMs have both a generative and predictive side. The generative power of LLMs is easier to leverage than the predictive power. If the generative side of LLMs are presented with contextual, concise and relevant data at inference-time, hallucination is negated. 176 | Fine-tuning LLMs involves training data curation, transformation and LLM related cost. Fine-tuned models are frozen with a definite time-stamp and will still demand innovation around prompt creation and data presentation to the LLM. 177 | When classifying text based on pre-defined classes or intents, NLU still has an advantage with built-in efficiencies. 178 | I hasten to add that there has been significant advances in improving no-code to low-code UIs and fine-tuning costs. A prudent approach is to make use of a hybrid solution, drawing on the benefits of fine-tuning and RAG. 179 | The aim of fine-tuning of LLMs is to engender more accurate and succinct reasoning and answers. 180 | The proven solution to hallucination is using highly relevant and contextual prompts at inference-time, and asking the LLM to follow chain-of-thought reasoning. This also solves for one of the big problems with LLMs; hallucination, where the LLM returns highly plausible but incorrect answers. 181 | 182 | 183 | 184 | 185 | $env:CMAKE_ARGS="-DLLAMA_CUBLAS=on" 186 | pip install llama-cpp-python[server]==0.2.53 187 | python -m llama_cpp.server --help 188 | python -m llama_cpp.server --host 0.0.0.0 --model model/Quyen-Mini-v0.1.Q4_K_M.gguf --chat_format chatml --n_ctx 8196 --n_gpu_layers 25 189 | 190 | 191 | python -m llama_cpp.server --host 0.0.0.0 --model model/llama-2-7b-chat.Q4_K_M.gguf --chat_format llama-2 --n_ctx 4096 --n_gpu_layers 33 192 | 193 | 194 | llama_cpp.llama_chat_format.LlamaChatCompletionHandlerNotFoundException: Invalid chat handler: llama2 (valid formats: ['llama-2', 'alpaca', 'qwen', 'vicuna', 'oasst_llama', 'baichuan-2', 'baichuan', 'openbuddy', 'redpajama-incite', 'snoozy', 'phind', 'intel', 'open-orca', 'mistrallite', 'zephyr', 'pygmalion', 'chatml', 'mistral-instruct', 'chatglm3', 'openchat', 'saiga', 'gemma', 'functionary', 'functionary-v2', 'functionary-v1', 'chatml-function-calling']) 195 | 196 | 197 | python -m llama_cpp.server --host 0.0.0.0 --model model/qwen1_5-4b-chat-q6_k.gguf --chat_format chatml --n_ctx 32768 --n_gpu_layers 41 198 | 199 | --------------------------------------------------------------------------------