├── images
    ├── allimagesHERE.md
    ├── logo.png
    └── banner.png
├── requirements.txt
├── model
    └── yourGGUFhere.md
├── singleExtract.py
├── README.md
├── example_2BD1Q_log.txt
├── stapp.py
├── example_IME7D_log.txt
└── instrucitons.txt


/images/allimagesHERE.md:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/images/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/images/logo.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/requirements.txt


--------------------------------------------------------------------------------
/images/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/images/banner.png


--------------------------------------------------------------------------------
/model/yourGGUFhere.md:
--------------------------------------------------------------------------------
1 | ```
2 | mkdir model
3 | cd model
4 | wget https://huggingface.co/Felladrin/gguf-NuExtract-tiny/resolve/main/NuExtract-tiny.gguf -OutFile NuExtract-tiny.gguf
5 | ```
6 | 


--------------------------------------------------------------------------------
/singleExtract.py:
--------------------------------------------------------------------------------
 1 | 
 2 | # https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract
 3 | # https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion
 4 | # https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction
 5 | # https://huggingface.co/numind/NuExtract-tiny
 6 | # GGUF repo https://huggingface.co/Felladrin/gguf-NuExtract-tiny
 7 | # https://www.geeksforgeeks.org/json-loads-in-python/
 8 | 
 9 | import json
10 | from llama_cpp import Llama
11 | from rich.markdown import Markdown
12 | import warnings
13 | warnings.filterwarnings(action='ignore')
14 | import datetime
15 | from rich.console import Console
16 | console = Console(width=90)
17 | 
18 | print('loading NuExtract-tiny.gguf with LlamaCPP...')
19 | nCTX = 12000
20 | sTOPS = ['<|end-output|>']
21 | client = Llama(
22 |             model_path='model/NuExtract-tiny.gguf',
23 |             #n_gpu_layers=0,
24 |             temperature=0.24,
25 |             n_ctx=nCTX,
26 |             max_tokens=600,
27 |             repeat_penalty=1.176,
28 |             stop=sTOPS,
29 |             verbose=False,
30 |             )
31 | print('Done...')
32 | 
33 | prompt = """<|input|>\n### Template:
34 | {
35 |     "Car": {
36 |         "Name": "",
37 |         "Manufacturer": "",
38 |         "Designers": [],
39 |         "Number of units produced": "",
40 |     }
41 | }
42 | ### Text:
43 | The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).
44 | 
45 | The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.
46 | 
47 | In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
48 | <|output|>
49 | """
50 | 
51 | nCTX = 12000
52 | sTOPS = ['<|end-output|>']
53 | 
54 | start =  datetime.datetime.now()
55 | 
56 | 
57 | output = client.create_completion(
58 |                 prompt =prompt,
59 |                 temperature=0.1,
60 |                 repeat_penalty= 1.11,
61 |                 stop=sTOPS,
62 |                 max_tokens=500,              
63 |                 stream=False)
64 | delta = datetime.datetime.now() - start
65 | result = output['choices'][0]['text']
66 | console.print(result)
67 | console.print('---')
68 | console.print(f'Completed in {delta}')
69 | 
70 | replies = [result]
71 | import json
72 | from haystack.components.converters import OutputAdapter
73 | 
74 | 
75 | adapter = result.replace("'",'"')
76 | final = json.loads(adapter)
77 | console.print('---')
78 | console.print(final)


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # AI-ExtractData-NuExtract-tiny
  2 | NuExtract-tiny GGUF for data extraction in json format
  3 | 
  4 | 
  5 | <img src='https://github.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/raw/main/images/banner.png' width=800>
  6 | 
  7 | ### How to go from this to this
  8 | 
  9 | ##### JSON schema template:
 10 | ```
 11 | {
 12 |     "Car": {
 13 |         "Name": "",
 14 |         "Manufacturer": "",
 15 |         "Designers": [],
 16 |         "Number of units produced": "",
 17 |     }
 18 | }
 19 | ```
 20 | ##### original text
 21 | ```
 22 | The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda,
 23 | introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured
 24 | through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license
 25 | to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).
 26 | 
 27 | The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car
 28 | of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under
 29 | the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande
 30 | Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.
 31 | 
 32 | In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early
 33 | 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4]
 34 | During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately,
 35 | the Panda was named after Empanda, the Roman goddess and patroness of travelers.
 36 | 
 37 | ```
 38 | 
 39 | ##### extracted JSON:
 40 | ```
 41 | {
 42 |     'Car': {
 43 |         'Name': 'Fiat Panda',
 44 |         'Manufacturer': 'Fiat',
 45 |         'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani'],
 46 |         'Number of units produced': '7.8 million'
 47 |     }
 48 | }
 49 | ```
 50 | 
 51 | ### Create venv and install packages
 52 | ```
 53 | python -m venv venv
 54 | venv\Scripts\activate
 55 | 
 56 | pip install llama-cpp-python==0.2.85 tiktoken streamlit==1.36.0
 57 | ```
 58 | 
 59 | ### download he GGUF file
 60 | download from HugginFace into `model` subfolder file NuExtract-tiny.gguf (fp16 quantization)
 61 | 
 62 | ```
 63 | GGUF repo [https://huggingface.co/Felladrin/gguf-NuExtract-tiny](https://huggingface.co/Felladrin/gguf-NuExtract-tiny)
 64 | ```
 65 | 
 66 | Original model card: https://huggingface.co/numind/NuExtract-tiny
 67 | ```
 68 | NuExtract_tiny is a version of Qwen1.5-0.5, fine-tuned on a private high-quality synthetic dataset for information extraction. To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract.
 69 | 
 70 | Note: This model is purely extractive, so all text output by the model is present as is in the original text. You can also provide an example of output formatting to help the model understand your task more precisely.
 71 | 
 72 | Note: While this model provides good 0 shot performance, it is intended to be fine-tuned on a specific task (>=30 examples).
 73 | ```
 74 | 
 75 | Run everything with
 76 | ```
 77 | python singleExtract.py
 78 | ```
 79 | 
 80 | To run Streamlit interface, after downloading all images in the `images` subfolder:
 81 | ```
 82 | streamlit run stapp.py
 83 | ```
 84 | 
 85 | ### Hyperparameters
 86 | ```
 87 | temperature=0.1,
 88 | repeat_penalty= 1.11,
 89 | stop=['<|end-output|>'],
 90 | ```
 91 | 
 92 | ---
 93 | 
 94 | ### Additional resources
 95 | 
 96 | - [https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract](https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract)
 97 | - [https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion](https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion)
 98 | - [https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction](https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction)
 99 | - [https://huggingface.co/numind/NuExtract-tiny](https://huggingface.co/numind/NuExtract-tiny)
100 | - GGUF repo [https://huggingface.co/Felladrin/gguf-NuExtract-tiny](https://huggingface.co/Felladrin/gguf-NuExtract-tiny)
101 | - [https://www.geeksforgeeks.org/json-loads-in-python/](https://www.geeksforgeeks.org/json-loads-in-python/)
102 | 
103 | 
104 | ---
105 | 
106 | 
107 | 
108 | 


--------------------------------------------------------------------------------
/example_2BD1Q_log.txt:
--------------------------------------------------------------------------------
  1 | 2024-08-05 17:57:04.227098
  2 | 
  3 | Your own LocalGPT with 🌀 NuExtract-tiny
  4 | ---
  5 | 🧠🫡: You are a helpful assistant.
  6 | 🌀: How may I help you today?
  7 | ✨: <|input|>
  8 | ### Template:
  9 | {
 10 |     "Applicant": {
 11 |         "Name": "",
 12 |         "Surname": "",
 13 |         "email address":"",
 14 |         "phone number":"",
 15 |         "Date of Birth":"",
 16 |         "Nationality":"",
 17 |         "Country":"",
 18 |         "Languages": []
 19 |     }
 20 | }
 21 | 
 22 | 
 23 | {
 24 |     "Applicant": {
 25 |         "Name": "",
 26 |         "Surname": "",
 27 |         "email address":"",
 28 |         "phone number":"",
 29 |         "Date of Birth":""
 30 |     }
 31 | }
 32 | 
 33 | 
 34 | 
 35 | 
 36 | ### Text:
 37 | CURRICULUM VITAE Name Fabio ElMassry
 38 | Position Instrument & Control Engineer
 39 | Date of Birth 05/04/1986
 40 | Nationality Egyptian
 41 | Status Married
 42 | Languages English/Italian/Arabic
 43 | Phone/ mail         +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings:                    ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
 44 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
 45 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
 46 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
 47 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
 48 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
 49 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
 50 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
 51 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
 52 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture)                           Algeria
 53 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
 54 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
 55 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
 56 | graphic configuration, PLC programming and configuration.
 57 | x Troubleshooting system and process alarms using DCS Honeywell
 58 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
 59 | <|output|>
 60 | 
 61 | 🌀: {
 62 | "Applicant":{
 63 | "Name":"Fabio ElMassry"
 64 | "Surname":""
 65 | "email address":"fabioElMassry@yahoo.fr"
 66 | "phone number":"+39685737365"
 67 | "Date of Birth":"05/04/1986"
 68 | }
 69 | }
 70 | 
 71 | ---
 72 | 
 73 | {
 74 |     "Details": {
 75 |         "Nationality": "",
 76 |         "Country": "",
 77 |         "Languages": []
 78 |     }
 79 | }
 80 | 
 81 | CURRICULUM VITAE Name Fabio ElMassry
 82 | Position Instrument & Control Engineer
 83 | Date of Birth 05/04/1986
 84 | Nationality Egyptian
 85 | Status Married
 86 | Languages English/Italian/Arabic
 87 | Phone/ mail         +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings:                    ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
 88 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
 89 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
 90 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
 91 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
 92 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
 93 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
 94 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
 95 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
 96 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture)                           Algeria
 97 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
 98 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
 99 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
100 | graphic configuration, PLC programming and configuration.
101 | x Troubleshooting system and process alarms using DCS Honeywell
102 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
103 | 
104 | 
105 | {
106 | "Details":{
107 | "Nationality":"Egyptian"
108 | "Country":""
109 | "Languages":[
110 | 0:"English"
111 | 1:"Italian"
112 | 2:"Arabic"
113 | ]
114 | }
115 | }
116 | 
117 | 
118 | 


--------------------------------------------------------------------------------
/stapp.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | from llama_cpp import Llama
  3 | import warnings
  4 | warnings.filterwarnings(action='ignore')
  5 | import datetime
  6 | import random
  7 | import string
  8 | from time import sleep
  9 | import tiktoken
 10 | import json
 11 | 
 12 | # for counting the tokens in the prompt and in the result
 13 | #context_count = len(encoding.encode(yourtext))
 14 | encoding = tiktoken.get_encoding("r50k_base") 
 15 | 
 16 | nCTX = 12000
 17 | sTOPS = ['<|end-output|>']
 18 | modelname = "NuExtract-tiny"
 19 | # Set the webpage title
 20 | st.set_page_config(
 21 |     page_title=f"Your LocalGPT ✨ with {modelname}",
 22 |     page_icon="🌟",
 23 |     layout="wide")
 24 | 
 25 | if "hf_model" not in st.session_state:
 26 |     st.session_state.hf_model = "NuExtract-tiny"
 27 | # Initialize chat history
 28 | if "messages" not in st.session_state:
 29 |     st.session_state.messages = []
 30 | 
 31 | if "repeat" not in st.session_state:
 32 |     st.session_state.repeat = 1.35
 33 | 
 34 | if "temperature" not in st.session_state:
 35 |     st.session_state.temperature = 0.1
 36 | 
 37 | if "maxlength" not in st.session_state:
 38 |     st.session_state.maxlength = 500
 39 | 
 40 | if "speed" not in st.session_state:
 41 |     st.session_state.speed = 0.0
 42 | 
 43 | def writehistory(filename,text):
 44 |     with open(filename, 'a', encoding='utf-8') as f:
 45 |         f.write(text)
 46 |         f.write('\n')
 47 |     f.close()
 48 | 
 49 | def genRANstring(n):
 50 |     """
 51 |     n = int number of char to randomize
 52 |     """
 53 |     N = n
 54 |     res = ''.join(random.choices(string.ascii_uppercase +
 55 |                                 string.digits, k=N))
 56 |     return res
 57 | 
 58 | @st.cache_resource 
 59 | def create_chat():   
 60 | # Set HF API token  and HF repo
 61 |     from llama_cpp import Llama
 62 |     client = Llama(
 63 |                 model_path='model/NuExtract-tiny.gguf',
 64 |                 temperature=0.24,
 65 |                 n_ctx=nCTX,
 66 |                 max_tokens=600,
 67 |                 repeat_penalty=1.176,
 68 |                 stop=sTOPS,
 69 |                 verbose=False,
 70 |                 )
 71 |     print('loading NuExtract-tiny.gguf with LlamaCPP...')
 72 |     return client
 73 | 
 74 | 
 75 | # create THE SESSIoN STATES
 76 | if "logfilename" not in st.session_state:
 77 | ## Logger file
 78 |     logfile = f'{genRANstring(5)}_log.txt'
 79 |     st.session_state.logfilename = logfile
 80 |     #Write in the history the first 2 sessions
 81 |     writehistory(st.session_state.logfilename,f'{str(datetime.datetime.now())}\n\nYour own LocalGPT with 🌀 {modelname}\n---\n🧠🫡: You are a helpful assistant.')    
 82 |     writehistory(st.session_state.logfilename,f'🌀: How may I help you today?')
 83 | 
 84 | 
 85 | 
 86 | ### START STREAMLIT UI
 87 | # Create a header element
 88 | st.image('images/banner.png',use_column_width=True)
 89 | mytitle = f'> *Extract data with {modelname} into `JSON` format*'
 90 | st.markdown(mytitle, unsafe_allow_html=True)
 91 | #st.markdown('> Local Chat ')
 92 | #st.markdown('---')
 93 | 
 94 | # CREATE THE SIDEBAR
 95 | with st.sidebar:
 96 |     st.image('images/logo.png', use_column_width=True)
 97 |     st.session_state.temperature = st.slider('Temperature:', min_value=0.0, max_value=1.0, value=0.1, step=0.01)
 98 |     st.session_state.maxlength = st.slider('Length reply:', min_value=150, max_value=2000, 
 99 |                                            value=500, step=50)
100 |     st.session_state.repeat = st.slider('Repeat Penalty:', min_value=0.0, max_value=2.0, value=1.11, step=0.02)
101 |     st.markdown(f"**Logfile**: {st.session_state.logfilename}")
102 |     statspeed = st.markdown(f'💫 speed: {st.session_state.speed}  t/s')
103 |     btnClear = st.button("Clear History",type="primary", use_container_width=True)
104 | 
105 | llm = create_chat()
106 | 
107 | st.session_state.jsonformat = st.text_area('JSON Schema to be applied', value="", height=150,  
108 |                      placeholder='here your schema', disabled=False, label_visibility="visible")
109 | st.session_state.origintext = st.text_area('Source Document', value="", height=150,  
110 |                      placeholder='here your text', disabled=False, label_visibility="visible")
111 | extract_btn = st.button("Extract Data",type="primary", use_container_width=False)
112 | st.markdown('---')
113 | st.session_state.extractedJSON = st.empty()
114 | st.session_state.onlyJSON = st.empty()
115 | 
116 | 
117 | 
118 | if extract_btn:
119 |         prompt = f"""<|input|>\n### Template:
120 | {st.session_state.jsonformat}
121 | 
122 | ### Text:
123 | {st.session_state.origintext}
124 | <|output|>
125 | """
126 |         print(prompt)
127 |         with st.spinner("Thinking..."):
128 |             start =  datetime.datetime.now()
129 |             output = llm.create_completion(
130 |                             prompt =prompt,
131 |                             temperature=0.1,
132 |                             repeat_penalty= 1.11,
133 |                             stop=sTOPS,
134 |                             max_tokens=500,              
135 |                             stream=False)
136 | 
137 |         delta = datetime.datetime.now() -start
138 |         result = output['choices'][0]['text']
139 |         st.write(result)
140 |         adapter = result.replace("'",'"')
141 |         final = json.loads(adapter) 
142 |         totalTokens = len(encoding.encode(prompt))+len(encoding.encode(result))
143 |         totalseconds = delta.total_seconds()
144 |         st.session_state.speed = totalTokens/totalseconds
145 |         statspeed.markdown(f'💫 speed: {st.session_state.speed:.2f}  t/s')
146 |         totalstring = f"""GENERATED STRING
147 | 
148 | {result}
149 | ---
150 | 
151 | Generated in {delta}
152 | 
153 | ---
154 | 
155 | JSON FORMAT:
156 | """   
157 |         with st.session_state.extractedJSON:
158 |             st.markdown(totalstring)
159 |         st.session_state.onlyJSON.json(final)    
160 |         writehistory(st.session_state.logfilename,f'✨: {prompt}')
161 |         writehistory(st.session_state.logfilename,f'🌀: {result}')
162 |         writehistory(st.session_state.logfilename,f'---\n\n')
163 | 
164 | 


--------------------------------------------------------------------------------
/example_IME7D_log.txt:
--------------------------------------------------------------------------------
  1 | 2024-08-06 07:30:44.882140
  2 | 
  3 | Your own LocalGPT with 🌀 NuExtract-tiny
  4 | ---
  5 | 🧠🫡: You are a helpful assistant.
  6 | 🌀: How may I help you today?
  7 | 
  8 | ---
  9 | 
 10 | 
 11 | ✨: <|input|>
 12 | ### Template:
 13 | {
 14 |     "Applicant": {
 15 |         "Name": "",
 16 |         "Surname": "",
 17 |         "email address":"",
 18 |         "phone number":"",
 19 |         "Date of Birth":""
 20 |     }
 21 | }
 22 | 
 23 | ### Text:
 24 | CURRICULUM VITAE Name Fabio ElMassry
 25 | Position Instrument & Control Engineer
 26 | Date of Birth 05/04/1986
 27 | Nationality Egyptian
 28 | Status Married
 29 | Languages English/Italian/Arabic
 30 | Phone/ mail         +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings:                    ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
 31 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
 32 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
 33 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
 34 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
 35 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
 36 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
 37 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
 38 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
 39 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture)                           Algeria
 40 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
 41 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
 42 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
 43 | graphic configuration, PLC programming and configuration.
 44 | x Troubleshooting system and process alarms using DCS Honeywell
 45 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
 46 | <|output|>
 47 | 
 48 | 🌀: {
 49 |     "Applicant": {
 50 |         "Name": "Fabio ElMassry",
 51 |         "Surname": "",
 52 |         "email address":"fabioElMassry@yahoo.fr",
 53 |         "phone number":"+39685737365",
 54 |         "Date of Birth": "05/04/1986"
 55 |     }
 56 | }
 57 | 
 58 | ---
 59 | 
 60 | 
 61 | ✨: <|input|>
 62 | ### Template:
 63 | {
 64 |     "Details": {
 65 |         "Nationality": "",
 66 |         "Country": "",
 67 |         "Languages":"",
 68 |         "phone number": []
 69 |     }
 70 | }
 71 | 
 72 | ### Text:
 73 | CURRICULUM VITAE Name Fabio ElMassry
 74 | Position Instrument & Control Engineer
 75 | Date of Birth 05/04/1986
 76 | Nationality Egyptian
 77 | Status Married
 78 | Languages English/Italian/Arabic
 79 | Phone/ mail         +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings:                    ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
 80 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
 81 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
 82 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
 83 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
 84 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
 85 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
 86 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
 87 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
 88 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture)                           Algeria
 89 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
 90 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
 91 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
 92 | graphic configuration, PLC programming and configuration.
 93 | x Troubleshooting system and process alarms using DCS Honeywell
 94 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
 95 | <|output|>
 96 | 
 97 | 🌀: {
 98 |     "Details": {
 99 |         "Nationality": "Egyptian",
100 |         "Country": "",
101 |         "Languages": [
102 |             "English",
103 |             "Italian",
104 |             "Arabic"
105 |         ],
106 |         "phone number": [
107 |             "+39685737365"
108 |         ]
109 |     }
110 | }
111 | 
112 | ---
113 | 
114 | 
115 | ✨: <|input|>
116 | ### Template:
117 | {
118 |     "Details": {
119 |         "Nationality": "",
120 |         "Country": "",
121 |         "Languages": []
122 |     }
123 | }
124 | 
125 | ### Text:
126 | CURRICULUM VITAE Name Fabio ElMassry
127 | Position Instrument & Control Engineer
128 | Date of Birth 05/04/1986
129 | Nationality Egyptian
130 | Status Married
131 | Languages English/Italian/Arabic
132 | Phone/ mail         +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings:                    ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
133 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
134 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
135 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
136 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
137 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
138 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
139 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
140 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
141 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture)                           Algeria
142 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
143 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
144 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
145 | graphic configuration, PLC programming and configuration.
146 | x Troubleshooting system and process alarms using DCS Honeywell
147 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
148 | <|output|>
149 | 
150 | 🌀: {
151 |     "Details": {
152 |         "Nationality": "Egyptian",
153 |         "Country": "",
154 |         "Languages": [
155 |             "English",
156 |             "Italian",
157 |             "Arabic"
158 |         ]
159 |     }
160 | }
161 | 
162 | ---
163 | 
164 | 
165 | 


--------------------------------------------------------------------------------
/instrucitons.txt:
--------------------------------------------------------------------------------
  1 | Felladrin/gguf-NuExtract-tiny
  2 | https://huggingface.co/Felladrin/gguf-NuExtract-tiny/tree/main
  3 | 
  4 | Original Repo
  5 | https://huggingface.co/numind/NuExtract-tiny
  6 | 
  7 | 
  8 | pip install llama-cpp-python==0.2.85 tiktoken streamlit==1.36.0 huggingface-hub
  9 | pip install langchain langchain-community  faiss-cpu duckduckgo-search newspaper3k
 10 | pip install pymupdf4llm strip_markdown
 11 | 
 12 | ```
 13 | >>> import strip_markdown
 14 | >>>
 15 | >>> TXT: str = strip_markdown.strip_markdown(MD: str)
 16 | ```
 17 | 
 18 | 
 19 | 
 20 | MODEL CARD
 21 | NuExtract_tiny is a version of Qwen1.5-0.5, fine-tuned on a private high-quality synthetic dataset for information extraction. To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract.
 22 | 
 23 | Note: This model is purely extractive, so all text output by the model is present as is in the original text. You can also provide an example of output formatting to help the model understand your task more precisely.
 24 | 
 25 | Note: While this model provides good 0 shot performance, it is intended to be fine-tuned on a specific task (>=30 examples).
 26 | 
 27 | We also provide a base (3.8B) and large(7B) version of this model: NuExtract and NuExtract-large
 28 | https://huggingface.co/numind/NuExtract
 29 | https://huggingface.co/numind/NuExtract-large
 30 | 
 31 | 
 32 | llama_model_loader: loaded meta data with 27 key-value pairs and 290 tensors from model/NuExtract-tiny.gguf (version GGUF V3 (latest))
 33 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 34 | llama_model_loader: - kv   0:                       general.architecture str              = qwen2
 35 | llama_model_loader: - kv   1:                               general.type str              = model
 36 | llama_model_loader: - kv   2:                               general.name str              = Qwen1.5 0.5B
 37 | llama_model_loader: - kv   3:                       general.organization str              = Qwen
 38 | llama_model_loader: - kv   4:                           general.basename str              = Qwen1.5
 39 | llama_model_loader: - kv   5:                         general.size_label str              = 0.5B
 40 | llama_model_loader: - kv   6:                            general.license str              = mit
 41 | llama_model_loader: - kv   7:                          general.languages arr[str,1]       = ["en"]
 42 | llama_model_loader: - kv   8:                          qwen2.block_count u32              = 24
 43 | llama_model_loader: - kv   9:                       qwen2.context_length u32              = 32768
 44 | llama_model_loader: - kv  10:                     qwen2.embedding_length u32              = 1024
 45 | llm_load_print_meta: format           = GGUF V3 (latest)
 46 | llm_load_print_meta: arch             = qwen2
 47 | llm_load_print_meta: vocab type       = BPE
 48 | llm_load_print_meta: n_vocab          = 151936
 49 | llm_load_print_meta: n_merges         = 151387
 50 | llm_load_print_meta: vocab_only       = 0
 51 | llm_load_print_meta: n_ctx_train      = 32768
 52 | llm_load_print_meta: n_embd           = 1024
 53 | llm_load_print_meta: model type       = 0.5B
 54 | llm_load_print_meta: model ftype      = BF16
 55 | llm_load_print_meta: model params     = 463.99 M
 56 | llm_load_print_meta: model size       = 885.22 MiB (16.00 BPW)
 57 | llm_load_print_meta: general.name     = Qwen1.5 0.5B
 58 | llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
 59 | llm_load_print_meta: EOS token        = 151646 '<|end-output|>'
 60 | llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
 61 | llm_load_print_meta: LF token         = 148848 'ÄĬ'
 62 | llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
 63 | 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"}
 64 | Available chat formats from metadata: chat_template.default
 65 | Using gguf chat template: {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
 66 | You are a helpful assistant<|im_end|>
 67 | ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
 68 | ' + message['content'] + '<|im_end|>' + '
 69 | '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
 70 | ' }}{% endif %}
 71 | Using chat eos_token: <|end-output|>
 72 | Using chat bos_token: <|endoftext|>
 73 | 
 74 | 
 75 | 
 76 | CODE SNIPPET
 77 | ```
 78 | import json
 79 | from transformers import AutoModelForCausalLM, AutoTokenizer
 80 | 
 81 | 
 82 | def predict_NuExtract(model, tokenizer, text, schema, example=["","",""]):
 83 |     schema = json.dumps(json.loads(schema), indent=4)
 84 |     input_llm =  "<|input|>\n### Template:\n" +  schema + "\n"
 85 |     for i in example:
 86 |       if i != "":
 87 |           input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n"
 88 |     
 89 |     input_llm +=  "### Text:\n"+text +"\n<|output|>\n"
 90 |     input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to("cuda")
 91 | 
 92 |     output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
 93 |     return output.split("<|output|>")[1].split("<|end-output|>")[0]
 94 | 
 95 | 
 96 | model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True)
 97 | tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True)
 98 | 
 99 | model.to("cuda")
100 | 
101 | model.eval()
102 | 
103 | text = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for
104 | superior performance and efficiency. Mistral 7B outperforms the best open 13B
105 | model (Llama 2) across all evaluated benchmarks, and the best released 34B
106 | model (Llama 1) in reasoning, mathematics, and code generation. Our model
107 | leverages grouped-query attention (GQA) for faster inference, coupled with sliding
108 | window attention (SWA) to effectively handle sequences of arbitrary length with a
109 | reduced inference cost. We also provide a model fine-tuned to follow instructions,
110 | Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
111 | automated benchmarks. Our models are released under the Apache 2.0 license.
112 | Code: https://github.com/mistralai/mistral-src
113 | Webpage: https://mistral.ai/news/announcing-mistral-7b/"""
114 | 
115 | schema = """{
116 |     "Model": {
117 |         "Name": "",
118 |         "Number of parameters": "",
119 |         "Number of max token": "",
120 |         "Architecture": []
121 |     },
122 |     "Usage": {
123 |         "Use case": [],
124 |         "Licence": ""
125 |     }
126 | }"""
127 | 
128 | prediction = predict_NuExtract(model, tokenizer, text, schema, example=["","",""])
129 | print(prediction)
130 | ```
131 | 
132 | 
133 | 
134 | 
135 | 
136 | 
137 | Zero configuration Local LLMs for everyone!
138 | 
139 | LM Studio: experience the magic of LLMs with Zero technical expertise
140 | Your guide to Zero configuration Local LLMs on any computer.
141 | 
142 | 
143 | https://medium.com/mlearning-ai/metadata-metamorphosis-from-plain-data-to-enhanced-insights-with-retrieval-augmented-generation-8d1a8d5a6061?sk=70e8abf76409be379bce7509d35afe05
144 | 
145 | 
146 | On the command line, including multiple files at once
147 | I recommend using the huggingface-hub Python library:
148 | 
149 | pip3 install huggingface-hub
150 | 
151 | Then you can download any individual model file to the current directory, at high speed, with a command like this:
152 | 
153 | huggingface-cli download TheBloke/Panda-7B-v0.1-GGUF panda-7b-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
154 | 
155 | 
156 | Example llama.cpp command
157 | Make sure you are using llama.cpp from commit d0cee0d or later.
158 | 
159 | ./main -ngl 35 -m panda-7b-v0.1.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "{prompt}"
160 | 
161 | Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
162 | 
163 | Change -c 32768 to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
164 | 
165 | If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins
166 | 
167 | 
168 | 
169 | 
170 | 
171 | Due to the unstructured nature of human conversational language data, the input to LLMs are conversational and unstructured, in the form of Prompt Engineering.
172 | And the output of LLMs is also conversational and unstructured; a highly succinct form of natural language generation (NLG).
173 | LLMs introduced functionality to fine-tune and create custom models. And the initial primary approach to customising LLMs was creating custom models via fine-tuning.
174 | This approach has fallen into disfavour for three reasons:
175 | As LLMs have both a generative and predictive side. The generative power of LLMs is easier to leverage than the predictive power. If the generative side of LLMs are presented with contextual, concise and relevant data at inference-time, hallucination is negated.
176 | Fine-tuning LLMs involves training data curation, transformation and LLM related cost. Fine-tuned models are frozen with a definite time-stamp and will still demand innovation around prompt creation and data presentation to the LLM.
177 | When classifying text based on pre-defined classes or intents, NLU still has an advantage with built-in efficiencies.
178 | I hasten to add that there has been significant advances in improving no-code to low-code UIs and fine-tuning costs. A prudent approach is to make use of a hybrid solution, drawing on the benefits of fine-tuning and RAG.
179 | The aim of fine-tuning of LLMs is to engender more accurate and succinct reasoning and answers.
180 | The proven solution to hallucination is using highly relevant and contextual prompts at inference-time, and asking the LLM to follow chain-of-thought reasoning. This also solves for one of the big problems with LLMs; hallucination, where the LLM returns highly plausible but incorrect answers.
181 | 
182 | 
183 | 
184 | 
185 | $env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"
186 | pip install llama-cpp-python[server]==0.2.53
187 | python -m llama_cpp.server --help
188 | python -m llama_cpp.server --host 0.0.0.0 --model model/Quyen-Mini-v0.1.Q4_K_M.gguf --chat_format chatml --n_ctx 8196 --n_gpu_layers 25
189 | 
190 | 
191 | python -m llama_cpp.server --host 0.0.0.0 --model model/llama-2-7b-chat.Q4_K_M.gguf --chat_format llama-2 --n_ctx 4096 --n_gpu_layers 33
192 | 
193 | 
194 | llama_cpp.llama_chat_format.LlamaChatCompletionHandlerNotFoundException: Invalid chat handler: llama2 (valid formats: ['llama-2', 'alpaca', 'qwen', 'vicuna', 'oasst_llama', 'baichuan-2', 'baichuan', 'openbuddy', 'redpajama-incite', 'snoozy', 'phind', 'intel', 'open-orca', 'mistrallite', 'zephyr', 'pygmalion', 'chatml', 'mistral-instruct', 'chatglm3', 'openchat', 'saiga', 'gemma', 'functionary', 'functionary-v2', 'functionary-v1', 'chatml-function-calling'])
195 | 
196 | 
197 | python -m llama_cpp.server --host 0.0.0.0 --model model/qwen1_5-4b-chat-q6_k.gguf --chat_format chatml --n_ctx 32768 --n_gpu_layers 41
198 | 
199 | 


--------------------------------------------------------------------------------