├── images
├── allimagesHERE.md
├── logo.png
└── banner.png
├── requirements.txt
├── model
└── yourGGUFhere.md
├── singleExtract.py
├── README.md
├── example_2BD1Q_log.txt
├── stapp.py
├── example_IME7D_log.txt
└── instrucitons.txt
/images/allimagesHERE.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/images/logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/images/logo.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/requirements.txt
--------------------------------------------------------------------------------
/images/banner.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/AI-ExtractData-NuExtract-tiny/main/images/banner.png
--------------------------------------------------------------------------------
/model/yourGGUFhere.md:
--------------------------------------------------------------------------------
1 | ```
2 | mkdir model
3 | cd model
4 | wget https://huggingface.co/Felladrin/gguf-NuExtract-tiny/resolve/main/NuExtract-tiny.gguf -OutFile NuExtract-tiny.gguf
5 | ```
6 |
--------------------------------------------------------------------------------
/singleExtract.py:
--------------------------------------------------------------------------------
1 |
2 | # https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract
3 | # https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion
4 | # https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction
5 | # https://huggingface.co/numind/NuExtract-tiny
6 | # GGUF repo https://huggingface.co/Felladrin/gguf-NuExtract-tiny
7 | # https://www.geeksforgeeks.org/json-loads-in-python/
8 |
9 | import json
10 | from llama_cpp import Llama
11 | from rich.markdown import Markdown
12 | import warnings
13 | warnings.filterwarnings(action='ignore')
14 | import datetime
15 | from rich.console import Console
16 | console = Console(width=90)
17 |
18 | print('loading NuExtract-tiny.gguf with LlamaCPP...')
19 | nCTX = 12000
20 | sTOPS = ['<|end-output|>']
21 | client = Llama(
22 | model_path='model/NuExtract-tiny.gguf',
23 | #n_gpu_layers=0,
24 | temperature=0.24,
25 | n_ctx=nCTX,
26 | max_tokens=600,
27 | repeat_penalty=1.176,
28 | stop=sTOPS,
29 | verbose=False,
30 | )
31 | print('Done...')
32 |
33 | prompt = """<|input|>\n### Template:
34 | {
35 | "Car": {
36 | "Name": "",
37 | "Manufacturer": "",
38 | "Designers": [],
39 | "Number of units produced": "",
40 | }
41 | }
42 | ### Text:
43 | The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).
44 |
45 | The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.
46 |
47 | In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
48 | <|output|>
49 | """
50 |
51 | nCTX = 12000
52 | sTOPS = ['<|end-output|>']
53 |
54 | start = datetime.datetime.now()
55 |
56 |
57 | output = client.create_completion(
58 | prompt =prompt,
59 | temperature=0.1,
60 | repeat_penalty= 1.11,
61 | stop=sTOPS,
62 | max_tokens=500,
63 | stream=False)
64 | delta = datetime.datetime.now() - start
65 | result = output['choices'][0]['text']
66 | console.print(result)
67 | console.print('---')
68 | console.print(f'Completed in {delta}')
69 |
70 | replies = [result]
71 | import json
72 | from haystack.components.converters import OutputAdapter
73 |
74 |
75 | adapter = result.replace("'",'"')
76 | final = json.loads(adapter)
77 | console.print('---')
78 | console.print(final)
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # AI-ExtractData-NuExtract-tiny
2 | NuExtract-tiny GGUF for data extraction in json format
3 |
4 |
5 |
6 |
7 | ### How to go from this to this
8 |
9 | ##### JSON schema template:
10 | ```
11 | {
12 | "Car": {
13 | "Name": "",
14 | "Manufacturer": "",
15 | "Designers": [],
16 | "Number of units produced": "",
17 | }
18 | }
19 | ```
20 | ##### original text
21 | ```
22 | The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda,
23 | introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured
24 | through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license
25 | to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).
26 |
27 | The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car
28 | of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under
29 | the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande
30 | Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.
31 |
32 | In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early
33 | 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4]
34 | During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately,
35 | the Panda was named after Empanda, the Roman goddess and patroness of travelers.
36 |
37 | ```
38 |
39 | ##### extracted JSON:
40 | ```
41 | {
42 | 'Car': {
43 | 'Name': 'Fiat Panda',
44 | 'Manufacturer': 'Fiat',
45 | 'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani'],
46 | 'Number of units produced': '7.8 million'
47 | }
48 | }
49 | ```
50 |
51 | ### Create venv and install packages
52 | ```
53 | python -m venv venv
54 | venv\Scripts\activate
55 |
56 | pip install llama-cpp-python==0.2.85 tiktoken streamlit==1.36.0
57 | ```
58 |
59 | ### download he GGUF file
60 | download from HugginFace into `model` subfolder file NuExtract-tiny.gguf (fp16 quantization)
61 |
62 | ```
63 | GGUF repo [https://huggingface.co/Felladrin/gguf-NuExtract-tiny](https://huggingface.co/Felladrin/gguf-NuExtract-tiny)
64 | ```
65 |
66 | Original model card: https://huggingface.co/numind/NuExtract-tiny
67 | ```
68 | NuExtract_tiny is a version of Qwen1.5-0.5, fine-tuned on a private high-quality synthetic dataset for information extraction. To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract.
69 |
70 | Note: This model is purely extractive, so all text output by the model is present as is in the original text. You can also provide an example of output formatting to help the model understand your task more precisely.
71 |
72 | Note: While this model provides good 0 shot performance, it is intended to be fine-tuned on a specific task (>=30 examples).
73 | ```
74 |
75 | Run everything with
76 | ```
77 | python singleExtract.py
78 | ```
79 |
80 | To run Streamlit interface, after downloading all images in the `images` subfolder:
81 | ```
82 | streamlit run stapp.py
83 | ```
84 |
85 | ### Hyperparameters
86 | ```
87 | temperature=0.1,
88 | repeat_penalty= 1.11,
89 | stop=['<|end-output|>'],
90 | ```
91 |
92 | ---
93 |
94 | ### Additional resources
95 |
96 | - [https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract](https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract)
97 | - [https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion](https://llama-cpp-python.readthedocs.io/en/stable/api-reference/#llama_cpp.Llama.create_completion)
98 | - [https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction](https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction)
99 | - [https://huggingface.co/numind/NuExtract-tiny](https://huggingface.co/numind/NuExtract-tiny)
100 | - GGUF repo [https://huggingface.co/Felladrin/gguf-NuExtract-tiny](https://huggingface.co/Felladrin/gguf-NuExtract-tiny)
101 | - [https://www.geeksforgeeks.org/json-loads-in-python/](https://www.geeksforgeeks.org/json-loads-in-python/)
102 |
103 |
104 | ---
105 |
106 |
107 |
108 |
--------------------------------------------------------------------------------
/example_2BD1Q_log.txt:
--------------------------------------------------------------------------------
1 | 2024-08-05 17:57:04.227098
2 |
3 | Your own LocalGPT with 🌀 NuExtract-tiny
4 | ---
5 | 🧠🫡: You are a helpful assistant.
6 | 🌀: How may I help you today?
7 | ✨: <|input|>
8 | ### Template:
9 | {
10 | "Applicant": {
11 | "Name": "",
12 | "Surname": "",
13 | "email address":"",
14 | "phone number":"",
15 | "Date of Birth":"",
16 | "Nationality":"",
17 | "Country":"",
18 | "Languages": []
19 | }
20 | }
21 |
22 |
23 | {
24 | "Applicant": {
25 | "Name": "",
26 | "Surname": "",
27 | "email address":"",
28 | "phone number":"",
29 | "Date of Birth":""
30 | }
31 | }
32 |
33 |
34 |
35 |
36 | ### Text:
37 | CURRICULUM VITAE Name Fabio ElMassry
38 | Position Instrument & Control Engineer
39 | Date of Birth 05/04/1986
40 | Nationality Egyptian
41 | Status Married
42 | Languages English/Italian/Arabic
43 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
44 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
45 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
46 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
47 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
48 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
49 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
50 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
51 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
52 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria
53 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
54 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
55 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
56 | graphic configuration, PLC programming and configuration.
57 | x Troubleshooting system and process alarms using DCS Honeywell
58 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
59 | <|output|>
60 |
61 | 🌀: {
62 | "Applicant":{
63 | "Name":"Fabio ElMassry"
64 | "Surname":""
65 | "email address":"fabioElMassry@yahoo.fr"
66 | "phone number":"+39685737365"
67 | "Date of Birth":"05/04/1986"
68 | }
69 | }
70 |
71 | ---
72 |
73 | {
74 | "Details": {
75 | "Nationality": "",
76 | "Country": "",
77 | "Languages": []
78 | }
79 | }
80 |
81 | CURRICULUM VITAE Name Fabio ElMassry
82 | Position Instrument & Control Engineer
83 | Date of Birth 05/04/1986
84 | Nationality Egyptian
85 | Status Married
86 | Languages English/Italian/Arabic
87 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
88 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
89 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
90 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
91 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
92 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
93 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
94 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
95 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
96 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria
97 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
98 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
99 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
100 | graphic configuration, PLC programming and configuration.
101 | x Troubleshooting system and process alarms using DCS Honeywell
102 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
103 |
104 |
105 | {
106 | "Details":{
107 | "Nationality":"Egyptian"
108 | "Country":""
109 | "Languages":[
110 | 0:"English"
111 | 1:"Italian"
112 | 2:"Arabic"
113 | ]
114 | }
115 | }
116 |
117 |
118 |
--------------------------------------------------------------------------------
/stapp.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | from llama_cpp import Llama
3 | import warnings
4 | warnings.filterwarnings(action='ignore')
5 | import datetime
6 | import random
7 | import string
8 | from time import sleep
9 | import tiktoken
10 | import json
11 |
12 | # for counting the tokens in the prompt and in the result
13 | #context_count = len(encoding.encode(yourtext))
14 | encoding = tiktoken.get_encoding("r50k_base")
15 |
16 | nCTX = 12000
17 | sTOPS = ['<|end-output|>']
18 | modelname = "NuExtract-tiny"
19 | # Set the webpage title
20 | st.set_page_config(
21 | page_title=f"Your LocalGPT ✨ with {modelname}",
22 | page_icon="🌟",
23 | layout="wide")
24 |
25 | if "hf_model" not in st.session_state:
26 | st.session_state.hf_model = "NuExtract-tiny"
27 | # Initialize chat history
28 | if "messages" not in st.session_state:
29 | st.session_state.messages = []
30 |
31 | if "repeat" not in st.session_state:
32 | st.session_state.repeat = 1.35
33 |
34 | if "temperature" not in st.session_state:
35 | st.session_state.temperature = 0.1
36 |
37 | if "maxlength" not in st.session_state:
38 | st.session_state.maxlength = 500
39 |
40 | if "speed" not in st.session_state:
41 | st.session_state.speed = 0.0
42 |
43 | def writehistory(filename,text):
44 | with open(filename, 'a', encoding='utf-8') as f:
45 | f.write(text)
46 | f.write('\n')
47 | f.close()
48 |
49 | def genRANstring(n):
50 | """
51 | n = int number of char to randomize
52 | """
53 | N = n
54 | res = ''.join(random.choices(string.ascii_uppercase +
55 | string.digits, k=N))
56 | return res
57 |
58 | @st.cache_resource
59 | def create_chat():
60 | # Set HF API token and HF repo
61 | from llama_cpp import Llama
62 | client = Llama(
63 | model_path='model/NuExtract-tiny.gguf',
64 | temperature=0.24,
65 | n_ctx=nCTX,
66 | max_tokens=600,
67 | repeat_penalty=1.176,
68 | stop=sTOPS,
69 | verbose=False,
70 | )
71 | print('loading NuExtract-tiny.gguf with LlamaCPP...')
72 | return client
73 |
74 |
75 | # create THE SESSIoN STATES
76 | if "logfilename" not in st.session_state:
77 | ## Logger file
78 | logfile = f'{genRANstring(5)}_log.txt'
79 | st.session_state.logfilename = logfile
80 | #Write in the history the first 2 sessions
81 | writehistory(st.session_state.logfilename,f'{str(datetime.datetime.now())}\n\nYour own LocalGPT with 🌀 {modelname}\n---\n🧠🫡: You are a helpful assistant.')
82 | writehistory(st.session_state.logfilename,f'🌀: How may I help you today?')
83 |
84 |
85 |
86 | ### START STREAMLIT UI
87 | # Create a header element
88 | st.image('images/banner.png',use_column_width=True)
89 | mytitle = f'> *Extract data with {modelname} into `JSON` format*'
90 | st.markdown(mytitle, unsafe_allow_html=True)
91 | #st.markdown('> Local Chat ')
92 | #st.markdown('---')
93 |
94 | # CREATE THE SIDEBAR
95 | with st.sidebar:
96 | st.image('images/logo.png', use_column_width=True)
97 | st.session_state.temperature = st.slider('Temperature:', min_value=0.0, max_value=1.0, value=0.1, step=0.01)
98 | st.session_state.maxlength = st.slider('Length reply:', min_value=150, max_value=2000,
99 | value=500, step=50)
100 | st.session_state.repeat = st.slider('Repeat Penalty:', min_value=0.0, max_value=2.0, value=1.11, step=0.02)
101 | st.markdown(f"**Logfile**: {st.session_state.logfilename}")
102 | statspeed = st.markdown(f'💫 speed: {st.session_state.speed} t/s')
103 | btnClear = st.button("Clear History",type="primary", use_container_width=True)
104 |
105 | llm = create_chat()
106 |
107 | st.session_state.jsonformat = st.text_area('JSON Schema to be applied', value="", height=150,
108 | placeholder='here your schema', disabled=False, label_visibility="visible")
109 | st.session_state.origintext = st.text_area('Source Document', value="", height=150,
110 | placeholder='here your text', disabled=False, label_visibility="visible")
111 | extract_btn = st.button("Extract Data",type="primary", use_container_width=False)
112 | st.markdown('---')
113 | st.session_state.extractedJSON = st.empty()
114 | st.session_state.onlyJSON = st.empty()
115 |
116 |
117 |
118 | if extract_btn:
119 | prompt = f"""<|input|>\n### Template:
120 | {st.session_state.jsonformat}
121 |
122 | ### Text:
123 | {st.session_state.origintext}
124 | <|output|>
125 | """
126 | print(prompt)
127 | with st.spinner("Thinking..."):
128 | start = datetime.datetime.now()
129 | output = llm.create_completion(
130 | prompt =prompt,
131 | temperature=0.1,
132 | repeat_penalty= 1.11,
133 | stop=sTOPS,
134 | max_tokens=500,
135 | stream=False)
136 |
137 | delta = datetime.datetime.now() -start
138 | result = output['choices'][0]['text']
139 | st.write(result)
140 | adapter = result.replace("'",'"')
141 | final = json.loads(adapter)
142 | totalTokens = len(encoding.encode(prompt))+len(encoding.encode(result))
143 | totalseconds = delta.total_seconds()
144 | st.session_state.speed = totalTokens/totalseconds
145 | statspeed.markdown(f'💫 speed: {st.session_state.speed:.2f} t/s')
146 | totalstring = f"""GENERATED STRING
147 |
148 | {result}
149 | ---
150 |
151 | Generated in {delta}
152 |
153 | ---
154 |
155 | JSON FORMAT:
156 | """
157 | with st.session_state.extractedJSON:
158 | st.markdown(totalstring)
159 | st.session_state.onlyJSON.json(final)
160 | writehistory(st.session_state.logfilename,f'✨: {prompt}')
161 | writehistory(st.session_state.logfilename,f'🌀: {result}')
162 | writehistory(st.session_state.logfilename,f'---\n\n')
163 |
164 |
--------------------------------------------------------------------------------
/example_IME7D_log.txt:
--------------------------------------------------------------------------------
1 | 2024-08-06 07:30:44.882140
2 |
3 | Your own LocalGPT with 🌀 NuExtract-tiny
4 | ---
5 | 🧠🫡: You are a helpful assistant.
6 | 🌀: How may I help you today?
7 |
8 | ---
9 |
10 |
11 | ✨: <|input|>
12 | ### Template:
13 | {
14 | "Applicant": {
15 | "Name": "",
16 | "Surname": "",
17 | "email address":"",
18 | "phone number":"",
19 | "Date of Birth":""
20 | }
21 | }
22 |
23 | ### Text:
24 | CURRICULUM VITAE Name Fabio ElMassry
25 | Position Instrument & Control Engineer
26 | Date of Birth 05/04/1986
27 | Nationality Egyptian
28 | Status Married
29 | Languages English/Italian/Arabic
30 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
31 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
32 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
33 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
34 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
35 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
36 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
37 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
38 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
39 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria
40 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
41 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
42 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
43 | graphic configuration, PLC programming and configuration.
44 | x Troubleshooting system and process alarms using DCS Honeywell
45 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
46 | <|output|>
47 |
48 | 🌀: {
49 | "Applicant": {
50 | "Name": "Fabio ElMassry",
51 | "Surname": "",
52 | "email address":"fabioElMassry@yahoo.fr",
53 | "phone number":"+39685737365",
54 | "Date of Birth": "05/04/1986"
55 | }
56 | }
57 |
58 | ---
59 |
60 |
61 | ✨: <|input|>
62 | ### Template:
63 | {
64 | "Details": {
65 | "Nationality": "",
66 | "Country": "",
67 | "Languages":"",
68 | "phone number": []
69 | }
70 | }
71 |
72 | ### Text:
73 | CURRICULUM VITAE Name Fabio ElMassry
74 | Position Instrument & Control Engineer
75 | Date of Birth 05/04/1986
76 | Nationality Egyptian
77 | Status Married
78 | Languages English/Italian/Arabic
79 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
80 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
81 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
82 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
83 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
84 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
85 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
86 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
87 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
88 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria
89 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
90 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
91 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
92 | graphic configuration, PLC programming and configuration.
93 | x Troubleshooting system and process alarms using DCS Honeywell
94 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
95 | <|output|>
96 |
97 | 🌀: {
98 | "Details": {
99 | "Nationality": "Egyptian",
100 | "Country": "",
101 | "Languages": [
102 | "English",
103 | "Italian",
104 | "Arabic"
105 | ],
106 | "phone number": [
107 | "+39685737365"
108 | ]
109 | }
110 | }
111 |
112 | ---
113 |
114 |
115 | ✨: <|input|>
116 | ### Template:
117 | {
118 | "Details": {
119 | "Nationality": "",
120 | "Country": "",
121 | "Languages": []
122 | }
123 | }
124 |
125 | ### Text:
126 | CURRICULUM VITAE Name Fabio ElMassry
127 | Position Instrument & Control Engineer
128 | Date of Birth 05/04/1986
129 | Nationality Egyptian
130 | Status Married
131 | Languages English/Italian/Arabic
132 | Phone/ mail +39685737365/ fabioElMassry@yahoo.frEducation: 2002 – 2007 - Bachelor of Engineering degree in Automation EngineeringProfessional Certification: Specialized instrumentation engineering (Algerian Petroleum Institute),Trainings: ICSS (DCS, ESD, F&G) troubleshooting & maintenance, GE Control
133 | Turbine Mark VIe maintenance, PLC Programming, DCS configuration &
134 | Troubleshooting, CCTV & Access control (Germany), UPS maintenance.IT skills: Proficient user of LINUX & Microsoft Windows, Microsoft Office pack,
135 | Matlab, SQL, AutoCAD, PRIMAVIRA, C++, Hysys.General:
136 | Strong experience as instrumentation and control engineer at SONATRACH, the Algerian National Oil & Gas
137 | Company and its partners within Oil & Gas Processing Plants, Refineries, Petrochemical and Chemical
138 | Plants, and mining including EPC projects, plant maintenance, commissioning & startup, engineering,
139 | revamping. I have a wide range experience in configuration, commissioning programming, troubleshooting &
140 | diagnostic of ICSS, DCS, ESD, F&G, SCADA, PLC, GE Turbine control Mark V, VI and VIe, all kinds of
141 | instrument & machines and familiar with standards and codes IEC, ISA….Work experienceMay 2014 – Present: ENI-Sonatrach (Joint venture) Algeria
142 | Instrument & Control engineerWorked as Instrument & Control Engineer at Joint-Venture Eni-Sonatrach,
143 | Hassi Messoud, Algeria, on CPF, Central Processing facilities, plant
144 | including three Gas & Crude Treatment trains and utilities facilities.x Maintenance of ICSS system (DCS, ESD, Fire & gas), loop check,
145 | graphic configuration, PLC programming and configuration.
146 | x Troubleshooting system and process alarms using DCS Honeywell
147 | TPS/TDC3000, ESD FSC/ Triconex, F&G, BMS and BOP systems.
148 | <|output|>
149 |
150 | 🌀: {
151 | "Details": {
152 | "Nationality": "Egyptian",
153 | "Country": "",
154 | "Languages": [
155 | "English",
156 | "Italian",
157 | "Arabic"
158 | ]
159 | }
160 | }
161 |
162 | ---
163 |
164 |
165 |
--------------------------------------------------------------------------------
/instrucitons.txt:
--------------------------------------------------------------------------------
1 | Felladrin/gguf-NuExtract-tiny
2 | https://huggingface.co/Felladrin/gguf-NuExtract-tiny/tree/main
3 |
4 | Original Repo
5 | https://huggingface.co/numind/NuExtract-tiny
6 |
7 |
8 | pip install llama-cpp-python==0.2.85 tiktoken streamlit==1.36.0 huggingface-hub
9 | pip install langchain langchain-community faiss-cpu duckduckgo-search newspaper3k
10 | pip install pymupdf4llm strip_markdown
11 |
12 | ```
13 | >>> import strip_markdown
14 | >>>
15 | >>> TXT: str = strip_markdown.strip_markdown(MD: str)
16 | ```
17 |
18 |
19 |
20 | MODEL CARD
21 | NuExtract_tiny is a version of Qwen1.5-0.5, fine-tuned on a private high-quality synthetic dataset for information extraction. To use the model, provide an input text (less than 2000 tokens) and a JSON template describing the information you need to extract.
22 |
23 | Note: This model is purely extractive, so all text output by the model is present as is in the original text. You can also provide an example of output formatting to help the model understand your task more precisely.
24 |
25 | Note: While this model provides good 0 shot performance, it is intended to be fine-tuned on a specific task (>=30 examples).
26 |
27 | We also provide a base (3.8B) and large(7B) version of this model: NuExtract and NuExtract-large
28 | https://huggingface.co/numind/NuExtract
29 | https://huggingface.co/numind/NuExtract-large
30 |
31 |
32 | llama_model_loader: loaded meta data with 27 key-value pairs and 290 tensors from model/NuExtract-tiny.gguf (version GGUF V3 (latest))
33 | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
34 | llama_model_loader: - kv 0: general.architecture str = qwen2
35 | llama_model_loader: - kv 1: general.type str = model
36 | llama_model_loader: - kv 2: general.name str = Qwen1.5 0.5B
37 | llama_model_loader: - kv 3: general.organization str = Qwen
38 | llama_model_loader: - kv 4: general.basename str = Qwen1.5
39 | llama_model_loader: - kv 5: general.size_label str = 0.5B
40 | llama_model_loader: - kv 6: general.license str = mit
41 | llama_model_loader: - kv 7: general.languages arr[str,1] = ["en"]
42 | llama_model_loader: - kv 8: qwen2.block_count u32 = 24
43 | llama_model_loader: - kv 9: qwen2.context_length u32 = 32768
44 | llama_model_loader: - kv 10: qwen2.embedding_length u32 = 1024
45 | llm_load_print_meta: format = GGUF V3 (latest)
46 | llm_load_print_meta: arch = qwen2
47 | llm_load_print_meta: vocab type = BPE
48 | llm_load_print_meta: n_vocab = 151936
49 | llm_load_print_meta: n_merges = 151387
50 | llm_load_print_meta: vocab_only = 0
51 | llm_load_print_meta: n_ctx_train = 32768
52 | llm_load_print_meta: n_embd = 1024
53 | llm_load_print_meta: model type = 0.5B
54 | llm_load_print_meta: model ftype = BF16
55 | llm_load_print_meta: model params = 463.99 M
56 | llm_load_print_meta: model size = 885.22 MiB (16.00 BPW)
57 | llm_load_print_meta: general.name = Qwen1.5 0.5B
58 | llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
59 | llm_load_print_meta: EOS token = 151646 '<|end-output|>'
60 | llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
61 | llm_load_print_meta: LF token = 148848 'ÄĬ'
62 | llm_load_print_meta: EOT token = 151645 '<|im_end|>'
63 | 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"}
64 | Available chat formats from metadata: chat_template.default
65 | Using gguf chat template: {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
66 | You are a helpful assistant<|im_end|>
67 | ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
68 | ' + message['content'] + '<|im_end|>' + '
69 | '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
70 | ' }}{% endif %}
71 | Using chat eos_token: <|end-output|>
72 | Using chat bos_token: <|endoftext|>
73 |
74 |
75 |
76 | CODE SNIPPET
77 | ```
78 | import json
79 | from transformers import AutoModelForCausalLM, AutoTokenizer
80 |
81 |
82 | def predict_NuExtract(model, tokenizer, text, schema, example=["","",""]):
83 | schema = json.dumps(json.loads(schema), indent=4)
84 | input_llm = "<|input|>\n### Template:\n" + schema + "\n"
85 | for i in example:
86 | if i != "":
87 | input_llm += "### Example:\n"+ json.dumps(json.loads(i), indent=4)+"\n"
88 |
89 | input_llm += "### Text:\n"+text +"\n<|output|>\n"
90 | input_ids = tokenizer(input_llm, return_tensors="pt", truncation=True, max_length=4000).to("cuda")
91 |
92 | output = tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True)
93 | return output.split("<|output|>")[1].split("<|end-output|>")[0]
94 |
95 |
96 | model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True)
97 | tokenizer = AutoTokenizer.from_pretrained("numind/NuExtract-tiny", trust_remote_code=True)
98 |
99 | model.to("cuda")
100 |
101 | model.eval()
102 |
103 | text = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for
104 | superior performance and efficiency. Mistral 7B outperforms the best open 13B
105 | model (Llama 2) across all evaluated benchmarks, and the best released 34B
106 | model (Llama 1) in reasoning, mathematics, and code generation. Our model
107 | leverages grouped-query attention (GQA) for faster inference, coupled with sliding
108 | window attention (SWA) to effectively handle sequences of arbitrary length with a
109 | reduced inference cost. We also provide a model fine-tuned to follow instructions,
110 | Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
111 | automated benchmarks. Our models are released under the Apache 2.0 license.
112 | Code: https://github.com/mistralai/mistral-src
113 | Webpage: https://mistral.ai/news/announcing-mistral-7b/"""
114 |
115 | schema = """{
116 | "Model": {
117 | "Name": "",
118 | "Number of parameters": "",
119 | "Number of max token": "",
120 | "Architecture": []
121 | },
122 | "Usage": {
123 | "Use case": [],
124 | "Licence": ""
125 | }
126 | }"""
127 |
128 | prediction = predict_NuExtract(model, tokenizer, text, schema, example=["","",""])
129 | print(prediction)
130 | ```
131 |
132 |
133 |
134 |
135 |
136 |
137 | Zero configuration Local LLMs for everyone!
138 |
139 | LM Studio: experience the magic of LLMs with Zero technical expertise
140 | Your guide to Zero configuration Local LLMs on any computer.
141 |
142 |
143 | https://medium.com/mlearning-ai/metadata-metamorphosis-from-plain-data-to-enhanced-insights-with-retrieval-augmented-generation-8d1a8d5a6061?sk=70e8abf76409be379bce7509d35afe05
144 |
145 |
146 | On the command line, including multiple files at once
147 | I recommend using the huggingface-hub Python library:
148 |
149 | pip3 install huggingface-hub
150 |
151 | Then you can download any individual model file to the current directory, at high speed, with a command like this:
152 |
153 | huggingface-cli download TheBloke/Panda-7B-v0.1-GGUF panda-7b-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
154 |
155 |
156 | Example llama.cpp command
157 | Make sure you are using llama.cpp from commit d0cee0d or later.
158 |
159 | ./main -ngl 35 -m panda-7b-v0.1.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "{prompt}"
160 |
161 | Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
162 |
163 | Change -c 32768 to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.
164 |
165 | If you want to have a chat-style conversation, replace the -p argument with -i -ins
166 |
167 |
168 |
169 |
170 |
171 | Due to the unstructured nature of human conversational language data, the input to LLMs are conversational and unstructured, in the form of Prompt Engineering.
172 | And the output of LLMs is also conversational and unstructured; a highly succinct form of natural language generation (NLG).
173 | LLMs introduced functionality to fine-tune and create custom models. And the initial primary approach to customising LLMs was creating custom models via fine-tuning.
174 | This approach has fallen into disfavour for three reasons:
175 | As LLMs have both a generative and predictive side. The generative power of LLMs is easier to leverage than the predictive power. If the generative side of LLMs are presented with contextual, concise and relevant data at inference-time, hallucination is negated.
176 | Fine-tuning LLMs involves training data curation, transformation and LLM related cost. Fine-tuned models are frozen with a definite time-stamp and will still demand innovation around prompt creation and data presentation to the LLM.
177 | When classifying text based on pre-defined classes or intents, NLU still has an advantage with built-in efficiencies.
178 | I hasten to add that there has been significant advances in improving no-code to low-code UIs and fine-tuning costs. A prudent approach is to make use of a hybrid solution, drawing on the benefits of fine-tuning and RAG.
179 | The aim of fine-tuning of LLMs is to engender more accurate and succinct reasoning and answers.
180 | The proven solution to hallucination is using highly relevant and contextual prompts at inference-time, and asking the LLM to follow chain-of-thought reasoning. This also solves for one of the big problems with LLMs; hallucination, where the LLM returns highly plausible but incorrect answers.
181 |
182 |
183 |
184 |
185 | $env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"
186 | pip install llama-cpp-python[server]==0.2.53
187 | python -m llama_cpp.server --help
188 | python -m llama_cpp.server --host 0.0.0.0 --model model/Quyen-Mini-v0.1.Q4_K_M.gguf --chat_format chatml --n_ctx 8196 --n_gpu_layers 25
189 |
190 |
191 | python -m llama_cpp.server --host 0.0.0.0 --model model/llama-2-7b-chat.Q4_K_M.gguf --chat_format llama-2 --n_ctx 4096 --n_gpu_layers 33
192 |
193 |
194 | llama_cpp.llama_chat_format.LlamaChatCompletionHandlerNotFoundException: Invalid chat handler: llama2 (valid formats: ['llama-2', 'alpaca', 'qwen', 'vicuna', 'oasst_llama', 'baichuan-2', 'baichuan', 'openbuddy', 'redpajama-incite', 'snoozy', 'phind', 'intel', 'open-orca', 'mistrallite', 'zephyr', 'pygmalion', 'chatml', 'mistral-instruct', 'chatglm3', 'openchat', 'saiga', 'gemma', 'functionary', 'functionary-v2', 'functionary-v1', 'chatml-function-calling'])
195 |
196 |
197 | python -m llama_cpp.server --host 0.0.0.0 --model model/qwen1_5-4b-chat-q6_k.gguf --chat_format chatml --n_ctx 32768 --n_gpu_layers 41
198 |
199 |
--------------------------------------------------------------------------------