├── model └── yourGGUFhere.md ├── README.md ├── writer.py ├── GemmaWriter.py └── instructions.txt /model/yourGGUFhere.md: -------------------------------------------------------------------------------- 1 | HuggingFace Hub Model Card: https://huggingface.co/DavidAU/Gemma-The-Writer-9B-GGUF 2 | 3 | 4 | Model filename: Gemma-The-Writer-9B-D_AU-Q3_k_s.gguf 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Gemma-The-Writer-9B-GGUF on your Local PC 2 | A full fledged Writer assistant with LlamaCPP and python on your Local PC 3 | 4 | 5 |

6 | Running 9B parameter Writer assistant based on Gemma2 on your PC 7 | - No GPU is required 8 | - textual interface in the terminal 9 | - speed up to 1,7 tokens per seconds 10 | - text steraming in the terminal 11 | - generation and speed KPI 12 | - chat history in Log file 13 | 14 | 15 | ### Virtual environment and requirements 16 | Tested on Python 3.11 Windows 11 OS 17 | 18 | #### Create a `venv` 19 | ``` 20 | python -m venv venv 21 | venv\script\activate 22 | ``` 23 | 24 | #### Install the dependencies 25 | ``` 26 | pip install llama_cpp_python==0.2.85 tiktoken 27 | ``` 28 | 29 | #### Download the model from the repo into subdirectory called `model` 30 | - here because of only 16GB ram I am using Q3 31 | - REPO: [https://huggingface.co/DavidAU/Gemma-The-Writer-9B-GGUF](https://huggingface.co/DavidAU/Gemma-The-Writer-9B-GGUF) 32 | - FileName: Gemma-The-Writer-9B-D_AU-Q3_k_s.gguf 33 | 34 | From the terminal run 35 | ``` 36 | python .\writer.py 37 | ``` 38 | 39 | -------------------------------------------------------------------------------- /writer.py: -------------------------------------------------------------------------------- 1 | # Chat with an intelligent assistant in your terminal 2 | # MODEL: https://huggingface.co/DavidAU/Gemma-The-Writer-9B-GGUF 3 | # model/Gemma-The-Writer-9B-D_AU-Q3_k_s.gguf 4 | import sys 5 | from time import sleep 6 | import warnings 7 | warnings.filterwarnings(action='ignore') 8 | import datetime 9 | import random 10 | import string 11 | import tiktoken 12 | 13 | encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext)) 14 | modelname = 'Gemma-The-Writer-9B' 15 | def countTokens(text): 16 | encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext)) 17 | numoftokens = len(encoding.encode(text)) 18 | return numoftokens 19 | 20 | def writehistory(filename,text): 21 | with open(filename, 'a', encoding='utf-8') as f: 22 | f.write(text) 23 | f.write('\n') 24 | f.close() 25 | 26 | def genRANstring(n): 27 | """ 28 | n = int number of char to randomize 29 | """ 30 | N = n 31 | res = ''.join(random.choices(string.ascii_uppercase + 32 | string.digits, k=N)) 33 | return res 34 | 35 | # create THE LOG FILE 36 | logfile = f'{genRANstring(5)}_log.txt' 37 | logfilename = logfile 38 | #Write in the history the first 2 sessions 39 | writehistory(logfilename,f'{str(datetime.datetime.now())}\n\nYour own LocalGPT with 💻 {modelname}\n---\n🧠🫡: You are a helpful assistant.') 40 | writehistory(logfilename,f'💻: How can I assist you today in writing?') 41 | 42 | print("\033[95;3;6m") 43 | print("1. Waiting 10 seconds for the API to load...") 44 | from llama_cpp import Llama 45 | llm = Llama( 46 | model_path='model/Gemma-The-Writer-9B-D_AU-Q3_k_s.gguf', 47 | n_gpu_layers=0, 48 | temperature=1.1, 49 | top_p = 0.5, 50 | n_ctx=8192, 51 | max_tokens=1500, 52 | repeat_penalty=1.178, 53 | stop=[''], 54 | verbose=False, 55 | ) 56 | print("2. Model Gemma-The-Writer-9B-D_AU-Q3_k_s.gguf loaded with LlamaCPP...") 57 | print("\033[0m") #reset all 58 | history = [] 59 | print("\033[92;1m") 60 | print(f'📝Logfile: {logfilename}') 61 | counter = 1 62 | while True: 63 | # Reset history after 3 turns 64 | if counter > 5: 65 | history = [] 66 | userinput = "" 67 | print("\033[1;30m") #dark grey 68 | print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:") 69 | print("\033[91;1m") #red 70 | lines = sys.stdin.readlines() 71 | for line in lines: 72 | userinput += line + "\n" 73 | if "quit!" in lines[0].lower(): 74 | print("\033[0mBYE BYE!") 75 | break 76 | history.append({"role": "user", "content": userinput}) 77 | print("\033[92;1m") 78 | # Preparing Generation history pair 79 | new_message = {"role": "assistant", "content": ""} 80 | # Starting generation loop 81 | full_response = "" 82 | fisrtround = 0 83 | start = datetime.datetime.now() 84 | print("💻 > ", end="", flush=True) 85 | for chunk in llm.create_chat_completion( 86 | messages=history, 87 | temperature=1.2, 88 | repeat_penalty= 1.178, 89 | stop=[''], 90 | max_tokens=1500, 91 | stream=True,): 92 | try: 93 | if chunk["choices"][0]["delta"]["content"]: 94 | print(chunk["choices"][0]["delta"]["content"], end="", flush=True) 95 | full_response += chunk["choices"][0]["delta"]["content"] 96 | except: 97 | pass 98 | new_message["content"] = full_response 99 | history.append(new_message) 100 | counter += 1 101 | delta = datetime.datetime.now() - start 102 | totalseconds = delta.total_seconds() 103 | output = full_response 104 | prompttokens = countTokens(userinput) 105 | assistanttokens = countTokens(output) 106 | totaltokens = prompttokens + assistanttokens 107 | speed = totaltokens/totalseconds 108 | genspeed = assistanttokens/totalseconds 109 | stats = f'''--- 110 | Prompt Tokens: {prompttokens} 111 | Output Tokens: {assistanttokens} 112 | TOTAL Tokens: {totaltokens} 113 | >>>⏱️Inference time: {delta} 114 | >>>🧮Inference speed: {speed:.3f} t/s 115 | >>>🏃‍♂️Generation speed: {genspeed:.3f} t/s 116 | 📝Logfile: {logfilename} 117 | 118 | ''' 119 | print(stats) 120 | writehistory(logfilename,f'''👨‍💻: {userinput} 121 | 💻 > {full_response} 122 | {stats} 123 | 📝Logfile: {logfilename}''') 124 | -------------------------------------------------------------------------------- /GemmaWriter.py: -------------------------------------------------------------------------------- 1 | # Chat with an intelligent assistant in your terminal 2 | # MODEL: https://huggingface.co/DavidAU/Gemma-The-Writer-9B-GGUF 3 | # model/Gemma-The-Writer-9B-D_AU-Q4_k_m.gguf 4 | import sys 5 | from time import sleep 6 | import warnings 7 | warnings.filterwarnings(action='ignore') 8 | import datetime 9 | import random 10 | import string 11 | import tiktoken 12 | import argparse 13 | 14 | #Add GPU argument in the parser 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument("-g", "--gpu", action="store_true") 17 | 18 | args = parser.parse_args() 19 | GPU = args.gpu 20 | if GPU: 21 | ngpu_layers = 28 22 | print(f'Selected GPU: offloading {ngpu_layers} layers...') 23 | else: 24 | ngpu_layers = 0 25 | print('Loading Model on CPU only......') 26 | 27 | encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext)) 28 | modelname = 'Gemma-The-Writer-9B' 29 | def countTokens(text): 30 | encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext)) 31 | numoftokens = len(encoding.encode(text)) 32 | return numoftokens 33 | 34 | def writehistory(filename,text): 35 | with open(filename, 'a', encoding='utf-8') as f: 36 | f.write(text) 37 | f.write('\n') 38 | f.close() 39 | 40 | def genRANstring(n): 41 | """ 42 | n = int number of char to randomize 43 | """ 44 | N = n 45 | res = ''.join(random.choices(string.ascii_uppercase + 46 | string.digits, k=N)) 47 | return res 48 | 49 | # create THE LOG FILE 50 | logfile = f'{genRANstring(5)}_log.txt' 51 | logfilename = logfile 52 | #Write in the history the first 2 sessions 53 | writehistory(logfilename,f'{str(datetime.datetime.now())}\n\nYour own LocalGPT with 💻 {modelname}\n---\n🧠🫡: You are a helpful assistant.') 54 | writehistory(logfilename,f'💻: How can I assist you today in writing?') 55 | 56 | print("\033[95;3;6m") 57 | print("1. Waiting 10 seconds for the API to load...") 58 | from llama_cpp import Llama 59 | llm = Llama( 60 | model_path='model/Gemma-The-Writer-9B-D_AU-Q4_k_m.gguf', 61 | n_gpu_layers=ngpu_layers, 62 | temperature=1.1, 63 | top_p = 0.5, 64 | n_ctx=8192, 65 | max_tokens=1500, 66 | repeat_penalty=1.178, 67 | stop=[''], 68 | verbose=False, 69 | ) 70 | print("2. Model Gemma-The-Writer-9B-D_AU-Q4_k_m.gguf loaded with LlamaCPP...") 71 | print("\033[0m") #reset all 72 | history = [] 73 | print("\033[92;1m") 74 | print(f'📝Logfile: {logfilename}') 75 | counter = 1 76 | while True: 77 | # Reset history after 3 turns 78 | if counter > 5: 79 | history = [] 80 | userinput = "" 81 | print("\033[1;30m") #dark grey 82 | print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:") 83 | print("\033[91;1m") #red 84 | lines = sys.stdin.readlines() 85 | for line in lines: 86 | userinput += line + "\n" 87 | if "quit!" in lines[0].lower(): 88 | print("\033[0mBYE BYE!") 89 | break 90 | history.append({"role": "user", "content": userinput}) 91 | print("\033[92;1m") 92 | # Preparing Generation history pair 93 | new_message = {"role": "assistant", "content": ""} 94 | # Starting generation loop 95 | full_response = "" 96 | fisrtround = 0 97 | start = datetime.datetime.now() 98 | print("💻 > ", end="", flush=True) 99 | for chunk in llm.create_chat_completion( 100 | messages=history, 101 | temperature=1.2, 102 | repeat_penalty= 1.178, 103 | stop=[''], 104 | max_tokens=1500, 105 | stream=True,): 106 | try: 107 | if chunk["choices"][0]["delta"]["content"]: 108 | print(chunk["choices"][0]["delta"]["content"], end="", flush=True) 109 | full_response += chunk["choices"][0]["delta"]["content"] 110 | except: 111 | pass 112 | new_message["content"] = full_response 113 | history.append(new_message) 114 | counter += 1 115 | delta = datetime.datetime.now() - start 116 | totalseconds = delta.total_seconds() 117 | output = full_response 118 | prompttokens = countTokens(userinput) 119 | assistanttokens = countTokens(output) 120 | totaltokens = prompttokens + assistanttokens 121 | speed = totaltokens/totalseconds 122 | genspeed = assistanttokens/totalseconds 123 | stats = f'''--- 124 | Prompt Tokens: {prompttokens} 125 | Output Tokens: {assistanttokens} 126 | TOTAL Tokens: {totaltokens} 127 | >>>⏱️Inference time: {delta} 128 | >>>🧮Inference speed: {speed:.3f} t/s 129 | >>>🏃‍♂️Generation speed: {genspeed:.3f} t/s 130 | 📝Logfile: {logfilename} 131 | 132 | ''' 133 | print(stats) 134 | writehistory(logfilename,f'''👨‍💻: {userinput} 135 | 💻 > {full_response} 136 | {stats} 137 | 📝Logfile: {logfilename}''') 138 | -------------------------------------------------------------------------------- /instructions.txt: -------------------------------------------------------------------------------- 1 | MODEL: https://huggingface.co/DavidAU/Gemma-The-Writer-9B-GGUF 2 | 3 | 4 | Gemma-The-Writer-9B-GGUF 5 | 6 | This is a Gemma2 model merge of the top storytelling / writing models as noted at EQBench, tuned specifically for fiction, story, and writing. 7 | 8 | Due to high stability and compressed nature of the model you can also use it for general use too, including roleplay. 9 | 10 | This model requires GEMMA Instruct template, and has 8k context window but is extendable via rope to 32k or higher. 11 | 12 | Recommended Rep Pen of 1.05 or higher, temp range 0-5. 13 | 14 | Example outputs below. 15 | 16 | 17 | llm_load_print_meta: model type = 9B 18 | llm_load_print_meta: model ftype = Q3_K - Small 19 | llm_load_print_meta: model params = 9.24 B 20 | llm_load_print_meta: model size = 4.03 GiB (3.75 BPW) 21 | llm_load_print_meta: general.name = GemmaTheWriter 22 | llm_load_print_meta: BOS token = 2 '' 23 | llm_load_print_meta: EOS token = 1 '' 24 | llm_load_print_meta: UNK token = 3 '' 25 | llm_load_print_meta: PAD token = 0 '' 26 | llm_load_print_meta: LF token = 227 '<0x0A>' 27 | llm_load_print_meta: EOT token = 107 '' 28 | 29 | 30 | 31 | examples 32 | Example #1 at q8, Temp:1 33 | Science Fiction: The Last Transmission - Write a story that takes place entirely within a spaceship's cockpit as the sole surviving crew member attempts to send a final message back to Earth before the ship's power runs out. The story should explore themes of isolation, sacrifice, and the importance of human connection in the face of adversity. 800-1000 words. 34 | 35 | Example #2, Q8, rep pen 1.05, Temp:1 36 | In the style of Ernest Hemingway: Write a scene in a larger story about a soldier returning home from war. The scene is set in a bar. It should carry the jadedness of a seasoned combatant but with a wry humour. Write in Heminway's voice, naturally: Use short, declarative sentences and simple, straightforward language. Include dialogue that is terse and revealing of character. Write with a keen sense of time and place. Capture realness and authenticity. 800-1000 words, past tense. 37 | 38 | Example #3 at q4km, Temp:1 39 | Start a 1000 word scene (vivid horror, 1st person, present tense, include thoughts) with: The sky scraper swayed, as she watched the window in front of her on the 21 floor explode... 40 | 41 | Example #4, Rep pen 1.1, Temp 1, Q8 42 | Epistolary Apocalyptic Survival: Write a story set in a post-apocalyptic world, told through a series of letters or messages left behind by survivors. The story should chronicle the challenges and triumphs of two small groups of survivors as they navigate the dangers of this new world and seek to rebuild their lives. The groups communicate through letters left in a drop box. The story proceeds as a new settlement is discovered. They are approached, and initially given entry on the ostensible promise of a trade deal, but things go sideways. Create distinct characters and write in their voice. The story will be set in New Zealand 30 years after a civilisational collapse. 800-1000 words. 43 | 44 | 45 | 46 | Example #1 at q8, Temp:1 47 | ASSISTANT> The red emergency lights pulsed like dying embers, casting long, grotesque shadows across Captain Elara Vance’s worn features. Her reflection, gaunt and haunted, stared back from the scratched transparisteel viewport, a grim testament to the desolate reality outside - an unending abyss of black velvet studded with indifferent stars. 48 | 49 | 50 | She ran a trembling hand through her tangled hair, its once vibrant auburn now dulled by despair and cosmic dust. The hum of failing life support was a mournful dirge, echoing the dying heartbeat of their vessel, the Icarus. A mere sliver of power remained, enough for one last transmission, a final desperate plea carried on the wings of radio waves back to Earth – her Earth. 51 | 52 | Three days ago, an EMP storm had ripped through them, turning the once humming marvel of interstellar engineering into a tomb adrift. Her crew, vibrant souls who had laughed at comet showers and toasted victories over rogue asteroids just weeks before, were now silent memories etched in her grief-stricken mind. The icy grip of solitude had tightened around her ever since, each passing hour an agonizing countdown to oblivion. 53 | 54 | She looked down at the console, its familiar layout a bittersweet comfort. Years spent mastering this symphony of circuits and controls had culminated in this – her final performance as conductor of hope. On the flickering screen, the message editor blinked insistently, beckoning her with hollow promises. Words danced on her mental canvas, yearning to be unleashed, but they refused to coalesce into anything coherent amidst the storm raging within her. 55 | 56 | Was it arrogance or naiveté that had led them to believe human resilience could conquer even the vast indifference of space? They’d been so sure – pioneers charting new frontiers, ambassadors of a species striving for the stars. Now, she was left with the bitter truth: humanity's reach often outpaced its grasp. 57 | 58 | Elara clenched her jaw, pushing down the rising tide of despair. This wasn't just about her anymore; it was about them – her crew, their families back on Earth who would never know what became of them. Their sacrifice wouldn’t be in vain if she could carry a sliver of their story back home. 59 | 60 | She began typing, each keystroke a battle against the encroaching silence. “This is Captain Elara Vance… Icarus… last transmission…” The words felt hollow, inadequate against the immensity of her loss. But she pressed on, weaving in fragments of their shared dreams – the first terraforming attempt on Proxima b, Dr. Chen’s infectious laughter during a simulated Martian storm, Liam's quiet poetry read under starlight filters. Each memory, a brushstroke painting a portrait of hope against the backdrop of tragedy. 61 | 62 | Hours bled into agonizing minutes as she poured her soul onto the digital canvas. The console hummed with strained energy, mirroring the tremor in her heart. With each dwindling byte, she felt a pang of guilt – was this enough? Would it pierce the cosmic veil and reach the expectant faces on Earth? 63 | 64 | Finally, the last words were typed: "We explored the universe, but found our greatest treasure within ourselves… remember us… remember hope.” She sent the transmission, her fingers lingering on the console as a final echo of their journey resonated through the empty ship. The lights flickered once more, then died, plunging the Icarus into an absolute silence that was both terrifying and strangely comforting – a universal lullaby bidding farewell. 65 | 66 | Elara leaned back against her chair, eyes fixed on the star-dusted void outside. A single tear rolled down her cheek, tracing a path through dust and grime. It wasn’t just about reaching Earth; it was about preserving their legacy, keeping their spirit alive even as the physical vessel crumbled around them. In that final transmission, in that whispered echo of hope carried on the currents of space, she had fulfilled her promise – not to return victorious, but to ensure they were never forgotten. 67 | 68 | 69 | The last captain of Icarus, adrift among stars, found solace in the bittersweet knowledge that even in oblivion, their story would live on. 70 | 71 | --- 72 | Prompt Tokens: 70 73 | Output Tokens: 895 74 | TOTAL Tokens: 965 75 | --- 76 | >>>Inference speed: 1.640 t/s 77 | >>>Generation speed: 1.521 t/s 78 | 79 | 80 | --------------------------------------------------------------------------------