├── TableComparison.png
├── QWENsettingsNODRAFT.me
├── QWENsettingsDRAFT.me
├── Qwen2.5-7B-Instruct_CPPSERV-NODRAFT_5J885.csv
├── Qwen2.5-7B-Instruct_CPPSERV-SPECDECODE_1JGXF.csv
├── README.md
├── promptLibv2DRAFT.py
├── 102.QWEN2.5-instruct_LlamaCPPSERVER-NODRAFT_API_promptTest.py
├── 101.QWEN2.5-instruct_LlamaCPPSERVER-DRAFT_API_promptTest.py
├── Qwen2.5-7B-Instruct_CPPSERV-NODRAFT_5J885_log.txt
└── Qwen2.5-7B-Instruct_CPPSERV-SPECDECODE_1JGXF_log.txt
/TableComparison.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fabiomatricardi/llamacpp-speculative/main/TableComparison.png
--------------------------------------------------------------------------------
/QWENsettingsNODRAFT.me:
--------------------------------------------------------------------------------
1 | {
2 | "port" : 8001,
3 | "models": [
4 | {
5 | "model": "models/Qwen2.5-7B-Instruct-Q4_0.gguf",
6 | "model_alias": "QWEN2.5",
7 | "n_ctx": 8196
8 | }
9 | ]
10 | }
--------------------------------------------------------------------------------
/QWENsettingsDRAFT.me:
--------------------------------------------------------------------------------
1 | {
2 | "port" : 8001,
3 | "models": [
4 | {
5 | "model": "models/Qwen2.5-7B-Instruct-Q4_0.gguf",
6 | "model_alias": "QWEN2.5",
7 | "n_ctx": 8196,
8 | "draft_model" : "models/Qwen2.5-0.5B-Instruct-Q8_0.gguf",
9 | "draft_model_num_pred_tokens" : 2
10 | }
11 | ]
12 | }
--------------------------------------------------------------------------------
/Qwen2.5-7B-Instruct_CPPSERV-NODRAFT_5J885.csv:
--------------------------------------------------------------------------------
1 | #,TASK,GENSPEED,INFSPEED,TTFT,TIME
2 | 1,introduction,6.06,7.50,1.09,10.40
3 | 2,explain in one sentence,5.74,7.25,0.97,6.62
4 | 3,explain in three paragraphs,6.68,6.99,1.08,38.19
5 | 4,say 'I am ready',0.91,95.10,3.75,4.40
6 | 5,summarize,5.43,19.34,3.35,29.11
7 | 6,Summarize in two sentences,4.34,36.18,3.57,12.91
8 | 7,Write in a list the three main key points - format output,4.70,33.81,3.68,14.67
9 | 8,Table of Contents,4.53,31.53,3.70,15.67
10 | 9,RAG,4.55,29.55,3.72,17.60
11 | 10,Truthful RAG,0.71,108.37,3.77,4.25
12 | 11,write content from a reference,5.69,10.95,3.42,82.94
13 | 12,extract 5 topics,5.84,15.18,3.69,43.68
14 | 13,Creativity: 1000 words SF story,6.41,6.88,2.04,149.23
15 | 14,Reflection prompt,6.46,9.62,3.31,108.55
16 |
--------------------------------------------------------------------------------
/Qwen2.5-7B-Instruct_CPPSERV-SPECDECODE_1JGXF.csv:
--------------------------------------------------------------------------------
1 | ,,,,,,,,,,
2 | ,,WITH SPECULATIVE DECODING,,,,,NO DRAFT MODEL,,,
3 | #,TASK,GENSPEED,INFSPEED,TTFT,TIME,,GENSPEED,INFSPEED,TTFT,TIME
4 | 1,introduction,5.01,6.72,1.21,8.78,,6.06,7.5,1.09,10.4
5 | 2,explain in one sentence,4.4,5.56,1.15,8.64,,5.74,7.25,0.97,6.62
6 | 3,explain in three paragraphs,4.86,5.17,0.99,37.89,,6.68,6.99,1.08,38.19
7 | 4,say 'I am ready',0.9,94.3,3.8,4.43,,0.91,95.1,3.75,4.4
8 | 5,summarize,4.07,13.71,3.35,42.03,,5.43,19.34,3.35,29.11
9 | 6,Summarize in two sentences,3.25,27.14,3.64,17.2,,4.34,36.18,3.57,12.91
10 | 7,Write in a list the three main key points - format output,4.25,29.14,3.67,17.16,,4.7,33.81,3.68,14.67
11 | 8,Table of Contents,3.59,24.95,3.6,19.8,,4.53,31.53,3.7,15.67
12 | 9,RAG,3.66,23.76,3.84,21.89,,4.55,29.55,3.72,17.6
13 | 10,Truthful RAG,0.73,112.25,3.8,4.11,,0.71,108.37,3.77,4.25
14 | 11,write content from a reference,4.43,8.72,3.41,101.72,,5.69,10.95,3.42,82.94
15 | 12,extract 5 topics,4.54,12.84,3.72,49.15,,5.84,15.18,3.69,43.68
16 | 13,Creativity: 1000 words SF story,4.54,4.79,2.05,279.29,,6.41,6.88,2.04,149.23
17 | 14,Reflection prompt,5.06,7.21,3.19,159.71,,6.46,9.62,3.31,108.55
18 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # llamacpp-speculative decoding
2 | Comparison of using llama.cpp llama-cpp-python server with speculative decoding and without it
3 |
4 | > There is really no benefit?
5 |
6 | In my findings, without offloading any layer oì to GPU, there are no increase in speed.
7 |
8 | ## Comparison table
9 |
10 |
11 |
12 |
13 | ## STACK
14 | - llama-cpp-python[server] revision 0.3.2 that supports also Granite3, OLMO and MoE
15 | - openAI pytho library to access the endpoints
16 | - automatic prompt evaluation
17 | - server settings as per [documentation](https://llama-cpp-python.readthedocs.io/en/latest/server/)
18 |
19 | ## Dependencies
20 | - llama-cpp-python[server]==0.3.2
21 | - openai
22 | - tiktoken
23 |
24 |
25 | ## How to use
26 | ### with speculative decoding
27 | from one terminal window run
28 | ```
29 | python -m llama_cpp.server --config_file .\QWENsettingsDRAFT.me
30 | ```
31 | Using the following server settings
32 | ```
33 | {
34 | "port" : 8001,
35 | "models": [
36 | {
37 | "model": "models/Qwen2.5-7B-Instruct-Q4_0.gguf",
38 | "model_alias": "QWEN2.5",
39 | "n_ctx": 8196,
40 | "draft_model" : "models/Qwen2.5-0.5B-Instruct-Q8_0.gguf",
41 | "draft_model_num_pred_tokens" : 2
42 | }
43 | ]
44 | }
45 | ```
46 | In another terminal window run
47 | ```
48 | python .\101.QWEN2.5-instruct_LlamaCPPSERVER-DRAFT_API_promptTest.py
49 | ```
50 |
51 | ### WITHOUT speculative decoding
52 | from one terminal window run
53 | ```
54 | python -m llama_cpp.server --config_file .\QWENsettingsNODRAFT.me
55 | ```
56 | Using the following server settings
57 | ```
58 | {
59 | "port" : 8001,
60 | "models": [
61 | {
62 | "model": "models/Qwen2.5-7B-Instruct-Q4_0.gguf",
63 | "model_alias": "QWEN2.5",
64 | "n_ctx": 8196
65 | }
66 | ]
67 | }
68 | ```
69 | In another terminal window run
70 | ```
71 | python .\102.QWEN2.5-instruct_LlamaCPPSERVER-NODRAFT_API_promptTest.py
72 | ```
73 |
74 |
75 |
--------------------------------------------------------------------------------
/promptLibv2DRAFT.py:
--------------------------------------------------------------------------------
1 | """
2 | V2 changes
3 | added Time To First Token in the statistics ttft
4 | added some more prompts in the catalog
5 | - say 'I am ready'
6 | - modified for Llama3.2-1b Write in a list the three main key points - format output
7 |
8 | 20240929 FAMA
9 | """
10 |
11 | import random
12 | import string
13 | import tiktoken
14 |
15 | def createCatalog():
16 | """
17 | Create a dictionary with
18 | 'task' : description of the NLP task in the prompt
19 | 'prompt' : the instruction prompt for the LLM
20 | """
21 | context = """One of the things everybody in the West knows about China is that it is not a democracy, and is instead a regime run with an iron fist by a single entity, the Chinese Communist Party, whose leadership rarely acts transparently, running the country without the need for primary elections, alternative candidacies, etc.
22 | In general, those of us who live in democracies, with relatively transparent electoral processes, tend to consider the Chinese system undesirable, little more than a dictatorship where people have no say in who governs them.
23 | That said, among the “advantages” of the Chinese system is that because the leadership never has to put its legitimacy to the vote, it can carry out very long-term planning in the knowledge that another administration isn’t going to come along and change those plans.
24 | Obviously, I put “advantages” in quotation marks because, as democrats, most of my readers would never be willing to sacrifice their freedom for greater planning, but there is no doubt that China, since its system works like this and its population seems to have accepted it for generations, intends to turn this into a comparative advantage, the term used in business when analyzing companies.
25 | It turns out that China’s capacity for long-term planning is achieving something unheard of in the West: it seems the country reached peak carbon dioxide and greenhouse gas emissions in 2023, and that the figures for 2024, driven above all by a determined increase in the installation of renewable energies, are not only lower, but apparently going to mark a turning point.
26 | China and India were until recently the planet’s biggest polluters, but they now offer a model for energy transition (there is still a long way to go; but we are talking about models, not a done deal).
27 | It could soon be the case that the so-called developing countries will be showing the West the way forward."""
28 | catalog = []
29 | prmpt_tasks = ["introduction",
30 | "explain in one sentence",
31 | "explain in three paragraphs",
32 | "say 'I am ready'",
33 | "summarize",
34 | "Summarize in two sentences",
35 | "Write in a list the three main key points - format output",
36 | "Table of Contents",
37 | "RAG",
38 | "Truthful RAG",
39 | "write content from a reference",
40 | "extract 5 topics",
41 | "Creativity: 1000 words SF story",
42 | "Reflection prompt"
43 | ]
44 | prmpt_coll = [
45 | """Hi there I am Fabio, a Medium writer. who are you?""",
46 | """explain in 1 sentence what is science.\n""",
47 | """explain only in 3 paragraphs what is artificial intelligence.\n""",
48 | f"""read the following text and when you are done say "I am ready".
49 |
50 | [text]
51 | {context}
52 | [end of text]
53 |
54 | """,
55 | f"""summarize the following text:
56 | [text]
57 | {context}
58 | [end of text]
59 |
60 | """,
61 | f"""Write a summary of the following text. Use only 2 sentences.
62 | [text]
63 | {context}
64 | [end of text]
65 |
66 | """,
67 | f"""1. extract the 3 key points from the provided text
68 | 2. format the output as a python list.
69 | [text]
70 | {context}
71 | [end of text]
72 | Return only the python list.
73 | """,
74 | f"""write a "table of contents" of the provided text. Be simple and concise.
75 | [text]
76 | {context}
77 | [end of text]
78 |
79 | "table of content":
80 | """,
81 | f"""Reply to the question using the provided context. If the answer is not contained in the text say "unanswerable".
82 | [context]
83 | {context}
84 | [end of context]
85 |
86 | question: what China achieved with it's long-term planning?
87 | answer:
88 | """,
89 | f"""Reply to the question only using the provided context. If the answer is not contained in the provided context say "unanswerable".
90 |
91 | question: who is Anne Frank?
92 |
93 | [context]
94 | {context}
95 | [end of context]
96 |
97 | Remember: if you cannot answer based on the provided context, say "unanswerable"
98 |
99 | answer:
100 | """,
101 |
102 | f"""Using the following text as a reference, write a 5-paragraphs essay about "the benefits of China economic model".
103 |
104 | [text]
105 | {context}
106 | [end of text]
107 | Remember: use the information provided and write exactly 5 paragraphs.
108 | """,
109 | f"""List 5 most important topics from the following text:
110 | [text]
111 | {context}
112 | [end of text]
113 | """,
114 | """Science Fiction: The Last Transmission - Write a story that takes place entirely within a spaceship's cockpit as the sole surviving crew member attempts to send a final message back to Earth before the ship's power runs out. The story should explore themes of isolation, sacrifice, and the importance of human connection in the face of adversity. 800-1000 words.
115 |
116 | """,
117 | """You are an AI assistant designed to provide detailed, step-by-step responses. Your outputs should follow this structure:
118 | 1. Begin with a section.
119 | 2. Inside the thinking section:
120 | a. Briefly analyze the question and outline your approach.
121 | b. Present a clear plan of steps to solve the problem.
122 | c. Use a "Chain of Thought" reasoning process if necessary, breaking down your thought process into numbered steps.
123 | 3. Include a section for each idea where you:
124 | a. Review your reasoning.
125 | b. Check for potential errors or oversights.
126 | c. Confirm or adjust your conclusion if necessary.
127 | 4. Be sure to close all reflection sections.
128 | 5. Close the thinking section with .
129 | 6. Provide your final answer in an