├── requirements.txt ├── assets ├── overview.png └── marqo_photo.jpeg ├── start_marqo.sh ├── .gitignore ├── main.py ├── news_summaries.txt ├── README.md └── news.py /requirements.txt: -------------------------------------------------------------------------------- 1 | marqo 2 | openai -------------------------------------------------------------------------------- /assets/overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iain-mackie/marqo-gpt3/HEAD/assets/overview.png -------------------------------------------------------------------------------- /assets/marqo_photo.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iain-mackie/marqo-gpt3/HEAD/assets/marqo_photo.jpeg -------------------------------------------------------------------------------- /start_marqo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | docker pull marqoai/marqo:0.0.6; 4 | docker rm -f marqo; 5 | docker run --name marqo -it --privileged -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:0.0.6 -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | 131 | .idea 132 | data/* -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | 2 | import marqo 3 | import json 4 | import openai 5 | 6 | from news import MARQO_DOCUMENTS 7 | 8 | # init GPT3 API 9 | openai.organization = None 10 | openai.api_key = None 11 | 12 | DOC_INDEX_NAME = 'news-index' 13 | output = './news_summaries.txt' 14 | queries = [ 15 | # question # date filter 16 | ('How is the US Midterm Election going?', None), 17 | ('How is COP27 progressing?', None), 18 | ('What is happening in business today?', '2022-11-09'), 19 | 20 | ] 21 | 22 | 23 | if __name__ == '__main__': 24 | 25 | print('Establishing connection to marqo client.') 26 | mq = marqo.Client(url='http://localhost:8882') 27 | 28 | ######################################################################### 29 | ######################### MARQO INDEXING ################################ 30 | ######################################################################### 31 | # mq.index(DOC_INDEX_NAME).delete() 32 | try: 33 | print(f'document index build: {mq.index(DOC_INDEX_NAME).get_stats()}') 34 | except KeyboardInterrupt: 35 | raise 36 | except: 37 | print('Indexing documents') 38 | mq.index(DOC_INDEX_NAME).add_documents(MARQO_DOCUMENTS) 39 | print('Done') 40 | 41 | 42 | ######################################################################### 43 | ######################### GPT3 GENERATION ############################### 44 | ######################################################################### 45 | 46 | def get_no_context_prompt(question): 47 | """ GPT3 prompt without any context. """ 48 | return f'Question: {question}\n\nAnswer:' 49 | 50 | def get_context_prompt(question, context): 51 | """ GPT3 prompt without text-based context from marqo search. """ 52 | return f'Background: \n{context}\n\nQuestion: {question}\n\nAnswer:' 53 | 54 | def prompt_to_essay(prompt): 55 | """ Process GPT-3 prompt and clean string . """ 56 | response = openai.Completion.create( 57 | engine="text-davinci-002", 58 | prompt=prompt, 59 | temperature=0.0, 60 | max_tokens=256, 61 | top_p=1.0, 62 | frequency_penalty=0.0, 63 | presence_penalty=0.0 64 | ) 65 | return response['choices'][0]['text'].strip().replace('\n', ' ') 66 | 67 | 68 | ######################################################################### 69 | ########################### EXPERIMENTS ################################ 70 | ######################################################################### 71 | 72 | # Write to news_summaries.txt for analysis. 73 | with open(output, 'w') as f_out: 74 | for question, date in queries: 75 | f_out.write('////////////////////////////////////////////////////////\n') 76 | f_out.write('////////////////////////////////////////////////////////\n') 77 | 78 | f_out.write(f'question: {question}, date filter: {date}\n') 79 | 80 | f_out.write('================= GPT3 NO CONTEXT ======================\n') 81 | # Build prompt without context. 82 | prompt = get_no_context_prompt(question) 83 | f_out.write(f'Prompt: \n{prompt}\n') 84 | summary = prompt_to_essay(prompt) 85 | f_out.write(f'{summary}\n\n') 86 | 87 | f_out.write('================= GPT3 + Marqo =======================\n') 88 | # Query Marqo and set filters based on user query 89 | if isinstance(date, str): 90 | results = mq.index(DOC_INDEX_NAME).search(q=question, 91 | searchable_attributes=['Title', 'Description'], 92 | filter_string=f"date:{date}", 93 | limit=5) 94 | else: 95 | results = mq.index(DOC_INDEX_NAME).search(q=question, 96 | searchable_attributes=['Title', 'Description'], 97 | limit=5) 98 | 99 | # Build context using Marqo's highlighting functionality. 100 | context = '' 101 | for i, hit in enumerate(results['hits']): 102 | title = hit['Title'] 103 | text = hit['Description'] 104 | # for section, text in hit['_highlights'].items(): 105 | # context += text + '\n' 106 | context += f'Source {i}) {title} || {" ".join(text.split()[:60])}... \n' 107 | # Build prompt with Marqo context. 108 | prompt = get_context_prompt(question=question, context=context) 109 | f_out.write(f'Prompt: \n{prompt}\n') 110 | summary = prompt_to_essay(prompt) 111 | f_out.write(f'{summary}\n\n') 112 | 113 | -------------------------------------------------------------------------------- /news_summaries.txt: -------------------------------------------------------------------------------- 1 | //////////////////////////////////////////////////////// 2 | //////////////////////////////////////////////////////// 3 | question: How is the US Midterm Election going?, date filter: None 4 | ================= GPT3 NO CONTEXT ====================== 5 | Prompt: 6 | Question: How is the US Midterm Election going? 7 | 8 | Answer: 9 | The US Midterm Election is going well. 10 | 11 | ================= GPT3 + Marqo ======================= 12 | Prompt: 13 | Background: 14 | Source 0) Georgia race goes to run-off as fight for US Senate neck-and-neck || Results are being declared in the US midterm elections, with control of Congress hanging in the balance. Republicans are likely to take control of the House of Representatives but the Senate fight is on a knife-edge. The race for the Senate seat in Georgia, which could determine the outcome, will not be decided until a runoff election on 6 December.... 15 | Source 1) US election results: When will we know who won? || As voters across the US wake up on Wednesday morning after the election, the results of the 2022 midterms are not yet completely clear - with officials across the country warning that elections may drag on for days or weeks. The expected delays are the result of a number of factors, including razor-thin margins between candidates, potentially contested elections, the... 16 | Source 2) Republican 'red wave' turns into a ripple, Georgia Senate headed to runoff || Republicans made modest gains in U.S. midterm elections but Democrats performed better than expected, as control of the Senate hinged on three races that remained too close to call on Wednesday afternoon. The Georgia U.S. Senate race between Democratic incumbent Raphael Warnock and Republican Herschel Walker will go to a Dec. 6 runoff, Edison Research projected. That means it could... 17 | Source 3) Control of U.S. Congress unclear as Republican 'red wave' fizzles || Republicans made modest gains in U.S. midterm elections but Democrats performed better than expected, as control of the Senate hinged on three races that remained too close to call on Wednesday afternoon. The Georgia U.S. Senate race between Democratic incumbent Raphael Warnock and Republican Herschel Walker will go to a Dec. 6 runoff, Edison Research projected. That means it could... 18 | Source 4) Allianz beats quarterly profit expectations, posts rosier 2022 outlook || German insurer Allianz on Wednesday posted a better-than-expected 17% rise in third-quarter net profit, helped by strength at its property and casualty division, and gave a more optimistic full-year outlook.... 19 | 20 | 21 | Question: How is the US Midterm Election going? 22 | 23 | Answer: 24 | The US Midterm Election is going neck-and-neck, with the Republicans likely to take control of the House of Representatives, but the Senate fight is on a knife-edge. The race for the Senate seat in Georgia, which could determine the outcome, will not be decided until a runoff election on 6 December. 25 | 26 | //////////////////////////////////////////////////////// 27 | //////////////////////////////////////////////////////// 28 | question: How is COP27 progressing?, date filter: None 29 | ================= GPT3 NO CONTEXT ====================== 30 | Prompt: 31 | Question: How is COP27 progressing? 32 | 33 | Answer: 34 | The COP27 climate conference is progressing well. The conference is focused on finalizing the details of the Paris Agreement, which was reached in 2015. 35 | 36 | ================= GPT3 + Marqo ======================= 37 | Prompt: 38 | Background: 39 | Source 0) What is COP27 and why is it important? || World leaders are discussing action to tackle climate change at the COP27 climate summit in Egypt. It follows a year of climate-related disasters and broken temperature records. UK Prime Minister Rishi Sunak is attending, having previously said he would not. United Nations (UN) climate summits are held every year, for governments to agree steps to limit global temperature rises. They... 40 | Source 1) Show us the money: Developing world at COP27 seeks financing details || Finance took centre stage at the COP27 climate talks on Wednesday, with U.N. experts publishing a list of projects worth $120 billion that investors could back to help poorer countries cut emissions and adapt to the impacts of global warming. A $3 billion water transfer project between Lesotho and Botswana and a $10 million plan to improve the public water... 41 | Source 2) COP27: Time to pay the climate bill - vulnerable nations || Leaders of countries flooded or parched due to climate change are pleading at the COP27 summit for an urgent financial lifeline from richer nations. "We will not give up... the alternative consigns us to a watery grave," Bahamas Prime Minister Philip Davis said.Countries are meeting in Sharm el-Sheikh, Egypt to discuss next steps in curbing climate change. Front line nations... 42 | Source 3) COP27: Ukraine a reason to act fast on climate change - Rishi Sunak || The war in Ukraine is a reason to act faster to tackle climate change, UK Prime Minister Rishi Sunak has told the UN climate summit COP27. "Climate and energy security go hand-in-hand," he said in his first international appearance since taking office. Leaders from 120 countries are meeting in Sharm el-Sheikh, Egypt to discuss next steps in curbing climate change.... 43 | Source 4) COP27: UAE and Egypt agree to build one of world's biggest wind farms || The presidents of the United Arab Emirates (UAE) and Egypt witnessed the signing of an agreement on Tuesday to develop one of the world's largest onshore wind projects in Egypt, according to an official statement on the Gulf nation's state news agency. The memorandum of understanding was signed between the UAE's renewable energy firm Masdar alongside its joint venture with... 44 | 45 | 46 | Question: How is COP27 progressing? 47 | 48 | Answer: 49 | COP27 is progressing well, with leaders from around the world discussing ways to tackle climate change. One key issue that has been raised is the need for more financing to help developing countries cut emissions and adapt to the impacts of global warming. Another issue that has been discussed is the importance of increasing renewable energy production, with the UAE and Egypt agreeing to develop one of the world's largest onshore wind projects. 50 | 51 | //////////////////////////////////////////////////////// 52 | //////////////////////////////////////////////////////// 53 | question: What is happening in business today?, date filter: 2022-11-09 54 | ================= GPT3 NO CONTEXT ====================== 55 | Prompt: 56 | Question: What is happening in business today? 57 | 58 | Answer: 59 | There is a lot happening in business today. The economy is slowly recovering from the recession, and businesses are starting to invest again. However, there is still a lot of uncertainty in the business world, and many companies are struggling to stay afloat. 60 | 61 | ================= GPT3 + Marqo ======================= 62 | Prompt: 63 | Background: 64 | Source 0) M&S warns of 'gathering storm' as shoppers squeezed || Marks and Spencer has warned of a "gathering storm" of higher costs for retailers and pressure on household budgets as it reported a fall in profits for the first half of the year. The High Street giant said trading would become "more challenging" after it revealed its profits dropped by 24%. It said "all parts" of retail would be affected... 65 | Source 1) Facebook-owner Meta to cut 11,000 staff || Meta, which owns Facebook, Instagram and WhatsApp, has announced that it will cut 13% of its workforce. The first mass lay-offs in the firm's history will result in 11,000 employees, from a worldwide headcount of 87,000, losing their jobs. Meta chief executive Mark Zuckerberg said the cuts were "the most difficult changes we've made in Meta's history". The news follows... 66 | Source 2) Allianz beats quarterly profit expectations, posts rosier 2022 outlook || German insurer Allianz on Wednesday posted a better-than-expected 17% rise in third-quarter net profit, helped by strength at its property and casualty division, and gave a more optimistic full-year outlook.... 67 | Source 3) Georgia race goes to run-off as fight for US Senate neck-and-neck || Results are being declared in the US midterm elections, with control of Congress hanging in the balance. Republicans are likely to take control of the House of Representatives but the Senate fight is on a knife-edge. The race for the Senate seat in Georgia, which could determine the outcome, will not be decided until a runoff election on 6 December.... 68 | Source 4) Tesla stock hits 2-year low after Musk sells $4 bln worth of shares || Tesla Inc (TSLA.O) shares slid to their lowest level in nearly two years on Wednesday after Chief Executive Elon Musk sold $3.95 billion worth of shares in the electric-vehicle maker. The shares were down 6.1% at $179.66 in afternoon trading. Musk's latest share sale fueled jitters about the fallout of his Twitter buy on the world's most valuable automaker, analysts... 69 | 70 | 71 | Question: What is happening in business today? 72 | 73 | Answer: 74 | There are a few major stories happening in business today. Firstly, Marks and Spencer has warned of a "gathering storm" of higher costs for retailers and pressure on household budgets. Secondly, Facebook-owner Meta is cutting 11,000 staff in its first mass lay-offs. And finally, Tesla stock has hit a 2-year low after CEO Elon Musk sold $4 billion worth of shares. 75 | 76 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Using Marqo to Power News Summarisation 2 | 3 | ## Who am I? 4 | 5 | My name is Iain Mackie and I lead NLP investments at Creator Fund and previously worked in Quant trading. I am currently finishing my PhD in neural search systems at the University of Glasgow and recently won the $500k grand prize at the Alexa TaskBot Challenge.🤖 6 | 7 | Today we announce that Creator Fund has led a £660,000 Pre-Seed investment into Marqo. Marqo allows search engines to think like humans through neural search. The Melbourne-based company is led by two senior engineers from Amazon Robotics and Australia. 8 | 9 | ## What is neural search? 10 | 11 | Search, or information retrieval, is the study of retrieving relevant information given a user query. The most obvious example is Google, where a user inputs a query and Google will return a ranked list of web documents. Search is complex because: 12 |
29 |
30 |
31 | Marqo's open-source Github Github has reached 1.1k+ 🌟 in 6 weeks (top 10 trending libraries!). They have also launched the cloud beta that allows customers to pay-per-use and reduces costs due to pooled resources (join waiting list). Lastly, they are building a community of search enthusiasts tackling different problems (Slack).
32 |
33 | ## Topical news summarisation
34 |
35 | Now for the fun bit...
36 |
37 | I wanted to build a fun search application within minutes to show the ease and power of Marqo. I decided to build a news summarisation application, i.e. answer questions like "What is happening in business today?" that synthesises example news corpus (link).
38 |
39 | The plan is to use Marqo's search to provide useful context for a generation algorithm; we use OpenAI's GPT3 API (link). This is more formally called "retrieval-augmented generation" and helps with generation tasks that require specific knowledge that the model has not seen during training. For example, company-specific documents and news data that's "in the future". Overview of what we're planning:
40 |
41 |
42 |
43 |
44 |
45 | Thus, we can see the problem when we solely ask GPT3, "What is happening in business today?" It does not know and thus generates a generic response:
46 | ```
47 | Question: What is happening in business today?
48 |
49 | Answer:
50 | There is a lot happening in business today. The economy is slowly recovering from the recession, and businesses are starting to invest again...
51 | ```
52 | In fact, anyone following the financial markets knows 'the "economy is slowly recovering" and "businesses are starting to invest again" is completely wrong!!
53 |
54 | To solve this, we need to start our Marqo docker container, which creates a Python API we'll interact with during this demo:
55 | ```
56 | docker pull marqoai/marqo:0.0.6;
57 | docker rm -f marqo;
58 | docker run --name marqo -it --privileged -p 8882:8882 --add-host host.docker.internal:host-gateway marqoai/marqo:0.0.6
59 | ```
60 |
61 | Next, let's look at our example news documents corpus, which contains BBC and Reuters news content from 8th and 9th of November. We use "_id" as Marqo document identifier, the "date" the article was written, "website" indicating the web domain, "Title" for the headline, and "Description" for the article body:
62 | ```
63 | [
64 | {
65 | '_id': '2',
66 | 'date': '2022-11-09',
67 | 'website': 'www.bbc.co.uk',
68 | 'Title': 'COP27: Time to pay the climate bill - vulnerable nations',
69 | 'Description': 'Leaders of countries flooded or parched due to climate change are pleading at the COP27 summit...'
70 | },...
71 | ]
72 | ```
73 |
74 | We then index our news documents that manage both the lexical and neural embeddings. By default, Marqo uses SBERT from neural text embeding and has complete OpenSearch lexical and metadata functionality natively.
75 | ```
76 | from news import MARQO_DOCUMENTS
77 |
78 | DOC_INDEX_NAME = ''news-index'
79 |
80 | print('Establishing connection to marqo client.')
81 | mq = marqo.Client(url='http://localhost:8882')
82 |
83 | print('Indexing documents')
84 | mq.index(DOC_INDEX_NAME).add_documents(MARQO_DOCUMENTS)
85 | ```
86 |
87 | Now we have indexed our news documents, we can simply use Marqo Python search API to return relevant context for our GPT3 generation. For query "q", we use the question and want to match news context based on the "Title" and "Description" text. We also want to filter our documents for "today", which was '2022-11-09'.
88 | ```
89 | question = 'What is happening in business today?'
90 | date = '2022-11-09'
91 | results = mq.index(DOC_INDEX_NAME).search(
92 | q=question,
93 | searchable_attributes=['Title', 'Description'],
94 | filter_string=f"date:{date}"
95 | limit=5)
96 | ```
97 |
98 | Next, we insert Marqo's search results into GPT3 prompt as context, and we try generating an answer again::
99 | ```
100 | Background:
101 | Source 0) M&S warns of 'gathering storm' as shoppers squeezed || Marks and Spencer has warned of a "gathering storm" of higher costs for retailers and pressure on household budgets as it reported a fall in profits for the first half of the year. The High Street giant said trading would become "more challenging" after it revealed its profits dropped by 24%. It said "all parts" of retail would be affected...
102 | Source 1) Facebook-owner Meta to cut 11,000 staff || Meta, which owns Facebook, Instagram and WhatsApp, has announced that it will cut 13% of its workforce. The first mass lay-offs in the firm's history will result in 11,000 employees, from a worldwide headcount of 87,000, losing their jobs. Meta chief executive Mark Zuckerberg said the cuts were "the most difficult changes we've made in Meta's history". The news follows...
103 | Source 2) Allianz beats quarterly profit expectations, posts rosier 2022 outlook || German insurer Allianz on Wednesday posted a better-than-expected 17% rise in third-quarter net profit, helped by strength at its property and casualty division, and gave a more optimistic full-year outlook....
104 | Source 3) Georgia race goes to run-off as fight for US Senate neck-and-neck || Results are being declared in the US midterm elections, with control of Congress hanging in the balance. Republicans are likely to take control of the House of Representatives but the Senate fight is on a knife-edge. The race for the Senate seat in Georgia, which could determine the outcome, will not be decided until a runoff election on 6 December....
105 | Source 4) Tesla stock hits 2-year low after Musk sells $4 bln worth of shares || Tesla Inc (TSLA.O) shares slid to their lowest level in nearly two years on Wednesday after Chief Executive Elon Musk sold $3.95 billion worth of shares in the electric-vehicle maker. The shares were down 6.1% at $179.66 in afternoon trading. Musk's latest share sale fueled jitters about the fallout of his Twitter buy on the world's most valuable automaker, analysts...
106 |
107 |
108 | Question: What is happening in business today?
109 |
110 | Answer:
111 | There are a few major stories happening in business today. Firstly, Marks and Spencer has warned of a "gathering storm" of higher costs for retailers and pressure on household budgets. Secondly, Facebook-owner Meta is cutting 11,000 staff in its first mass lay-offs. And finally, Tesla stock has hit a 2-year low after CEO Elon Musk sold $4 billion worth of shares.
112 | ```
113 |
114 | Sucess! You'll notice that using Marqo to add relevant and temporally correct context means we can build a news summarisation application with ease. So instead of wrong and vague answers, we get factually-grounded summaries based on retrieved facts such as:
115 |